Indexing PDF documents

This forum is only for questions or discussions about working with the mojoPortal source code in Visual Studio, obtaining the source code from the repository, developing custom features, etc. If your question is not along these lines this is not the right forum. Please try to post your question in the appropriate forum.

Please do not post questions about design, CSS, or skinning here. Use the Help With Skins Forum for those questions.

This forum is for discussing mojoPortal development

This forum is only for questions or discussions about working with the mojoPortal source code in Visual Studio, obtaining the source code from the repository, developing custom features, etc. If your question is not along these lines this is not the right forum. Please try to post your question in the appropriate forum.

You can monitor commits to the repository from this page. We also recommend developers to subscribe to email notifications in the developer forum as occasionally important things are announced.

Before posting questions here you might want to review the developer documentation.

Do not post questions about design, CSS, or skinning here. Use the Help With Skins Forum for those questions.
This thread is closed to new posts. You must sign in to post in the forums.
3/18/2010 7:50:04 AM
Gravatar
Total Posts 72

Indexing PDF documents

Hi Joe,

I have a requirement to index the PDF documents stored in Shared Files Modules inside mp.

So my plan is to use something like iTextSharp to read in the content of the documents and then pass this to Lucene to index.

Could you give me a few details of how Lucene is integrated into mojoPortal with perhaps some clues as to the files I should look in.

I saw a HtmlIndexContentBuilderProvider so thought I should maybe implement a PDFIndexContentBuilderProvider. Since we are storing file extensions it should be easy enough to retrieve all PDFs.

Due to the size of the documents however, I probably would want the indexing of these documents to be a manual process as I certainly don't want them being indexed during the day.

Many thanks,
Ben

3/20/2010 7:07:07 AM
Gravatar
Total Posts 18439

Re: Indexing PDF documents

Hi Ben,

I'm not really sure of a good strategy for indexing pdf documents directly, most solutions I've seen require some external indexing server. One of my customers had the same need and what I did was add a description field to the shared files feature, so what they do is copy the text content from the pdf and paste it into the description as plain text so it can be indexed by our search engine. They keep the description hidden and use it only for this purpose to make their pdf files come up in search results.

Hope it helps,

Joe

You must sign in to post in the forums. This thread is closed to new posts.