Parsing a pdf file...

This forum is only for questions or discussions about working with the mojoPortal source code in Visual Studio, obtaining the source code from the repository, developing custom features, etc. If your question is not along these lines this is not the right forum. Please try to post your question in the appropriate forum.

Please do not post questions about design, CSS, or skinning here. Use the Help With Skins Forum for those questions.

This forum is for discussing mojoPortal development

This forum is only for questions or discussions about working with the mojoPortal source code in Visual Studio, obtaining the source code from the repository, developing custom features, etc. If your question is not along these lines this is not the right forum. Please try to post your question in the appropriate forum.

You can monitor commits to the repository from this page. We also recommend developers to subscribe to email notifications in the developer forum as occasionally important things are announced.

Before posting questions here you might want to review the developer documentation.

Do not post questions about design, CSS, or skinning here. Use the Help With Skins Forum for those questions.
This thread is closed to new posts. You must sign in to post in the forums.
2/9/2011 1:06:27 PM
Gravatar
Total Posts 7

Parsing a pdf file...

I have a module where the user can upload various files.  My goal is to be able to parse them to be able to be viewed in the web page.  I have managed to parse Word documents and RTF documents just fine using the Microsoft.Office.Interop.Word assembly.

However, I would also like to support pdf files.  I have found various libraries online that can parse pdf files, but none of them have managed to work very well for me.  The one that shows the most promise is pdfbox.  http://www.codeproject.com/KB/string/pdf2text.aspx

However, I can't seem to get this to work.  I keep getting the error: "The invoked member is not supported in a dynamic assembly."

My module sits in mojoPortal.Features.UI and I've made references to the IKVM.GNU.Classpath and PDFBox-0.7.2.dll libraries and placed the IKVM.Runtime.dll in the 'bin' folder in the UI namespace.

I was just wondering if this is easily solved, or if you have had experience with another solution.

I don't need to necessarily maintain the formatting, but I need the text at least.  (None of the PDF files are going to be image scans, so OCR isn't needed)

2/10/2011 1:45:40 PM
Gravatar
Total Posts 70

Re: Parsing a pdf file...

Hi wantlesspower,

I never used pdfbox but the free pdf parser is very interesting, please, let us know about the solution in this thread.

But, in general, I could recommend you to check the correct post-build event xcopy commands in the project of your module. You should ensure what all required DLL files are copied into the mojo's bin folder after each rebuild of your project. This process, for example, explained here as Step 3.

BTW, are you sure what  Microsoft.Office.Interop.Word assembly processes are correctly killed at remote host?

 

Regards, Igor

2/12/2011 10:53:27 AM
Gravatar
Total Posts 70

Re: Parsing a pdf file...


BTW: I've not touched the java world for a long time, this way it is was more convenient for me to have the .net precompiled binaries.

You may found the precompiled .Net PDFbox 1.2.1 binaries here.  FYI: the latest PDFbox release is 1.4.0.

Well, it is work fine at my tests, just do not forget to add post build events for copy of required DLLs ( at the module project properties ).

Regards, Igor

You must sign in to post in the forums. This thread is closed to new posts.