Skip to main content
Email Password   helpLost your password?

How to parse PDF files

While extending the indexing solution for an intranet built using the DotLucene fulltext search library I decided to add support for PDF files. But DotLucene can only handle plain text so the PDF files had to be converted.

After hours of Googling I found a reasonable solution that uses "pure" .NET - at least there are no other dependencies other than a few assemblies of IKVM.NET. Before we start with the solution let's take a look at the other ways I tried.

Using Adobe PDF IFilter

Using Adobe PDF IFilter requires:

  1. Using unreliable COM interop that handles IFilter interface (and the combination of IFilter COM and Adobe PDF IFilter is especially troublesome) and
  2. A separate installation of Adobe IFilter on the target system. This can be painful if you need to distribute your indexing solution to someone else.

Read more about using IFilter in Microsoft Office Documents Parsing.

Using iTextSharp

iTextSharp is a .NET port of iText, a PDF manipulation library for Java. It is primarily focused on creating and not reading PDFs but there are some classes that allow you to read PDF - especially PdfReader. But extracting the text from the hierarchy of objects is not an easy task (PDF is not a simple format, the PDF Reference is 7 MB - compressed - PDF file). I was able to get to PdfArray, PdfBoolean, PdfDictionary and other objects but after some hours of trying to resolve PdfIndirectReference I gave up and threw away the iTextSharp based parser.

Finally: PDFBox

PDFBox is another Java PDF library. It is also ready to use with the original Java Lucene (see LucenePDFDocument).

Fortunately, there is a .NET version of PDFBox that is created using IKVM.NET (just download the PDFBox package, it's in the bin directory).

Using PDFBox in .NET requires adding references to:

and copying IKVM.Runtime.dll to the bin directory.

Using the PDFBox to parse PDFs is fairly easy:

private static string parseUsingPDFBox(string filename)
{
    PDDocument doc = PDDocument.load(filename);
    PDFTextStripper stripper = new PDFTextStripper();
    return stripper.getText(doc);
}

The size of the required assemblies adds up to almost 16 MB:

The speed is not so bad: Parsing the U.S. Copyright Act PDF (1.4 MB) took about 7 seconds.

Related information

You must Sign In to use this message board.
 
 
Per page   
 FirstPrevNext
GeneralPDFBox in trust level = medium Pin
fnajib
4:02 6 Nov '09  
Questionproblem with Hebrew Pin
JY1
6:19 19 Oct '09  
GeneralExtracting Text and Images Pin
angstrey
4:53 19 Oct '09  
GeneralConvert PDF to Text Pin
zelandiya
17:07 13 Oct '09  
GeneralMy vote of 1 Pin
babu9
3:39 2 Oct '09  
GeneralSystem Null Reference Pin
wopfather1
18:15 22 Sep '09  
Generalproblem in using ikvm Pin
z_gh_n
1:43 13 Sep '09  
Generalthe beginner problem wiith text reading Pin
maroch44
8:34 28 Aug '09  
Generalhey bro I have a problem Converting PDF to Text in japanese; any advise? [modified] Pin
txdtjpu
17:54 20 Aug '09  
GeneralMy vote of 1 Pin
ubik
1:14 22 Jul '09  
Generalhow i can convert pdf to HTML in C# ?? Pin
re7et_3esh
2:48 2 Jul '09  
GeneralDocument is not getting loaded [modified] Pin
sanjivaniVB
0:44 9 Jun '09  
GeneralRead PDF file in C# with Images Pin
Saurabh_ClearDevelop
22:56 26 May '09  
QuestionExtracting tables data from PDF file ? Pin
Hanan Harush
10:51 21 Mar '09  
AnswerRe: Extracting tables data from PDF file ? Pin
ivanclay
3:41 18 Nov '09  
GeneralThe type or namespace name 'GNU' does not exist in the namespace 'IKVM' Pin
m096510
17:30 2 Feb '09  
GeneralRe: The type or namespace name 'GNU' does not exist in the namespace 'IKVM' Pin
Sundararajan
8:25 10 Nov '09  
GeneralIs there a way to use this for URL Pin
MeCode123
10:57 30 Jan '09  
GeneralWrappedIOException C# not working; but vb.net is working Pin
saif_2006
19:54 18 Jan '09  
GeneralRe: WrappedIOException C# not working; but vb.net is working Pin
saif_2006
0:51 19 Jan '09  
Generalsolution for the type initializer for "java.io.file" threw an exception Pin
Daniel_Wilson
1:01 12 Jan '09  
GeneralRe: solution for the type initializer for "java.io.file" threw an exception Pin
Flashr
3:52 19 Feb '09  
GeneralMy vote of 1 Pin
clarkapp
18:50 3 Jan '09  
GeneralCan anybody tell me this? Pin
e40s
15:30 27 Dec '08  
GeneralI can't find a PDFBox-0.7.x.dll with a strong name. [modified] Pin
e40s
12:39 26 Dec '08  


Last Updated 12 Dec 2005 | Advertise | Privacy | Terms of Use | Copyright © CodeProject, 1999-2009