Click here to Skip to main content
Licence 
First Posted 1 Dec 2005
Views 431,682
Bookmarked 244 times

Converting PDF to Text in C#

By | 11 Dec 2005 | Article
Parsing PDF files in .NET using PDFBox and IKVM.NET (managed code).

How to parse PDF files

While extending the indexing solution for an intranet built using the DotLucene fulltext search library I decided to add support for PDF files. But DotLucene can only handle plain text so the PDF files had to be converted.

After hours of Googling I found a reasonable solution that uses "pure" .NET - at least there are no other dependencies other than a few assemblies of IKVM.NET. Before we start with the solution let's take a look at the other ways I tried.

Using Adobe PDF IFilter

Using Adobe PDF IFilter requires:

  1. Using unreliable COM interop that handles IFilter interface (and the combination of IFilter COM and Adobe PDF IFilter is especially troublesome) and
  2. A separate installation of Adobe IFilter on the target system. This can be painful if you need to distribute your indexing solution to someone else.

Read more about using IFilter in Microsoft Office Documents Parsing.

Using iTextSharp

iTextSharp is a .NET port of iText, a PDF manipulation library for Java. It is primarily focused on creating and not reading PDFs but there are some classes that allow you to read PDF - especially PdfReader. But extracting the text from the hierarchy of objects is not an easy task (PDF is not a simple format, the PDF Reference is 7 MB - compressed - PDF file). I was able to get to PdfArray, PdfBoolean, PdfDictionary and other objects but after some hours of trying to resolve PdfIndirectReference I gave up and threw away the iTextSharp based parser.

Finally: PDFBox

PDFBox is another Java PDF library. It is also ready to use with the original Java Lucene (see LucenePDFDocument).

Fortunately, there is a .NET version of PDFBox that is created using IKVM.NET (just download the PDFBox package, it's in the bin directory).

Using PDFBox in .NET requires adding references to:

  • PDFBox-0.7.2.dll
  • IKVM.GNU.Classpath

and copying IKVM.Runtime.dll to the bin directory.

Using the PDFBox to parse PDFs is fairly easy:

private static string parseUsingPDFBox(string filename)
{
    PDDocument doc = PDDocument.load(filename);
    PDFTextStripper stripper = new PDFTextStripper();
    return stripper.getText(doc);
}

The size of the required assemblies adds up to almost 16 MB:

  • IKVM.GNU.Classpath.dll (7 MB)
  • IKVM.Runtime.dll (360 kB)
  • PDFBox-0.7.2.dll (8 MB)

The speed is not so bad: Parsing the U.S. Copyright Act PDF (1.4 MB) took about 7 seconds.

Related information

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Dan Letecky



Czech Republic Czech Republic

Member

My open-source ASP.NET 2.0 controls:
 
DayPilot - Outlook-like calendar/scheduling control
DayPilot MonthPicker - Light-weight month picker
MenuPilot - Hover context menu

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
GeneralRe: WrappedIOException C# not working; but vb.net is working Pinmembersaif_200623:51 18 Jan '09  
GeneralRe: its a request Pinmembertkmsabarish20:29 8 Jan '10  
Generalsolution for the type initializer for "java.io.file" threw an exception PinmemberDaniel_Wilson0:01 12 Jan '09  
GeneralRe: solution for the type initializer for "java.io.file" threw an exception PinmemberFlashr2:52 19 Feb '09  
GeneralMy vote of 1 Pinmemberclarkapp17:50 3 Jan '09  
QuestionCan anybody tell me this? Pinmembere40s14:30 27 Dec '08  
GeneralI can't find a PDFBox-0.7.x.dll with a strong name. [modified] Pinmembere40s11:39 26 Dec '08  
Why is it you all are able to load a PDFBox-0.7.x.dll into the GAC? Are you compiling your own PDFBox-0.7.x.dll with a .snk? If so, from what PDFBox-0.7.x.dll source? If not, where can I locate, for download, a strongly named PDFBox-0.7.[2 or 3 or whatever].dll?
 
I've tried PDFBox versions .2 and .3 and my gacutil.exe fails on adding either assembly to the cache.
 
But you guys appear to have no problem with that. I'm using .NET SDK v2.0.
 
BTW, is anybody even using the GAC? Or are you allowed to just drop these DLLs directly into a directory path and start compiling the PDF sample?
 
Thanks for any replies.
 
modified on Friday, December 26, 2008 8:10 PM

GeneralRe: I can't find a PDFBox-0.7.x.dll with a strong name. PinmemberDaniel_Wilson0:11 12 Jan '09  
QuestionNot working when convertring gujarati langugage PinmemberAmitsp23:37 24 Dec '08  
GeneralCannot convert in correct format from Pdf file to MS -Word i.e doc file Pinmemberumeshrajpoot18:48 23 Dec '08  
GeneralReading Table data from pdf documents Pinmemberdinesh choudhary22:43 9 Dec '08  
Generalpdf with password PinmemberMember 32344034:09 7 Dec '08  
GeneralRe: pdf with password PinmemberMember 32344034:11 7 Dec '08  
QuestionI need help Pinmembermartinbrout8:47 24 Oct '08  
GeneralPdf to word conversion Pinmemberchint.997:44 22 Oct '08  
GeneralWorks, but only with 0.7.2 and only for local files, not URLs PinmemberMember 35090808:58 10 Sep '08  
GeneralRe: Works, but only with 0.7.2 and only for local files, not URLs Pinmembergirish.nakhate2:47 19 Sep '08  
GeneralRe: Works, but only with 0.7.2 and only for local files, not URLs PinmemberMember 350908011:28 13 Jul '09  
QuestionAdded the DLLs, created the references, added namespaces, nothing - Help? PinmemberMember 35090807:47 10 Sep '08  
AnswerRe: Added the DLLs, created the references, added namespaces, nothing - FIXED!! PinmemberMember 35090808:21 10 Sep '08  
GeneralRe: Added the DLLs, created the references, added namespaces, nothing - Does NOT support URLs PinmemberMember 35090808:26 10 Sep '08  
AnswerRe: Added the DLLs, created the references, added namespaces, nothing - Help? Pinmemberbyustep21:32 12 Jul '09  
GeneralRe: Added the DLLs, created the references, added namespaces, nothing - Help? -- SOLVED [modified] PinmemberMember 350908011:17 13 Jul '09  
GeneralError in PDFBox Pinmemberblackjack215021:16 7 Aug '08  
GeneralRe: Error in PDFBox Pinmemberblackjack21502:32 18 Sep '08  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web02 | 2.5.120529.1 | Last Updated 12 Dec 2005
Article Copyright 2005 by Dan Letecky
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid