Click here to Skip to main content
Click here to Skip to main content

Converting PDF to Text in C#

By , 17 Aug 2012
 

Warning    

June 29, 2012: It turns out that although the IKVM.NET bridge brings some overhead this is still one of the best ways to parse PDF files in .NET.  

The article and the Visual Studio project are updated and work with the latest PDFBox version (1.7.0). It's also possible to download the project with all dependencies (resolving the dependencies proved to be a bit tricky).   

How to parse PDF files     

When extending the indexing solution for an intranet built using the Lucene.NET library I decided to add support for PDF files. But DotLucene can only handle plain text so the PDF files had to be converted.  

After hours of Googling I found a reasonable solution that uses "pure" .NET - at least there are no other dependencies than a few IKVM.NET assemblies. Before we start with the solution let's take a look at the other ways I tried.   

Using Adobe PDF IFilter  

Using Adobe PDF IFilter requires: 

  1. Using unreliable COM interop that handles IFilter interface (and the combination of IFilter COM and Adobe PDF IFilter is especially troublesome) and
  2. A separate installation of Adobe IFilter on the target system. This can be painful if you need to distribute your indexing solution to someone else. 

Using iTextSharp  

iTextSharp is a .NET port of iText, a PDF manipulation library for Java. It is primarily focused on creating and not reading PDFs but there are some classes that allow you to read PDF - especially PdfReader. But extracting the text from the hierarchy of objects is not an easy task (PDF is not a simple format, the PDF Reference is 7 MB - compressed - PDF file). I was able to get to PdfArray, PdfBoolean, PdfDictionary and other objects but after some hours of trying to resolve PdfIndirectReference I gave up and threw away the iTextSharp based parser.

Finally: PDFBox  

PDFBox is another Java PDF library. It is also ready to be used with the original Java Lucene (see LucenePDFDocument).  

Fortunately, there is a .NET version of PDFBox that is created using IKVM.NET (just download the PDFBox package). 

Using PDFBox in .NET requires adding references to:  

  • IKVM.OpenJDK.Core.dll 
  • IKVM.OpenJDK.SwingAWT.dll 
  • pdfbox-1.7.0.dll 

and copying the following files the bin directory: 

  • commons-logging.dll 
  • fontbox-1.7.0.dll 
  • IKVM.OpenJDK.Util.dll 
  • IKVM.Runtime.dll 

Using the PDFBox to parse PDFs is fairly easy: 

private static string parseUsingPDFBox(string filename)
{
    PDDocument doc = PDDocument.load(filename);
    PDFTextStripper stripper = new PDFTextStripper();
    string text = stripper.getText(doc);
    doc.close();
    return text;
}  

The size of the required assemblies adds up to almost 18 MB:

  • IKVM.OpenJDK.Core.dll (4 MB) 
  • IKVM.OpenJDK.SwingAWT.dll (6 MB) 
  • pdfbox-1.7.0.dll  (4 MB)  
  • commons-logging.dll (82 kB)  
  • fontbox-1.7.0.dll (180 kB)  
  • IKVM.OpenJDK.Util.dll (2 MB)  
  • IKVM.Runtime.dll (1 MB) 

The speed is not so bad: Parsing the U.S. Copyright Act PDF (5.1 MB) took about 13 seconds. 

Related information 

History

  • June 20, 2012 - Updated to work with the latest PDFBox release 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Dan Letecky
Czech Republic Czech Republic
Member
My open-source AJAX controls:
 
DayPilot
DayPilot MVC
DayPilot Java
Outlook-Like Calendar/Scheduling Controls

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
Hint: For improved responsiveness ensure Javascript is enabled and choose 'Normal' from the Layout dropdown and hit 'Update'.
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
Questionshare my methodmemberdanny rough8 May '13 - 17:03 
QuestionGreek CharactermemberWeslley29 Apr '13 - 22:38 
QuestionNeed some infomemberTridip Bhattacharjee26 Mar '13 - 4:56 
AnswerRe: Need some infomemberDan Letecky26 Mar '13 - 9:14 
GeneralMy vote of 5memberHumayun Kabir Mamun14 Mar '13 - 23:38 
GeneralRe: My vote of 5memberDan Letecky26 Mar '13 - 9:15 
Questionretain format?memberKalpana Volety11 Jan '13 - 7:22 
SuggestionPdf to text conversion in c#memberHighCommand18 Dec '12 - 8:22 
QuestionTo know the coordinates of each extracting word.membertjimenez6 Dec '12 - 6:44 
QuestionGetting "the invoked member is not supported in a dynamic assembly" exceptionmemberroyk1231 Nov '12 - 18:14 
QuestionHow to convert only range of page from pdf filemembermayur.ce1 Nov '12 - 2:17 
QuestionExceptionmemberguton28 Oct '12 - 7:27 
QuestionCan I retain formatting?memberStealthNinja00720 Aug '12 - 14:27 
GeneralMy vote of 4memberSreenath Kalahasti20 Aug '12 - 7:46 
BugPdf conversion to text is not happeningmemberMember 86741234 Jul '12 - 0:34 
GeneralRe: Pdf conversion to text is not happeningmemberDan Letecky4 Jul '12 - 10:43 
GeneralRe: Pdf conversion to text is not happeningmemberMember 86741234 Jul '12 - 21:11 
QuestionError occuring in vb.netmemberSteve.Brown2 Jul '12 - 20:32 
AnswerRe: Error occuring in vb.netmemberDan Letecky2 Jul '12 - 21:30 
GeneralRe: Error occuring in vb.netmemberSteve.Brown2 Jul '12 - 21:31 
GeneralGetting empty output? [modified]memberii_noname_ii2 Jul '12 - 0:03 
GeneralRe: Getting empty output?memberDan Letecky2 Jul '12 - 21:22 
GeneralRe: Getting empty output?memberii_noname_ii2 Jul '12 - 23:40 
QuestionAnother open source useless toymemberaquant23 May '12 - 22:48 
GeneralRe: Another open source useless toy [modified]memberabdurahman ibn hattab29 Jun '12 - 19:27 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web04 | 2.6.130516.1 | Last Updated 17 Aug 2012
Article Copyright 2005 by Dan Letecky
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid