5,317,180 members and growing! (22,270 online)
Email Password   helpLost your password?
General Programming » String handling » Strings     Intermediate

Converting PDF to Text in C#

By Dan Letecky

Parsing PDF files in .NET using PDFBox and IKVM.NET (managed code).
C#, Windows, .NET, Visual Studio, Dev

Posted: 1 Dec 2005
Updated: 12 Dec 2005
Views: 100,143
Announcements
Want a new Job?



Search    
Advanced Search
Sitemap
32 votes for this Article.
Popularity: 6.77 Rating: 4.50 out of 5
0 votes, 0.0%
1
1 vote, 3.1%
2
1 vote, 3.1%
3
4 votes, 12.5%
4
26 votes, 81.3%
5

How to parse PDF files

While extending the indexing solution for an intranet built using the DotLucene fulltext search library I decided to add support for PDF files. But DotLucene can only handle plain text so the PDF files had to be converted.

After hours of Googling I found a reasonable solution that uses "pure" .NET - at least there are no other dependencies other than a few assemblies of IKVM.NET. Before we start with the solution let's take a look at the other ways I tried.

Using Adobe PDF IFilter

Using Adobe PDF IFilter requires:

  1. Using unreliable COM interop that handles IFilter interface (and the combination of IFilter COM and Adobe PDF IFilter is especially troublesome) and
  2. A separate installation of Adobe IFilter on the target system. This can be painful if you need to distribute your indexing solution to someone else.

Read more about using IFilter in Microsoft Office Documents Parsing.

Using iTextSharp

iTextSharp is a .NET port of iText, a PDF manipulation library for Java. It is primarily focused on creating and not reading PDFs but there are some classes that allow you to read PDF - especially PdfReader. But extracting the text from the hierarchy of objects is not an easy task (PDF is not a simple format, the PDF Reference is 7 MB - compressed - PDF file). I was able to get to PdfArray, PdfBoolean, PdfDictionary and other objects but after some hours of trying to resolve PdfIndirectReference I gave up and threw away the iTextSharp based parser.

Finally: PDFBox

PDFBox is another Java PDF library. It is also ready to use with the original Java Lucene (see LucenePDFDocument).

Fortunately, there is a .NET version of PDFBox that is created using IKVM.NET (just download the PDFBox package, it's in the bin directory).

Using PDFBox in .NET requires adding references to:

  • PDFBox-0.7.2.dll
  • IKVM.GNU.Classpath

and copying IKVM.Runtime.dll to the bin directory.

Using the PDFBox to parse PDFs is fairly easy:

private static string parseUsingPDFBox(string filename)
{
    PDDocument doc = PDDocument.load(filename);
    PDFTextStripper stripper = new PDFTextStripper();
    return stripper.getText(doc);
}

The size of the required assemblies adds up to almost 16 MB:

  • IKVM.GNU.Classpath.dll (7 MB)
  • IKVM.Runtime.dll (360 kB)
  • PDFBox-0.7.2.dll (8 MB)

The speed is not so bad: Parsing the U.S. Copyright Act PDF (1.4 MB) took about 7 seconds.

Related information

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Dan Letecky


My open-source ASP.NET 2.0 controls:

DayPilot - Outlook-like calendar/scheduling control
DayPilot MonthPicker - Light-weight month picker
MenuPilot - Hover context menu

Location: Czech Republic Czech Republic

Other popular String handling articles:

Article Top
Sign Up to vote for this article
You must Sign In to use this message board.
FAQ FAQ Noise ToleranceSearch Search Messages 
 Layout  Per page   
 Msgs 1 to 25 of 59 (Total in Forum: 59) (Refresh)FirstPrevNext
Subject  Author Date 
GeneralSelected text in pdfmemberMember 456740621:53 16 Jul '08  
GeneralNot converting for pdf 9.0 versionmemberhammadNasirAhmed4:10 7 Jul '08  
GeneralDan you're a LegendmemberStrini0:43 29 Jun '08  
GeneralIt does not work for persian pdfmemberGreat George Smith5:03 5 Jun '08  
QuestionExtract a selected pagememberstan9211:21 3 Jun '08  
QuestionFile not found exception in pdfbox 0.7.3supporterAriadne6:15 13 May '08  
AnswerRe: File not found exception in pdfbox 0.7.3memberDan Letecky7:10 13 May '08  
AnswerRe: File not found exception in pdfbox 0.7.3supporterAriadne9:09 13 May '08  
Generalworks great!memberSelecters10:46 2 Apr '08  
QuestionPDF file -Retrieve X and Y Cordinates, Fonts and Text from PDF filememberHBP9:18 12 Mar '08  
GeneralReading a table form from PDFmembersunanth krishnan18:28 20 Feb '08  
Generalregarding the search inside the PDF documentmemberRakesh B Singh20:50 25 Dec '07  
GeneralGreat results, need to close the document in the sample code.memberhspc12:13 10 Dec '07  
GeneralIsImageOnly checkmembertopry10:03 11 Oct '07  
Questionload formmembereldawly3:29 7 Oct '07  
QuestionSave as JpgmemberGovindaraj SR20:09 26 Sep '07  
GeneralHow to read tables from pdf files using your librarymembermarquito_cuba10:07 18 Sep '07  
GeneralRe: How to read tables from pdf files using your librarymemberDan Letecky22:12 18 Sep '07  
GeneralConverting PDF to TXTmemberMahesh_Azaad23:21 25 Jul '07  
GeneralTo extract text using iTextSharpmembervhd5010:25 27 Jun '07  
GeneralRe: To extract text using iTextSharpmemberByronBBB10:45 2 Jul '07  
GeneralRe: To extract text using iTextSharpmemberDan Letecky23:31 25 Jul '07  
GeneralRe: To extract text using iTextSharpmemberyopyoptrunk8:36 24 Aug '07  
GeneralRe: To extract text using iTextSharpmemberalfonspeyman22:28 2 Jun '08  
GeneralCompleting a PDF formmemberelgee7713:14 22 Jun '07  

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

PermaLink | Privacy | Terms of Use
Last Updated: 12 Dec 2005
Editor: Rinish Biju
Copyright 2005 by Dan Letecky
Everything else Copyright © CodeProject, 1999-2008
Web17 | Advertise on the Code Project