Click here to Skip to main content
Click here to Skip to main content

Converting PDF to Text in C#

, 10 Mar 2014
Rate this:
Please Sign up or sign in to vote.
Parsing PDF files in .NET using PDFBox and IKVM.NET (managed code).

Update

February 27, 2014: This article originally described parsing PDF files using PDFBox. It has been extended to include samples for IFilter and iTextSharp.

The article and the Visual Studio project are updated and work with the latest PDFBox version (1.8.4). It's also possible to download the project with all dependencies (resolving the dependencies proved to be a bit tricky).

How to Parse PDF Files

There are several main methods for extracting text from PDF files in .NET:

  • Microsoft IFilter interface and Adobe IFilter implementation.
  • iTextSharp
  • PDFBox

None of these PDF parsing solutions is perfect. We will discuss all these methods below.

Adobe PDF IFilter

In order to parse PDF files using IFilter interface you need the following:

Sample code:

using IFilter;

// ...

public static string ExtractTextFromPdf(string path) {
  return DefaultParser.Extract(path); 
} 

Download a sample project:

If you are using the PDF IFilter that comes with Adobe Acrobat Reader you will need to rename the process to "filtdump.exe" otherwise the IFilter interface will return E_NOTIMPL error code. See more at Parsing PDF Files using IFilter [squarepdf.net].

Disadvantages:

  1. Using unreliable COM interop that handles IFilter interface (and the combination of IFilter COM and Adobe PDF IFilter is especially troublesome).
  2. A separate installation of Adobe IFilter on the target system. This can be painful if you need to distribute your indexing solution to someone else.
  3. You have to use "filtdump.exe" file name for your application with the latest PDF IFilter implementation that comes with Acrobat Reader.

iTextSharp

iTextSharp is a .NET port of iText, a PDF manipulation library for Java. It is primarily focused on creating and not reading PDFs but it supports extracting text from PDF as well.

Sample code:

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

// ...
 
public static string ExtractTextFromPdf(string path)
{
  using (PdfReader reader = new PdfReader(path))
  {
    StringBuilder text = new StringBuilder();

    for (int i = 1; i <= reader.NumberOfPages; i++)
    {
        text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
    }

    return text.ToString();
  }
} 

Credit: Member 10364982

Download a sample project:

You may consider using LocationTextExtractionStrategy to get better precision.

public static string ExtractTextFromPdf(string path)
{
  ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
  
  using (PdfReader reader = new PdfReader(path))
  {
      StringBuilder text = new StringBuilder();

      for (int i = 1; i <= reader.NumberOfPages; i++)
      {
          string thePage = PdfTextExtractor.GetTextFromPage(reader, i, its);
          string[] theLines = thePage.Split('\n');
          foreach (var theLine in theLines)
          {
              text.AppendLine(theLine);
          }
      }
      return text.ToString();
  }
}  

Credit: Member 10140900

Disadvantages of iTextSharp:

  1. Licensing if you are not happy with AGPL license

PDFBox

PDFBox is another Java PDF library. It is also ready to be used with the original Java Lucene (see LucenePDFDocument).

Fortunately, there is a .NET version of PDFBox that is created using IKVM.NET (just download the PDFBox package).

Using PDFBox in .NET requires adding references to:

  • IKVM.OpenJDK.Core.dll
  • IKVM.OpenJDK.SwingAWT.dll
  • pdfbox-1.8.4.dll

and copying the following files the bin directory:

  • commons-logging.dll
  • fontbox-1.8.4.dll
  • IKVM.OpenJDK.Util.dll
  • IKVM.Runtime.dll

Using the PDFBox to parse PDFs is fairly easy:

using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;

// ...

private static string ExtractTextFromPdf(string path)
{
  PDDocument doc = null;
  try {
    doc = PDDocument.load(path)
    PDFTextStripper stripper = new PDFTextStripper();
    return stripper.getText(doc);
  }
  finally {
    if (doc != null) {
      doc.close();
    }
  }
}  

Download a sample project:

The size of the required assemblies adds up to almost 18 MB:

  • IKVM.OpenJDK.Core.dll (4 MB)
  • IKVM.OpenJDK.SwingAWT.dll (6 MB)
  • pdfbox-1.8.4.dll (4 MB)
  • commons-logging.dll (82 kB)
  • fontbox-1.8.4.dll (180 kB)
  • IKVM.OpenJDK.Util.dll (2 MB)
  • IKVM.Runtime.dll (1 MB)

The speed is not so bad: Parsing the U.S. Copyright Act PDF (5.1 MB) took about 13 seconds.

Thanks to bobrien100 for improvements suggestions.

Disadvantages:

  1. IKVM.NET Dependencies (18 MB)
  2. Speed (especially the IKVM.NET warm-up time)

Related information

History

  • March 10, 2014 - IFilter file name limitations added, iTextSharp sample extended
  • February 27, 2014 - Samples for IFilter and iTextSharp added.
  • February 24, 2014 - Updated to work with the latest PDFBox release (1.8.4)
  • June 20, 2012 - Updated to work with the latest PDFBox release (1.7.0)

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Dan Letecky

Czech Republic Czech Republic
My open-source event calendar/scheduling AJAX controls:
 
DayPilot for JavaScript/HTML5/jQuery
DayPilot for ASP.NET
DayPilot for MVC
DayPilot for Java

Comments and Discussions

 
QuestionSystem.IO exception: A required privilege is not held by the client error Pinmemberjackthomson25-May-14 1:32 
QuestioniTextSharp to Text - return duplicate content - text or htmle format pdf file? PinmemberHal Xie9-Apr-14 20:33 
QuestionPDF IFilter PinmemberMember 101587313-Apr-14 12:39 
QuestionMissing dependency error Pinmembersakushi632-Apr-14 17:32 
SuggestionFree (GNU) xpdf library to extract text PinmemberJepy11-Mar-14 5:54 
QuestionPDF Box Special Character PinmemberMember 1041359810-Mar-14 2:30 
AnswerRe: PDF Box Special Character PinpremiumDan Letecky10-Mar-14 8:07 
GeneralRe: PDF Box Special Character [modified] PinmemberMember 1041359819-Mar-14 0:35 
QuestioniTextSharp suggestion and problem PinprofessionalB. Clay Shannon3-Mar-14 9:01 
AnswerRe: iTextSharp suggestion and problem PinmemberDan Letecky3-Mar-14 9:25 
GeneralRe: iTextSharp suggestion and problem PinprofessionalB. Clay Shannon3-Mar-14 9:33 
GeneralRe: iTextSharp suggestion and problem PinmemberDan Letecky3-Mar-14 10:13 
GeneralRe: iTextSharp suggestion and problem PinmemberMember 101409004-Mar-14 11:43 
GeneralRe: iTextSharp suggestion and problem PinmemberDan Letecky10-Mar-14 1:23 
QuestionThe sample code Pinmemberbobrien10026-Feb-14 10:08 
AnswerRe: The sample code PinmemberDan Letecky26-Feb-14 23:10 
QuestioniTextSharp PinmemberMember 1036498225-Feb-14 22:28 
AnswerRe: iTextSharp PinmemberDan Letecky26-Feb-14 23:10 
AnswerRe: iTextSharp PinprofessionalB. Clay Shannon3-Mar-14 9:02 
GeneralRe: iTextSharp PinmemberMember 103649823-Mar-14 22:46 
QuestionMy Vote of 4 PinmemberTheFigmo14-Nov-13 5:17 
AnswerRe: My Vote of 4 PinmemberDan Letecky1-Dec-13 21:24 
QuestionThanks Pinmember_WinBase_14-Nov-13 2:26 
AnswerRe: Thanks PinmemberDan Letecky1-Dec-13 21:24 
Questionyou can also try this one! Pinmemberjacaboo22-Aug-13 16:27 
GeneralMy vote of 5 Pinmemberfredatcodeproject13-Aug-13 22:09 
GeneralRe: My vote of 5 PinmemberDan Letecky1-Dec-13 21:23 
Questionshare my method Pinmemberdanny rough8-May-13 17:03 
GeneralRe: share my method Pinmembercodeproject.ir26-Jul-13 10:09 
QuestionGreek Character PinmemberWeslley29-Apr-13 22:38 
QuestionNeed some info PinmemberTridip Bhattacharjee26-Mar-13 4:56 
AnswerRe: Need some info PinmemberDan Letecky26-Mar-13 9:14 
GeneralMy vote of 5 PinmemberHumayun Kabir Mamun14-Mar-13 23:38 
GeneralRe: My vote of 5 PinmemberDan Letecky26-Mar-13 9:15 
Questionretain format? PinmemberKalpana Volety11-Jan-13 7:22 
SuggestionPdf to text conversion in c# PinmemberHighCommand18-Dec-12 8:22 
QuestionTo know the coordinates of each extracting word. Pinmembertjimenez6-Dec-12 6:44 
QuestionGetting "the invoked member is not supported in a dynamic assembly" exception Pinmemberroyk1231-Nov-12 18:14 
AnswerRe: Getting "the invoked member is not supported in a dynamic assembly" exception Pinmemberflodpanter6-Aug-13 3:34 
QuestionHow to convert only range of page from pdf file Pinmembermayur.ce1-Nov-12 2:17 
AnswerRe: How to convert only range of page from pdf file Pinmembercodeproject.ir22-Jul-13 9:48 
QuestionException Pinmemberguton28-Oct-12 7:27 
QuestionCan I retain formatting? PinmemberStealthNinja00720-Aug-12 14:27 
GeneralMy vote of 4 PinmemberSreenath Kalahasti20-Aug-12 7:46 
BugPdf conversion to text is not happening PinmemberMember 86741234-Jul-12 0:34 
GeneralRe: Pdf conversion to text is not happening PinmemberDan Letecky4-Jul-12 10:43 
This seems to describe the situation quite precisely:
 
http://www.mail-archive.com/pdfbox-dev@incubator.apache.org/msg01812.html[^]
 
Especially:
 
> 1. I do get a lot of unsupported/disabled Operation info messages from
> the logger (Appendix A)
> What do they mean for me? some parts not read? Do I have to wory about
> something?
 
Yes, pdfbox doesn't support every operation yet. Some are seldom, some
are not that important and others will lead to an incomplete rendering
or whatever you try to do with the pdf.
 
> 2. Sometimes I get problems with corrupted stream (Appendix A) though
> rather seldom ..from files perfectly viewable in Acrobat Reader? I
> assume Reader ist just more error resilent and files has some bugs?
 
Yes, that assumption is right.
--
My open-source ASP.NET 2.0 controls:
DayPilot - Outlook-like calendar/scheduling control
DayPilot MonthPicker - Light-weight month picker
MenuPilot - Hover context menu

GeneralRe: Pdf conversion to text is not happening PinmemberMember 86741234-Jul-12 21:11 
QuestionError occuring in vb.net PinmemberSteve.Brown2-Jul-12 20:32 
AnswerRe: Error occuring in vb.net PinmemberDan Letecky2-Jul-12 21:30 
GeneralRe: Error occuring in vb.net PinmemberSteve.Brown2-Jul-12 21:31 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.140721.1 | Last Updated 10 Mar 2014
Article Copyright 2005 by Dan Letecky
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid