Click here to Skip to main content
15,880,469 members
Articles / Web Development / HTML

Converting PDF to Text in C#

Rate me:
Please Sign up or sign in to vote.
4.80/5 (144 votes)
19 Apr 2015CPOL3 min read 1.9M   31.8K   484   256
Parsing PDF files in .NET using PDFBox and IKVM.NET (managed code).

Update

April 20, 2015: The article and the Visual Studio project are updated and work with the latest PDFBox version (1.8.9). It's also possible to download the project with all dependencies (resolving the dependencies proved to be a bit tricky).

February 27, 2014: This article originally described parsing PDF files using PDFBox. It has been extended to include samples for IFilter and iTextSharp.

How to Parse PDF Files

There are several main methods for extracting text from PDF files in .NET:

  • Microsoft IFilter interface and Adobe IFilter implementation.
  • iTextSharp
  • PDFBox

None of these PDF parsing solutions is perfect. We will discuss all these methods below.

1. Parsing PDF using Adobe PDF IFilter

In order to parse PDF files using IFilter interface you need the following:

Sample code:

using IFilter;

// ...

public static string ExtractTextFromPdf(string path) {
  return DefaultParser.Extract(path); 
} 

Download a sample project:

If you are using the PDF IFilter that comes with Adobe Acrobat Reader you will need to rename the process to "filtdump.exe" otherwise the IFilter interface will return E_NOTIMPL error code. See more at Parsing PDF Files using IFilter [squarepdf.net].

Disadvantages:

  1. Using unreliable COM interop that handles IFilter interface (and the combination of IFilter COM and Adobe PDF IFilter is especially troublesome).
  2. A separate installation of Adobe IFilter on the target system. This can be painful if you need to distribute your indexing solution to someone else.
  3. You have to use "filtdump.exe" file name for your application with the latest PDF IFilter implementation that comes with Acrobat Reader.

2. Parsing PDF using iTextSharp

iTextSharp is a .NET port of iText, a PDF manipulation library for Java. It is primarily focused on creating and not reading PDFs but it supports extracting text from PDF as well.

Sample code:

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

// ...
 
public static string ExtractTextFromPdf(string path)
{
  using (PdfReader reader = new PdfReader(path))
  {
    StringBuilder text = new StringBuilder();

    for (int i = 1; i <= reader.NumberOfPages; i++)
    {
        text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
    }

    return text.ToString();
  }
} 

Credit: Member 10364982

Download a sample project:

You may consider using LocationTextExtractionStrategy to get better precision.

public static string ExtractTextFromPdf(string path)
{
  ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
  
  using (PdfReader reader = new PdfReader(path))
  {
      StringBuilder text = new StringBuilder();

      for (int i = 1; i <= reader.NumberOfPages; i++)
      {
          string thePage = PdfTextExtractor.GetTextFromPage(reader, i, its);
          string[] theLines = thePage.Split('\n');
          foreach (var theLine in theLines)
          {
              text.AppendLine(theLine);
          }
      }
      return text.ToString();
  }
}  

 

Credit: Member 10140900

Disadvantages of iTextSharp:

  1. Licensing if you are not happy with AGPL license

3. Parsing PDF using PDFBox

PDFBox is another Java PDF library. It is also ready to be used with the original Java Lucene (see LucenePDFDocument).

Fortunately, there is a .NET version of PDFBox that is created using IKVM.NET (just download the PDFBox package).

Using PDFBox in .NET requires adding references to:

  • IKVM.OpenJDK.Core.dll
  • IKVM.OpenJDK.SwingAWT.dll
  • pdfbox-1.8.9.dll

and copying the following files the bin directory:

  • commons-logging.dll
  • fontbox-1.8.9.dll
  • IKVM.OpenJDK.Text.dll
  • IKVM.OpenJDK.Util.dll
  • IKVM.Runtime.dll

Using the PDFBox to parse PDFs is fairly easy:

C#
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;

// ...

private static string ExtractTextFromPdf(string path)
{
  PDDocument doc = null;
  try {
    doc = PDDocument.load(path)
    PDFTextStripper stripper = new PDFTextStripper();
    return stripper.getText(doc);
  }
  finally {
    if (doc != null) {
      doc.close();
    }
  }
}  

Download a sample project:

The size of the required assemblies adds up to almost 18 MB:

  • IKVM.OpenJDK.Core.dll (4 MB)
  • IKVM.OpenJDK.SwingAWT.dll (6 MB)
  • pdfbox-1.8.9.dll (4 MB)
  • commons-logging.dll (82 kB)
  • fontbox-1.8.9.dll (180 kB)
  • IKVM.OpenJDK.Text.dll (800 kB)
  • IKVM.OpenJDK.Util.dll (2 MB)
  • IKVM.Runtime.dll (1 MB)

The speed is not so bad: Parsing the U.S. Copyright Act PDF (5.1 MB) took about 13 seconds.

Thanks to bobrien100 for improvements suggestions.

Disadvantages:

  1. IKVM.NET Dependencies (18 MB)
  2. Speed (especially the IKVM.NET warm-up time)

Related information

History

  • April 20, 2015 - Updated to work with the latest PDFBox release (1.8.9)
  • November 27, 2014 - Updated to work with the latest PDFBox release (1.8.7)
  • March 10, 2014 - IFilter file name limitations added, iTextSharp sample extended
  • February 27, 2014 - Samples for IFilter and iTextSharp added.
  • February 24, 2014 - Updated to work with the latest PDFBox release (1.8.4)
  • June 20, 2012 - Updated to work with the latest PDFBox release (1.7.0)

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Czech Republic Czech Republic
My open-source event calendar/scheduling web UI components:

DayPilot for JavaScript, Angular, React and Vue

Comments and Discussions

 
GeneralRe: My vote of 5 Pin
Dan Letecky26-Mar-13 9:15
Dan Letecky26-Mar-13 9:15 
QuestionTo know the coordinates of each extracting word. Pin
tjimenez6-Dec-12 6:44
tjimenez6-Dec-12 6:44 
QuestionGetting "the invoked member is not supported in a dynamic assembly" exception Pin
royk1231-Nov-12 18:14
royk1231-Nov-12 18:14 
AnswerRe: Getting "the invoked member is not supported in a dynamic assembly" exception Pin
flodpanter6-Aug-13 3:34
flodpanter6-Aug-13 3:34 
QuestionHow to convert only range of page from pdf file Pin
mayur.ce1-Nov-12 2:17
mayur.ce1-Nov-12 2:17 
AnswerRe: How to convert only range of page from pdf file Pin
codeproject.ir22-Jul-13 9:48
codeproject.ir22-Jul-13 9:48 
QuestionException Pin
guton28-Oct-12 7:27
guton28-Oct-12 7:27 
QuestionCan I retain formatting? Pin
StealthNinja00720-Aug-12 14:27
StealthNinja00720-Aug-12 14:27 
Howdy,
This is a useful article. However, I might have called this process a text extractor, not converter, as all formatting is lost except for newlines.

I would like to be able to convert to HTML or SGML. This article says I can use an -html tag, so I am going to try that out:
http://java.dzone.com/articles/converting-pdf-html-using">[^]

If anybody already has experience with this, with either PDFBOX or another tool such as PDFtoHTML, I would appreciate some leads!

Happy progging,
GeneralMy vote of 4 Pin
Sreenath Kalahasti20-Aug-12 7:46
Sreenath Kalahasti20-Aug-12 7:46 
BugPdf conversion to text is not happening Pin
Member 86741234-Jul-12 0:34
Member 86741234-Jul-12 0:34 
GeneralRe: Pdf conversion to text is not happening Pin
Dan Letecky4-Jul-12 10:43
Dan Letecky4-Jul-12 10:43 
GeneralRe: Pdf conversion to text is not happening Pin
Member 86741234-Jul-12 21:11
Member 86741234-Jul-12 21:11 
QuestionError occuring in vb.net Pin
Steve.Brown2-Jul-12 20:32
Steve.Brown2-Jul-12 20:32 
AnswerRe: Error occuring in vb.net Pin
Dan Letecky2-Jul-12 21:30
Dan Letecky2-Jul-12 21:30 
GeneralRe: Error occuring in vb.net Pin
Steve.Brown2-Jul-12 21:31
Steve.Brown2-Jul-12 21:31 
GeneralGetting empty output? Pin
ii_noname_ii2-Jul-12 0:03
ii_noname_ii2-Jul-12 0:03 
GeneralRe: Getting empty output? Pin
Dan Letecky2-Jul-12 21:22
Dan Letecky2-Jul-12 21:22 
GeneralRe: Getting empty output? Pin
ii_noname_ii2-Jul-12 23:40
ii_noname_ii2-Jul-12 23:40 
QuestionAnother open source useless toy Pin
aquant23-May-12 22:48
aquant23-May-12 22:48 
GeneralRe: Another open source useless toy Pin
abdurahman ibn hattab29-Jun-12 19:27
abdurahman ibn hattab29-Jun-12 19:27 
QuestionIs license req for PDFBox Pin
Member 867094927-Feb-12 2:00
Member 867094927-Feb-12 2:00 
GeneralMy vote of 5 Pin
Manoj Kumar Choubey9-Feb-12 2:31
professionalManoj Kumar Choubey9-Feb-12 2:31 
QuestionCould not generate text using parseUsingPDFBox function Pin
Member 771558926-Dec-11 19:40
Member 771558926-Dec-11 19:40 
AnswerRe: Could not generate text using parseUsingPDFBox function Pin
caodinhtuan12-Feb-12 23:28
caodinhtuan12-Feb-12 23:28 
QuestionUnable to scrap text from pfd to html Pin
nileshsmarathe5-Dec-11 23:57
nileshsmarathe5-Dec-11 23:57 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.