Click here to Skip to main content
Click here to Skip to main content

Converting PDF to Text in C#

By , 17 Aug 2012
 

Warning    

June 29, 2012: It turns out that although the IKVM.NET bridge brings some overhead this is still one of the best ways to parse PDF files in .NET.  

The article and the Visual Studio project are updated and work with the latest PDFBox version (1.7.0). It's also possible to download the project with all dependencies (resolving the dependencies proved to be a bit tricky).   

How to parse PDF files     

When extending the indexing solution for an intranet built using the Lucene.NET library I decided to add support for PDF files. But DotLucene can only handle plain text so the PDF files had to be converted.  

After hours of Googling I found a reasonable solution that uses "pure" .NET - at least there are no other dependencies than a few IKVM.NET assemblies. Before we start with the solution let's take a look at the other ways I tried.   

Using Adobe PDF IFilter  

Using Adobe PDF IFilter requires: 

  1. Using unreliable COM interop that handles IFilter interface (and the combination of IFilter COM and Adobe PDF IFilter is especially troublesome) and
  2. A separate installation of Adobe IFilter on the target system. This can be painful if you need to distribute your indexing solution to someone else. 

Using iTextSharp  

iTextSharp is a .NET port of iText, a PDF manipulation library for Java. It is primarily focused on creating and not reading PDFs but there are some classes that allow you to read PDF - especially PdfReader. But extracting the text from the hierarchy of objects is not an easy task (PDF is not a simple format, the PDF Reference is 7 MB - compressed - PDF file). I was able to get to PdfArray, PdfBoolean, PdfDictionary and other objects but after some hours of trying to resolve PdfIndirectReference I gave up and threw away the iTextSharp based parser.

Finally: PDFBox  

PDFBox is another Java PDF library. It is also ready to be used with the original Java Lucene (see LucenePDFDocument).  

Fortunately, there is a .NET version of PDFBox that is created using IKVM.NET (just download the PDFBox package). 

Using PDFBox in .NET requires adding references to:  

  • IKVM.OpenJDK.Core.dll 
  • IKVM.OpenJDK.SwingAWT.dll 
  • pdfbox-1.7.0.dll 

and copying the following files the bin directory: 

  • commons-logging.dll 
  • fontbox-1.7.0.dll 
  • IKVM.OpenJDK.Util.dll 
  • IKVM.Runtime.dll 

Using the PDFBox to parse PDFs is fairly easy: 

private static string parseUsingPDFBox(string filename)
{
    PDDocument doc = PDDocument.load(filename);
    PDFTextStripper stripper = new PDFTextStripper();
    string text = stripper.getText(doc);
    doc.close();
    return text;
}  

The size of the required assemblies adds up to almost 18 MB:

  • IKVM.OpenJDK.Core.dll (4 MB) 
  • IKVM.OpenJDK.SwingAWT.dll (6 MB) 
  • pdfbox-1.7.0.dll  (4 MB)  
  • commons-logging.dll (82 kB)  
  • fontbox-1.7.0.dll (180 kB)  
  • IKVM.OpenJDK.Util.dll (2 MB)  
  • IKVM.Runtime.dll (1 MB) 

The speed is not so bad: Parsing the U.S. Copyright Act PDF (5.1 MB) took about 13 seconds. 

Related information 

History

  • June 20, 2012 - Updated to work with the latest PDFBox release 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Dan Letecky
Czech Republic Czech Republic
Member
My open-source AJAX controls:
 
DayPilot
DayPilot MVC
DayPilot Java
Outlook-Like Calendar/Scheduling Controls

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
Questionshare my methodmemberdanny rough8 May '13 - 17:03 
thanks for sharing your way to convert pdf to text in c# . i have always looked for a suitable software to do that, but failed . you remind me that i can just do it by myself. i am used to be a developer , and i believe there is no software on the internet will both good to use and free. if a good converter is employed , our works will be much more relaxed, i can give you some advices. my friend once recommed this pdf to text converter to me , it can be applied in c#, net, or vb.net. if you don;t want to do it by yourself, you can use this. thanks again for sharing, good luck.
http://www.rasteredge.com/how-to/csharp-imaging/pdf-convert-text/[^]
QuestionGreek CharactermemberWeslley29 Apr '13 - 22:38 
Hi Dan,
 
I need to convert PDF with Greek character. I'm trying use your example, but without sucess. Need I setting something? How can I define it?
 
Other question is: My document have spaces beetween lines or lines in blank. Can I create the text file with this lines in blank?
 
Other question (sorry): I have more on page. The application will convert?
 
Thanks a lot!
QuestionNeed some infomemberTridip Bhattacharjee26 Mar '13 - 4:56 
suppose if there will be many images and tables etc then what will happen?
tbhattacharjee

AnswerRe: Need some infomemberDan Letecky26 Mar '13 - 9:14 
It should extract text from tables but you will need to use OCR to read text from images.
--
My open-source AJAX controls:
DayPilot - Calendar/Scheduling Control for ASP.NET WebForms
DayPilot for MVC - Calendar/Scheduling Control for ASP.NET MVC
DayPilot for Java - Calendar/Scheduling Control for Java

GeneralMy vote of 5memberHumayun Kabir Mamun14 Mar '13 - 23:38 
Very Helpful
GeneralRe: My vote of 5memberDan Letecky26 Mar '13 - 9:15 
Thanks!
--
My open-source AJAX controls:
DayPilot - Calendar/Scheduling Control for ASP.NET WebForms
DayPilot for MVC - Calendar/Scheduling Control for ASP.NET MVC
DayPilot for Java - Calendar/Scheduling Control for Java

Questionretain format?memberKalpana Volety11 Jan '13 - 7:22 
Does it retain the format of the text? For example bold text retained as bold etc.
 
Kalpana Volety
PDF to Text
SuggestionPdf to text conversion in c#memberHighCommand18 Dec '12 - 8:22 
we can also convert pdf to text with free utility. (pdf to text)
 
here is the demonstration
pdf to text in asp.net
QuestionTo know the coordinates of each extracting word.membertjimenez6 Dec '12 - 6:44 
Hello,
First of all, congratulations for the project. Great work !!
I would like to ask you a question:
When we are extracting the text, it is possible to know the coordinates of each word?
I mean knowing the rounding position; for instance: upper left x,y and lower right x,y.
QuestionGetting "the invoked member is not supported in a dynamic assembly" exceptionmemberroyk1231 Nov '12 - 18:14 
I get this exception only on Win 7.
On XP it runs w/o errors.
 

When I run the code from VS with either debug or release - it runs smoothly.
However, when I install my project using a setup project - The pdf2text code fails with the following exception: "the invoked member is not supported in a dynamic assembly"
 
I've added a reference to all of the DLLs that come with pdf2text, and made sure they are also included in the setup project.
 
Any idea?
 
tnx
QuestionHow to convert only range of page from pdf filemembermayur.ce1 Nov '12 - 2:17 
Hi,..
 
It is possible to convert only specific page into text , not whole pdf file by your code..!!
Please suggest..!!
QuestionExceptionmemberguton28 Oct '12 - 7:27 
Hi Dan,
 
the code works but I have this exception...
Whats I wrong?
thanks
 
A first chance exception of type 'ClassNotFoundException' occurred in IKVM.Runtime.dll
A first chance exception of type 'java.lang.ClassNotFoundException' occurred in IKVM.Runtime.dll
A first chance exception of type 'System.IO.FileNotFoundException' occurred in mscorlib.dll
A first chance exception of type 'ClassNotFoundException' occurred in IKVM.Runtime.dll
A first chance exception of type 'java.lang.ClassNotFoundException' occurred in IKVM.Runtime.dll
A first chance exception of type 'java.lang.NoClassDefFoundError' occurred in commons-logging.dll
A first chance exception of type 'System.TypeInitializationException' occurred in mscorlib.dll
A first chance exception of type 'ClassNotFoundException' occurred in IKVM.Runtime.dll
A first chance exception of type 'java.lang.ClassNotFoundException' occurred in IKVM.Runtime.dll
A first chance exception of type 'java.lang.NoSuchMethodException' occurred in IKVM.OpenJDK.Core.dll
A first chance exception of type 'System.IO.FileNotFoundException' occurred in mscorlib.dll
'Pdf2Text.vshost.exe' (Managed (v2.0.50727)): Loaded 'C:\Users\t\Desktop\Pdf2Text\Pdf2Text\bin\Debug\fontbox-1.7.0.dll'
A first chance exception of type 'System.IO.FileNotFoundException' occurred in mscorlib.dll
A first chance exception of type 'System.IO.FileNotFoundException' occurred in mscorlib.dll
A first chance exception of type 'System.IO.FileNotFoundException' occurred in mscorlib.dll
A first chance exception of type 'System.IO.FileNotFoundException' occurred in mscorlib.dll
A first chance exception of type 'System.IO.FileNotFoundException' occurred in mscorlib.dll
A first chance exception of type 'ClassNotFoundException' occurred in IKVM.Runtime.dll
A first chance exception of type 'java.lang.ClassNotFoundException' occurred in IKVM.Runtime.dll

QuestionCan I retain formatting?memberStealthNinja00720 Aug '12 - 14:27 
Howdy,
This is a useful article. However, I might have called this process a text extractor, not converter, as all formatting is lost except for newlines.
 
I would like to be able to convert to HTML or SGML. This article says I can use an -html tag, so I am going to try that out:
http://java.dzone.com/articles/converting-pdf-html-using">[^]
 
If anybody already has experience with this, with either PDFBOX or another tool such as PDFtoHTML, I would appreciate some leads!
 
Happy progging,
GeneralMy vote of 4memberSreenath Kalahasti20 Aug '12 - 7:46 
Thanks for the article.
BugPdf conversion to text is not happeningmemberMember 86741234 Jul '12 - 0:34 
org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EI
GeneralRe: Pdf conversion to text is not happeningmemberDan Letecky4 Jul '12 - 10:43 
This seems to describe the situation quite precisely:
 
http://www.mail-archive.com/pdfbox-dev@incubator.apache.org/msg01812.html[^]
 
Especially:
 
> 1. I do get a lot of unsupported/disabled Operation info messages from
> the logger (Appendix A)
> What do they mean for me? some parts not read? Do I have to wory about
> something?
 
Yes, pdfbox doesn't support every operation yet. Some are seldom, some
are not that important and others will lead to an incomplete rendering
or whatever you try to do with the pdf.
 
> 2. Sometimes I get problems with corrupted stream (Appendix A) though
> rather seldom ..from files perfectly viewable in Acrobat Reader? I
> assume Reader ist just more error resilent and files has some bugs?
 
Yes, that assumption is right.
--
My open-source ASP.NET 2.0 controls:
DayPilot - Outlook-like calendar/scheduling control
DayPilot MonthPicker - Light-weight month picker
MenuPilot - Hover context menu

GeneralRe: Pdf conversion to text is not happeningmemberMember 86741234 Jul '12 - 21:11 
actually my pfd is read only i.e., it does not have permission to copy that may be one of the reason which I think .
QuestionError occuring in vb.netmemberSteve.Brown2 Jul '12 - 20:32 
Looks like a great article but I get the following error (in vb.net) at the following line. Any ideas?
 
Return stripper.getText(doc)
 
Could not load file or assembly 'IKVM.OpenJDK.Text, Version=7.0.4335.0, Culture=neutral, PublicKeyToken=13235d27fcbfff58' or one of its dependencies. The system cannot find the file specified.
Steve

AnswerRe: Error occuring in vb.netmemberDan Letecky2 Jul '12 - 21:30 
You should download the full PDFBox .NET package[^], and copy IKVM.OpenJDK.Text.dll to the folder with your application.
 
You can also add it as a reference in the VS project so it gets copied to the output folder automatically.
--
My open-source ASP.NET 2.0 controls:
DayPilot - Outlook-like calendar/scheduling control
DayPilot MonthPicker - Light-weight month picker
MenuPilot - Hover context menu

GeneralRe: Error occuring in vb.netmemberSteve.Brown2 Jul '12 - 21:31 
Thanks for the rapid response Smile | :) . problem solved!
Steve

GeneralGetting empty output? [modified]memberii_noname_ii2 Jul '12 - 0:03 
Hello,
 
tested a bit..
The few pdf docs I've tested so far, give me just a few blank spaces, but no more content...
Any ideas?
 
*Edit*
Finally found one pdf that worked...
 
What to do when most files give me no (or empty strings) output??

modified 2 Jul '12 - 6:13.

GeneralRe: Getting empty output?memberDan Letecky2 Jul '12 - 21:22 
Please see here:
 
How come I am not getting any text from the PDF document?[^]
 
If the PDF really contains text (and not just images, which is often the case if you scan a document to PDF) then it will be a limitation of PDFBox.
--
My open-source ASP.NET 2.0 controls:
DayPilot - Outlook-like calendar/scheduling control
DayPilot MonthPicker - Light-weight month picker
MenuPilot - Hover context menu

GeneralRe: Getting empty output?memberii_noname_ii2 Jul '12 - 23:40 
Yeah, probably...Thx.
QuestionAnother open source useless toymemberaquant23 May '12 - 22:48 
The library simply doesn't work.
1. I need to copy a file FontBox-0.1.0-dev.dll to your release folder to avoid trowing errors.
2. Even if I did it getText returns series of "\n\r\n\r....." insted of text
I testing many various pdf files and there is no difference: allways the same string of 'end-of-page' characters.
 
This is general property of all open source software. Tons of documentation, 'oh-and-ahs' and the result is always the same: DOESNT" WORK Wink | ;)
GeneralRe: Another open source useless toy [modified]memberabdurahman ibn hattab29 Jun '12 - 19:27 
Linux Kernel, GCC and LLVM do work. They are the great exceptions from this rule.

modified 3 Jul '12 - 5:50.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web03 | 2.6.130523.1 | Last Updated 17 Aug 2012
Article Copyright 2005 by Dan Letecky
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid