5,286,006 members and growing! (20,777 online)
Email Password   helpLost your password?
Languages » C# » General     Beginner

Extract text from PDF in C# (100% .NET)

By Zollor

A simple class to extract plain text from PDF documents with ITextSharp.
C#, Windows, .NET, Visual Studio, Dev

Posted: 20 May 2006
Updated: 20 May 2006
Views: 40,377
Announcements
Want a new Job?



Search    
Advanced Search
Sitemap
11 votes for this Article.
Popularity: 3.64 Rating: 3.50 out of 5
2 votes, 18.2%
1
1 vote, 9.1%
2
1 vote, 9.1%
3
2 votes, 18.2%
4
5 votes, 45.5%
5
Note: This is an unedited contribution. If this article is inappropriate, needs attention or copies someone else's work without reference then please Report This Article

Introduction

This is a 100% .NET solution to extract text from PDF documents.

Background

Dan Letecky posted a nice code how to extract text from PDF documents in C# based on PDFBox. Altough his solution works well it has a drawback, the size of the required additional libraries is almost 16 MB. Using iTextSharp the size of required additional libraries is only 2.3 MB.

Using the code

In order to use this solution in your projects you need to do the following steps:

  • Add references to: itextsharp.dll and SharpZiplib.dll
  • Add the PDFParser.cs class to your project.

Then you can use the newly added class in the following way:

   // create an instance of the pdfparser class

   PDFParser pdfParser = new PDFParser();
   
   // extract the text

   String result = pdfParser.ExtractText(pdfFile);

I also created a small console application which uses the class and shows the progress of the conversion. Please keep in mind that if you try to extract text from big pdf files keeping all the resulted text in memory is not the best solution, in these cases you should write the extracted text to the file after parsing every page.

How is it working?

My code is based on the algorithm in C ExtractPDFText. Using iTextSharp's PdfReader class to extract the deflated content of every page, then i use a simple function ExtractTextFromPDFBytes to extract the text contents from the deflated page.

Further improvements

Although the code worked well for me, i didn't find in Adobe's pdf reference how to parse special characters. So if someone knows how to do this, just post it and i will update the class.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Zollor



Occupation: Web Developer
Location: Romania Romania

Other popular C# articles:

Article Top
Sign Up to vote for this article
You must Sign In to use this message board.
FAQ FAQ Noise ToleranceSearch Search Messages 
 Layout  Per page   
 Msgs 1 to 17 of 17 (Total in Forum: 17) (Refresh)FirstPrevNext
Subject  Author Date 
Questionproblem making it workmemberMember 38795026:42 1 Jul '08  
GeneralFree Text Mining Tool that can convert PDF files to textmemberVitaliy Petrenko23:19 23 Nov '07  
GeneralRe: Free Text Mining Tool that can convert PDF files to textmemberblackjack21500:29 25 Feb '08  
Questionany improvements / alternative 100% .NET solutions?membercwenger0514:04 19 Nov '07  
GeneralBug Fix - Error reading document \ Index out of bounds errormemberwww.kilon.co.uk4:50 21 Aug '07  
GeneralUnable to pull footer informationmemberPorter36:20 6 Aug '07  
Questionthe pdf of the Complex-Chinese edition can't extract the text-file...Help!!memberPoChungLi17:16 12 Jul '07  
GeneralUsing this in a web applicationmemberBaxterBressler3:46 22 May '07  
QuestionErrormembersrochford@ardrua.com10:29 16 Apr '07  
GeneralError with some pdf'smembergodsvision354:54 27 Nov '06  
GeneralDoes not extract any text with some pdf, but pdfbox canmemberpetoulachi3:58 23 May '06  
GeneralRe: Does not extract any text with some pdf, but pdfbox canmemberrajaher1:54 14 Sep '06  
GeneralRe: Does not extract any text with some pdf, but pdfbox canmemberWobba6:06 8 Mar '08  
GeneralDoesn't extract all textmemberKevin Whitefoot3:50 23 May '06  
GeneralRe: Doesn't extract all textmemberManuel__833:31 10 Oct '06  
Generalnot supporting non-ASCII charactersmemberUnruled Boy17:08 21 May '06  
AnswerRe: not supporting non-ASCII charactersmemberpetoulachi4:51 22 May '06  

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

PermaLink | Privacy | Terms of Use
Last Updated: 20 May 2006
Editor:
Copyright 2006 by Zollor
Everything else Copyright © CodeProject, 1999-2008
Web13 | Advertise on the Code Project