Click here to Skip to main content
6,295,667 members and growing! (17,152 online)
Email Password   helpLost your password?
Languages » C# » General     Beginner License: The Code Project Open License (CPOL)

Extract Text from PDF in C# (100% .NET)

By Zollor

A simple class to extract plain text from PDF documents with ITextSharp
C#, Windows, .NET, Visual Studio, Dev
Posted:20 May 2006
Views:66,302
Bookmarked:58 times
Announcements
Loading...
 
Search    
Advanced Search
printPrint   Broken Article?Report       add Share
  Discuss Discuss   Recommend Article Email
15 votes for this article.
Popularity: 4.25 Rating: 3.61 out of 5
3 votes, 20.0%
1
1 vote, 6.7%
2
1 vote, 6.7%
3
2 votes, 13.3%
4
8 votes, 53.3%
5

Introduction

This is a 100% .NET solution to extract text from PDF documents.

Background

Dan Letecky posted a nice code on how to extract text from PDF documents in C# based on PDFBox. Although his solution works well it has a drawback, the size of the required additional libraries is almost 16 MB. Using iTextSharp the size of required additional libraries is only 2.3 MB.

Using the Code

In order to use this solution in your projects, you need to do the following steps:

  • Add references to itextsharp.dll and SharpZiplib.dll
  • Add the PDFParser.cs class to your project

Then you can use the newly added class in the following way:

// create an instance of the pdfparser class
PDFParser pdfParser = new PDFParser();
   
// extract the text
String result = pdfParser.ExtractText(pdfFile);

I also created a small console application which uses the class and shows the progress of the conversion. Please keep in mind that if you try to extract text from big PDF files, keeping all the resultant text in memory is not the best solution, in these cases you should write the extracted text to the file after parsing every page.

How Is It Working?

My code is based on the algorithm in C ExtractPDFText. Using iTextSharp's PdfReader class to extract the deflated content of every page, I use a simple function ExtractTextFromPDFBytes to extract the text contents from the deflated page.

Further Improvements

Although the code worked well for me, I didn't find in Adobe's PDF reference how to parse special characters. So if someone knows how to do this, just post it and I will update the class.

History

  • 20th May, 2006: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Zollor


Member

Occupation: Web Developer
Location: Romania Romania

Other popular C# articles:

Article Top
You must Sign In to use this message board.
FAQ FAQ 
 
Noise Tolerance  Layout  Per page   
 Msgs 1 to 25 of 28 (Total in Forum: 28) (Refresh)FirstPrevNext
GeneralSupport for non-ASCII solved PinmemberMrVeloso9:29 19 Jun '09  
GeneralRe: Support for non-ASCII solved PinmemberZlate8712:32 25 Jun '09  
GeneralRe: Support for non-ASCII solved PinmemberMrVeloso13:16 25 Jun '09  
GeneralRe: Support for non-ASCII solved PinmemberZlate8713:55 25 Jun '09  
GeneralRe: Support for non-ASCII solved PinmemberMrVeloso2:56 26 Jun '09  
GeneralRe: Support for non-ASCII solved PinmemberZlate8713:50 27 Jun '09  
GeneralDoesn't work for code behind (OCR)... Pinmemberwill_affinity12:00 1 Apr '09  
GeneralIt does not work PinmemberRam.Cse11:48 3 Nov '08  
GeneralText runs all together PinmemberMember 350908010:20 10 Sep '08  
RantDoesn't Work.. off to the PDFBox version PinmemberMichaelSimons7:22 28 Aug '08  
Questionproblem making it work PinmemberMember 38795026:42 1 Jul '08  
AnswerRe: problem making it work Pinmembertamagotchi10:28 4 Jan '09  
GeneralFree Text Mining Tool that can convert PDF files to text PinmemberVitaliy Petrenko23:19 23 Nov '07  
GeneralRe: Free Text Mining Tool that can convert PDF files to text Pinmemberblackjack21500:29 25 Feb '08  
Questionany improvements / alternative 100% .NET solutions? Pinmembercwenger0514:04 19 Nov '07  
GeneralBug Fix - Error reading document \ Index out of bounds error Pinmemberwww.kilon.co.uk4:50 21 Aug '07  
GeneralUnable to pull footer information PinmemberPorter36:20 6 Aug '07  
Questionthe pdf of the Complex-Chinese edition can't extract the text-file...Help!! PinmemberPoChungLi17:16 12 Jul '07  
GeneralUsing this in a web application PinmemberBaxterBressler3:46 22 May '07  
QuestionError Pinmembersrochford@ardrua.com10:29 16 Apr '07  
GeneralError with some pdf's Pinmembergodsvision354:54 27 Nov '06  
GeneralDoes not extract any text with some pdf, but pdfbox can Pinmemberpetoulachi3:58 23 May '06  
GeneralRe: Does not extract any text with some pdf, but pdfbox can Pinmemberrajaher1:54 14 Sep '06  
GeneralRe: Does not extract any text with some pdf, but pdfbox can PinmemberWobba6:06 8 Mar '08  
GeneralDoesn't extract all text PinmemberKevin Whitefoot3:50 23 May '06  

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

PermaLink | Privacy | Terms of Use
Last Updated: 20 May 2006
Editor: Deeksha Shenoy
Copyright 2006 by Zollor
Everything else Copyright © CodeProject, 1999-2009
Web15 | Advertise on the Code Project