Click here to Skip to main content
Click here to Skip to main content

Extract Text from PDF in C# (100% .NET)

, 20 May 2006 CPOL
Rate this:
Please Sign up or sign in to vote.
A simple class to extract plain text from PDF documents with ITextSharp


This is a 100% .NET solution to extract text from PDF documents.


Dan Letecky posted a nice code on how to extract text from PDF documents in C# based on PDFBox. Although his solution works well it has a drawback, the size of the required additional libraries is almost 16 MB. Using iTextSharp the size of required additional libraries is only 2.3 MB.

Using the Code

In order to use this solution in your projects, you need to do the following steps:

  • Add references to itextsharp.dll and SharpZiplib.dll
  • Add the PDFParser.cs class to your project

Then you can use the newly added class in the following way:

// create an instance of the pdfparser class
PDFParser pdfParser = new PDFParser();
// extract the text
String result = pdfParser.ExtractText(pdfFile);

I also created a small console application which uses the class and shows the progress of the conversion. Please keep in mind that if you try to extract text from big PDF files, keeping all the resultant text in memory is not the best solution, in these cases you should write the extracted text to the file after parsing every page.

How Is It Working?

My code is based on the algorithm in C ExtractPDFText. Using iTextSharp's PdfReader class to extract the deflated content of every page, I use a simple function ExtractTextFromPDFBytes to extract the text contents from the deflated page.

Further Improvements

Although the code worked well for me, I didn't find in Adobe's PDF reference how to parse special characters. So if someone knows how to do this, just post it and I will update the class.


  • 20th May, 2006: Initial post


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Web Developer
Romania Romania
No Biography provided

Comments and Discussions

QuestionI think you have the files mixed (Source and Demo) PinmemberTheJaredHooper2-Dec-14 2:54 
AnswerGood tool, thanks for sharing. Pinmembermic.xu25-Sep-14 9:40 
GeneralMy vote of 2 PinmemberMember 107128088-Aug-14 2:43 
QuestionTanks a lot! PinmemberKrNeS23-Jan-14 6:41 
QuestionWorks (kind of) PinmemberMember 380812010-Dec-13 2:09 
GeneralGreat job. PinmemberPerry Orr7-Aug-13 18:20 
QuestionToo simplistic - why I voted 1 Pinmemberatlaste10-Jul-13 23:28 
AnswerRe: Too simplistic - why I voted 1 PinmemberPerry Orr7-Aug-13 18:22 
GeneralRe: Too simplistic - why I voted 1 Pinmemberatlaste2-Mar-14 22:30 
QuestionHow it work!? Pinmemberreza2168116-Apr-13 7:51 
Questionlayout Pinmembertmac1211-Mar-13 0:32 
QuestionThank you! PinmemberJoseph guidry8-Jan-13 10:15 
SuggestionPdf to text conversion in PinmemberHighCommand18-Dec-12 9:24 
BugFound bug PinmemberMunissoR24-Apr-12 2:57 
AnswerRe: Found bug Pinmemberfborelli4-Jul-12 9:33 
GeneralGreat Post, Works Great! PinmemberMember 202264516-Apr-12 6:13 
QuestionGreat post! PinmemberEric Castellon9-Apr-12 10:35 
GeneralMy vote of 5 Pinmemberbrinda roy21-Feb-12 0:41 
GeneralMy vote of 1 Pinmembermjkhan78620-Jan-12 22:35 
Questionhow to export data from excel to PDF ? PinmembernimolZero28-Aug-11 7:10 
Questionnot work Pinmembercutithongtin1-Aug-11 15:45 
QuestionDosn't work. Pinmembersasirekam29-Jun-11 20:13 
GeneralAlternate Solution Pinmemberkaaskop7-May-11 4:44 
The iTextSharp.dll that is included with this project bombed when I ran the program on a test file that I created with Acrobat X. The latest version of iTextSharp works better. The program itself works sort of with PDF files created with ABBYY, but it does not interpret all the tokens correctly. The result is unwanted spaces within the text. While looking for an explanation of the tokens that are embedded in the stream, I came arcross It has the source that compiles to a program that not only extracts the text, but also lists the dictionary and content stream. The only drawback is that you have to copy and paste 26 files into Visual Studio since I have not been able to find a download link, but it does what I needed it to do and more.
GeneralRe: Alternate Solution PinmemberWizdave052-Feb-12 10:04 
GeneralRe: Alternate Solution PinmemberMember 864124213-Feb-12 16:45 
GeneralRe: Alternate Solution PinmemberMember 909494814-Aug-12 14:45 
GeneralRe: Alternate Solution PinmemberJames Henze29-Nov-13 6:39 
General(Solved) Error when reading some document (page missing) PinmemberLord TaGoH8-Apr-11 1:13 
GeneralRe: (Solved) Error when reading some document (page missing) [modified] PinmemberJBress22-Jun-11 8:40 
GeneralNew line problem Pinmemberdejan19dejan194-Jan-11 4:53 
Generalnot working on scaned image pdf file Pinmembergaurav.ipec16-Dec-10 19:52 
GeneralRe: not working on scaned image pdf file Pinmemberdr_csci10-Jan-11 5:07 
GeneralMy vote of 5 Pinmemberstefan_lahnor25-Nov-10 23:54 
GeneralMy vote of 1 Pinmemberajc27-Oct-10 8:16 
AnswerSupport for Unicode strings [modified] PinmemberVasiliy Zverev29-Sep-10 7:25 
GeneralRe: Support for Unicode strings Pinmembergulak29-Jan-11 0:19 
GeneralRe: Support for Unicode strings PinmemberVasiliy Zverev29-Jan-11 10:02 
GeneralRe: Support for Unicode strings Pinmembergulak31-Jan-11 4:47 
Generaltext position Pinmemberelinfo19-Sep-10 2:51 
Questionchar " - ascii (34) Pinmemberdoomelo31-Aug-10 0:54 
General[My vote of 2] Techinchal mistake PinmemberUmair Aslam Bhatti30-Aug-10 22:52 
GeneralRe: [My vote of 2] Techinchal mistake Pinmembermdimad27-Sep-10 11:14 
GeneralDoesn't work PinmemberHale McBraske17-Aug-10 11:35 
GeneralMy vote of 1 Pinmembermycode.mycode@rocketmail.com13-Aug-10 22:34 
GeneralMy vote of 3 PinmemberDotnetSniper9-Aug-10 20:23 
QuestionHow do i port this console application to .net web application? Pinmemberfabriziorz5-May-10 10:21 
GeneralIt works only english text language but not work with the bengali or hindi or chinese or any other language. Pinmemberzqonline16-Mar-10 0:04 
GeneralMore explanation of how it works PinmemberSimon Stevens13-Jan-10 0:38 
GeneralVery useful PinmemberRoberto Zanardo14-Oct-09 7:29 
GeneralPDFParser vs PDFBox Pinmembertalbot_c5-Oct-09 20:36 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.150326.1 | Last Updated 20 May 2006
Article Copyright 2006 by Zollor
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid