Click here to Skip to main content
13,252,362 members (60,342 online)
Click here to Skip to main content
Add your own
alternative version


168 bookmarked
Posted 20 May 2006

Extract Text from PDF in C# (100% .NET)

, 20 May 2006
Rate this:
Please Sign up or sign in to vote.
A simple class to extract plain text from PDF documents with ITextSharp


This is a 100% .NET solution to extract text from PDF documents.


Dan Letecky posted a nice code on how to extract text from PDF documents in C# based on PDFBox. Although his solution works well it has a drawback, the size of the required additional libraries is almost 16 MB. Using iTextSharp the size of required additional libraries is only 2.3 MB.

Using the Code

In order to use this solution in your projects, you need to do the following steps:

  • Add references to itextsharp.dll and SharpZiplib.dll
  • Add the PDFParser.cs class to your project

Then you can use the newly added class in the following way:

// create an instance of the pdfparser class
PDFParser pdfParser = new PDFParser();
// extract the text
String result = pdfParser.ExtractText(pdfFile);

I also created a small console application which uses the class and shows the progress of the conversion. Please keep in mind that if you try to extract text from big PDF files, keeping all the resultant text in memory is not the best solution, in these cases you should write the extracted text to the file after parsing every page.

How Is It Working?

My code is based on the algorithm in C ExtractPDFText. Using iTextSharp's PdfReader class to extract the deflated content of every page, I use a simple function ExtractTextFromPDFBytes to extract the text contents from the deflated page.

Further Improvements

Although the code worked well for me, I didn't find in Adobe's PDF reference how to parse special characters. So if someone knows how to do this, just post it and I will update the class.


  • 20th May, 2006: Initial post


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Web Developer
Romania Romania
No Biography provided

You may also be interested in...

Comments and Discussions

BugSee your return type Pin
shezee61112-Aug-17 7:59
membershezee61112-Aug-17 7:59 
QuestionNot working Pin
Member 1293388831-Mar-17 20:34
memberMember 1293388831-Mar-17 20:34 
Questionwaste of time Pin
Member 1259699615-Mar-17 1:59
memberMember 1259699615-Mar-17 1:59 
AnswerRe: waste of time Pin
ngoj28-Apr-17 0:36
memberngoj28-Apr-17 0:36 
QuestionNot working Pin
Onur Guzel9-Mar-17 0:52
memberOnur Guzel9-Mar-17 0:52 
Questionyour code is dumb Pin
Hassan Alrehamy14-Aug-16 19:39
memberHassan Alrehamy14-Aug-16 19:39 
Questionempty txt file Pin
xxsaxx24-Feb-16 8:43
memberxxsaxx24-Feb-16 8:43 
AnswerRe: empty txt file Pin
duy nguyễn6-Aug-17 17:54
memberduy nguyễn6-Aug-17 17:54 
QuestionNeed to convert pdf to excel Pin
atulonweb@gmail.com17-Nov-15 21:43
memberatulonweb@gmail.com17-Nov-15 21:43 
QuestionExtract Text from PDF in C# (100% .NET) Pin
rose lindo12-Aug-15 20:53
memberrose lindo12-Aug-15 20:53 
GeneralNice demo Pin
awaneesh jatrana1-Jul-15 1:04
memberawaneesh jatrana1-Jul-15 1:04 
QuestionKaputt for me Pin
B. Clay Shannon29-Jun-15 6:51
professionalB. Clay Shannon29-Jun-15 6:51 
QuestionGet an err msg instead of the actual contents Pin
B. Clay Shannon25-Jun-15 9:52
professionalB. Clay Shannon25-Jun-15 9:52 
QuestionProbably a Dumb Question Pin
Member 117451965-Jun-15 10:35
memberMember 117451965-Jun-15 10:35 
GeneralGreat Job Pin
khaen31-Mar-15 14:41
memberkhaen31-Mar-15 14:41 
QuestionIt is not working. Pin
irensaltali28-Mar-15 6:54
memberirensaltali28-Mar-15 6:54 
QuestionI think you have the files mixed (Source and Demo) Pin
TheJaredHooper2-Dec-14 2:54
memberTheJaredHooper2-Dec-14 2:54 
AnswerGood tool, thanks for sharing. Pin
mic.xu25-Sep-14 9:40
membermic.xu25-Sep-14 9:40 
GeneralMy vote of 2 Pin
Member 107128088-Aug-14 2:43
memberMember 107128088-Aug-14 2:43 
QuestionTanks a lot! Pin
KrNeS23-Jan-14 6:41
memberKrNeS23-Jan-14 6:41 
QuestionWorks (kind of) Pin
Member 380812010-Dec-13 2:09
memberMember 380812010-Dec-13 2:09 
GeneralGreat job. Pin
Perry Orr7-Aug-13 18:20
memberPerry Orr7-Aug-13 18:20 
QuestionToo simplistic - why I voted 1 Pin
atlaste10-Jul-13 23:28
memberatlaste10-Jul-13 23:28 
Sorry to be the bearer of bad news for all the good intentions. I know the PDF standard quite well, even implemented it a couple of times, and to be honest this is *not* what you want. Extracting text from PDF is a *very hard* thing to implement, not something to take lightly. To name a few issues with this code:

- Different styles of encodings are not supported; it's not just ascii out there!
- Unicode cmaps are not implemented; you'll get jibberish, lots of it
- Different ways to encode strings are not supported
- Characters are positioned absolutely in PDF; you cannot just grab them and hope you end up with text, you need some type of OCR-like text merging

... I can go on for quite a while here ... Yes, there's a reason all implementations are megabytes in size.

If you really want to extract text from PDF, read the standard and then if you're still up to it, start coding. Then download a couple of 1000 pdf's from the internet, see all your code go to hell, fix all the issues and go on. This simplistic "solution" will just give you lots and lots of bad and unpredictable results.
AnswerRe: Too simplistic - why I voted 1 Pin
Perry Orr7-Aug-13 18:22
memberPerry Orr7-Aug-13 18:22 
GeneralRe: Too simplistic - why I voted 1 Pin
atlaste2-Mar-14 22:30
memberatlaste2-Mar-14 22:30 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.171114.1 | Last Updated 20 May 2006
Article Copyright 2006 by Zollor
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid