Click here to Skip to main content
11,493,233 members (59,741 online)
Click here to Skip to main content

Extract Text from PDF in C# (100% .NET)

, 20 May 2006 CPOL 418.7K 62.9K 156
Rate this:
Please Sign up or sign in to vote.
A simple class to extract plain text from PDF documents with ITextSharp

Introduction

This is a 100% .NET solution to extract text from PDF documents.

Background

Dan Letecky posted a nice code on how to extract text from PDF documents in C# based on PDFBox. Although his solution works well it has a drawback, the size of the required additional libraries is almost 16 MB. Using iTextSharp the size of required additional libraries is only 2.3 MB.

Using the Code

In order to use this solution in your projects, you need to do the following steps:

  • Add references to itextsharp.dll and SharpZiplib.dll
  • Add the PDFParser.cs class to your project

Then you can use the newly added class in the following way:

// create an instance of the pdfparser class
PDFParser pdfParser = new PDFParser();
   
// extract the text
String result = pdfParser.ExtractText(pdfFile);

I also created a small console application which uses the class and shows the progress of the conversion. Please keep in mind that if you try to extract text from big PDF files, keeping all the resultant text in memory is not the best solution, in these cases you should write the extracted text to the file after parsing every page.

How Is It Working?

My code is based on the algorithm in C ExtractPDFText. Using iTextSharp's PdfReader class to extract the deflated content of every page, I use a simple function ExtractTextFromPDFBytes to extract the text contents from the deflated page.

Further Improvements

Although the code worked well for me, I didn't find in Adobe's PDF reference how to parse special characters. So if someone knows how to do this, just post it and I will update the class.

History

  • 20th May, 2006: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Zollor
Web Developer
Romania Romania
No Biography provided

Comments and Discussions

 
GeneralRe: Alternate Solution Pin
Wizdave052-Feb-12 10:04
memberWizdave052-Feb-12 10:04 
GeneralRe: Alternate Solution Pin
Member 864124213-Feb-12 16:45
memberMember 864124213-Feb-12 16:45 
GeneralRe: Alternate Solution Pin
Member 909494814-Aug-12 14:45
memberMember 909494814-Aug-12 14:45 
GeneralRe: Alternate Solution Pin
James Henze29-Nov-13 6:39
memberJames Henze29-Nov-13 6:39 
General(Solved) Error when reading some document (page missing) Pin
Lord TaGoH8-Apr-11 1:13
memberLord TaGoH8-Apr-11 1:13 
GeneralRe: (Solved) Error when reading some document (page missing) [modified] Pin
JBress22-Jun-11 8:40
memberJBress22-Jun-11 8:40 
GeneralNew line problem Pin
dejan19dejan194-Jan-11 4:53
memberdejan19dejan194-Jan-11 4:53 
Generalnot working on scaned image pdf file Pin
gaurav.ipec16-Dec-10 19:52
membergaurav.ipec16-Dec-10 19:52 
GeneralRe: not working on scaned image pdf file Pin
dr_csci10-Jan-11 5:07
memberdr_csci10-Jan-11 5:07 
GeneralMy vote of 5 Pin
stefan_lahnor25-Nov-10 23:54
memberstefan_lahnor25-Nov-10 23:54 
GeneralMy vote of 1 Pin
ajc27-Oct-10 8:16
memberajc27-Oct-10 8:16 
AnswerSupport for Unicode strings [modified] Pin
Vasiliy Zverev29-Sep-10 7:25
memberVasiliy Zverev29-Sep-10 7:25 
GeneralRe: Support for Unicode strings Pin
gulak29-Jan-11 0:19
membergulak29-Jan-11 0:19 
GeneralRe: Support for Unicode strings Pin
Vasiliy Zverev29-Jan-11 10:02
memberVasiliy Zverev29-Jan-11 10:02 
GeneralRe: Support for Unicode strings Pin
gulak31-Jan-11 4:47
membergulak31-Jan-11 4:47 
Generaltext position Pin
elinfo19-Sep-10 2:51
memberelinfo19-Sep-10 2:51 
Questionchar " - ascii (34) Pin
doomelo31-Aug-10 0:54
memberdoomelo31-Aug-10 0:54 
General[My vote of 2] Techinchal mistake Pin
Umair Aslam Bhatti30-Aug-10 22:52
memberUmair Aslam Bhatti30-Aug-10 22:52 
GeneralRe: [My vote of 2] Techinchal mistake Pin
mdimad27-Sep-10 11:14
membermdimad27-Sep-10 11:14 
GeneralDoesn't work Pin
Hale McBraske17-Aug-10 11:35
memberHale McBraske17-Aug-10 11:35 
GeneralMy vote of 1 Pin
mycode.mycode@rocketmail.com13-Aug-10 22:34
membermycode.mycode@rocketmail.com13-Aug-10 22:34 
GeneralMy vote of 3 Pin
DotnetSniper9-Aug-10 20:23
memberDotnetSniper9-Aug-10 20:23 
QuestionHow do i port this console application to .net web application? Pin
fabriziorz5-May-10 10:21
memberfabriziorz5-May-10 10:21 
GeneralIt works only english text language but not work with the bengali or hindi or chinese or any other language. Pin
zqonline16-Mar-10 0:04
memberzqonline16-Mar-10 0:04 
GeneralMore explanation of how it works Pin
Simon Stevens13-Jan-10 0:38
memberSimon Stevens13-Jan-10 0:38 
GeneralVery useful Pin
Roberto Zanardo14-Oct-09 7:29
memberRoberto Zanardo14-Oct-09 7:29 
GeneralPDFParser vs PDFBox Pin
talbot_c5-Oct-09 20:36
membertalbot_c5-Oct-09 20:36 
Generaldon't want to show PDFproducer name... Pin
raj23136210-Aug-09 0:29
memberraj23136210-Aug-09 0:29 
QuestionHOW TO UPDATE IMAGE ON TEXT..... Pin
raj2313629-Aug-09 22:28
memberraj2313629-Aug-09 22:28 
Generalcould not get all text Pin
chuckdawit27-Jul-09 14:24
memberchuckdawit27-Jul-09 14:24 
GeneralRe: could not get all text Pin
Marco Tenuti22-Dec-09 8:27
memberMarco Tenuti22-Dec-09 8:27 
GeneralSupport for non-ASCII solved Pin
MrVeloso19-Jun-09 9:29
memberMrVeloso19-Jun-09 9:29 
GeneralRe: Support for non-ASCII solved Pin
Zlate8725-Jun-09 12:32
memberZlate8725-Jun-09 12:32 
GeneralRe: Support for non-ASCII solved Pin
MrVeloso25-Jun-09 13:16
memberMrVeloso25-Jun-09 13:16 
GeneralRe: Support for non-ASCII solved Pin
Zlate8725-Jun-09 13:55
memberZlate8725-Jun-09 13:55 
GeneralRe: Support for non-ASCII solved Pin
MrVeloso26-Jun-09 2:56
memberMrVeloso26-Jun-09 2:56 
GeneralRe: Support for non-ASCII solved Pin
Zlate8727-Jun-09 13:50
memberZlate8727-Jun-09 13:50 
GeneralDoesn't work for code behind (OCR)... Pin
will_affinity1-Apr-09 12:00
memberwill_affinity1-Apr-09 12:00 
GeneralIt does not work Pin
Ram.Cse3-Nov-08 11:48
memberRam.Cse3-Nov-08 11:48 
GeneralText runs all together Pin
Member 350908010-Sep-08 10:20
memberMember 350908010-Sep-08 10:20 
GeneralRe: Text runs all together Pin
Chris.Procter30-Aug-10 10:25
memberChris.Procter30-Aug-10 10:25 
RantDoesn't Work.. off to the PDFBox version Pin
MichaelSimons28-Aug-08 7:22
memberMichaelSimons28-Aug-08 7:22 
Questionproblem making it work Pin
Member 38795021-Jul-08 6:42
memberMember 38795021-Jul-08 6:42 
AnswerRe: problem making it work Pin
tamagotchi4-Jan-09 10:28
membertamagotchi4-Jan-09 10:28 
GeneralFree Text Mining Tool that can convert PDF files to text Pin
Vitaliy Petrenko23-Nov-07 23:19
memberVitaliy Petrenko23-Nov-07 23:19 
GeneralRe: Free Text Mining Tool that can convert PDF files to text Pin
blackjack215025-Feb-08 0:29
memberblackjack215025-Feb-08 0:29 
Questionany improvements / alternative 100% .NET solutions? Pin
cwenger0519-Nov-07 14:04
membercwenger0519-Nov-07 14:04 
GeneralBug Fix - Error reading document \ Index out of bounds error Pin
www.kilon.co.uk21-Aug-07 4:50
memberwww.kilon.co.uk21-Aug-07 4:50 
GeneralRe: Bug Fix - Error reading document \ Index out of bounds error Pin
Vasiliy Zverev11-Sep-10 16:05
memberVasiliy Zverev11-Sep-10 16:05 
GeneralUnable to pull footer information Pin
Porter36-Aug-07 6:20
memberPorter36-Aug-07 6:20 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.150520.1 | Last Updated 20 May 2006
Article Copyright 2006 by Zollor
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid