Click here to Skip to main content
Licence CPOL
First Posted 20 May 2006
Views 194,919
Downloads 15,414
Bookmarked 114 times

Extract Text from PDF in C# (100% .NET)

By Zollor | 20 May 2006
A simple class to extract plain text from PDF documents with ITextSharp
6 votes, 21.4%
1
2 votes, 7.1%
2
2 votes, 7.1%
3
5 votes, 17.9%
4
13 votes, 46.4%
5
3.66/5 - 28 votes
μ 3.60, σa 2.78 [?]

Introduction

This is a 100% .NET solution to extract text from PDF documents.

Background

Dan Letecky posted a nice code on how to extract text from PDF documents in C# based on PDFBox. Although his solution works well it has a drawback, the size of the required additional libraries is almost 16 MB. Using iTextSharp the size of required additional libraries is only 2.3 MB.

Using the Code

In order to use this solution in your projects, you need to do the following steps:

  • Add references to itextsharp.dll and SharpZiplib.dll
  • Add the PDFParser.cs class to your project

Then you can use the newly added class in the following way:

// create an instance of the pdfparser class
PDFParser pdfParser = new PDFParser();
   
// extract the text
String result = pdfParser.ExtractText(pdfFile);

I also created a small console application which uses the class and shows the progress of the conversion. Please keep in mind that if you try to extract text from big PDF files, keeping all the resultant text in memory is not the best solution, in these cases you should write the extracted text to the file after parsing every page.

How Is It Working?

My code is based on the algorithm in C ExtractPDFText. Using iTextSharp's PdfReader class to extract the deflated content of every page, I use a simple function ExtractTextFromPDFBytes to extract the text contents from the deflated page.

Further Improvements

Although the code worked well for me, I didn't find in Adobe's PDF reference how to parse special characters. So if someone knows how to do this, just post it and I will update the class.

History

  • 20th May, 2006: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Zollor

Web Developer

Romania Romania

Member


Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
GeneralMy vote of 1 Pinmembermjkhan78622:35 20 Jan '12  
Questionhow to export data from excel to PDF ? PinmembernimolZero7:10 28 Aug '11  
Questionnot work Pinmembercutithongtin15:45 1 Aug '11  
QuestionDosn't work. Pinmembersasirekam20:13 29 Jun '11  
GeneralAlternate Solution Pinmemberkaaskop4:44 7 May '11  
GeneralRe: Alternate Solution PinmemberWizdave0510:04 2 Feb '12  
General(Solved) Error when reading some document (page missing) PinmemberLord TaGoH1:13 8 Apr '11  
GeneralRe: (Solved) Error when reading some document (page missing) [modified] PinmemberJBress8:40 22 Jun '11  
GeneralNew line problem Pinmemberdejan19dejan194:53 4 Jan '11  
Generalnot working on scaned image pdf file Pinmembergaurav.ipec19:52 16 Dec '10  
GeneralRe: not working on scaned image pdf file Pinmemberdr_csci5:07 10 Jan '11  
GeneralMy vote of 5 Pinmemberstefan_lahnor23:54 25 Nov '10  
GeneralMy vote of 1 Pinmemberajc8:16 27 Oct '10  
AnswerSupport for Unicode strings [modified] PinmemberVasiliy Zverev7:25 29 Sep '10  
GeneralRe: Support for Unicode strings Pinmembergulak0:19 29 Jan '11  
GeneralRe: Support for Unicode strings PinmemberVasiliy Zverev10:02 29 Jan '11  
GeneralRe: Support for Unicode strings Pinmembergulak4:47 31 Jan '11  
Generaltext position Pinmemberelinfo12:51 9 Sep '10  
Questionchar " - ascii (34) Pinmemberdoomelo0:54 31 Aug '10  
General[My vote of 2] Techinchal mistake PinmemberUmair Aslam Bhatti22:52 30 Aug '10  
GeneralRe: [My vote of 2] Techinchal mistake Pinmembermdimad11:14 27 Sep '10  
GeneralDoesn't work PinmemberHale McBraske11:35 17 Aug '10  
GeneralMy vote of 1 Pinmembermycode.mycode@rocketmail.com22:34 13 Aug '10  
GeneralMy vote of 3 PinmemberDotnetSniper20:23 9 Aug '10  
QuestionHow do i port this console application to .net web application? Pinmemberfabriziorz10:21 5 May '10  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web02 | 2.5.120209.1 | Last Updated 20 May 2006
Article Copyright 2006 by Zollor
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid