Click here to Skip to main content
15,886,783 members
Articles / Programming Languages / C#
Article

Extract Text from PDF in C# (100% .NET)

Rate me:
Please Sign up or sign in to vote.
3.67/5 (60 votes)
20 May 2006CPOL1 min read 969.9K   120.4K   174   106
A simple class to extract plain text from PDF documents with ITextSharp

Introduction

This is a 100% .NET solution to extract text from PDF documents.

Background

Dan Letecky posted a nice code on how to extract text from PDF documents in C# based on PDFBox. Although his solution works well it has a drawback, the size of the required additional libraries is almost 16 MB. Using iTextSharp the size of required additional libraries is only 2.3 MB.

Using the Code

In order to use this solution in your projects, you need to do the following steps:

  • Add references to itextsharp.dll and SharpZiplib.dll
  • Add the PDFParser.cs class to your project

Then you can use the newly added class in the following way:

C#
// create an instance of the pdfparser class
PDFParser pdfParser = new PDFParser();
   
// extract the text
String result = pdfParser.ExtractText(pdfFile);

I also created a small console application which uses the class and shows the progress of the conversion. Please keep in mind that if you try to extract text from big PDF files, keeping all the resultant text in memory is not the best solution, in these cases you should write the extracted text to the file after parsing every page.

How Is It Working?

My code is based on the algorithm in C ExtractPDFText. Using iTextSharp's PdfReader class to extract the deflated content of every page, I use a simple function ExtractTextFromPDFBytes to extract the text contents from the deflated page.

Further Improvements

Although the code worked well for me, I didn't find in Adobe's PDF reference how to parse special characters. So if someone knows how to do this, just post it and I will update the class.

History

  • 20th May, 2006: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Web Developer
Romania Romania
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
Questionnot work Pin
cutithongtin1-Aug-11 14:45
cutithongtin1-Aug-11 14:45 
QuestionDosn't work. Pin
sasirekam29-Jun-11 19:13
sasirekam29-Jun-11 19:13 
GeneralAlternate Solution Pin
kaaskop7-May-11 3:44
kaaskop7-May-11 3:44 
GeneralRe: Alternate Solution Pin
Wizdave052-Feb-12 9:04
Wizdave052-Feb-12 9:04 
GeneralRe: Alternate Solution Pin
Member 864124213-Feb-12 15:45
Member 864124213-Feb-12 15:45 
GeneralRe: Alternate Solution Pin
Member 909494814-Aug-12 13:45
Member 909494814-Aug-12 13:45 
GeneralRe: Alternate Solution Pin
James Henze29-Nov-13 5:39
James Henze29-Nov-13 5:39 
General(Solved) Error when reading some document (page missing) Pin
Lord TaGoH8-Apr-11 0:13
Lord TaGoH8-Apr-11 0:13 
Thanks you very much for your CODE!Thumbs Up | :thumbsup: Thumbs Up | :thumbsup: Thumbs Up | :thumbsup:
you saved my ass on my current project when PDFBox fail to extract the text!Cool | :cool: Cool | :cool:

I encounter some problem reading some pages of some document anyway
because in the code (method ExtractTextFromPDFBytes)you call:
if (CheckToken(new string[] {"'", "T*", "\""}, previousCharacters))
{
         resultString += "\n";
}


But the CheckToken take for granted that ALL tokens are 2 character long at least
if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
   (recent[_numberOfCharsToKeep - 2] == token[1]) &&
...


checking if the token is long 1 character or more solve the problem.

you need to change the CheckToken method with this one:
private bool CheckToken(string[] tokens, char[] recent)
        {
            foreach(string token in tokens)
            {
                if (token.Length > 1)
                {
                    if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
                        (recent[_numberOfCharsToKeep - 2] == token[1]) &&
                        ((recent[_numberOfCharsToKeep - 1] == ' ') ||
                        (recent[_numberOfCharsToKeep - 1] == 0x0d) ||
                        (recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
                        ((recent[_numberOfCharsToKeep - 4] == ' ') ||
                        (recent[_numberOfCharsToKeep - 4] == 0x0d) ||
                        (recent[_numberOfCharsToKeep - 4] == 0x0a))
                        )
                    {
                        return true;
                    }
                }
                else
                {
                    if ((recent[_numberOfCharsToKeep - 2] == token[0]) &&
                        ((recent[_numberOfCharsToKeep - 1] == ' ') ||
                        (recent[_numberOfCharsToKeep - 1] == 0x0d) ||
                        (recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
                        ((recent[_numberOfCharsToKeep - 4] == ' ') ||
                        (recent[_numberOfCharsToKeep - 4] == 0x0d) ||
                        (recent[_numberOfCharsToKeep - 4] == 0x0a))
                        )
                    {
                        return true;
                    }
                }
            }
            return false;
        }

GeneralRe: (Solved) Error when reading some document (page missing) [modified] Pin
JBress22-Jun-11 7:40
JBress22-Jun-11 7:40 
GeneralNew line problem Pin
dejan19dejan194-Jan-11 3:53
dejan19dejan194-Jan-11 3:53 
Generalnot working on scaned image pdf file Pin
gaurav.ipec16-Dec-10 18:52
gaurav.ipec16-Dec-10 18:52 
GeneralRe: not working on scaned image pdf file Pin
dr_csci10-Jan-11 4:07
dr_csci10-Jan-11 4:07 
GeneralMy vote of 5 Pin
stefan_lahnor25-Nov-10 22:54
stefan_lahnor25-Nov-10 22:54 
GeneralMy vote of 1 Pin
Aaron Craft27-Oct-10 7:16
Aaron Craft27-Oct-10 7:16 
AnswerSupport for Unicode strings [modified] Pin
Vasiliy Zverev29-Sep-10 6:25
Vasiliy Zverev29-Sep-10 6:25 
GeneralRe: Support for Unicode strings Pin
gulak28-Jan-11 23:19
gulak28-Jan-11 23:19 
GeneralRe: Support for Unicode strings Pin
Vasiliy Zverev29-Jan-11 9:02
Vasiliy Zverev29-Jan-11 9:02 
GeneralRe: Support for Unicode strings Pin
gulak31-Jan-11 3:47
gulak31-Jan-11 3:47 
Generaltext position Pin
user37824759-Sep-10 1:51
user37824759-Sep-10 1:51 
Questionchar " - ascii (34) Pin
doomelo30-Aug-10 23:54
doomelo30-Aug-10 23:54 
General[My vote of 2] Techinchal mistake Pin
umairaslam2230-Aug-10 21:52
umairaslam2230-Aug-10 21:52 
GeneralRe: [My vote of 2] Techinchal mistake Pin
mdimad27-Sep-10 10:14
mdimad27-Sep-10 10:14 
GeneralDoesn't work Pin
Hale McBraske17-Aug-10 10:35
Hale McBraske17-Aug-10 10:35 
GeneralMy vote of 1 Pin
mycode.mycode@rocketmail.com13-Aug-10 21:34
mycode.mycode@rocketmail.com13-Aug-10 21:34 
GeneralMy vote of 3 Pin
DotnetSniper9-Aug-10 19:23
DotnetSniper9-Aug-10 19:23 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.