Click here to Skip to main content
Click here to Skip to main content

Extract Text from PDF in C# (100% .NET)

By , 20 May 2006
 

Introduction

This is a 100% .NET solution to extract text from PDF documents.

Background

Dan Letecky posted a nice code on how to extract text from PDF documents in C# based on PDFBox. Although his solution works well it has a drawback, the size of the required additional libraries is almost 16 MB. Using iTextSharp the size of required additional libraries is only 2.3 MB.

Using the Code

In order to use this solution in your projects, you need to do the following steps:

  • Add references to itextsharp.dll and SharpZiplib.dll
  • Add the PDFParser.cs class to your project

Then you can use the newly added class in the following way:

// create an instance of the pdfparser class
PDFParser pdfParser = new PDFParser();
   
// extract the text
String result = pdfParser.ExtractText(pdfFile);

I also created a small console application which uses the class and shows the progress of the conversion. Please keep in mind that if you try to extract text from big PDF files, keeping all the resultant text in memory is not the best solution, in these cases you should write the extracted text to the file after parsing every page.

How Is It Working?

My code is based on the algorithm in C ExtractPDFText. Using iTextSharp's PdfReader class to extract the deflated content of every page, I use a simple function ExtractTextFromPDFBytes to extract the text contents from the deflated page.

Further Improvements

Although the code worked well for me, I didn't find in Adobe's PDF reference how to parse special characters. So if someone knows how to do this, just post it and I will update the class.

History

  • 20th May, 2006: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Zollor
Web Developer
Romania Romania
Member
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionHow it work!?memberreza2168116 Apr '13 - 6:51 
Can you create a tutorial of how use your code?
It does not work when I open it with Visual Studio 2010.
Questionlayoutmembertmac1210 Mar '13 - 23:32 
Woks, but it doesn't mantain the layout of the pdf.. :(
QuestionThank you!memberJoseph guidry8 Jan '13 - 9:15 
Thanks your a life saver. Cool | :cool:
SuggestionPdf to text conversion in asp.netmemberHighCommand18 Dec '12 - 8:24 
we can also convert pdf to text with free utility. (pdftotext)
 
here is the demonstration
pdf to text in asp.net
BugFound bugmemberMunissoR24 Apr '12 - 1:57 
In the foreach loop of CheckToken method you are trying access token[1], that is not there if the length of the token is 1 (e.g.: you are checking for ' and ").
AnswerRe: Found bugmemberfborelli4 Jul '12 - 8:33 
private bool CheckToken(string[] tokens, char[] recent)
    {
        foreach(string token in tokens)
        {
            if (token.Length > 1)
            {
                if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
                    (recent[_numberOfCharsToKeep - 2] == token[1]) &&
                    ((recent[_numberOfCharsToKeep - 1] == ' ') ||
                    (recent[_numberOfCharsToKeep - 1] == 0x0d) ||
                    (recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
                    ((recent[_numberOfCharsToKeep - 4] == ' ') ||
                    (recent[_numberOfCharsToKeep - 4] == 0x0d) ||
                    (recent[_numberOfCharsToKeep - 4] == 0x0a))
                    )
                {
                    return true;
                }
            }
        }
        return false;
    }
    #endregion
}

GeneralGreat Post, Works Great!memberMember 202264516 Apr '12 - 5:13 
You did everyone a big favor documenting the text parsing; great job!!! People should have done this for iTextSharp long ago. Thank you!!! Smile | :)
Nevin House
Programmer/Developer

QuestionGreat post!memberEric Castellon9 Apr '12 - 9:35 
Great, my vote of 5. In fact i solved a problem with this.
 
Thanks!!!
GeneralMy vote of 5memberbrinda roy20 Feb '12 - 23:41 
good
GeneralMy vote of 1membermjkhan78620 Jan '12 - 21:35 
code is not working
Questionhow to export data from excel to PDF ?membernimolZero28 Aug '11 - 6:10 
Dear all,
 
anybody pls point me the way how can i export data from excel to pdf.
 
or export data from datagridview to pdf that support unicode string..
 

feel is very complicated serveral days ago but can't find any solution..
 
pls help me Sniff | :^) Sniff | :^)
Questionnot workmembercutithongtin1 Aug '11 - 14:45 
very simple feedback : it does not work ! Poke tongue | ;-P Poke tongue | ;-P Poke tongue | ;-P
QuestionDosn't work.membersasirekam29 Jun '11 - 19:13 
I couldn't extract text from pdf. I have just download the coding and include the dll file and PDFSharper file to my project but i didn't get any string in output file.
 
Anybody please help me.
 
Thanks in advance..
GeneralAlternate Solutionmemberkaaskop7 May '11 - 3:44 
The iTextSharp.dll that is included with this project bombed when I ran the program on a test file that I created with Acrobat X. The latest version of iTextSharp works better. The program itself works sort of with PDF files created with ABBYY, but it does not interpret all the tokens correctly. The result is unwanted spaces within the text. While looking for an explanation of the tokens that are embedded in the stream, I came arcross http://www.java2s.com/Open-Source/CSharp/PDF/iTextSharp/iTextSharp/text/pdf/parser/Catalogparser.htm. It has the source that compiles to a program that not only extracts the text, but also lists the dictionary and content stream. The only drawback is that you have to copy and paste 26 files into Visual Studio since I have not been able to find a download link, but it does what I needed it to do and more.
GeneralRe: Alternate SolutionmemberWizdave052 Feb '12 - 9:04 
This code is actually in the latest version of iTextSharp (5.1.3.0) and is much simpler to use than the code in this project (and works more reliably too). Here's a simple class that uses iTextSharp:
 
using System;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
 
public class PdfTextParser
{
    public string ExtractTextFromPDFPage(string pdfFile, int pageNumber)
    {
        PdfReader reader = new PdfReader(pdfFile);
        string text = PdfTextExtractor.GetTextFromPage(reader, pageNumber);
        try { reader.Close(); }
        catch {}
        return text;
    }
}

GeneralRe: Alternate SolutionmemberMember 864124213 Feb '12 - 15:45 
Wow, thank you! This works perfectly and requires quite a bit less code.
GeneralRe: Alternate SolutionmemberMember 909494814 Aug '12 - 13:45 
Works for me!Thumbs Up | :thumbsup:
General(Solved) Error when reading some document (page missing)memberLord TaGoH8 Apr '11 - 0:13 
Thanks you very much for your CODE!Thumbs Up | :thumbsup: Thumbs Up | :thumbsup: Thumbs Up | :thumbsup:
you saved my ass on my current project when PDFBox fail to extract the text!Cool | :cool: Cool | :cool:
 
I encounter some problem reading some pages of some document anyway
because in the code (method ExtractTextFromPDFBytes)you call:
if (CheckToken(new string[] {"'", "T*", "\""}, previousCharacters))
{
         resultString += "\n";
}
 
But the CheckToken take for granted that ALL tokens are 2 character long at least
if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
   (recent[_numberOfCharsToKeep - 2] == token[1]) &&
...
 
checking if the token is long 1 character or more solve the problem.
 
you need to change the CheckToken method with this one:
private bool CheckToken(string[] tokens, char[] recent)
        {
            foreach(string token in tokens)
            {
                if (token.Length > 1)
                {
                    if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
                        (recent[_numberOfCharsToKeep - 2] == token[1]) &&
                        ((recent[_numberOfCharsToKeep - 1] == ' ') ||
                        (recent[_numberOfCharsToKeep - 1] == 0x0d) ||
                        (recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
                        ((recent[_numberOfCharsToKeep - 4] == ' ') ||
                        (recent[_numberOfCharsToKeep - 4] == 0x0d) ||
                        (recent[_numberOfCharsToKeep - 4] == 0x0a))
                        )
                    {
                        return true;
                    }
                }
                else
                {
                    if ((recent[_numberOfCharsToKeep - 2] == token[0]) &&
                        ((recent[_numberOfCharsToKeep - 1] == ' ') ||
                        (recent[_numberOfCharsToKeep - 1] == 0x0d) ||
                        (recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
                        ((recent[_numberOfCharsToKeep - 4] == ' ') ||
                        (recent[_numberOfCharsToKeep - 4] == 0x0d) ||
                        (recent[_numberOfCharsToKeep - 4] == 0x0a))
                        )
                    {
                        return true;
                    }
                }
            }
            return false;
        }

GeneralRe: (Solved) Error when reading some document (page missing) [modified]memberJBress22 Jun '11 - 7:40 
Thanks for pointing this out
 
The CheckToken method seems to find [whitespace][token_chars][whitespace] at the end of recent
 
So when tokens[i] contains 2 characters (from -4 to -1) :
	(
		(recent[_numberOfCharsToKeep - 4] == ' ')
		|| (recent[_numberOfCharsToKeep - 4] == 0x0d)
		|| (recent[_numberOfCharsToKeep - 4] == 0x0a)
	)
	&& (recent[_numberOfCharsToKeep - 3] == token[0])
	&& (recent[_numberOfCharsToKeep - 2] == token[1])
	&& (
		(recent[_numberOfCharsToKeep - 1] == ' ')
		|| (recent[_numberOfCharsToKeep - 1] == 0x0d)
		|| (recent[_numberOfCharsToKeep - 1] == 0x0a)
	)
 
But when tokens[i] contains 1 character, it should be (from -3 to -1) :
	(
		(recent[_numberOfCharsToKeep - 3] == ' ')
		|| (recent[_numberOfCharsToKeep - 3] == 0x0d)
		|| (recent[_numberOfCharsToKeep - 3] == 0x0a)
	)
	&& (recent[_numberOfCharsToKeep - 2] == token[0])
	&& (
		(recent[_numberOfCharsToKeep - 1] == ' ')
		|| (recent[_numberOfCharsToKeep - 1] == 0x0d)
		|| (recent[_numberOfCharsToKeep - 1] == 0x0a)
	)
 
Right ?

modified on Wednesday, June 22, 2011 1:47 PM

GeneralNew line problemmemberdejan19dejan194 Jan '11 - 3:53 
Hi all,
 
is there any way to know when the new line comes from the pdf file. I gоt txt from pdf file, but I need to know when the lines end.
 
I tried to compare, is this
char c = (char) input [i];
is new line,
but without success.
Is there a way to do it.
 
BR,
Dejan
Generalnot working on scaned image pdf filemembergaurav.ipec16 Dec '10 - 18:52 
this solution is not working on scand image PDF file ....output is blank text file.
GeneralRe: not working on scaned image pdf filememberdr_csci10 Jan '11 - 4:07 
Indeed it should return a blank file since your scanned pdf document is really an image. There will not be any text data in the PDF for this program to extract. What you are really looking for is OCR:
 
http://en.wikipedia.org/wiki/Optical_character_recognition
GeneralMy vote of 5memberstefan_lahnor25 Nov '10 - 22:54 
works great for me
GeneralMy vote of 1memberajc27 Oct '10 - 7:16 
The code segment provided does NOT match the pdfParser class supplied in the source code.
AnswerSupport for Unicode strings [modified]memberVasiliy Zverev29 Sep '10 - 6:25 
PDF stores ASCII strings (one byte per character) as is in parentheses: (string). But Unicode strings (two bytes per character) are stored as hex string in angle brackets: <0073007400720069006e0067>. Each 4 hex digits are one Unicode character.
I have added decoding of such hex strings into ExtractTextFromPDFBytes(). I have added new utility function GetCharFromHex() which converts 4 hex digits to Unicode char. I prepare result string in StringBuilder instead of String to improve performance. And I included support of ANSI characters encoded as octal code (thanks to MrVeloso). I've also removed try/catch in ExtractTextFromPDFBytes() because it masks any program/logic problems and can lead to losing a text without any warning.
#region GetCharFromHex
/// <summary>
/// convert 4 (or less) hex digits to unicode character.
/// Hex digits are stored at the end of previousCharacters.
/// </summary>
private char GetCharFromHex(char[] previousCharacters, int hexDigits)
{
    short code = 0;
    code = Convert.ToInt16(new string(previousCharacters, previousCharacters.Length - hexDigits, hexDigits), 16);
    // if there are less digits than 4, we add zeros to the right (e.g. 'ab' -> 'ab00'). This is done according to PDF specification.
    while (hexDigits < 4)
    {
        code <<= 4;
        hexDigits++;
    }
    return (char)code;
}
#endregion
 
#region ExtractTextFromPDFBytes
/// <summary>
/// This method processes an uncompressed Adobe (text) object 
/// and extracts text.
/// </summary>
/// <param name="input">uncompressed</param>
/// <returns></returns>
public string ExtractTextFromPDFBytes(byte[] input)
{
    if (input == null || input.Length == 0) return "";
 
    StringBuilder resultString = new StringBuilder(4096);
 
    // Flag showing if we are we currently inside a text object
    bool inTextObject = false;
 
    // Flag showing if the next character is literal 
    // e.g. '\\' to get a '\' character or '\(' to get '('
    bool nextLiteral = false;
 
    // () Bracket nesting level. Text appears inside ()
    int bracketDepth = 0;
    // is in hex string: <1234>
    bool inHexString = false;
    // number of hex digits read while in hex string.
    int hexDigits = 0;
 
    // Keep previous chars to get extract numbers etc.:
    char[] previousCharacters = new char[_numberOfCharsToKeep];
    for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' ';
 
    for (int i = 0; i < input.Length; i++)
    {
        char c = (char)input[i];
 
        if (inTextObject)
        {
            // Position the text
            if (bracketDepth == 0 && !inHexString)
            {
                if (CheckToken(new string[] { "TD", "Td" }, previousCharacters))
                    resultString.Append("\n\r");
                else
                {
                    if (CheckToken(new string[] { "'", "T*", "\"" /*"*/ /*fix code highlighting on codeproject*/ }, previousCharacters))
                        resultString.Append("\n");
                    else
                    {
                        if (CheckToken(new string[] { "Tj" }, previousCharacters))
                            resultString.Append(' ');
                    }
                }
            }
 
            // End of a text object, also go to a new line.
            if (bracketDepth == 0 && !inHexString &&
                CheckToken(new string[] { "ET" }, previousCharacters))
            {
 
                inTextObject = false;
                if (resultString.Length > 0 && resultString[resultString.Length - 1] != ' ')
                    resultString.Append(' ');
            }
            else
            {
                // Start outputting text
                if ((c == '(') && (bracketDepth == 0) && (!nextLiteral))
                {
                    bracketDepth = 1;
                }
                else
                {
                    // Stop outputting text
                    if ((c == ')') && (bracketDepth == 1) && (!nextLiteral))
                    {
                        bracketDepth = 0;
                    }
                    else
                    {
                        if (c == '<' && !inHexString && bracketDepth == 0)
                        { // Start of hex string
                            inHexString = true;
                            hexDigits = 0;
                        }
                        else
                        {
                            if ((c == '<' || c == '>') && inHexString) // end of hex string e.g. <1234>
                                // or this is not hex string but dictionary e.g. <<dict>>
                            {
                                if (hexDigits > 0) // convert last char
                                    resultString.Append(GetCharFromHex(previousCharacters, hexDigits));
                                inHexString = false;
                            }
                            else
                                // Just a normal text character:
                                if (bracketDepth == 1)
                                {
                                    if (c == '\\' && !nextLiteral)
                                    { // start of escaped character
                                        nextLiteral = true;
                                    }
                                    else
                                    {
                                        if (nextLiteral && c >= '0' && c <= '9')
                                        { // character is encoded in octal code
                                            char[] octalCode = {(char)input[i], (char)input[++i], (char)input[++i]};
                                            c = (char)Convert.ToInt32(new string(octalCode), 8);
                                        }
                                        
                                        if (((c >= ' ') && (c <= '~')) ||
                                            ((c >= 128) && (c < 255)))
                                        {
                                            resultString.Append(c);
                                        }
 
                                        nextLiteral = false;
                                    }
                                }
                                else
                                    if (inHexString)
                                    {
                                        if (hexDigits == 4)
                                        { // ready to extract next unicode character (4 hex digits)
                                            resultString.Append(GetCharFromHex(previousCharacters, hexDigits));
                                            hexDigits = 0;
                                        }
                                        hexDigits++; // new hex digit c is not added to previousCharacters yet, but it will be added below,
                                            // so I increment hexDigits after the check for 4.
                                    }
                        }
                    }
                }
            }
        }
 
        // Store the recent characters for 
        // when we have to go back for a checking
        for (int j = 0; j < _numberOfCharsToKeep - 1; j++)
        {
            previousCharacters[j] = previousCharacters[j + 1];
        }
        previousCharacters[_numberOfCharsToKeep - 1] = c;
 
        // Start of a text object
        if (!inTextObject && CheckToken(new string[] { "BT" }, previousCharacters))
        {
            inTextObject = true;
        }
    }
    return resultString.ToString();
}
#endregion

modified on Wednesday, September 29, 2010 12:47 PM

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web03 | 2.6.130516.1 | Last Updated 20 May 2006
Article Copyright 2006 by Zollor
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid