Click here to Skip to main content
15,885,365 members
Articles / Web Development / HTML
Alternative
Article

Converting PDF to Text in C#

Rate me:
Please Sign up or sign in to vote.
4.50/5 (9 votes)
22 Oct 2012CPOL3 min read 46.7K   28   12
This is an alternative for "Converting PDF to Text in C#"

Introduction 

This article demonstrates how to use the iTextSharp .NET library to convert a PDF file to text.

Background 

It seems like I was always searching for a better way to convert a PDF file to text (so I could edit it, parse it with regex, etc). And we are not talking about a couple pages of PDF here - I was receiving daily reports in PDF format that were 200-300 pages in length.

I started with a Python library that I found to do the PDF-to-text conversion. This seemed like a good choice because I was planning on using Python to parse the PDF anyway. Unfortunately, converting a single 200+ page PDF with this method was taking on the order of several minutes (on a pretty fast machine). Unacceptable.

Code Project to the rescue! My next solution was the original article regarding PDF-to-text that used PDFBox. By using this method, my PDF conversion went down from a couple minutes to about 10 seconds (again for a 200+ page PDF). All good, right?

Well...it was a great improvement to be sure. But something about knowing that the code was piggybacking on the Java VM and that it was, therefore, slower than the Java version rubbed me the wrong way. So between that and the fact that I am a huge nerd when I get a free weekend, I decided to revisit the potential solutions listed in the original article. I was able to convert the Java source code that uses the iText library and utilize the iTextSharp version of this same library. The end result is that I can now convert a 250 page PDF file to text in less than a second. 

Using the code 

This is the full C# code for my project. As I mentioned in the introduction, I just converted it from Java, so you may see a little design weirdness. Nevertheless, the code is quite short. And the resulting app is fast!

This code references a few of the iTextSharp dlls. I have included them in the project download files, but you can also find them on sourceforge (for future updates, etc.). 

C#
using System;
using System.IO; 
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

public class ParsingPDF {
 
    static string PDF;
    static string TEXT2;
 
    /**
     * Parses the PDF using PRTokeniser
     * @param src  the path to the original PDF file
     * @param dest the path to the resulting text file
     */
    public void parsePdf(String src, String dest)
    {
        PdfReader reader = new PdfReader(src);
        StreamWriter output = new StreamWriter(new FileStream(dest, FileMode.Create));
        int pageCount = reader.NumberOfPages;
        for (int pg = 1; pg <= pageCount; pg++)
        {
            // we can inspect the syntax of the imported page
            byte[] streamBytes = reader.GetPageContent(pg);
            PRTokeniser tokenizer = new PRTokeniser(streamBytes);
            while (tokenizer.NextToken())
            {
                if (tokenizer.TokenType == PRTokeniser.TokType.STRING)
                {
                    output.WriteLine(tokenizer.StringValue);
                }
            }
        }
        output.Flush();
        output.Close();
    }
 
    /**
     * Main method.
     */
    static void Main(string[] args)
    {
        if (args.Length < 1 || args.Length > 2)
        {
            Console.WriteLine("USAGE: ParsePDF infile.pdf <outfile.txt>");
            return;
        }
        else if (args.Length == 1)
        {
            PDF = args[0];
            TEXT2 = Path.GetFileNameWithoutExtension(PDF) + ".txt";
        }
        else
        {
            PDF = args[0];
            TEXT2 = args[1];
        }

        try
        {
            DateTime t1 = DateTime.Now;

            ParsingPDF example = new ParsingPDF();
            example.parsePdf(PDF, TEXT2);

            DateTime t2 = DateTime.Now;
            TimeSpan ts = t2 - t1;
            Console.WriteLine("Parsing completed in {0:0.00} seconds.", ts.TotalSeconds);
        }
        catch (Exception ex)
        {
            Console.WriteLine("ERROR: " + ex.Message);
        }
    } // class

    public class MyTextRenderListener : IRenderListener
    {
        /** The print writer to which the information will be written. */
        protected StreamWriter output;

        /**
         * Creates a RenderListener that will look for text.
         */
        public MyTextRenderListener(StreamWriter output)
        {
            this.output = output;
        }

        public void BeginTextBlock()
        {
            output.Write("<");
        }

        public void EndTextBlock()
        {
            output.WriteLine(">");
        }

        public void RenderImage(ImageRenderInfo renderInfo)
        {
        }

        public void RenderText(TextRenderInfo renderInfo)
        {
            output.Write("<");
            output.Write(renderInfo.GetText());
            output.Write(">");
        }
    } // class
} // namespace  

Points of Interest 

It was interesting to see some Java code again. I haven't done anything serious in Java for over five years, but what struck me was how close the Java code is to the C# code. This made the conversion relatively easy.

I initially planned to incorporate the Task Parallel Library to try and speed up the results, but that was before I realized that the non-parallel version was performing in under half a second. Just for fun, I may look into the TPL anyway. It would be an good learning exercise, and it would be interesting to see how TPL performs. In the case of a many-page PDF document, I'm sure TPL is not going to launch hundreds of threads, for example. But how many will it launch?

So I'll keep this article updated if I pursue the TPL version. Also, I want to implement this in my new favorite baby: F#. 

History 

Original version posted Oct 22, 2012.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer HellaC0der
United States United States
I specialize in creating software for the trading industry. I have worked for over 10 years developing software for trading options, commodity futures and equities. I have also been a trader on the floor of the CBOE and CBOT in Chicago, and I try to blend my trading experience and software development knowledge to create trading software that traders can understand and relate to.

I work in both California and the Chicago area. If you have any questions or software projects related to trading, feel free to contact me.

Education:
BS Computer Science, Purdue University
MBA with Finance, Loyola University Chicago

Comments and Discussions

 
Questionwhen convert chinese pdf to txt garbled ,how to solve? Pin
Member 344279616-Nov-15 2:18
Member 344279616-Nov-15 2:18 
QuestionNice ... but which version of iTextSharp dll ? Pin
peterkmx10-Aug-15 0:39
professionalpeterkmx10-Aug-15 0:39 
QuestionConverting the result in paraghraph Pin
Sana Ali27-Jul-14 23:56
Sana Ali27-Jul-14 23:56 
Questionnice work Pin
Digitalbil20-Feb-14 17:42
Digitalbil20-Feb-14 17:42 
Questionnice code Pin
Аslam Iqbal28-Jan-14 4:36
professionalАslam Iqbal28-Jan-14 4:36 
QuestionPRTokeniser tokenizer = new PRTokeniser(streamBytes) Not allowed to pass arguments PinPopular
Member 878328415-Dec-13 20:08
Member 878328415-Dec-13 20:08 
AnswerRe: PRTokeniser tokenizer = new PRTokeniser(streamBytes) Not allowed to pass arguments PinPopular
HeheHong16-Apr-14 17:03
HeheHong16-Apr-14 17:03 
AnswerRe: PRTokeniser tokenizer = new PRTokeniser(streamBytes) Not allowed to pass arguments Pin
blodgyblodgy3-Mar-17 2:00
blodgyblodgy3-Mar-17 2:00 
Questionthank you Pin
danny rough8-May-13 17:15
danny rough8-May-13 17:15 
GeneralMy vote of 1 Pin
zunfoh30-Apr-13 9:33
zunfoh30-Apr-13 9:33 
Questionnewline problems Pin
Dirkus Maximus15-Dec-12 0:53
Dirkus Maximus15-Dec-12 0:53 
GeneralMy vote of 5 Pin
Nathaniel A Collier30-Oct-12 10:56
Nathaniel A Collier30-Oct-12 10:56 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.