Click here to Skip to main content
Click here to Skip to main content
Alternative Article

Converting PDF to Text in C#

, 22 Oct 2012 CPOL
Rate this:
Please Sign up or sign in to vote.
This is an alternative for "Converting PDF to Text in C#"

Introduction 

This article demonstrates how to use the iTextSharp .NET library to convert a PDF file to text.

Background 

It seems like I was always searching for a better way to convert a PDF file to text (so I could edit it, parse it with regex, etc). And we are not talking about a couple pages of PDF here - I was receiving daily reports in PDF format that were 200-300 pages in length.

I started with a Python library that I found to do the PDF-to-text conversion. This seemed like a good choice because I was planning on using Python to parse the PDF anyway. Unfortunately, converting a single 200+ page PDF with this method was taking on the order of several minutes (on a pretty fast machine). Unacceptable.

Code Project to the rescue! My next solution was the original article regarding PDF-to-text that used PDFBox. By using this method, my PDF conversion went down from a couple minutes to about 10 seconds (again for a 200+ page PDF). All good, right?

Well...it was a great improvement to be sure. But something about knowing that the code was piggybacking on the Java VM and that it was, therefore, slower than the Java version rubbed me the wrong way. So between that and the fact that I am a huge nerd when I get a free weekend, I decided to revisit the potential solutions listed in the original article. I was able to convert the Java source code that uses the iText library and utilize the iTextSharp version of this same library. The end result is that I can now convert a 250 page PDF file to text in less than a second. 

Using the code 

This is the full C# code for my project. As I mentioned in the introduction, I just converted it from Java, so you may see a little design weirdness. Nevertheless, the code is quite short. And the resulting app is fast!

This code references a few of the iTextSharp dlls. I have included them in the project download files, but you can also find them on sourceforge (for future updates, etc.). 

using System;
using System.IO; 
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

public class ParsingPDF {
 
    static string PDF;
    static string TEXT2;
 
    /**
     * Parses the PDF using PRTokeniser
     * @param src  the path to the original PDF file
     * @param dest the path to the resulting text file
     */
    public void parsePdf(String src, String dest)
    {
        PdfReader reader = new PdfReader(src);
        StreamWriter output = new StreamWriter(new FileStream(dest, FileMode.Create));
        int pageCount = reader.NumberOfPages;
        for (int pg = 1; pg <= pageCount; pg++)
        {
            // we can inspect the syntax of the imported page
            byte[] streamBytes = reader.GetPageContent(pg);
            PRTokeniser tokenizer = new PRTokeniser(streamBytes);
            while (tokenizer.NextToken())
            {
                if (tokenizer.TokenType == PRTokeniser.TokType.STRING)
                {
                    output.WriteLine(tokenizer.StringValue);
                }
            }
        }
        output.Flush();
        output.Close();
    }
 
    /**
     * Main method.
     */
    static void Main(string[] args)
    {
        if (args.Length < 1 || args.Length > 2)
        {
            Console.WriteLine("USAGE: ParsePDF infile.pdf <outfile.txt>");
            return;
        }
        else if (args.Length == 1)
        {
            PDF = args[0];
            TEXT2 = Path.GetFileNameWithoutExtension(PDF) + ".txt";
        }
        else
        {
            PDF = args[0];
            TEXT2 = args[1];
        }

        try
        {
            DateTime t1 = DateTime.Now;

            ParsingPDF example = new ParsingPDF();
            example.parsePdf(PDF, TEXT2);

            DateTime t2 = DateTime.Now;
            TimeSpan ts = t2 - t1;
            Console.WriteLine("Parsing completed in {0:0.00} seconds.", ts.TotalSeconds);
        }
        catch (Exception ex)
        {
            Console.WriteLine("ERROR: " + ex.Message);
        }
    } // class

    public class MyTextRenderListener : IRenderListener
    {
        /** The print writer to which the information will be written. */
        protected StreamWriter output;

        /**
         * Creates a RenderListener that will look for text.
         */
        public MyTextRenderListener(StreamWriter output)
        {
            this.output = output;
        }

        public void BeginTextBlock()
        {
            output.Write("<");
        }

        public void EndTextBlock()
        {
            output.WriteLine(">");
        }

        public void RenderImage(ImageRenderInfo renderInfo)
        {
        }

        public void RenderText(TextRenderInfo renderInfo)
        {
            output.Write("<");
            output.Write(renderInfo.GetText());
            output.Write(">");
        }
    } // class
} // namespace  

Points of Interest 

It was interesting to see some Java code again. I haven't done anything serious in Java for over five years, but what struck me was how close the Java code is to the C# code. This made the conversion relatively easy.

I initially planned to incorporate the Task Parallel Library to try and speed up the results, but that was before I realized that the non-parallel version was performing in under half a second. Just for fun, I may look into the TPL anyway. It would be an good learning exercise, and it would be interesting to see how TPL performs. In the case of a many-page PDF document, I'm sure TPL is not going to launch hundreds of threads, for example. But how many will it launch?

So I'll keep this article updated if I pursue the TPL version. Also, I want to implement this in my new favorite baby: F#. 

History 

Original version posted Oct 22, 2012.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

hatman70
Software Developer HellaC0der
United States United States
I specialize in creating software for the trading industry. I have worked for over 10 years developing software for trading options, commodity futures and equities. I have also been a trader on the floor of the CBOE and CBOT in Chicago, and I try to blend my trading experience and software development knowledge to create trading software that traders can understand and relate to.
 
I work in both California and the Chicago area. If you have any questions or software projects related to trading, feel free to contact me.
 
Education:
BS Computer Science, Purdue University
MBA with Finance, Loyola University Chicago

Comments and Discussions

 
QuestionConverting the result in paraghraph PinmemberSana Ali28-Jul-14 0:56 
Questionnice work PinmemberDigitalbil20-Feb-14 18:42 
Questionnice code PinmemberАslam Iqbal28-Jan-14 5:36 
QuestionPRTokeniser tokenizer = new PRTokeniser(streamBytes) Not allowed to pass arguments PinmemberMember 878328415-Dec-13 21:08 
AnswerRe: PRTokeniser tokenizer = new PRTokeniser(streamBytes) Not allowed to pass arguments PinmemberHeheHong16-Apr-14 18:03 
Questionthank you Pinmemberdanny rough8-May-13 18:15 
GeneralMy vote of 1 Pinmemberzunfoh30-Apr-13 10:33 
Questionnewline problems PinmemberDirkus Maximus15-Dec-12 1:53 
GeneralMy vote of 5 PinmemberNathaniel A Collier30-Oct-12 11:56 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.141223.1 | Last Updated 22 Oct 2012
Article Copyright 2012 by hatman70
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid