Click here to Skip to main content
Click here to Skip to main content

Convert PDF file content into string using C#

By , 18 May 2012
 

Introduction

Hello friends, this is my first article in CodeProject.com. This article is mainly intended to read content from a PDF file and convert that into a string using C#.

Background

This was actually assigned as a task for me. Actually I Googled about this and finally did it with a simple code. I'm sure this code will be very helpful for beginners.

Using the code

The following steps will guide you to read content from a PDF file:

  1. To start with this, you need to download itextsharp-all-5.2.1, which can be download from here.
  2. Extract the whole archive (inside itextsharp-all-5.2.1 folder also) to your local directory.
  3. You have successfully completed the initial step in the process..... hurrah.....! ! ! !

    Now open Microsoft Visual studio. For me it is Microsoft Visual C# 2010 Express.

  4. New project --> WindowsFormsApplication --> Give project name (I named mine PDF_To_Text).
  5. Add itextsharp-all-5.2.1.dll as reference.
  6. Select Project menu --> Select Browse tab --> Select itextsharp.dll from the local directory.

  7. Place a "richTextBox1" control in the Form work space.
  8. Now paste the following code in Form1.cs.
  9. using System;
    using System.Collections.Generic;
    using System.ComponentModel;
    using System.Data;
    using System.Drawing;
    using System.Linq;
    using System.Text;
    using System.Windows.Forms;
    using iTextSharp.text.pdf;
    using iTextSharp.text.pdf.parser;
    
    
    namespace WindowsFormsApplication1
    {
        public partial class Form1 : Form
        {
            public Form1()
            {
                InitializeComponent();
                ExtractTextFromPDFPage("c:\sample.pdf", 1);
            }
    
            public void ExtractTextFromPDFPage(string pdfFile, int pageNumber)
            {
                PdfReader reader = new PdfReader(pdfFile);
                string text = PdfTextExtractor.GetTextFromPage(reader, pageNumber);
                try { reader.Close(); }
                catch { }
                richTextBox1.Text = text;
            }
        }
    }

    Look how simple it is....!!! Smile | <img src= " src="http://www.codeproject.com/script/Forums/Images/smiley_smile.gif" />

  10. Now Build the solution using Ctrl+Shift+B, or Build the solution by selecting the Build menu from the menu bar.
  11. Once succeeded, Run the application by pressing F5.
  12. You will find the file content is converted into text and displayed in the RichTextBox control.

That's it, you have successfully converted a PDF file into text.

Note

Here c:\sample.pdf is where I kept my PDF file. So you should update the path to your file. The second parameter denotes which page you need to get converted. 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

rk_prabakar
Software Developer (Junior) Scintel
India India
Member
There are only 10 type of people in this programming world....
one who knows the binary and other who doesn't.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
Questionmore pagesmembersaurabh49parikh27 Aug '12 - 8:03 
Thank you..
You have done a great job.
but,
how can i read more then one page from this code??
AnswerRe: more pagesmemberrk_prabakar15 Oct '12 - 18:46 
Sorry for the late response....
Try the following code
public Form1()
       {
           InitializeComponent();
//Iterate the calling function with number of pages in it.
for(int i=1;i<Count;i++)           
ExtractTextFromPDFPage("c:\sample.pdf", i);
       }
And then just append the content to richtextbox control
public void ExtractTextFromPDFPage(string pdfFile, int pageNumber)
        {
            PdfReader reader = new PdfReader(pdfFile);
            string text = PdfTextExtractor.GetTextFromPage(reader, pageNumber);
            try { reader.Close(); }
            catch 
                  { //Exception handler here
                  }
//Append the read content in to the richtextbox control, or any other control that you want            
            richTextBox1.Text += text;
        }
I hope this could work, i have'nt try this on my machine. I'm just giving some idea about it. Roll eyes | :rolleyes: Simple is'nt? Poke tongue | ;-P
Thanks and Regards,
 
RK_PRABAKAR

Questionno text in rich text boxmembermlan sopno9 Jun '12 - 0:04 
I did exactly same described before.But i don't get any output.Help me.
AnswerRe: no text in rich text boxmemberrk_prabakar28 Aug '12 - 23:01 
You must be missing something at some point, post your code
Thanks and Regards,
 
RK_PRABAKAR

QuestionHelpful postmemberMember 287297829 May '12 - 6:50 
This article is helpful. The itextsharp library is very nice and extremely helpful for the difficult task of extracting text out of a pdf file.
AnswerRe: Helpful postmemberrk_prabakar29 May '12 - 18:57 
thanks for your comments
Thanks and Regards,
 
RK_PRABAKAR

GeneralMy vote of 1memberstooboo19 May '12 - 3:33 
I agree with Tom, I don't think this is an article.
 
Suggestion: please look into using Nuget, (iTextSharp is just one of the packges on there) and it will make you life a lot easier in the future
GeneralRe: My vote of 1memberstooboo19 May '12 - 3:36 
Sorry I just saw http://stackoverflow.com/questions/4566908/how-can-i-use-nuget-with-visual-c-sharp-express[^] it appears nuget isn't available within 'Express' .. you might still be able to utilize it from the command line though.
 
(If you're working as a developer it's likely that you'll be using at least 'VS 2010 Professional' though which does have nuget)
Question[My vote of 1] Not really an articlementorTom Clement18 May '12 - 6:58 
Hi rk,
 
First of all, I really like your impulse to share what you've learned with a broader audience. It's the spirit of Code Project and what makes it such a useful site for collaboration.
 
That said, this article doesn't really provide much help for people. The most fundamental thing you need to do in an article is help people grow and learn. What this article does, though, is just lay out some rote steps for using a tool written by someone else. Maybe I'm missing something, but there doesn't seem to be anything to learn here. Even if fleshed out with more information, it seems more like a tip than an article.
 
I don't even like this as a utility program for accomplishing the task. To improve that, you'd want to have a button or menu item that brings up an OpenFile dialog with a filter set to *.pdf and use that to identify the file to convert (rather than hard wire it into the program). You'd want to offer the ability to write the converted text back out to a file. As an article, you'd want to at least talk about why you used a rich text box (there's no indication in the article text that the output of this utility is RTF).
 
I wish I could give you something more encouraging, but unfortunately my impression is that even fixed up, there wouldn't be enough here for an article. So I'd encourage you to continue programming, and when you get an insight and do a program or utility that achieves something a bit more significant, interesting and novel, go through this exercise again --- write an article and share it.
Tom Clement
Serena Software, Inc.
www.serena.com
 
articles[^]

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web02 | 2.6.130523.1 | Last Updated 18 May 2012
Article Copyright 2012 by rk_prabakar
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid