How to extract formatted text from PDF in C#

Question

0.00/5 (No votes)

See more:

Hello Experts,
I am developing a web based application through which user will upload its PDF document, i need to extract several details from that PDF and after analysing the data i will show the result on web page. I have googled a lot and found several article which helped me to extract text using iTextSharp, PDFBox and many more similar question asked on Codeproject and stackoverflow
Somehow i got the text page by page but it was not formatted so i could not perform operation on data extracted from pdf. Is there any way to extract text like line by line , column by column.

Thank you

Posted 15-Oct-13 1:04am

Kalpesh Bhadra

Updated 30-Oct-17 0:19am

Add a Solution

Comments

Sergey Alexandrovich Kryukov 15-Oct-13 14:48pm

http://www.whathaveyoutried.com so far?
—SA

David_Wimbley 16-Oct-13 23:33pm

What operations are you trying to perform that your text has to be formatted? Also if you need formatted text your better off turning your PDFs into thumbnail images rather then trying to grab the text.

Kalpesh Bhadra 17-Oct-13 1:11am

Hi david,
The client will upload their "Form16" a document issued by Company to their employee which contains the information of employee's Personal info, TDS, HRA, allowances and other details related to file their INCOME TAX RETURN. All these details are formatted in tabular manner. I need to retrieve all these details programmatically and store it into the database as well perform some mathematical operation.

David_Wimbley 17-Oct-13 14:30pm

Can you not do something regex wise?

Say they have a form that looks like this inside the PDF

First Name: David
Last Name: Wimbley
Address: 100 Main Street
City: Your Town
State: FL

But when you go to extract the text from the PDF it looks like

First Name: David Last Name: Wimbley Address: 100 Main Street City: Your Town State: FL

Could you not do some regex to give you all the text from/between First Name: and Last Name: in order to get the applicants First Name?

Then repeat same to get all the fields you need out of the PDF.

I've messed with PDFs a bit and I think the only way to truly keep the formatting is digging down into lower level pdf formatting stuff which could be painful (but i could be more then likely wrong) where if you just built a parser where you could figure a way to regex out the information you need...could be less of a head ache.

Kalpesh Bhadra 18-Oct-13 1:16am

Thank you david,
I got it. I can do this with regex as it is painful bt somehow will try to deal with it. aganin thank you so much. :)

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Member 11346579 · Answer 1 · 2017-10-30T00:19:00

public string ReadPdfFile(string path)
        {
            string result = "";
            StringBuilder text = new StringBuilder();

            PdfReader pdfReader = new PdfReader(path);

            for (int page = 1; page <= pdfReader.NumberOfPages; page++)
            {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                result += PdfTextExtractor.  GetTextFromPage(pdfReader, page, strategy);

                //  result = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(result)));
                // text.Append(result);

            }

            pdfReader.Close();
            txtInput.Text = result;
            return result;
        }