Click here to Skip to main content
15,881,248 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
Hello Experts,
I am developing a web based application through which user will upload its PDF document, i need to extract several details from that PDF and after analysing the data i will show the result on web page. I have googled a lot and found several article which helped me to extract text using iTextSharp, PDFBox and many more similar question asked on Codeproject and stackoverflow
Somehow i got the text page by page but it was not formatted so i could not perform operation on data extracted from pdf. Is there any way to extract text like line by line , column by column.

Thank you
Posted
Updated 30-Oct-17 0:19am
Comments
Sergey Alexandrovich Kryukov 15-Oct-13 14:48pm    
David_Wimbley 16-Oct-13 23:33pm    
What operations are you trying to perform that your text has to be formatted? Also if you need formatted text your better off turning your PDFs into thumbnail images rather then trying to grab the text.
Kalpesh Bhadra 17-Oct-13 1:11am    
Hi david,
The client will upload their "Form16" a document issued by Company to their employee which contains the information of employee's Personal info, TDS, HRA, allowances and other details related to file their INCOME TAX RETURN. All these details are formatted in tabular manner. I need to retrieve all these details programmatically and store it into the database as well perform some mathematical operation.
David_Wimbley 17-Oct-13 14:30pm    
Can you not do something regex wise?

Say they have a form that looks like this inside the PDF

First Name: David
Last Name: Wimbley
Address: 100 Main Street
City: Your Town
State: FL

But when you go to extract the text from the PDF it looks like

First Name: David Last Name: Wimbley Address: 100 Main Street City: Your Town State: FL

Could you not do some regex to give you all the text from/between First Name: and Last Name: in order to get the applicants First Name?

Then repeat same to get all the fields you need out of the PDF.

I've messed with PDFs a bit and I think the only way to truly keep the formatting is digging down into lower level pdf formatting stuff which could be painful (but i could be more then likely wrong) where if you just built a parser where you could figure a way to regex out the information you need...could be less of a head ache.
Kalpesh Bhadra 18-Oct-13 1:16am    
Thank you david,
I got it. I can do this with regex as it is painful bt somehow will try to deal with it. aganin thank you so much. :)

1 solution

public string ReadPdfFile(string path)
        {
            string result = "";
            StringBuilder text = new StringBuilder();

            PdfReader pdfReader = new PdfReader(path);

            for (int page = 1; page <= pdfReader.NumberOfPages; page++)
            {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                result += PdfTextExtractor.  GetTextFromPage(pdfReader, page, strategy);

                //  result = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(result)));
                // text.Append(result);

            }

            pdfReader.Close();
            txtInput.Text = result;
            return result;
        }
 
Share this answer
 
Comments
Richard MacCutchan 30-Oct-17 6:23am    
FOUR years too late.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900