Convert a PDF into an Excel file

Question

5.00/5 (1 vote)

See more:

I have been emailed a PDF file which is a set of tables.
I what to manipulate and analyse this data set, but...

I can copy and paste, but it becomes an image on the Excel worksheet.
Or
I can copy the data and it becomes unreadable (non human readable that is, looks like a binary file characters.)

My third option is to print them out and get someone to retype them in.

I am pretty sure there is a fourth option...

Can someone help?

Posted 19-Sep-11 8:38am

Dalek Dave

Add a Solution

Comments

Sergey Alexandrovich Kryukov 19-Sep-11 19:22pm

What, without programming?! I don't think so; please see my solution.
--SA

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Sergey Alexandrovich Kryukov · Accepted Answer · 2011-09-19T13:21:00

Solution 1

Dalek, the option is to read PDF in your code written in Java or C# (or some other .NET language). Then you can present the PDF data as some text you can import in Excel or directly create an excel file using Microsoft Office interop, in this case its better to use .NET (well, in all cases it's better to use .NET, just some people would prefer Java :-)). Apparently, another input should be some mapping rules you want to apply when generating output.

Why did I mentioned just Java and .NET? Because of the most recommended product called iText for Java, see http://en.wikipedia.org/wiki/IText[^], http://itextpdf.com/[^].

There is also a .NET port called iTextSharp, see http://sourceforge.net/projects/itextsharp/[^].

I believe you can find out from MSDN how to work with Excel in .NET. How about that?

—SA

Posted 19-Sep-11 13:21pm

Sergey Alexandrovich Kryukov

Updated 19-Sep-11 13:23pm

v2

Comments

Dalek Dave 20-Sep-11 3:20am

Thanks for the help.
I was hoping it was possible to do a direct transfer, but despite much googling there was nothing helpful.
(If I buy the Adobe package it can be done, but I refuse to do that out of principle).

Sergey Alexandrovich Kryukov 20-Sep-11 10:41am

Well, this is explainable. Acrobat format is pretty much unstructured compared to the one of a spreadsheet, it is just a collection of fragments of text and graphics pre-rendered on the page, essential an electronic representation of a set of printed pages. There is no "natural" mapping to a spreadsheet and even to a plain text which could be used as a universal default. When, say, a database data is placed in PDF, data entropy grows and structural information get lost. There are commercial products providing "data recognition" in PDFs.

Business settings in business using such product (or rather integrating it in they software and workflow) become totally idiotic. Some company produce electronic components and publish a catalog. This company itself uses, say, Oracle database of components and other data. But the catalog is published in PDF. Some other company buy complex integration packages which integrated such PDF recognition tools which uses it, with great difficulty, to recognize table-like data from PDF and... put it in other database, perhaps same kind of Oracle database, but now on the side of the users of those electronic components, all this with great efforts, risk and data loss.

I know this because I co-authored meta-data driven technology used middleware product used mostly for data and workflow integration in business. At least our product considered all endpoint protocols and some custom conversion units as plug-ins; many company ordered ad-hoc products from scratch. I called the business of that our company "acting as a parasite on someone's stupidity". :-)

By the way, are you accepting my answer formally? I don't think you will find a lot more of any helpful stuff :-)
--SA