Click here to Skip to main content
15,885,932 members
Please Sign up or sign in to vote.
5.00/5 (1 vote)
See more:
I have been emailed a PDF file which is a set of tables.
I what to manipulate and analyse this data set, but...

I can copy and paste, but it becomes an image on the Excel worksheet.
Or
I can copy the data and it becomes unreadable (non human readable that is, looks like a binary file characters.)

My third option is to print them out and get someone to retype them in.

I am pretty sure there is a fourth option...


Can someone help?
Posted
Comments
Sergey Alexandrovich Kryukov 19-Sep-11 19:22pm    
What, without programming?! I don't think so; please see my solution.
--SA

1 solution

Dalek, the option is to read PDF in your code written in Java or C# (or some other .NET language). Then you can present the PDF data as some text you can import in Excel or directly create an excel file using Microsoft Office interop, in this case its better to use .NET (well, in all cases it's better to use .NET, just some people would prefer Java :-)). Apparently, another input should be some mapping rules you want to apply when generating output.

Why did I mentioned just Java and .NET? Because of the most recommended product called iText for Java, see http://en.wikipedia.org/wiki/IText[^], http://itextpdf.com/[^].

There is also a .NET port called iTextSharp, see http://sourceforge.net/projects/itextsharp/[^].

I believe you can find out from MSDN how to work with Excel in .NET. How about that?

—SA
 
Share this answer
 
v2
Comments
Dalek Dave 20-Sep-11 3:20am    
Thanks for the help.
I was hoping it was possible to do a direct transfer, but despite much googling there was nothing helpful.
(If I buy the Adobe package it can be done, but I refuse to do that out of principle).
Sergey Alexandrovich Kryukov 20-Sep-11 10:41am    
Well, this is explainable. Acrobat format is pretty much unstructured compared to the one of a spreadsheet, it is just a collection of fragments of text and graphics pre-rendered on the page, essential an electronic representation of a set of printed pages. There is no "natural" mapping to a spreadsheet and even to a plain text which could be used as a universal default. When, say, a database data is placed in PDF, data entropy grows and structural information get lost. There are commercial products providing "data recognition" in PDFs.

Business settings in business using such product (or rather integrating it in they software and workflow) become totally idiotic. Some company produce electronic components and publish a catalog. This company itself uses, say, Oracle database of components and other data. But the catalog is published in PDF. Some other company buy complex integration packages which integrated such PDF recognition tools which uses it, with great difficulty, to recognize table-like data from PDF and... put it in other database, perhaps same kind of Oracle database, but now on the side of the users of those electronic components, all this with great efforts, risk and data loss.

I know this because I co-authored meta-data driven technology used middleware product used mostly for data and workflow integration in business. At least our product considered all endpoint protocols and some custom conversion units as plug-ins; many company ordered ad-hoc products from scratch. I called the business of that our company "acting as a parasite on someone's stupidity". :-)

By the way, are you accepting my answer formally? I don't think you will find a lot more of any helpful stuff :-)
--SA

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900