Click here to Skip to main content
13,045,422 members (102,287 online)
Rate this:
Please Sign up or sign in to vote.
See more:
I have been emailed a PDF file which is a set of tables.
I what to manipulate and analyse this data set, but...

I can copy and paste, but it becomes an image on the Excel worksheet.
I can copy the data and it becomes unreadable (non human readable that is, looks like a binary file characters.)

My third option is to print them out and get someone to retype them in.

I am pretty sure there is a fourth option...

Can someone help?
Posted 19-Sep-11 8:38am
Dalek Dave433.3K
SAKryukov 19-Sep-11 19:22pm
What, without programming?! I don't think so; please see my solution.

1 solution

Rate this: bad
Please Sign up or sign in to vote.

Solution 1

Dalek, the option is to read PDF in your code written in Java or C# (or some other .NET language). Then you can present the PDF data as some text you can import in Excel or directly create an excel file using Microsoft Office interop, in this case its better to use .NET (well, in all cases it's better to use .NET, just some people would prefer Java :-)). Apparently, another input should be some mapping rules you want to apply when generating output.

Why did I mentioned just Java and .NET? Because of the most recommended product called iText for Java, see[^],[^].

There is also a .NET port called iTextSharp, see[^].

I believe you can find out from MSDN how to work with Excel in .NET. How about that?

Dalek Dave 20-Sep-11 3:20am
Thanks for the help.
I was hoping it was possible to do a direct transfer, but despite much googling there was nothing helpful.
(If I buy the Adobe package it can be done, but I refuse to do that out of principle).
SAKryukov 20-Sep-11 10:41am
Well, this is explainable. Acrobat format is pretty much unstructured compared to the one of a spreadsheet, it is just a collection of fragments of text and graphics pre-rendered on the page, essential an electronic representation of a set of printed pages. There is no "natural" mapping to a spreadsheet and even to a plain text which could be used as a universal default. When, say, a database data is placed in PDF, data entropy grows and structural information get lost. There are commercial products providing "data recognition" in PDFs.

Business settings in business using such product (or rather integrating it in they software and workflow) become totally idiotic. Some company produce electronic components and publish a catalog. This company itself uses, say, Oracle database of components and other data. But the catalog is published in PDF. Some other company buy complex integration packages which integrated such PDF recognition tools which uses it, with great difficulty, to recognize table-like data from PDF and... put it in other database, perhaps same kind of Oracle database, but now on the side of the users of those electronic components, all this with great efforts, risk and data loss.

I know this because I co-authored meta-data driven technology used middleware product used mostly for data and workflow integration in business. At least our product considered all endpoint protocols and some custom conversion units as plug-ins; many company ordered ad-hoc products from scratch. I called the business of that our company "acting as a parasite on someone's stupidity". :-)

By the way, are you accepting my answer formally? I don't think you will find a lot more of any helpful stuff :-)

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
Top Experts
Last 24hrsThis month

Advertise | Privacy | Mobile
Web01 | 2.8.170713.1 | Last Updated 19 Sep 2011
Copyright © CodeProject, 1999-2017
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100