Click here to Skip to main content
15,748,330 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I have a stack of 4 PDF forms that have been filled in. It is the same empty form filled in. I would like to take all the PDF files in a folder.

I would like to end up with
(a)an Excel file or
(b) comma-separated-variables file with quote marks around cell entries or
(c) a pipe "|" separated file
(d) a tab-separated file
any of which can be used in statistical packages.

The first row of the target file would be the names of the fields. Then there would be 1 row from each of the PDF files.

What I have tried:

A year and a half ago, I had worked out enough Python to get the data in a scrambled text file. I gave up after being stuck at that point.
The scrambled file looked a lot like an Algol heap from the mid-1970s.

I was hoping someone had worked this out. I wanted to find out if this had been
done before jumping back into it.

I do not see how to attach files on this forum, but I can supply a PDF file that has not been filled in, 4 PDF files that have been filled in, and an example of what I want to end up with.
Updated 16-Aug-20 17:26pm
RedDk 16-Aug-20 14:24pm    
Open the pdf in Acrobat Pro and save the "table" content as XML. That format can be imported directly into an Excel spreadsheet using the Excel app itself (multiple versions have allowed this since Excel first came out and I can't imagine the latest incarnation wouldn't allow it as well). There's a bit of a disconnect when it comes to pure (sp) XML and MS flavor of XML so be forewarned that retabulating the tables can be hit or miss. Also try Stackoverflow on the words Excel and XML ... and PDF too, right?

1 solution

if it were C# I'd be suggesting 'iTextSharp' .. but, Python it is - I'd use a library possibly from this list[^] to do the text extract .. that really is the hardest nut to crack - I'm pretty sure there would be Python code/librar(y|ies) out there to build the file in your choice of csv/tsv/psv etc
Share this answer

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900