Click here to Skip to main content
13,092,310 members (58,471 online)
Rate this:
Please Sign up or sign in to vote.
See more:
I am having bulk Scanned PDF document.

I want to read Scanned PDF document and generate to XML.

Again, i want to update the content in PDF from modified XML file.

How to do this...
Posted 19-Dec-12 2:18am
Abhishek Pant 19-Dec-12 9:13am
Rate this: bad
Please Sign up or sign in to vote.

Solution 1

Hire people to do this for you. :)
This is a really challenging task not for a "quick answers" kind of forum. There are commercial applications for such tasks, but in general it can't be performed with 100% accuracy.

What you need:
- an OCR engine (let's suppose, that the quality of the images is good enough, and there is no handwriting) - some scanners are already making an ocr-ed layer above the scanned image
- you need one or more patterns that map text to the xml element based on position or some metadata (supposing your documents are of a limited number of type)
- you will need a document type recognition logic
- you will need a content validation logic to have a clue how good the automatic process performed
- editing a PDF is something else. If the scanned images is not ocred by the scanner, you cannot edit the images itself, you have to put the new text above the original one

But these are only the basic concepts. Such a task is really a hard one, many months of full-time shifts, and at the end you will still have special cases, when the automatic handling will not work, thus you have to add some user interaction, thus you will need user interface too.
lewax00 19-Dec-12 10:26am
That sums it up pretty well. I work on a product with a similar feature, and I can add that even if the PDF is not scanned in (i.e. the text can be stripped from it) they still aren't easy to process. PDF is only good for one thing: printable documents that don't change based on the reader. They are terrible as a data source.
Rate this: bad
Please Sign up or sign in to vote.

Solution 2

Adobe ACROBAT 9 PRO (v.9.5.2) does a good job of making .xml out of .pdf. It has an option in the Save dialog to save as "XML 1.0" with settings;

Encoding, bookmark generation, tag generation ...

And there's Image File Settings;

Generate images, use sub-folder, as well as output format (TIFF,JPG,PNG), even downsample ...

So as complicated as "disassembling" a .pdf can be (knowing from personal experience), Adobe is the original "fonter" and "printer" and this app more than enables them to package their proprietary knowledge both formidably and somewhat successfully.

$$$; the only downside.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
Top Experts
Last 24hrsThis month

Advertise | Privacy |
Web01 | 2.8.170813.1 | Last Updated 19 Dec 2012
Copyright © CodeProject, 1999-2017
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100