12,064,615 members (24,241 online)
Document processing is used since decades in the financial and insurance industry. In this second part of the overview, the subject is Request Driven Extraction (RDE) as the next step beyond plain OCR analysis.
OCR is a powerful and popular technique to read paper based documents. Today’s OCR systems are no longer restricted to read floating text passages. They also provide higher layout structures like lists and tables. So why do we need extra table extraction?
Here is an example: Let’s say, you want to export table data to an Excel file. In our example, you have 3 paper documents with the very same table layout (your phone bills for example). If you are not willing to type in every character manually, you can scan the documents and perform an OCR analysis. But since OCR cannot ‘know’ that all documents contain the same table layout, you will get (worst case) three different table formats.
That is where RDE comes in. With RDE you create a unique table pattern for all documents and let the machine create corresponding results. After the process, you have one single information model for every document - a small but very important difference.
To show RDE technology in a simple way, I created the TableExtractor. This application can be seen as an expanded version to the MODI example from Document Processing Part I. Again, MS Office 2003 is required. The new feature is the 'Table Capture Frame'. This is a semi-transparent tool window to customize your personal table requests. The next steps guide you through the whole process of table extraction:
The implementation neither includes special tricks nor does it provide breaking new design patterns. I want to draw your attention to the underlying object model.
For the application, we design a simple document model. This is a hierarchy of four layout element classes: Documents, pages, lines, words. We don’t use the MODI Object model this time, because we need the line elements which are not provided in the MODI model. After OCR process is done, we generate an instance of that model by converting the MODI objects. At this point we will not generate lines. That is because of the special character of the cluster algorithm, which clusters words to lines.
In order to represent our knowledge about the table, we create a table request. This table request contains column requests. Each column request provides the relative width referring to the table.
The extraction process is implemented in two simple steps. In the first step, the table’s lines are clustered from the selected word elements. This clustering does a so called Hugh-transformation. Wherever words have overlapping projection on the Y-axis, they are combined to a line element. That’s the reason why we don’t do global line segmentation. Because of noise elements (like OCR errors), Hugh transformations work better when restricted to small areas. The second step iterates through all generated lines and splits the contained words to columns. This is done by using simple intersection criteria.
In this article a very basic table model is described. There are plenty of features to expand this model. Just to give you an idea, I listed a few examples. Please be aware, that some of these points may get very complex and that their development is currently keeping a lot of people busy.