Document processing is used since decades in the financial and insurance industry. In this second part of the overview, the subject is Request Driven Extraction (RDE) as the next step beyond plain OCR analysis.
OCR is a powerful and popular technique to read paper based documents. Today’s OCR systems are no longer restricted to read floating text passages. They also provide higher layout structures like lists and tables. So why do we need extra table extraction?
Here is an example: Let’s say, you want to export table data to an Excel file. In our example, you have 3 paper documents with the very same table layout (your phone bills for example). If you are not willing to type in every character manually, you can scan the documents and perform an OCR analysis. But since OCR cannot ‘know’ that all documents contain the same table layout, you will get (worst case) three different table formats.
That is where RDE comes in. With RDE you create a unique table pattern for all documents and let the machine create corresponding results. After the process, you have one single information model for every document - a small but very important difference.
The TableExtractor - A User's Manual
To show RDE technology in a simple way, I created the TableExtractor. This application can be seen as an expanded version to the MODI example from Document Processing Part I. Again, MS Office 2003 is required. The new feature is the 'Table Capture Frame'. This is a semi-transparent tool window to customize your personal table requests. The next steps guide you through the whole process of table extraction:
- Open an image document.
- Press the OCR button to get plain document text.
- Select a table you want to extract by using the red selection area.
- Press the Adjust button. That will show the Table capture frame. By default, the table request will contain only one single column.
- To customize your table request, choose Add Columns and resize them by dragging the column headers.
- Press Capture to extract the table.
- Export the table result to a file.
- Open a new document. Of course, you may use the already customized table.
The TableExtractor – Technical Aspects
The implementation neither includes special tricks nor does it provide breaking new design patterns. I want to draw your attention to the underlying object model.
The Document Model
For the application, we design a simple document model. This is a hierarchy of four layout element classes: Documents, pages, lines, words. We don’t use the MODI Object model this time, because we need the line elements which are not provided in the MODI model. After OCR process is done, we generate an instance of that model by converting the MODI objects. At this point we will not generate lines. That is because of the special character of the cluster algorithm, which clusters words to lines.
The Request Model
In order to represent our knowledge about the table, we create a table request. This table request contains column requests. Each column request provides the relative width referring to the table.
Table Extraction Process
The extraction process is implemented in two simple steps. In the first step, the table’s lines are clustered from the selected word elements. This clustering does a so called Hugh-transformation. Wherever words have overlapping projection on the Y-axis, they are combined to a line element. That’s the reason why we don’t do global line segmentation. Because of noise elements (like OCR errors), Hugh transformations work better when restricted to small areas. The second step iterates through all generated lines and splits the contained words to columns. This is done by using simple intersection criteria.
In this article a very basic table model is described. There are plenty of features to expand this model. Just to give you an idea, I listed a few examples. Please be aware, that some of these points may get very complex and that their development is currently keeping a lot of people busy.
- Multiple line requests: In our example request model, only one type of line is defined. You may allow alternative line types in one single table.
- AutoCorrection: You may add data format templates (e.g. regular expressions) to the column request model. That enables you to detect or correct OCR errors in the text content.
- Column Order: You may expand the model for different column orders and optional columns.