Without proper preparation, converting a scanned image to text via OCR can produce substandard results. Without the proper tools, preparing a scanned image for OCR can be a complex, if not an impossible task. LEADTOOLS Document Imaging Suite solves both of these problems by providing functions that get the image from the scanner, removes nonessential elements that could cause problems for OCR and finally converts the image to a text or document format with the industry leader OCR engine.
In this article I’ll create an application that allows you to choose a scanner, adjust the scanner’s settings to produce images optimal for OCR, scan from the device, OCR the documents, and save recognized text out to disk as a searchable PDF.
The sample in this article was compiled with Visual Studio 2005 using C# and LEADTOOLS Document Imaging Suite version 15 with the OCR PDF plug-in or the LEADTOOLS v15 .NET evaluation with OCR runtime.
I’ve created a Windows application with three buttons to keep this simple.
- Select Scanner – Allows you to choose a scanning device on your local machine.
- Scan – Initiates the scanning process.
- Save – Saves the results of the scanned images after having been OCR’d.
Under the hood, the LEADTOOLS .NET classes perform the bulk of the work. We’ll walk through the code in the order that it’s executed, starting with the form load event.
I first unlock the support for some of the Document Imaging Suite features. These functions only have to be called once (typically in a startup routine) and the features they unlock are then available for the life of the process. If you are using the LEADTOOLS evaluation, you do not have to call these functions, as all functionality is available.
Next we create each object. The OCR object (
RasterDocumentEngine) and the scanning object (
TwainSession) are created globally, as we’ll need them in multiple functions. The rest of the objects are used to clean the images as they are scanned into the application. They are created globally to avoid having to create and destroy them over and over for each page scanned.
For both the scanning object and OCR object, you must call the
StartUp function before you can begin using them.
I link a function to the
AcquirePage event in the scanning object. This event is called for each page captured by the scanner.
In this sample, we are saving out the text that the OCR object has generated from the images (recognized text) as a searchable PDF (PDF Image with text underneath). We also set the
RecognitionDataFileName to a file in the user’s temp directory. This file is used by the OCR engine to store the recognized text before it is converted to a final format, such as Microsoft Word, Excel, PDF, etc. Each time you OCR an image, it appends the recognized text to this file. This would allow you to append multiple documents together even if you restart your machine in between scans. To opt out of this option, simply delete this file prior to starting the recognition process.
Each document clean object is then initialized to values that are optimal for most scanned bitonal images.
btnSelectScanner_Click event, simply call
TwainSession::SelectSource with an empty string to display the
SelectSource dialog. This dialog is populated by the Twain Source Manager found in the twain32.dll file.
If you would like to select a scanning device without showing this dialog, simply pass the name of the device for the parameter in the
Before we begin the scan, you'll want to set up the scanner to produce images that are optimal for OCR. We set the X and Y resolution to 300 and set the bits per pixel to one, which essentially tells the scanning device to scan in black and white.
_twSession.Acquire begins the scanning process. In this sample, we passed "None" as a parameter, which means that no other user interface will appear before the scanner begins capturing. You can also pass "Show" to show the scanner’s dialog, which will allow the user to have the final say on the settings used.
Here is the code that does what was just described:
At this point, the images are being scanned and the
AcquirePage event is being called for each page scanned. This event is covered further down. Once the scan is complete, we call
AutoOrientPage for each page in the OCR. If a page was scanned up-side-down, this function will rotate it back to right-side-up. Next, we delete the recognition data file if it exists and then recognize all of the pages.
_twSession_AcquirePage event is called for each page that is scanned. The scanned image is given to you in the
TwainAcquirePageEventArgs::Image parameter. In this event, we clean up the image using each of the document clean-up classes created and set up in the form load event. Once the image is clean, we add it to the OCR object where it is later converted to editable text and stored in the recognition data file.
Lastly, I save the results from the OCR to disk. As you remember, I set up the OCR to output the results as a PDF file. The OCR will take the data in the recognition data file and convert it to a searchable PDF file.
Below are the results of an image before cleaning versus a cleaned image from the LEADTOOL’s OCR:
This is just one way you can implement a workflow of scanning, cleaning, OCRing and archiving. Should you have barcodes or patch codes on the scanned document, you can use LEADTOOL’s barcode functionality to detect them in the
AcquirePage event and perform additional logic based on the results.
Required Software to Build this Sample
In order to run this sample on your machine, you can download LEADTOOLS free 60 day evaluation.
Have questions about this sample? Contact our expert support team for free evaluation support!