Click here to Skip to main content
Click here to Skip to main content

Scan an Image, Clean It, OCR It and Save It

Create an application that allows the user to choose a scanner, adjust the scanner’s settings to produce images optimal for OCR, scan from the device, OCR the documents, and save recognized text out to disk as a searchable PDF.

Editorial Note

This article is in the Product Showcase section for our sponsors at CodeProject. These reviews are intended to provide you with information on products and services that we consider useful and of value to developers.

Introduction

Without proper preparation, converting a scanned image to text via OCR can produce substandard results. Without the proper tools, preparing a scanned image for OCR can be a complex, if not an impossible task. LEADTOOLS Document Imaging Suite solves both of these problems by providing functions that get the image from the scanner, removes nonessential elements that could cause problems for OCR and finally converts the image to a text or document format with the industry leader OCR engine.

In this article I’ll create an application that allows you to choose a scanner, adjust the scanner’s settings to produce images optimal for OCR, scan from the device, OCR the documents, and save recognized text out to disk as a searchable PDF.

Environment

The sample in this article was compiled with Visual Studio 2005 using C# and LEADTOOLS Document Imaging Suite version 15 with the OCR PDF plug-in or the LEADTOOLS v15 .NET evaluation with OCR runtime.

The Code

I’ve created a Windows application with three buttons to keep this simple.

ScanCleanOCR/image001.jpg
  • Select Scanner – Allows you to choose a scanning device on your local machine.
  • Scan – Initiates the scanning process.
  • Save – Saves the results of the scanned images after having been OCR’d.

Under the hood, the LEADTOOLS .NET classes perform the bulk of the work. We’ll walk through the code in the order that it’s executed, starting with the form load event.

ScanCleanOCR/image002.jpg
ScanCleanOCR/image003.jpg

I first unlock the support for some of the Document Imaging Suite features. These functions only have to be called once (typically in a startup routine) and the features they unlock are then available for the life of the process. If you are using the LEADTOOLS evaluation, you do not have to call these functions, as all functionality is available.

Next we create each object. The OCR object (RasterDocumentEngine) and the scanning object (TwainSession) are created globally, as we’ll need them in multiple functions. The rest of the objects are used to clean the images as they are scanned into the application. They are created globally to avoid having to create and destroy them over and over for each page scanned.

For both the scanning object and OCR object, you must call the StartUp function before you can begin using them.

I link a function to the AcquirePage event in the scanning object. This event is called for each page captured by the scanner.

In this sample, we are saving out the text that the OCR object has generated from the images (recognized text) as a searchable PDF (PDF Image with text underneath). We also set the RecognitionDataFileName to a file in the user’s temp directory. This file is used by the OCR engine to store the recognized text before it is converted to a final format, such as Microsoft Word, Excel, PDF, etc. Each time you OCR an image, it appends the recognized text to this file. This would allow you to append multiple documents together even if you restart your machine in between scans. To opt out of this option, simply delete this file prior to starting the recognition process.

Each document clean object is then initialized to values that are optimal for most scanned bitonal images.

ScanCleanOCR/image004.jpg

In the btnSelectScanner_Click event, simply call TwainSession::SelectSource with an empty string to display the SelectSource dialog. This dialog is populated by the Twain Source Manager found in the twain32.dll file.

ScanCleanOCR/image005.jpg

If you would like to select a scanning device without showing this dialog, simply pass the name of the device for the parameter in the SelectSource function.

Before we begin the scan, you'll want to set up the scanner to produce images that are optimal for OCR. We set the X and Y resolution to 300 and set the bits per pixel to one, which essentially tells the scanning device to scan in black and white.

Next, _twSession.Acquire begins the scanning process. In this sample, we passed "None" as a parameter, which means that no other user interface will appear before the scanner begins capturing. You can also pass "Show" to show the scanner’s dialog, which will allow the user to have the final say on the settings used.

Here is the code that does what was just described:

ScanCleanOCR/image006.jpg

At this point, the images are being scanned and the AcquirePage event is being called for each page scanned. This event is covered further down. Once the scan is complete, we call AutoOrientPage for each page in the OCR. If a page was scanned up-side-down, this function will rotate it back to right-side-up. Next, we delete the recognition data file if it exists and then recognize all of the pages.

The _twSession_AcquirePage event is called for each page that is scanned. The scanned image is given to you in the TwainAcquirePageEventArgs::Image parameter. In this event, we clean up the image using each of the document clean-up classes created and set up in the form load event. Once the image is clean, we add it to the OCR object where it is later converted to editable text and stored in the recognition data file.

ScanCleanOCR/image007.jpg

Lastly, I save the results from the OCR to disk. As you remember, I set up the OCR to output the results as a PDF file. The OCR will take the data in the recognition data file and convert it to a searchable PDF file.

ScanCleanOCR/image008.jpg

Results

Below are the results of an image before cleaning versus a cleaned image from the LEADTOOL’s OCR:

ScanCleanOCR/image009.jpg

Conclusion

This is just one way you can implement a workflow of scanning, cleaning, OCRing and archiving. Should you have barcodes or patch codes on the scanned document, you can use LEADTOOL’s barcode functionality to detect them in the AcquirePage event and perform additional logic based on the results.

Required Software to Build this Sample

In order to run this sample on your machine, you can download LEADTOOLS free 60 day evaluation.

Support

Have questions about this sample? Contact our expert support team for free evaluation support!

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

No Biography provided

Comments and Discussions

| Advertise | Privacy | Mobile
Web03 | 2.8.140721.1 | Last Updated 2 May 2008
Article Copyright 2008 by Travis Montgomery
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid