Click here to Skip to main content
6,290,044 members and growing! (16,037 online)
Email Password   helpLost your password?
Third Party Products » Product Showcase » Applications     Intermediate

Converting Scanned Document Images to Searchable PDFs with OCR

By Bill Bither

Demonstrates the use of Atalasoft's DotImage GlyphReader OCR to enable .NET applications to digitize paper documents as searchable PDFs that can be indexed by search engines.
C++, C#, VB, Windows, .NET, Visual Studio, WinForms, Architect, Dev
Version:2 (See All)
Posted:1 Dec 2006
Updated:14 Dec 2006
Views:66,130
Bookmarked:33 times
Announcements
Loading...
 
Search    
Advanced Search
printPrint   Broken Article?Report       add Share
  Discuss Discuss   Recommend Article Email
This article is in the Product Showcase section for our sponsors at The Code Project. These reviews are intended to provide you with information on products and services that we consider useful and of value to developers.

This is a showcase review for our sponsors at CodeProject. These reviews are intended to provide you with information on products and services that we consider useful and of value to developers.

Introduction

From health records, tax forms, and insurance claims, to old memos, magazines, and books; businesses are digitizing paper every day. With the advent of better search technology, having searchable text for all these documents is an obvious win. The common way to do this is to use OCR (Optical Character Recognition) to translate the images to a document format that indexers already know, but the drawback is that we often lose the layout, images and color of the original – plus, since no OCR is perfect, we need the original image to be able to fix mistakes. What we want is a document format that looks like the original images when humans look at it, but that looks like plain text when the indexer looks at it. And, when we copy from the image, we want text put on the clipboard. This is the promise of the searchable PDF.

In a searchable PDF, the original scanned image is retained so any human can read the document. The textual content that is extracted via OCR is put behind the image so search indexers can see it and Acrobat Reader will let us select it as text. The ubiquity of desktop and enterprise search, ever-increasing OCR accuracy, and mass adoption of PDF are a powerful combination that make searchable PDF's the ideal format to store digitized paper.

This article will demonstrate just how simple it is to develop an application that generates these searchable PDF's from scanned documents that can be indexed by Google, Sharepoint, Microsoft desktop search, and other applications that will index PDF documents.

To help build this application, Atalasoft publishes an OCR framework that simplifies working with industry leading OCR engines and our own highly accurate engine, GlyphReader. A free 30-day evaluation of the Atalasoft DotImage Document Imaging SDK, including the OCR module, GlyphReader, and all other add-ons can be downloaded from atalasoft.com.

Using our framework, these steps are handled for you:

  1. Decompress the image
  2. Pre-process the image to make OCR more accurate (including cleaning it or deskewing it)
  3. OCR the image to extract the text.
  4. Re-encode the image in a choice of formats, including CCIT Group 4, JBIG2, JPEG, or JPEG2000 for the absolute smallest file size possible.
  5. Construct a PDF with the image and the extracted text, with each word accurately positioned behind the appropriate place in the image.

Atalasoft's OCR framework includes a flexible Translator interface for producing output from the recognition process. For example, TextTranslator is available out of the box and generates a text stream. The Searchable PDF Module includes the PdfTranslator and is used to generate text only PDF's or Image with hidden text PDF's. Both are "searchable", but the latter includes the original image and is what we are going to use.

This article will use the following 2-page color TIFF as the source document to OCR. Shown here are the lower resolution images of the original scanned TIFF (a recent white paper from Atalasoft that was printed, and scanned in color).

Extracting the Text into a Text File

Let's start with a method that simply extracts the text into a file. First, we must create an ImageSource object to efficiently handle multi-page image files. Then we create the OCR engine, initialize it, translate it to the desired MIME type, and shutdown the engine.

void MakeText(string inFile, string outFile)
{
    using (FileSystemImageSource fis = 
           new FileSystemImageSource(new string[1] { inFile }, true))
    {
        GlyphReaderEngine ocr = new GlyphReaderEngine();
        ocr.Initialize();
        ocr.Translate(fis, "text/plain", outFile);
        ocr.ShutDown();
    }
}

The resulting text file obviously does not look at all like the original document, but it does contain the text. It also isn't stored in the same file as the image. We can do better.

Creating the Searchable PDF

For the next code sample, we'll use a PdfTranslator to create a searchable PDF. To do this we need to:

  1. Create an instance of the PdfTranslator
  2. Set its OutputType to TextUnderImage (to create a searchable PDF)
  3. Add it to the OcrEngine's Translators collection (since it's an add-on, it doesn't come pre-registered)
  4. Use the engine to translate with the output MIME type set to "application/pdf"

Here's the code:

void MakePdf(string inFile, string outFile)
{
    using (FileSystemImageSource fis = 
           new FileSystemImageSource(new string[1] { inFile }, true))
    {
        GlyphReaderEngine ocr = new GlyphReaderEngine();
        PdfTranslator pdfTrans = new PdfTranslator();
        pdfTrans.OutputType = PdfTranslatorOutputType.TextUnderImage;
        ocr.Translators.Add(pdfTrans);
        ocr.Initialize();
        ocr.Translate(fis, "application/pdf", outFile);
        ocr.ShutDown();
    }
}

The result is a high quality searchable PDF! When opening the PDF into Acrobat Reader (see screenshot below), all text in the document can be selected as real text, even though the visible part of this PDF is the actual color rasterized image.

The OCR Engine and PDF Translator handle all the details required to deskew the image, store it, produce accurate OCR, compress the image, accurately place the recognized text under the right part of the image, and generate the PDF document.

Simply having this file on your filesystem will cause Google Desktop Search, or Windows Desktop Search to index this document properly, with the document looking exactly like the original.

Product Requirements

To add searchable PDF generation to your applications, you will need the following products from Atalasoft:

  • DotImage Document Imaging SDK
  • OCR GlyphReader Engine Module (runtimes are additional)
  • OCR Searchable PDF Module (includes 20 runtimes)

Everything is included in the DotImage SDK which you can download and evaluate free for 30 days. Be sure to request Evaluation Licenses for the required products. Attached to this article is the resulting PDF and C# 2.0 source code for a simple console application where the first argument is the input image file, and the second argument is the resulting searchable PDF file.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Bill Bither


Member
The founder and CEO of Atalasoft, provider of Document and Photo Imaging Toolkits for Microsoft .NET Developers and Document Imaging and Viewing for SharePoint
Occupation: Founder
Location: United States United States

Other popular Applications & Tools articles:

Article Top
You must Sign In to use this message board.
FAQ FAQ 
 
Noise Tolerance  Layout  Per page   
 Msgs 1 to 25 of 25 (Total in Forum: 25) (Refresh)FirstPrevNext
GeneralDotImage OCR Searchable PDF problem PinmemberMember 383359013:19 13 Mar '08  
GeneralDoes it have to be a scanned document? Pinmemberbrian2514:44 20 Oct '07  
GeneralRe: Does it have to be a scanned document? PinmemberBill Bither6:06 23 Oct '07  
QuestionLocate text in images? Pinmemberphilip andrew17:21 31 Aug '07  
AnswerRe: Locate text in images? PinmemberBill Bither6:49 17 Sep '07  
GeneralCould you please recomend a commercial Sofware ? Pinmembermicaro21:29 19 Apr '07  
GeneralRe: Could you please recomend a commercial Sofware ? PinmemberBill Bither19:13 24 Apr '07  
Generali have a question ? Pinmembercombina_29:51 30 Jan '07  
GeneralRe: i have a question ? PinmemberBill Bither12:25 30 Jan '07  
GeneralRe: i have a question ? Pinmembercombina_210:26 31 Jan '07  
GeneralRe: i have a question ? PinmemberBill Bither12:39 31 Jan '07  
GeneralRe: i have a question ? Pinmembercombina_28:39 5 Feb '07  
GeneralRe: i have a question ? PinmemberBill Bither8:45 5 Feb '07  
GeneralRe: i have a question ? Pinmembercombina_29:00 5 Feb '07  
GeneralDoes it support Chinese like charset? Pinmemberfengjinzhi14:59 15 Jan '07  
GeneralRe: Does it support Chinese like charset? PinmemberBill Bither5:17 17 Jan '07  
GeneralRe: Does it support Chinese like charset? Pinmembercombina_210:26 30 Jan '07  
GeneralRe: Does it support Chinese like charset? PinmemberBill Bither12:17 30 Jan '07  
GeneralAn idea PinmemberHamed Mosavi23:59 21 Dec '06  
GeneralRe: An idea Pinmemberblue123420:59 16 Apr '07  
GeneralRe: An idea PinmemberHamed Mosavi20:52 17 Apr '07  
GeneralAtalaSoft Recommendation PinmemberDocEdge16:26 13 Dec '06  
GeneralHow about a working demo app Pinmemberxanth6:05 13 Dec '06  
GeneralRe: How about a working demo app PinmemberBill Bither6:41 13 Dec '06  
GeneralRe: How about a working demo app Pinmemberubflamed7:40 16 Aug '07  

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

PermaLink | Privacy | Terms of Use
Last Updated: 14 Dec 2006
Editor: Sean Ewington
Copyright 2006 by Bill Bither
Everything else Copyright © CodeProject, 1999-2009
Web15 | Advertise on the Code Project