Redact Sensitive Information within Scanned Documents using OCR and Pattern Recognition

Tom Setzer, Greg Freeland

1 Dec 2008CPOL8 min read

32.1K

Learn how to find, redact, or replace text patterns you define after converting scanned images into searchable documents. Hide sensitive personal information like social security numbers and credit card numbers to protect privacy. Pegasus Imaging’s SDKs and this sample project show you how.

This article is in the Product Showcase section for our sponsors at CodeProject. These articles are intended to provide you with information on products and services that we consider useful and of value to developers.

Introduction

OCR combined with a powerful approximate regular expression engine can capture and search data from text on images that would otherwise be lost. Even in today’s digital age, many companies still rely on paper documents. In order to bridge the gap, Optical Character Recognition (OCR) captures the data on those paper documents and brings that data into the digital workspace. OCR technology is very useful in a number of different instances and you can create solutions that are even more powerful by adding regular expression search with approximate matching to the OCR technology. Searchable document creation, capturing bank check amounts, getting dollar amounts from an invoice, redaction of sensitive data, and indexing documents for subsequent search are just a few of the typical uses for OCR and regular expression search.

In this article, we review some of the existing problems where this technology can provide a solution. We also give an overview of the technology used to create solutions for these problems. Finally, we demonstrate the power of this combined technology by implementing one of the use cases. The associated sample code and a trial download of Pegasus Imaging’s full-page OCR SDK can be found here.

The following use cases are common examples of where OCR is used.

Searchable Document Creation

When documents exist as images, either as digital fax or as scanned documents, they are not in a format that is easy to search. OCR converts the image of text into actual searchable text. You can combine this text with the original image in PDF files or XPS files. This is useful if you need to preserve the original image for legal reasons, such as when a signature is present on the image, but you also need to search the text. Google Desktop and Windows Desktop Search will index these OCR-created PDF files and XPS files, allowing you to find desired documents through routine text searches. Full-page OCR solutions, such as OCR Xpress, are best suited for this use.

Forms Processing

Insurance forms, entrance exams, tax returns, invoices, and checks are documents that many businesses process on a daily basis. Some businesses receive thousands, possibly even millions, of these documents every day. Forms processing is an automated way to process these documents. Most forms processing solutions use OCR to gather machine print data, ICR to gather hand written data, and OMR to detect filled in check boxes or bubbles. Structured forms processing typically uses zonal OCR and ICR, such as SmartZone v2, to collect data from form fields. Semi-structured and unstructured forms processing vary in using zonal or full page OCR, depending on the implementation.

Redaction of Sensitive Data

Redacting sensitive data from images is another important use of OCR. With the continued concern about privacy, the requirement to redact social security numbers, birth dates and other sensitive data from images is becoming more common. Businesses and government organizations frequently publish customer submitted document images on web sites. The organizations that collect these documents must remove or redact the sensitive data that exist in these documents prior to publishing them. Recent privacy legislation makes this a requirement for many types of document images. In our example below, we will develop a simple search and redaction program to demonstrate the combined power of an accurate OCR engine with an approximate regular expression engine.

Limitations of Technology

Many OCR Engines today approach or exceed 99% accuracy. In many use cases, this is sufficient accuracy for the problem being addressed. However, some applications require higher accuracy. There are a number of ways to improve the recognition accuracy of an OCR engine. Starting with clean images is one way to help improve accuracy. Using the best technology when converting images from color or gray images to black and white (binary) can also improve recognition accuracy. Starting with higher resolution images, 300 DPI or higher, helps the recognition process. Using multiple OCR engines and comparing results can lead to a reduction in recognition errors as well.

Unfortunately, not all of these options may be possible. The images may have originated outside of the control of the organization, resulting in poorly acquired images with tears, or speckles, or dark, low-resolution images. Some image cleanup can help, but may not be enough to get the OCR engine to 100% accuracy, despite claims to the contrary.

Overcoming the Limitations

When the solution involves searching for text patterns and 100% OCR accuracy cannot be guaranteed, another technology is required to help improve search results. Approximate regular expression search engines help improve search results.

What is an approximate regular expression search engine?

Regular expressions allow users to define patterns used to search for particular text in strings. If you have ever used “dir *.c” in a command line, you are using a variant of regular expressions. The best way to understand regular expressions is through an example. You can use a pattern of “\d\d\d” to search for any three digits in a row. This regular expression applied against the string “abc 123” will match the “123”. The regular expression engine will return an index into the input string to indicate where it found the match. In this example, the index is 4 (zero based index).

Approximate regular expressions extend the regular expression functionality by allowing errors in the match in the form of insertions, deletions and substitutions of characters in the string and still match the pattern.

If the OCR engine misreads the string and returns “abc 1Z3”, where the 2 is replaced with the letter Z, approximate regular expressions could still match the “/d/d/d” pattern when substitutions are allowed. Substituting the ‘Z’ for a ‘2’, or any other number, allows the pattern to match “1Z3. If the OCR engine inserts text into the string, for example “abc 1i23”, then with insertions allowed the pattern still matches “1i23”. And with deletions allowed against the string “abc 12”, the pattern matches “12”.

Example Implementation: Searching Images for Information

For this example, we’ll first need to download the OCR Xpress v2 SDK that also contains ImagXpress v9. Next, we will load an image that contains text as input using ImagXpress. Using the OCR Xpress engine, we will recognize the text on that image. We will then search the recognized text for a regular expression pattern using the approximate regular expression engine built into OCR Xpress v2. Next, we will highlight the text on the screen using NotateXpress v9 (also included in the OCR Xpress SDK). Finally, we export to a searchable, image over text PDF, with the text redacted from the image and removed from the searchable text.

Recognition of Text

The first step in creating this application is to load the image and perform recognition on the loaded image. There are a few other maintenance steps shown in the code below. The toolkits make the whole process straight forward and easy to use.

// Open the selected file with ImagXpress
//
ImageX documentImage = ImageX.FromStream(m_imagXpress,
                                         openFileDialog.OpenFile());

// OcrXpress will not assume ownership of the
System.Drawing.Bitmap 
// created by ImagXpress. The calling application
will need to dispose
// of the Bitmap instead. The using statement will do
this efficiently.
//
using (System.Drawing.Bitmap
       theImage = documentImage.ToBitmap(false))
{
    m_ocrXpressPage = m_ocrXpress.Document.AddPage(theImage);
}

// Process image with OcrXpress, so we have results
to search
//
m_ocrXpress.Document.AutoRotate(m_ocrXpressPage);
m_ocrXpress.Document.Deskew(m_ocrXpressPage);
m_ocrXpress.Document.Recognize(m_ocrXpressPage);

// Give image to ImagXpress viewer to display
//
if (m_ocrXpressPage.BitonalBitmap == null)
m_imageXView.Image = ImageX.FromBitmap(m_imagXpress,
                                       m_ocrXpressPage.Bitmap);
else
m_imageXView.Image
= ImageX.FromBitmap(m_imagXpress,
                    m_ocrXpressPage.BitonalBitmap);

Set up the Pattern for Search

Once the image has been loaded and recognized, the user can enter a search string, or choose from a predefined search pattern, such as a phone number. The code below shows how to set up the pattern in OCR Xpress and then perform the matching. If the user of the application chooses the approximate matching, we set up the structure to allow a total of two errors, which can be a combination of zero or one substitution, zero to two deletions and up to two insertions.

using (PatternMatcher

       search = new PatternMatcher(m_ocrXpress))
{
    List<MatchResult>
    searchResults;
    search.Pattern = txtSearchPattern.Text;
    if (chkMatchApproximate.CheckState == CheckState.Checked)
    {
        search.MaximumInsertions = 2;
        search.MaximumDeletions = 2;
        search.MaximumSubstitutions = 1;
        search.MaximumErrors = 2;
    }
    search.CaseSensitive = chkCaseSensitive.Checked;
    searchResults = search.PerformMatching(m_ocrXpressPage);
}

Note that in this example image, one occurance of the word “OCR” was damaged with an ink blot (via a paint program), causing the “O” to look like a “Q”. Standard regular expression engines would not match this pattern when searching for “OCR Xpress”, but when we turn on approximate matching, it does find this occurance, as well as several other occurances where the space between “OCR” and “Xpress” is eleminated.

Display the Results In the List Box

To display the results we built an array of match results in a System.Collections.ArrayList and tied it to the Windows.Forms.ListBox control for the display. We populated the ArrayList, listBoxItems, with a fragment of the text line that includes the match.

// Get the OCR results from the global page variable
// 
PageResult page =
m_ocrXpressPage.GetResult();

// Loop through and report each match
//
foreach (MatchResult
         result in searchResults)
{
    TextBlockResult block =
        page.GetTextBlockResult(result.TextBlockIndex);

    // If the entire search result is contained within a 
    // single text line result, then the process of
    collecting
        // the text of the search result is a bit easier
        //
        if (result.TextLineStartIndex ==
            result.TextLineEndIndex)
        {
            // Get the text line result where the match begins.
            In this
                // case, it is also the text line result where the
                match ends.
                //
                TextLineResult line =
                block.GetTextLineResult(result.TextLineStartIndex);

            // Get one word before the match to show context
            //
            int wordsBeforeIndex =
                GetStartIndexOfWordsBefore(line.Text, 
                result.CharStartIndex, 1);
            string itemString =
                line.Text.Substring(wordsBeforeIndex,
                result.CharStartIndex - wordsBeforeIndex);
            itemString += "[";
            itemString += line.Text.Substring(
                result.CharStartIndex, result.CharEndIndex - result.CharStartIndex);
            itemString += "]";

            // Get eight words after the match to show context
            //
            int wordsAfterIndex =
                GetEndIndexOfWordsAfter(line.Text, 
                result.CharEndIndex, 8);
            itemString += line.Text.Substring(result.CharEndIndex, 
                wordsAfterIndex - result.CharEndIndex);
        }
        else
        {
            //
            // Download the sample to see the code for this case 
            //
        }
}

Highlight on the Screen

An event from the ListBox calls a function that uses NotateXpress to highlight the text on the image. OCR Xpress provides the coordinates of the characters in the image.

// If the entire search result is contained within a 
// single text line result, then the process of
highlighting
// the result is a bit easier
//
if (result.TextLineStartIndex == result.TextLineEndIndex)
{
    if (result.CharStartIndex ==
        result.CharEndIndex)
    {
        // A match result that contains no characters can
        result
            // from the regular expression "^" or
            "$".
            //
            return;
    }
    rectAnnotation = new RectangleTool();
    rectAnnotation.BackStyle = BackStyle.Translucent;
    rectAnnotation.FillColor = fillColor;
    rectAnnotation.Moveable = rectAnnotation.Sizeable = false;

    // Get the text line result where the match begins.
    In this
        // case, it is also the text line result where the
        match ends.
        //
        textLine = textBlock.GetTextLineResult(result.TextLineStartIndex);

    // Ensures that we don't get a first character result
    that
        // is a space character. Since space characters do
        not
        // provide area information, this would throw off the
        // highlight bounding area
        //
        int i1, i2;
    for (i1 = result.CharStartIndex; i1 <

        result.CharEndIndex; i1++)
    {
        firstCharacterResult = textLine.GetCharacter(i1);
        if (firstCharacterResult.Text != " ")
            break;
    }

    // Ensures that we don't get a last character result
    that
        // is a space character. Since space characters do
        not
        // provide area information, this would throw off the
        // highlight bounding area
        //
        for (i2 = result.CharEndIndex - 1; i2 >= i1;
            i2--)
        {
            lastCharacterResult = textLine.GetCharacter(i2);
            if (lastCharacterResult.Text != " ")
                break;
        }

        // Construct the area of the highlight for the search
        result
            // based on the text line Y and Height values. Then
            use
            // the first and last characters of the search result
            to 
            // create the X and Width values.
            // 
            System.Drawing.Rectangle boundingRectangle =
            new System.Drawing.Rectangle();
        boundingRectangle.Y = textLine.Area.Y;
        boundingRectangle.Height = textLine.Area.Height;
        boundingRectangle.X = firstCharacterResult.Area.X;
        boundingRectangle.Width = (lastCharacterResult.Area.Width + 
            lastCharacterResult.Area.X) - firstCharacterResult.Area.X;
        rectAnnotation.BoundingRectangle = boundingRectangle;
        layer.Elements.Add(rectAnnotation);
}

After the image is highlighted, we adjust the scroll position so the highlighted text is on the screen.

// Adjust ImagXpress scroll position, so highlight
// will be visible.
//
int xVis = rectAnnotation.BoundingRectangle.X +

rectAnnotation.BoundingRectangle.Width / 2;
int yVis = rectAnnotation.BoundingRectangle.Y +

rectAnnotation.BoundingRectangle.Height / 2;
double xOffset = xVis * m_imageXView.ZoomFactor
- m_imageXView.Width / 2;
double yOffset = yVis * m_imageXView.ZoomFactor
- m_imageXView.Height / 2;
m_imageXView.ScrollPosition = new Point((int)xOffset,
                                        (int)yOffset);

Redact and Export

Finally, if the user is happy with the search results, they can redact the text that was matched and export the redacted text to a searchable PDF. We use NotateXpress to brand the redactions into the image, and then replace that image in OCR Xpress just prior to export. The underlying text is also redacted by replacing the offending text with “X”s, while the rest of the text is still searchable.

// Accumulate all MatchResult objects in the result
// list box, so we can use them for redaction
//
List<MatchResult>
results = new List<MatchResult>();
foreach (ResultListBoxItem
         item in listBoxItems)
{
    results.Add((MatchResult)item.ValueObject);
}

// Redact and replace search results
//

using (ImageXView
       imageXView = new ImageXView(m_imagXpress))
{
    // Make a copy of the displayed image, so the
    // redactions do not affect it
    //
    imageXView.Image = m_imageXView.Image.Copy();

    RedactSearchResultsOnImage(results, imageXView);

    // Set the redacted image back into the primary
    // image of the OcrXpress page, so it will be
    // used during export
    //
    using (System.Drawing.Bitmap
        redactedImage = imageXView.Image.ToBitmap(false))
    {
        m_ocrXpressPage.Bitmap = redactedImage;
    }

    RedactSearchResultsInCurrentPage(results);

    m_ocrXpress.Document.Export(ExportFormat.PdfImageWithText,

        saveFileDialog.FileName);
}

Conclusion

Developers can create powerful image search solutions when they combine accurate OCR with approximate regular expressions engines. This technology can solve a number of common problems that are pervasive across various industries. The simple OCR and search example that we created here is a demonstration of how the OCR Xpress and ImagXpress SDKs can make creating such powerful business solutions so easy.

You can find Pegasus Imaging product downloads and features at www.pegasusimaging.com. Please contact us at sales@jpg.com or support@jpg.com for more information.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Tom Setzer

Pegasus Imaging Corporation

United States

Tom leads several product development teams at Pegasus Imaging. He joined the company in 2006 with a valuable combination of technical expertise and solid project management skills. His analysis of customer requirements and business data assists in the development of product strategies resulting in timely and complete product deliveries. His past experience includes 12 years of developing technology solutions for industry leaders including Motorola, Capital One and General Electric. He brings additional experience working for small software development companies in the network security and electronic entertainment industries. Tom earned a Bachelor of Science in Computer Science from the University of North Carolina at Wilmington and a Master of Business Administration from the Fuqua School of Business at Duke University.

Written By

Greg Freeland

Pegasus Imaging Corporation

United States

Greg began his career at Pegasus Imaging as a software support engineer in 2002, and quickly moved into a software engineering role. After gaining experience working on various imaging components, Greg soon settled in with the team working on recognition technologies, and OCR Xpress in particular. He is responsible for building the component that uses the different OCR engines and merges their results. Greg holds a Bachelor of Science with a major in Management Information Systems and a minor in Computer Science from Florida State University and joined Pegasus after an internship with Eckerd Corporation.

Redact Sensitive Information within Scanned Documents using OCR and Pattern Recognition

Introduction

Searchable Document Creation

Forms Processing

Redaction of Sensitive Data

Limitations of Technology

Overcoming the Limitations

What is an approximate regular expression search engine?

Example Implementation: Searching Images for Information

Recognition of Text

Set up the Pattern for Search

Display the Results In the List Box

Highlight on the Screen

Redact and Export

Conclusion

License

Comments and Discussions