Click here to Skip to main content
15,860,972 members
Articles / Productivity Apps and Services / Microsoft Office

Converting Images to Text using Office 2007 OCR, OpenXML and Speech Recognition

Rate me:
Please Sign up or sign in to vote.
4.58/5 (18 votes)
19 Feb 2009CPOL4 min read 174.3K   8.2K   73   27
This article will show how to integrate the Office 2007 OCR engine with custom applications and use OpenXML and Speech Recognition

Introduction

Sometimes at the development of an application, we face situations where we have a scanned document (image) and we want to convert it to text (Word 2007 document). Some scanners provide applications that automatically perform this kind of conversion, but most times, the generated document format is a *.pdf or *.odt and so on. If you want to convert directly to *.docx (OpenXML) documents, you'll have to use third-party applications or develop it from scratch.

OpenXML became an ISO standard (IS29500) and its adoption is growing day by day driven by its performance, scalability and security. The format is the default format of Microsoft Office 2007 documents (*.pptx, *.docx, *.xlsx). It's 75 percent smaller than compared binary documents and is based on two major technologies: ZIP and XML.

The Speech recognition is a feature included with .NET Framework 3.5. Developers can use this API and provide better User-Experience, easy access to specific information and so on. The API is available since the .NET Framework 3.0 and it's a default feature of Windows Vista.

Scenario

To facilitate the work of developers and to avoid the integration with third-party applications, Microsoft released with Office 2007 one OCR (Optical Character Recognition) API that's called MODI (Microsoft Office Document Imaging). It's important to remember that the API used in this sample is exclusive of Office 2007 (Office 2003 has its own OCR API).

In this article, we'll create a Windows application that uses the Office 2007 OCR API to generate OpenXML documents. In addition, we'll use the Speech Recognition API to improve the application User-Experience.

Before we start, it's necessary that you already have the following requirements installed:

  • Visual Studio 2008
  • .NET Framework 3.5
  • OpenXML SDK 1.0
  • Office 2007

It's necessary that you have installed the Microsoft Office Document Imaging 12.0 Type Library. The Office 2007 installation setup doesn't install this component by default, being necessary to install it later. To do this:

  • Run the Office 2007 installation setup
  • Click on the button Add or Remove Features
  • Make sure that the component is installed

Using the MODI

To use the Office 2007 OCR API, you have to add a reference to Microsoft Office Document Imaging 12.0 Type Library. To do this:

  • At Solution Explorer, select Add Reference
  • At the COM tab, select Microsoft Office Document Imaging 12.0 Type Library

Create a MODI object:

C#
/// <summary>
/// Document Imaging Library
/// </summary>
MODI.Document md; 

In the Form class constructor, instantiate the MODI object:

C#
public Form1()
{
    InitializeComponent();
    speaker.Rate = -2;
    speaker.Volume = 100;
    ListFiles = new List<string>();
    md = new MODI.Document();
}

After that, you just have to implement the conversion method. Let's see how to do this:

C#
private void OCRImplementation()
{
    Cursor = Cursors.WaitCursor;
    foreach (string Name in checkedListBox1.CheckedItems)
    {
        try
        {
            md.Create(Name);
            md.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, true, true);
            string strText = String.Empty;
            MODI.Image image = (MODI.Image)md.Images[0];
            MODI.Layout layout = image.Layout;
            for (int i = 0; i < layout.Words.Count; i++)
            {
                MODI.Word word = (MODI.Word)layout.Words[i];
                if (strText.Length > 0)
                {
                    strText += " ";
                  }
                strText += word.Text;
            }
            md.Close(false);
            CreateDocument(strText);
        }
        catch (Exception ex)
        {
            MessageBox.Show(ex.Message);
        }
        finally
        {
            Cursor = Cursors.Default;
        }
    }
}

The method OCRImplementation will convert image files (*.tif, *.jpg, *.gif, *.bmp, in this case we're using a TIFF file). The method Create of the md object receives the path of the file to be converted. The OCR method receives three parameters, the first one represents the language of the document, the second parameter specifies whether the OCR engine attempts to determine the orientation of the page and the third parameter specifies whether the OCR engine attempts to fix small angles of misalignment from the vertical.

To retrieve the text, it's necessary to add references to the properties of the objects Image and Layout. The object Layout allows the text retrieval. The property Words of this object contains the property Count that allows the iteration through the list of words. You can retrieve the words using indexers, instead we're adding blank spaces between the words.

The method Close of the md object takes a boolean argument indicating whether to save changes to the image file.

Using OpenXML SDK

In the Solution Explorer, add a reference to the DocumentFormat.OpenXML library. This library allows the converted text to become a Word document. There's a constant object that will handle the structure and relationships of the document (It'll define the markup, in this case WordprocessingML).

C#
private const string PART_TEMPLATE =
 "<?xml version='1.0' encoding='UTF-8' standalone='yes'?>" +
"<w:document xmlns:w='http://schemas.openxmlformats.org/wordprocessingml/2006/main'>" +
"<w:body><w:p><w:r><w:t>#REPLACE#</w:t></w:r></w:p></w:body></w:document>";

The method CreateDocument is responsible for inserting the text inside the document structure.

C#
 private void CreateDocument(string Text)
{
    WordprocessingDocument wordDoc =
       WordprocessingDocument.Create(txt_SavePath.Text,
  WordprocessingDocumentType.Document);
    MainDocumentPart docPart = wordDoc.MainDocumentPart;

    string partML;
    docPart = wordDoc.AddMainDocumentPart();

    partML = PART_TEMPLATE.Replace("#REPLACE#", Text);

    Stream partStream = docPart.GetStream();
    UTF8Encoding encoder = new UTF8Encoding();
    Byte[] buffer = encoder.GetBytes(partML);
    partStream.Write(buffer, 0, buffer.Length);
    wordDoc.Close();
}

Speech Recognition

C#
/// <summary>
/// synthesis speech
/// </summary>
SpeechSynthesizer speaker = new SpeechSynthesizer();

Add a reference to System.Speech at the .NET tab. After that, you just have to adjust the Volume and Rate properties and use the method Speak to speak a string.

C#
speaker.Speak("Searching");

Conclusion

It is an interesting idea to combine these powerful APIs, the OCR implemented code is very short if compared with third-party APIs. It is a tool that can be explored in many ways and if integrated with the benefits of OpenXML and Speech Recognition, improves your applications.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer
Brazil Brazil
Developer at Microsoft Innovation Center | Brazil, MCTS, MOS, OpenXML Enthusiast.

Comments and Discussions

 
GeneralMy vote of 5 Pin
Jim Meadors22-Jul-13 20:00
Jim Meadors22-Jul-13 20:00 
QuestionNeed to read pdf file Pin
vishnukamath11-Jan-12 23:06
vishnukamath11-Jan-12 23:06 
QuestionA Simple example Pin
ZamirF22-Sep-11 8:23
ZamirF22-Sep-11 8:23 
Generalspace issue Pin
guostong11-Jan-11 4:28
guostong11-Jan-11 4:28 
GeneralRe: space issue Pin
Danie lCampos1-Feb-11 8:42
Danie lCampos1-Feb-11 8:42 
GeneralDifferent font style Pin
shubham.mvp9-Sep-10 23:45
shubham.mvp9-Sep-10 23:45 
GeneralRe: Different font style Pin
Danie lCampos1-Feb-11 8:39
Danie lCampos1-Feb-11 8:39 
QuestionHow can i use arabic image Pin
Member 46751956-Aug-10 12:09
Member 46751956-Aug-10 12:09 
AnswerRe: How can i use arabic image Pin
Danie lCampos1-Feb-11 8:35
Danie lCampos1-Feb-11 8:35 
GeneralRe: How can i use arabic image Pin
Member 46751951-Feb-11 10:45
Member 46751951-Feb-11 10:45 
Generalproblem with DLL Pin
lmontanez28-Jan-10 8:40
lmontanez28-Jan-10 8:40 
GeneralRe: problem with DLL Pin
Sivaji156530-Nov-14 23:21
Sivaji156530-Nov-14 23:21 
GeneralThis worked until the pc was rebooted then nevermore [modified] Pin
Bradley12345-Aug-09 12:18
Bradley12345-Aug-09 12:18 
GeneralRe: This worked until the pc was rebooted then nevermore Pin
Danie lCampos5-Aug-09 13:38
Danie lCampos5-Aug-09 13:38 
GeneralRe: This worked until the pc was rebooted then nevermore Pin
Bradley12345-Aug-09 20:42
Bradley12345-Aug-09 20:42 
GeneralPublish for install Pin
karatecoyote16-Sep-08 12:51
karatecoyote16-Sep-08 12:51 
AnswerRe: Publish for install Pin
Danie lCampos16-Sep-08 14:15
Danie lCampos16-Sep-08 14:15 
GeneralRe: Publish for install Pin
karatecoyote17-Sep-08 7:28
karatecoyote17-Sep-08 7:28 
GeneralRe: Publish for install Pin
MBrooker4-Nov-08 6:20
MBrooker4-Nov-08 6:20 
GeneralRe: Publish for install Pin
Danie lCampos4-Nov-08 9:56
Danie lCampos4-Nov-08 9:56 
GeneralRe: Publish for install Pin
MBrooker4-Nov-08 13:40
MBrooker4-Nov-08 13:40 
GeneralRe: Publish for install Pin
Danie lCampos4-Nov-08 14:13
Danie lCampos4-Nov-08 14:13 
RantRe: Publish for install Pin
Sike Mullivan19-Feb-09 17:57
Sike Mullivan19-Feb-09 17:57 
GeneralRe: Publish for install Pin
Danie lCampos20-Feb-09 10:37
Danie lCampos20-Feb-09 10:37 
GeneralRe: Publish for install Pin
Sike Mullivan20-Feb-09 14:40
Sike Mullivan20-Feb-09 14:40 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.