Sometimes at the development of an application, we face situations where we have a scanned document (image) and we want to convert it to text (Word 2007 document). Some scanners provide applications that automatically perform this kind of conversion, but most times, the generated document format is a *.pdf or *.odt and so on. If you want to convert directly to *.docx (OpenXML) documents, you'll have to use third-party applications or develop it from scratch.
OpenXML became an ISO standard (IS29500) and its adoption is growing day by day driven by its performance, scalability and security. The format is the default format of Microsoft Office 2007 documents (*.pptx, *.docx, *.xlsx). It's 75 percent smaller than compared binary documents and is based on two major technologies: ZIP and XML.
The Speech recognition is a feature included with .NET Framework 3.5. Developers can use this API and provide better User-Experience, easy access to specific information and so on. The API is available since the .NET Framework 3.0 and it's a default feature of Windows Vista.
To facilitate the work of developers and to avoid the integration with third-party applications, Microsoft released with Office 2007 one OCR (Optical Character Recognition) API that's called MODI (Microsoft Office Document Imaging). It's important to remember that the API used in this sample is exclusive of Office 2007 (Office 2003 has its own OCR API).
In this article, we'll create a Windows application that uses the Office 2007 OCR API to generate OpenXML documents. In addition, we'll use the Speech Recognition API to improve the application User-Experience.
Before we start, it's necessary that you already have the following requirements installed:
- Visual Studio 2008
- .NET Framework 3.5
- OpenXML SDK 1.0
- Office 2007
It's necessary that you have installed the Microsoft Office Document Imaging 12.0 Type Library. The Office 2007 installation setup doesn't install this component by default, being necessary to install it later. To do this:
- Run the Office 2007 installation setup
- Click on the button Add or Remove Features
- Make sure that the component is installed
Using the MODI
To use the Office 2007 OCR API, you have to add a reference to Microsoft Office Document Imaging 12.0 Type Library. To do this:
- At Solution Explorer, select Add Reference
- At the COM tab, select Microsoft Office Document Imaging 12.0 Type Library
Create a MODI object:
Formclass constructor, instantiate the MODI object:
speaker.Rate = -2;
speaker.Volume = 100;
ListFiles = new List<string>();
md = new MODI.Document();
After that, you just have to implement the conversion method. Let's see how to do this:
private void OCRImplementation()
Cursor = Cursors.WaitCursor;
foreach (string Name in checkedListBox1.CheckedItems)
md.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, true, true);
string strText = String.Empty;
MODI.Image image = (MODI.Image)md.Images;
MODI.Layout layout = image.Layout;
for (int i = 0; i < layout.Words.Count; i++)
MODI.Word word = (MODI.Word)layout.Words[i];
if (strText.Length > 0)
strText += " ";
strText += word.Text;
catch (Exception ex)
Cursor = Cursors.Default;
OCRImplementationwill convert image files (*.tif, *.jpg, *.gif, *.bmp, in this case we're using a TIFF file). The method
mdobject receives the path of the file to be converted. The
OCRmethod receives three parameters, the first one represents the language of the document, the second parameter specifies whether the OCR engine attempts to determine the orientation of the page and the third parameter specifies whether the OCR engine attempts to fix small angles of misalignment from the vertical.
To retrieve the text, it's necessary to add references to the properties of the objects
Layout. The object
Layoutallows the text retrieval. The property
Wordsof this object contains the property
Countthat allows the iteration through the list of words. You can retrieve the words using indexers, instead we're adding blank spaces between the words.
mdobject takes a boolean argument indicating whether to save changes to the image file.
Using OpenXML SDK
In the Solution Explorer, add a reference to the
DocumentFormat.OpenXMLlibrary. This library allows the converted text to become a Word document. There's a constant object that will handle the structure and relationships of the document (It'll define the markup, in this case
private const string PART_TEMPLATE =
"<?xml version='1.0' encoding='UTF-8' standalone='yes'?>" +
"<w:document xmlns:w='http://schemas.openxmlformats.org/wordprocessingml/2006/main'>" +
CreateDocumentis responsible for inserting the text inside the document structure.
private void CreateDocument(string Text)
WordprocessingDocument wordDoc =
MainDocumentPart docPart = wordDoc.MainDocumentPart;
docPart = wordDoc.AddMainDocumentPart();
partML = PART_TEMPLATE.Replace("#REPLACE#", Text);
Stream partStream = docPart.GetStream();
UTF8Encoding encoder = new UTF8Encoding();
Byte buffer = encoder.GetBytes(partML);
partStream.Write(buffer, 0, buffer.Length);
SpeechSynthesizer speaker = new SpeechSynthesizer();
Add a reference to
System.Speechat the .NET tab. After that, you just have to adjust the
Rateproperties and use the method
Speakto speak a
It is an interesting idea to combine these powerful APIs, the OCR implemented code is very short if compared with third-party APIs. It is a tool that can be explored in many ways and if integrated with the benefits of OpenXML and Speech Recognition, improves your applications.