Click here to Skip to main content
Click here to Skip to main content
Go to top

How To: Use Office 2007 OCR Using C#

, 24 Aug 2009
Rate this:
Please Sign up or sign in to vote.
Reading text from any image using Microsoft Office 2007 OCR.

Introduction

The sample application check for images in a specify directory and reads text from these images if any. It saves text from each image in a text file with the same name as the image, automatically. It can handle problems or exceptions with images.

If you have Office 2007 installed, the OCR component is available for you to use. The only dependency that's added to your code is Office 2007. Requiring Office (2007 or 2003) to be installed in order for your code to work may or may not fit a situation. But if your client can guarantee that machines that your code will run on have Office (2007 or 2003 )installed, then this solution is ideal for you.

What is OCR ?

OCR (Optical Character Recognition) is the recognition of printed or written text characters by a computer. This involves photoscanning of the text character-by-character, analysis of the scanned-in image, and then translation of the character image into character codes, such as ASCII, commonly used in data processing.

Or, we can say... Optical character recognition (OCR) translates images of text, such as scanned documents, into actual text characters. Also known as text recognition, OCR makes it possible to edit and reuse the text that is normally locked inside scanned images. OCR works using a form of artificial intelligence known as pattern recognition, to identify individual text characters on a page, including punctuation marks, spaces, and ends of lines.

What is document imaging?

Document imaging is the process of scanning paper documents, and converting them to digital images that are then stored on CD, DVD, or other magnetic storage. With Microsoft Office Document Imaging, you can scan paper documents and convert them to digital images that you can save in:

  • Tagged Image File Format (TIFF): A high-resolution, tag-based graphics format. TIFF is used for the universal interchange of digital graphics.
  • Microsoft Document Imaging Format (MDI): A high resolution, tag-based graphics format, based on the Tagged Image File Format (TIFF) used for digital graphics.

to your computer’s hard disk, network server, CD, or DVD. Microsoft Office Document Imaging also gives you the ability to perform Optical Character Recognition (OCR) either as part of scanning a document, or while you work with a TIFF or MDI file. By performing OCR, you can then copy recognized text from a scanned image or a fax into a Microsoft Word document or other Office program files.

Weakness

To run the application that uses OCR, you must have the Office OCR Component installed in your machine. That means, without the Office OCR component, your application will not work.

Strength

It's a free component that comes with Office and you can use it in your code for free. It is easy to use because Microsoft presents many sample code for how to use this component.

Namespaces

using System.Collections;
using System.IO;
using System.Drawing.Imaging;

Using the code

The name of the COM object that you need to add as a reference is Microsoft Office Document Imaging 12.0 Type Library. By default, Office 2007 doesn't install it. You'll need to make sure that it's added by using the Office 2007 installation program. Just run the installer, click on the Continue button with the "Add or Remove Features" selection made, and ensure that the imaging component is installed.

The OCR engine always defaults to the user's regional settings for the LangID argument, unless you specify the language explicitly when calling the OCR method; it does not retain the previously used setting. In a mixed-language environment, it is a good practice to specify the LangID argument explicitly in every call to the OCR method.

So, create a Windows Application using C#. From Visual Studio Solution Explorer >> right click on References >> select the COM tab >> then select Microsoft Office Document Imaging 12.0 Type Library.

/// <summary>
/// Check for Images
/// read text from these images.
/// save text from each image in text file automaticly.
/// handle problems with images
/// </summary>
/// <param name="directoryPath">Set Directory Path to check for Images in it</param>
public void CheckFileType(string directoryPath) 
{ 
    IEnumerator files = Directory.GetFiles(directoryPath).GetEnumerator(); 
    while (files.MoveNext()) 
    { 
        //get file extension 
        string fileExtension = Path.GetExtension(Convert.ToString(files.Current));

        //get file name without extenstion 
        string fileName=
          Convert.ToString(files.Current).Replace(fileExtension,string.Empty);

        //Check for JPG File Format 
        if (fileExtension == ".jpg" || fileExtension == ".JPG")
        // or // ImageFormat.Jpeg.ToString()
        {
            try 
            { 
                //OCR Operations ... 
                MODI.Document md = new MODI.Document(); 
                md.Create(Convert.ToString(files.Current)); 
                md.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, true, true); 
                MODI.Image image = (MODI.Image)md.Images[0];

                //create text file with the same Image file name 
                FileStream createFile = 
                  new FileStream(fileName + ".txt",FileMode.CreateNew);
                //save the image text in the text file 
                StreamWriter writeFile = new StreamWriter(createFile); 
                writeFile.Write(image.Layout.Text); 
                writeFile.Close(); 
            } 
            catch (Exception exc) 
            { 
                //uncomment the below code to see the expected errors
                //MessageBox.Show(exc.Message,
                //"OCR Exception",
                //MessageBoxButtons.OK, MessageBoxIcon.Information); 
            } 
        } 
    } 
}

Points of Interest

I have made a big sample application for Office OCR and I'll release it soon.

Remark

There are many people who use OCR for Internet Spiders to get data.

My Blog

http://waleedelkot.blogspot.com/

References

History

  • Released on 24-08-2009.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Waleed Elkot
Software Developer (Senior) Equinox Web
Egypt Egypt
I have 5 years experience working as a Software Developer. I have a wide range of experience in programming and I am skilled in the use of Visual Studio.NET 2008, Windows AppLication, Web Application, Web Services, Windows Services, WPF, HTML, Java Script, Ajax, ASP.NET, DevExpress Controls, Office Application Programmability in Visual Studio.NET 2008, creating web and windows applications using C#.NET and experienced in using all Microsoft Office Applications.

Comments and Discussions

 
Questioni have a problem PinmemberMember 104514491-Jun-14 11:38 
QuestionMODI for azeri latin PinmemberSabuhi Asadullayev29-Apr-14 21:40 
QuestionOCR using C# for windows phone 8 PinmemberMember 1024264630-Aug-13 3:00 
AnswerRe: OCR using C# for windows phone 8 PinmemberPrince Jeelani26-Nov-13 18:44 
QuestionAccessviolation Exception during OCR Read. PinmemberBeBadgujar29-Aug-13 19:38 
QuestionIt's not working for some images. PinmemberDynamicDeveloper26-Jun-13 1:58 
GeneralMy vote of 5 Pinmemberchampiondai3-Jan-13 14:37 
QuestionSet font? PinmemberAmit D Rajput3-Nov-12 4:02 
GeneralMy vote of 5 PinmemberMember 460927615-Sep-12 23:35 
Question[My vote of 1] Oh excellent! PinmemberMapsaels12-Sep-12 11:33 
Questionerror Pinmemberbasaparabhu11-Sep-12 21:27 
Questionunable to find Microsoft Office Document Imaging 12.0 PinmemberMuthukumar Nadar3-Jun-12 7:32 
AnswerRe: unable to find Microsoft Office Document Imaging 12.0 PinmemberHolz_A.4-Jun-13 21:52 
QuestionCar Plat Number PinmemberTramanah4-Dec-11 22:55 
AnswerRe: Car Plat Number PinmemberTayTun8-Jan-12 9:24 
Questionurdu language ocr Pinmemberfarhadidrees12316-Oct-11 9:44 
QuestionThanks ,it work well. Pinmemberrysheng20-Sep-11 6:25 
AnswerRe: Thanks ,it work well. Pinmemberefuewgf21-May-12 1:46 
GeneralMy vote of 5 PinmemberАslam Iqbal16-Jul-11 5:11 
QuestionExcellent Pinmembersajithdilhan27-Jun-11 7:18 
GeneralExcellent work Pinmembervijaysinh.vansadia12-Apr-11 21:23 
GeneralSimple example. PinmemberZamirF30-Dec-10 8:54 
GeneralArabic please(((((((( PinmemberMember 46751956-Aug-10 12:50 
Questioncan we use this wia to implement in web page? Pinmemberjoydeepbeyondsky19-May-10 1:20 
Generalmy vote of 5 PinmemberOmarGamil2-Feb-10 20:48 
Questionshall we process bubble sheet using Office ocr once it is scanned? PinmemberjissforLogic29-Jan-10 0:30 
QuestionArabic OCR PinmemberWaleedH25-Jan-10 21:57 
GeneralIs this applicable for C++ or Jave PinmemberJehwin12-Dec-09 3:24 
GeneralCopy Pasta PinmemberXybot24-Sep-09 13:28 
GeneralRe: Copy Pasta PinmemberWaleed Elkot26-Sep-09 22:44 
GeneralMicrosoft Office Document Imaging 11.0 Type Library PinmemberMember 22547711-Sep-09 4:14 
GeneralRe: Microsoft Office Document Imaging 11.0 Type Library PinmemberWaleed Elkot1-Sep-09 9:53 
GeneralOCR DLL Pinmemberhulkonline30-Aug-09 23:27 
GeneralRe: OCR DLL PinmemberWaleed Elkot30-Aug-09 23:47 
GeneralOffice PIA PinmemberhossamAbdo26-Aug-09 2:42 
GeneralRe: Office PIA PinmemberWaleed Elkot26-Aug-09 11:45 
GeneralRe: Office PIA PinmemberWael Hussein26-Aug-09 22:11 
GeneralRe: Office PIA PinmemberWaleed Elkot26-Aug-09 23:38 
GeneralRe: Office PIA Pinmemberefuewgf21-May-12 19:38 
GeneralCool PinmemberAnt210024-Aug-09 7:31 
GeneralRe: Cool PinmemberWaleed Elkot26-Aug-09 23:17 
GeneralRe: Cool PinmemberMember 295735627-Aug-09 2:23 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140921.1 | Last Updated 24 Aug 2009
Article Copyright 2009 by Waleed Elkot
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid