Click here to Skip to main content
15,868,016 members
Articles / Desktop Programming / Win32

How To: Use Office 2007 OCR Using C#

Rate me:
Please Sign up or sign in to vote.
5.00/5 (36 votes)
24 Aug 2009CPOL3 min read 337.5K   23.3K   132   48
Reading text from any image using Microsoft Office 2007 OCR

Introduction

The sample application checks for images in a specified directory and reads text from these images if any. It saves text from each image in a text file with the same name as the image, automatically. It can handle problems or exceptions with images.

If you have Office 2007 installed, the OCR component is available for you to use. The only dependency that's added to your code is Office 2007. Requiring Office (2007 or 2003) to be installed in order for your code to work may or may not fit a situation. But if your client can guarantee that machines that your code will run on have Office (2007 or 2003) installed, then this solution is ideal for you.

What is OCR?

OCR (Optical Character Recognition) is the recognition of printed or written text characters by a computer. This involves photoscanning of the text character-by-character, analysis of the scanned-in image, and then translation of the character image into character codes, such as ASCII, commonly used in data processing.

Or, we can say... Optical character recognition (OCR) translates images of text, such as scanned documents, into actual text characters. Also known as text recognition, OCR makes it possible to edit and reuse the text that is normally locked inside scanned images. OCR works using a form of artificial intelligence known as pattern recognition, to identify individual text characters on a page, including punctuation marks, spaces, and ends of lines.

What is Document Imaging?

Document imaging is the process of scanning paper documents, and converting them to digital images that are then stored on CD, DVD, or other magnetic storage. With Microsoft Office Document Imaging, you can scan paper documents and convert them to digital images that you can save in:

  • Tagged Image File Format (TIFF): A high-resolution, tag-based graphics format. TIFF is used for the universal interchange of digital graphics.
  • Microsoft Document Imaging Format (MDI): A high resolution, tag-based graphics format, based on the Tagged Image File Format (TIFF) used for digital graphics.

to your computer’s hard disk, network server, CD, or DVD. Microsoft Office Document Imaging also gives you the ability to perform Optical Character Recognition (OCR) either as part of scanning a document, or while you work with a TIFF or MDI file. By performing OCR, you can then copy recognized text from a scanned image or a fax into a Microsoft Word document or other Office program files.

Weakness

To run the application that uses OCR, you must have the Office OCR Component installed in your machine. That means, without the Office OCR component, your application will not work.

Strength

It's a free component that comes with Office and you can use it in your code for free. It is easy to use because Microsoft presents many sample code for how to use this component.

Namespaces

C#
using System.Collections;
using System.IO;
using System.Drawing.Imaging;

Using the Code

The name of the COM object that you need to add as a reference is Microsoft Office Document Imaging 12.0 Type Library. By default, Office 2007 doesn't install it. You'll need to make sure that it's added by using the Office 2007 installation program. Just run the installer, click on the Continue button with the "Add or Remove Features" selection made, and ensure that the imaging component is installed.

The OCR engine always defaults to the user's regional settings for the LangID argument, unless you specify the language explicitly when calling the OCR method; it does not retain the previously used setting. In a mixed-language environment, it is a good practice to specify the LangID argument explicitly in every call to the OCR method.

So, create a Windows Application using C#. From Visual Studio Solution Explorer >> right click on References >> select the COM tab >> then select Microsoft Office Document Imaging 12.0 Type Library.

C#
/// <summary>
/// Check for Images
/// read text from these images.
/// save text from each image in text file automatically.
/// handle problems with images
/// </summary>
/// <param name="directoryPath">Set Directory Path to check for Images in it</param>
public void CheckFileType(string directoryPath) 
{ 
    IEnumerator files = Directory.GetFiles(directoryPath).GetEnumerator(); 
    while (files.MoveNext()) 
    { 
        //get file extension 
        string fileExtension = Path.GetExtension(Convert.ToString(files.Current));

        //get file name without extension 
        string fileName=
          Convert.ToString(files.Current).Replace(fileExtension,string.Empty);

        //Check for JPG File Format 
        if (fileExtension == ".jpg" || fileExtension == ".JPG")
        // or // ImageFormat.Jpeg.ToString()
        {
            try 
            { 
                //OCR Operations ... 
                MODI.Document md = new MODI.Document(); 
                md.Create(Convert.ToString(files.Current)); 
                md.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, true, true); 
                MODI.Image image = (MODI.Image)md.Images[0];

                //create text file with the same Image file name 
                FileStream createFile = 
                  new FileStream(fileName + ".txt",FileMode.CreateNew);
                //save the image text in the text file 
                StreamWriter writeFile = new StreamWriter(createFile); 
                writeFile.Write(image.Layout.Text); 
                writeFile.Close(); 
            } 
            catch (Exception exc) 
            { 
                //uncomment the below code to see the expected errors
                //MessageBox.Show(exc.Message,
                //"OCR Exception",
                //MessageBoxButtons.OK, MessageBoxIcon.Information); 
            } 
        } 
    } 
}

Points of Interest

I have made a big sample application for Office OCR and I'll release it soon.

Remark

There are many people who use OCR for Internet Spiders to get data.

My Blog

References

History

  • 24-08-2009: Released

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior) Equinox Web
Egypt Egypt
I have 5 years experience working as a Software Developer. I have a wide range of experience in programming and I am skilled in the use of Visual Studio.NET 2008, Windows AppLication, Web Application, Web Services, Windows Services, WPF, HTML, Java Script, Ajax, ASP.NET, DevExpress Controls, Office Application Programmability in Visual Studio.NET 2008, creating web and windows applications using C#.NET and experienced in using all Microsoft Office Applications.

Comments and Discussions

 
Questionshall we process bubble sheet using Office ocr once it is scanned? Pin
jissforLogic29-Jan-10 0:30
jissforLogic29-Jan-10 0:30 
QuestionArabic OCR Pin
WaleedH25-Jan-10 21:57
WaleedH25-Jan-10 21:57 
GeneralIs this applicable for C++ or Jave Pin
Jehwin12-Dec-09 3:24
Jehwin12-Dec-09 3:24 
GeneralCopy Pasta Pin
Xybot24-Sep-09 13:28
Xybot24-Sep-09 13:28 
GeneralRe: Copy Pasta Pin
Waleed Elkot26-Sep-09 22:44
Waleed Elkot26-Sep-09 22:44 
GeneralMicrosoft Office Document Imaging 11.0 Type Library Pin
PaulTheSDET1-Sep-09 4:14
PaulTheSDET1-Sep-09 4:14 
GeneralRe: Microsoft Office Document Imaging 11.0 Type Library Pin
Waleed Elkot1-Sep-09 9:53
Waleed Elkot1-Sep-09 9:53 
GeneralOCR DLL Pin
hulkonline30-Aug-09 23:27
hulkonline30-Aug-09 23:27 
is there any free Component for OCR or anything in .NET Framework?
GeneralRe: OCR DLL Pin
Waleed Elkot30-Aug-09 23:47
Waleed Elkot30-Aug-09 23:47 
GeneralOffice PIA Pin
hossamAbdo26-Aug-09 2:42
hossamAbdo26-Aug-09 2:42 
GeneralRe: Office PIA Pin
Waleed Elkot26-Aug-09 11:45
Waleed Elkot26-Aug-09 11:45 
GeneralRe: Office PIA Pin
Wael Hussein26-Aug-09 22:11
Wael Hussein26-Aug-09 22:11 
GeneralRe: Office PIA Pin
Waleed Elkot26-Aug-09 23:38
Waleed Elkot26-Aug-09 23:38 
GeneralRe: Office PIA Pin
efuewgf21-May-12 19:38
efuewgf21-May-12 19:38 
GeneralCool Pin
Anthony Daly24-Aug-09 7:31
Anthony Daly24-Aug-09 7:31 
GeneralRe: Cool Pin
Waleed Elkot26-Aug-09 23:17
Waleed Elkot26-Aug-09 23:17 
GeneralRe: Cool Pin
Amr Saafan27-Aug-09 2:23
Amr Saafan27-Aug-09 2:23 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.