![]() |
Enterprise Systems »
SharePoint Server »
General
Intermediate
License: The Code Project Open License (CPOL)
SharePoint OCR image files indexingBy gstolarovSharePoint OCR image files indexing. |
C++WinXP, Win2003, ASP.NET, Win32, Dev
|
|
Advanced Search Add to IE Search |
|
|
|
||||||||||||||||
This article describes how to setup indexing of image files (including TIFF, PDF, JPEG, BMP...) using OCR technology. The indexing described below utilizes Microsoft IFilter technology, and as such, is not specific to SharePoint, but can be used with any product that uses Microsoft indexing: Microsoft Search, Desktop search, SQL Server search, and through the plug-ins with Google desktop search. I, however, use it with Microsoft Windows SharePoint Services 2003. For those other products, the registration may need to be slightly different.
One of the projects I was working on required storage of old documents scanned into PDF files. Then, there was a separate team of people responsible for providing tags for a search engine so those image documents could be found. The whole process was clumsy, labor intensive, and error prone. That was what started me on my exploration path.
The first search I fired was for the Open Source OCR products. Pretty quickly, I narrowed it down to TESSERACT (http://code.google.com/p/tesseract-ocr/). Tesseract is an orphaned brain child of HP that worked on it from 1985 to 1995. Then, it was moved to the Open Source, and now, if I understand it correctly, Google is working on it. With credentials like that, it's no wonder that Tesseract scores one of the highest marks on OCR recognition and accuracy. After downloading and struggling just a bit, I got Tesseract to work. The struggling part was that the home page claims that its base input format is a TIFF file. May be my TIFFs were bad, but I was able to get it to work only for BMP files.
So now that I have an OCR that can convert BMP files into text, how do I get text out of the image PDF files? One more search, and I settled down on ImageMagic (http://www.imagemagick.org/). This is another wonderful Open Source utility that can convert any file into image. It did work out of the box, converting TIFF files into bitmaps, but to get PDF files converted, it requires a GhostScript (http://mirror.cs.wisc.edu/pub/mirrors/ghost/GPL/gs864/gs864w32.exe).
With that utility installed, I was cooking - I can convert any file (in particular, PDF and TIFF) into bitmap, and then I can extract the text out of the bitmap. The only consideration was to somehow treat PDF files containing text differently - after all, OCR is very computation intensive, and somewhat error prone even with perfect image quality and resolution. So another quick search, and I have PDFTOTEXT (ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl4-win32.zip) - thank God for Open Source! With these guys, I can pull text out of PDF in an eye blink. However, I would get nothing for pure image PDFs, but I already have a solution for that!
It took another 15 minutes to setup a batch script to automate the process:
Once you unzip the attached project, check out the bin\OCR.BAT file. It will create a temporary file in the directory where your source file is with the same name + the '.txt' extension.
For example:
ocr.bat c:\temp\xyz.pdf
will generate the c:\temp\xyz.pdf.txt file.
So now I have a simple batch process to extract text out of any image and/or PDF file. To make it usable in SharePoint (or any other product that uses Microsoft Indexing technology), I need to create an IFilter component. This is a plug-in that Microsoft Indexing uses to search for specialized file formats (e.g., Office, PDF, ...).
Over here, there was a right way and a quick way. And I have to admit my guilt here - I chose the quick way. See, the thing is that all the components I use do have C/C++ APIs, and to do it right, I should pull everything together and create a component. Instead, I decided to just run the batch process I setup earlier. This is somewhat slower, but at least, I don't have to worry about memory leaks and page faults from code I'm not familiar with.
So I downloaded the Microsoft Platform SDK, got SmpFilt to work, changed GUIDs, got it to run my OCR.BAT - and here you have it - my own OCR plug-in to Microsoft Indexing.
Over here, I'm skipping over some pain and sweat of debugging IFilter, dealing with multi-byte to single byte strings and back, and all this fun that made Microsoft COM development so "loved" around the world. But the purpose of the article is not to teach how to do COM in C++ or how to develop IFilter.
Once you have your filter done and registered, the Platform SDK contains two utilities to test IFilter: filtdump.exe and filtreg.exe - you can play with them to make sure your filter is registered and works correctly.
The Microsoft IFilter template will do the appropriate registration for the Indexing Service, but SharePoint requires additional entries. In the download, there is a bin\wss_reg.reg file that will register SharePoint related entries. I would encourage you, however, to create a back up of the HKLM\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0 key before you try to register the wss_reg file - just in case, you know.
By the way, since I don't have an installer, the DLL (OcrFilt.dll) also needs to be registered manually.
regsvr32 OcrFilt.dll
Even though currently I'm using it only with SharePoint, there are other very interesting applications for this solution:
Even though all the components are Open Source, you might want to verify that your company's legal department has no problem with each component's licensing requirements.
There is no better way to show your support then to donate money. The second best thing - is to vote for the article you like :-)
| You must Sign In to use this message board. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
General
News
Question
Answer
Joke
Rant
Admin
Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads.
|
PermaLink |
Privacy |
Terms of Use
Last Updated: 2 Dec 2009 Editor: Smitha Vijayan |
Copyright 2009 by gstolarov Everything else Copyright © CodeProject, 1999-2010 Web22 | Advertise on the Code Project |