Click here to Skip to main content
14,237,256 members

Using IFilter in C#

Rate this:
4.98 (125 votes)
Please Sign up or sign in to vote.
4.98 (125 votes)
19 Mar 2006Ms-PL
Using the IFilter interface to extract text from various document types.

Sample Image - IFilter.png

What's in a IFilter?

The IFilter interface was designed by Microsoft for use in its Indexing Service. Its main purpose is to extract text from files so the Indexing Service can index them and later search them. Some versions of Windows comes with IFilter implementations for Office files, and there are free and commercial filters available for other file types (Adobe PDF filter is a popular one). The IFilter interface is used mainly in non-text files like Office documents, PDF documents etc., but is also used for text files like HTML and XML, to extract only the important parts of the file. Although the IFilter interface can be used for general purpose text extraction from documents, it is generally used in search engines. Windows Desktop Search uses filters to index files. For more information on IFilter, see the Links section.

So what else is new?

There are already quite a few articles and pieces of information on how to use the IFilter interface in .NET (see the Links section), so why write another article you ask? Well, there are some problems with the implementations offered in those articles (details below) which caused me to take a different approach to using and loading filters. I'm currently using this implementation in a new product I'm developing (more details will be revealed here), and since it's working great, I decided to share it with you (yes, You!).

Issues with the current implementations

These are the issues I and others have found with the current implementations, I'll discuss each in detail below:

  1. Extracting text from very large files.
  2. COM threading issues.
  3. Adobe PDF filter crashing the application when it's closed.

Extracting text from very large files

All of the sample code I found on using IFilters in C# provided a method that extracts the entire text of a document and returns that as a string. Usually, it's something like this:

public static string GetTextFromFile(string path)

Now, this might be OK for some uses, but for a general purpose indexer, I find that it isn't the most scalable way to extract text from documents. Some documents may be very large (30 MB PDFs or Word documents are not uncommon), and extracting the entire text at once can have negative effects on the garbage collector since these objects will be stored in the .NET "Large Objects Heap" (see the Links section for more information).

COM threading issues

Since filters are essentially COM objects, they carry with them all the COM threading model issues that we all love to hate. See the Links section for some of the reported problems. To make a long story short, some filters are marked as STA (Adobe PDF filter), some as MTA (Microsoft XML filter), and some as Both (Microsoft Office Filter). This means MTA filters will not load into C# threads that are marked with the [STAThread], and STA filters will not load into [MTAThread] threads. Some people recommend manually changing the registry to mark "problematic" filters as Both, but this isn't something you want to do during the installation of a product, nor can you reliably do it because you don't know which filters are installed on the customer's machine. We basically need a way to load an IFilter and use it no matter what its threading model or our threading model is.

Adobe PDF filter crashing the application when it's closed

There are quite a few reports about problems with the Adobe PDF filter v6. See this and this for some examples. I researched this issue for some time, and I believe I found what the problem is. It seems Adobe forgot (or not..) to export the DllCanUnloadNow function from their PDFFILT.dll. Since a filter is implemented as a COM object, it should export this function to let COM know when it can unload this library. It seems that this causes problems for C# applications because the .dll is never unloaded, and when it does, it's probably a bit late.

In a previous version of my application, I managed to work around this issue by specifically unloading the PDFFILT.dll library. In the current implementation, this workaround is not needed.

How my implementation solves these issues?

Implement a FilterReader

I decided to go the hard way and implement a TextReader derived class called FilterReader. This solves issue #1 because we don't have to extract the entire text at once. Instead, you can simply use the reader to get a buffer at a time. If you still want to get the entire text as a string, use the ReadToEnd() method.

Bypassing COM

In order to get an IFilter instance, you should call the LoadIFilter API and pass it a file name. LoadIFilter eventually calls CoCreateInstance() to actually instantiate the filter, and thus abide to COM rules. To avoid the threading issues, I decided to bypass COM and instantiate the filter COM class myself. This has the following implications and assumptions:

  1. I needed to find the correct COM class that implements the filter for a specific file type.
  2. I needed to dynamically load the COM DLL that implements that COM class and call the DllGetClassObject function that is exported from that .dll.
  3. I didn't want to re-implement all of the COM infrastructure, so in order to solve the issue of unloading COM DLLs only when they're not needed, I decided to keep the DLLs loaded during the entire application lifetime and only unload them when the application dies. Note that this essentially solves issue #3 since we manually unload the PDFFILT.dll.
  4. An IFilter should not be used by multiple threads since it is no longer protected by COM.
  5. I assumed that STA filters will behave correctly when called from an MTA thread when COM is not involved. Until now, I didn't encounter any problem with this approach. If you find a filter that behaves badly when used this way, please let me know.

To conclude:

How to use the code

Using the code is very simple: instantiate a FilterReader by passing it the file you want to extract text from, and use it like any TextReader derived class:

TextReader reader=new FilterReader(fileName);
using (reader)
{
  textBox1.Text=reader.ReadToEnd();
}

The details

Finding the correct COM class

Since I've decided not to use LoadIFilter, I needed to find a way to locate the correct DLL and class ID of the object implementing the filter for the file whose text we're interested in. This was a simple task, thanks to the excellent RegMon utility from SysInternals. I simply called LoadIFilter and traced which registry keys where read during that operation. I then used the same logic in my own implementation. The details can be found in the FilterLoader class. When a class\DLL pair is found for a certain file extension, this information is cached to avoid traversing the registry again.

During the research I made on how LoadIFilter works, I came across a utility called IFilter Explorer that shows which filters are installed on your computer. From that tool, I also learned that some indexing engines use methods not implemented in LoadIFilter to find filters. One of these methods uses the content type registered for that extension. My version of LoadIFilter also handles loading filters for files that have no filter registered for them but do have a filter registered for their content type.

Loading the DLL and instantiating the filter implementation

OK, so we have the name of the DLL and the ID of the class implementing our filter, how do we create an instance of that class? Most of the work is handled by the ComHelper class. The steps needed to accomplish that are:

  • Load the DLL using the LoadLibrary Win32 API.
  • Call the GetProcAddress Win32 API to get a pointer to the DllGetClassObject function.
  • Use Marshal.GetDelegateForFunctionPointer() to convert that function pointer to a delegate. Note: this is only available in .NET 2.0. For an equivalent method in .NET 1.1, see the Links section.
  • Call the DllGetClassObject function to get an IClassFactory object.
private static IClassFactory GetClassFactoryFromDll(string dllName, 
               string filterPersistClass)
{
  //Load the dll
  IntPtr dllHandle=Win32NativeMethods.LoadLibrary(dllName);
  if (dllHandle==IntPtr.Zero)
    return null;

  //Keep a reference to the dll until the process\AppDomain dies
  _dllList.AddDllHandle(dllHandle);

  //Get a pointer to the DllGetClassObject function
  IntPtr dllGetClassObjectPtr=Win32NativeMethods.GetProcAddress(dllHandle, 
    "DllGetClassObject");
  if (dllGetClassObjectPtr==IntPtr.Zero)
    return null;

  //Convert the function pointer to a .net delegate
  DllGetClassObject dllGetClassObject=
    (DllGetClassObject)Marshal.GetDelegateForFunctionPointer(
    dllGetClassObjectPtr, typeof(DllGetClassObject));

  //Call the DllGetClassObject to retreive a class factory 
  //for out Filter class
  Guid filterPersistGUID=new Guid(filterPersistClass);
  //IClassFactory class id
  Guid IClassFactoryGUID=new 
    Guid("00000001-0000-0000-C000-000000000046");
  Object unk;
  if (dllGetClassObject(ref filterPersistGUID, 
          ref IClassFactoryGUID, out unk)!=0)
    return null;

  //Yippie! cast the returned object to IClassFactory
  return (unk as IClassFactory);
}

Once we have an IClassFactory object, we can use it to create instances of the class implementing our filter:

private static IFilter LoadFilterFromDll(string dllName, 
                       string filterPersistClass)
{
  //Get a classFactory for our classID
  IClassFactory classFactory=ComHelper.GetClassFactory(dllName, 
    filterPersistClass);
  if (classFactory==null)
    return null;

  //And create an IFilter instance using that class factory
  Guid IFilterGUID=new Guid("89BCB740-6119-101A-BCB7-00DD010655AF");
  Object obj;
  classFactory.CreateInstance(null, ref IFilterGUID, out obj);
  return (obj as IFilter);
}

We finally have an IFilter instance that can be passed to our FilterReader (after doing the standard filter initialization code):

IPersistFile persistFile=(filter as IPersistFile);
if (persistFile!=null)
{
  persistFile.Load(fileName, 0);
  IFILTER_FLAGS flags;
  IFILTER_INIT iflags =
    IFILTER_INIT.CANON_HYPHENS |
    IFILTER_INIT.CANON_PARAGRAPHS |
    IFILTER_INIT.CANON_SPACES |
    IFILTER_INIT.APPLY_INDEX_ATTRIBUTES |
    IFILTER_INIT.HARD_LINE_BREAKS |
    IFILTER_INIT.FILTER_OWNED_VALUE_OK;

  if (filter.Init(iflags, 0, IntPtr.Zero, out flags)==IFilterReturnCode.S_OK)
    return filter;
}

Note that because we didn't use any COM calls during that process, we get a "raw" interface pointer to the filter class and COM does not create any proxy\stubs to protect that interface.

Conclusion

I've been using this approach for several months now without any problems. Here's a summary of the benefits and implications with this approach:

Benefits

  • No COM threading issues that cause certain filters not to function correctly.
  • No need to mark your thread as [STAThread] when using filters (this is a problem especially with web applications).
  • The Adobe PDF filter does not crash at the end of the application.
  • Better scalability when dealing with large files.
  • Better filter search logic than LoadIFilter (using content type).

Implications

  • Bypassing COM may damage your health :) - Actually, I very much enjoy bypassing COM, but FDA regulations force me to have that warning here.
  • Once filter DLLs are loaded into your application, they will stay loaded. If this is a problem for you, don't use this approach.
  • No COM protection for multi-threaded access to filters (Yeah, so?).

Links and References

License

This article, along with any associated source code and files, is licensed under The Microsoft Public License (Ms-PL)

Share

About the Author

Eyal Post
Israel Israel
No Biography provided

Comments and Discussions

 
GeneralRe: .pdf files ifilter load fails for AcroRdIF.dll in IClassFactory, Win32NativeMethods.LoadLibrary Pin
Member 231782329-Jul-08 5:24
memberMember 231782329-Jul-08 5:24 
AnswerRe: .pdf files ifilter load fails for AcroRdIF.dll in IClassFactory, Win32NativeMethods.LoadLibrary Pin
Claudio Nicora24-Mar-10 1:33
memberClaudio Nicora24-Mar-10 1:33 
GeneralTrying to Understand TextReader Implementation Pin
Marc Mueller11-Apr-08 4:37
memberMarc Mueller11-Apr-08 4:37 
GeneralRe: Trying to Understand TextReader Implementation Pin
Eyal Post12-Apr-08 7:09
memberEyal Post12-Apr-08 7:09 
Generalis iFilter can extract text generated from the iTextSharp<b>iFilter</b> Pin
my4color6-Apr-08 23:42
membermy4color6-Apr-08 23:42 
GeneralRe: is iFilter can extract text generated from the iTextSharpiFilter Pin
Eyal Post9-Apr-08 6:28
memberEyal Post9-Apr-08 6:28 
GeneralRe: is iFilter can extract text generated from the iTextSharpiFilter [modified] Pin
my4color16-Apr-08 3:30
membermy4color16-Apr-08 3:30 
GeneralPDF IFilter fails when document has been edited on a Vista machine Pin
Kirk evans4-Mar-08 4:14
memberKirk evans4-Mar-08 4:14 
I am programing in C# VS Express 2005

I have a program that extracts text from PDF files using the EPocalipse IFilter Dll.

Everything has worked fine for quite some time, but now I have a user that has a Vista machine. If said user opens a PDF, edits it, and "checks it in" (basically sends it to my server machine for indexing ) my program fails. I'm not prepared (sorry) to give you the exact error message, but the strange thing is that my program traps the error sucessfully, and reports the error message, but then any further calls to the program result in the "corrupted memory" error message.

From reading other posts by Googling this problem, I think this has something to do with 64 bit vs 32 bit IFilters, but I am not sure about that.

I am stuck as to what to do about it. The machine that runs the Indexing service ( the one that extracts the text ) is not a Vista machine.

Any help will be greatly appreciated.
Kirk
GeneralRe: PDF IFilter fails when document has been edited on a Vista machine Pin
hendi621-May-09 8:46
memberhendi621-May-09 8:46 
Questioncannot get a handle to adobe dll. Pin
rajdawg14-Feb-08 6:31
memberrajdawg14-Feb-08 6:31 
GeneralRe: cannot get a handle to adobe dll. Pin
dujin.mail23-Apr-08 23:23
memberdujin.mail23-Apr-08 23:23 
GeneralRe: cannot get a handle to adobe dll. Pin
rajdawg24-Apr-08 4:05
memberrajdawg24-Apr-08 4:05 
GeneralRe: cannot get a handle to adobe dll. Pin
Asier Garcia18-Sep-08 6:24
memberAsier Garcia18-Sep-08 6:24 
AnswerRe: cannot get a handle to adobe dll. Pin
Claudio Nicora24-Mar-10 1:35
memberClaudio Nicora24-Mar-10 1:35 
GeneralDoes anyonw know how to use it in vc++ Pin
plxxlp8-Jan-08 20:37
memberplxxlp8-Jan-08 20:37 
QuestionI get some errors with the IFilter Tester. [modified] Pin
sooule23-Dec-07 3:42
membersooule23-Dec-07 3:42 
GeneralRe: I get some errors with the IFilter Tester. Pin
anon25-Dec-07 7:39
memberanon25-Dec-07 7:39 
GeneralRe: I get some errors with the IFilter Tester. Pin
sooule26-Dec-07 1:22
membersooule26-Dec-07 1:22 
QuestionDid anyone get this to work under asp.net 2.0 medium trust? Pin
Peter Donker19-Dec-07 10:05
memberPeter Donker19-Dec-07 10:05 
GeneralLicensing this component Pin
arethuza2-Dec-07 0:08
memberarethuza2-Dec-07 0:08 
GeneralRe: Licensing this component Pin
casc_181-Aug-08 11:09
membercasc_181-Aug-08 11:09 
GeneralRe: Licensing this component Pin
Eyal Post5-Aug-08 5:06
memberEyal Post5-Aug-08 5:06 
GeneralJust what I was looking for Pin
Erik_D16-Nov-07 1:58
memberErik_D16-Nov-07 1:58 
QuestionAcrobat 8 IFilter crashes on closing Pin
Randall De Weerd13-Nov-07 12:14
memberRandall De Weerd13-Nov-07 12:14 
GeneralFilter sample Pin
kembo6-Nov-07 13:14
memberkembo6-Nov-07 13:14 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Article
Posted 11 Mar 2006

Stats

1.4M views
15.8K downloads
303 bookmarked