
What's in a IFilter?
The IFilter interface was designed by Microsoft for use in its Indexing Service. Its main purpose is to extract text from files so the Indexing Service can index them and later search them. Some versions of Windows comes with IFilter implementations for Office files, and there are free and commercial filters available for other file types (Adobe PDF filter is a popular one). The IFilter interface is used mainly in non-text files like Office documents, PDF documents etc., but is also used for text files like HTML and XML, to extract only the important parts of the file. Although the IFilter interface can be used for general purpose text extraction from documents, it is generally used in search engines. Windows Desktop Search uses filters to index files. For more information on IFilter, see the Links section.
So what else is new?
There are already quite a few articles and pieces of information on how to use the IFilter interface in .NET (see the Links section), so why write another article you ask? Well, there are some problems with the implementations offered in those articles (details below) which caused me to take a different approach to using and loading filters. I'm currently using this implementation in a new product I'm developing (more details will be revealed here), and since it's working great, I decided to share it with you (yes, You!).
Issues with the current implementations
These are the issues I and others have found with the current implementations, I'll discuss each in detail below:
- Extracting text from very large files.
- COM threading issues.
- Adobe PDF filter crashing the application when it's closed.
Extracting text from very large files
All of the sample code I found on using IFilters in C# provided a method that extracts the entire text of a document and returns that as a string. Usually, it's something like this:
public static string GetTextFromFile(string path)
Now, this might be OK for some uses, but for a general purpose indexer, I find that it isn't the most scalable way to extract text from documents. Some documents may be very large (30 MB PDFs or Word documents are not uncommon), and extracting the entire text at once can have negative effects on the garbage collector since these objects will be stored in the .NET "Large Objects Heap" (see the Links section for more information).
COM threading issues
Since filters are essentially COM objects, they carry with them all the COM threading model issues that we all love to hate. See the Links section for some of the reported problems. To make a long story short, some filters are marked as STA (Adobe PDF filter), some as MTA (Microsoft XML filter), and some as Both (Microsoft Office Filter). This means MTA filters will not load into C# threads that are marked with the [STAThread], and STA filters will not load into [MTAThread] threads. Some people recommend manually changing the registry to mark "problematic" filters as Both, but this isn't something you want to do during the installation of a product, nor can you reliably do it because you don't know which filters are installed on the customer's machine. We basically need a way to load an IFilter and use it no matter what its threading model or our threading model is.
Adobe PDF filter crashing the application when it's closed
There are quite a few reports about problems with the Adobe PDF filter v6. See this and this for some examples. I researched this issue for some time, and I believe I found what the problem is. It seems Adobe forgot (or not..) to export the DllCanUnloadNow function from their PDFFILT.dll. Since a filter is implemented as a COM object, it should export this function to let COM know when it can unload this library. It seems that this causes problems for C# applications because the .dll is never unloaded, and when it does, it's probably a bit late.
In a previous version of my application, I managed to work around this issue by specifically unloading the PDFFILT.dll library. In the current implementation, this workaround is not needed.
How my implementation solves these issues?
Implement a FilterReader
I decided to go the hard way and implement a TextReader derived class called FilterReader. This solves issue #1 because we don't have to extract the entire text at once. Instead, you can simply use the reader to get a buffer at a time. If you still want to get the entire text as a string, use the ReadToEnd() method.
Bypassing COM
In order to get an IFilter instance, you should call the LoadIFilter API and pass it a file name. LoadIFilter eventually calls CoCreateInstance() to actually instantiate the filter, and thus abide to COM rules. To avoid the threading issues, I decided to bypass COM and instantiate the filter COM class myself. This has the following implications and assumptions:
- I needed to find the correct COM class that implements the filter for a specific file type.
- I needed to dynamically load the COM DLL that implements that COM class and call the
DllGetClassObject function that is exported from that .dll.
- I didn't want to re-implement all of the COM infrastructure, so in order to solve the issue of unloading COM DLLs only when they're not needed, I decided to keep the DLLs loaded during the entire application lifetime and only unload them when the application dies. Note that this essentially solves issue #3 since we manually unload the PDFFILT.dll.
- An
IFilter should not be used by multiple threads since it is no longer protected by COM.
- I assumed that STA filters will behave correctly when called from an MTA thread when COM is not involved. Until now, I didn't encounter any problem with this approach. If you find a filter that behaves badly when used this way, please let me know.
To conclude:
How to use the code
Using the code is very simple: instantiate a FilterReader by passing it the file you want to extract text from, and use it like any TextReader derived class:
TextReader reader=new FilterReader(fileName);
using (reader)
{
textBox1.Text=reader.ReadToEnd();
}
The details
Finding the correct COM class
Since I've decided not to use LoadIFilter, I needed to find a way to locate the correct DLL and class ID of the object implementing the filter for the file whose text we're interested in. This was a simple task, thanks to the excellent RegMon utility from SysInternals. I simply called LoadIFilter and traced which registry keys where read during that operation. I then used the same logic in my own implementation. The details can be found in the FilterLoader class. When a class\DLL pair is found for a certain file extension, this information is cached to avoid traversing the registry again.
During the research I made on how LoadIFilter works, I came across a utility called IFilter Explorer that shows which filters are installed on your computer. From that tool, I also learned that some indexing engines use methods not implemented in LoadIFilter to find filters. One of these methods uses the content type registered for that extension. My version of LoadIFilter also handles loading filters for files that have no filter registered for them but do have a filter registered for their content type.
Loading the DLL and instantiating the filter implementation
OK, so we have the name of the DLL and the ID of the class implementing our filter, how do we create an instance of that class? Most of the work is handled by the ComHelper class. The steps needed to accomplish that are:
- Load the DLL using the
LoadLibrary Win32 API.
- Call the
GetProcAddress Win32 API to get a pointer to the DllGetClassObject function.
- Use
Marshal.GetDelegateForFunctionPointer() to convert that function pointer to a delegate. Note: this is only available in .NET 2.0. For an equivalent method in .NET 1.1, see the Links section.
- Call the
DllGetClassObject function to get an IClassFactory object.
private static IClassFactory GetClassFactoryFromDll(string dllName,
string filterPersistClass)
{
IntPtr dllHandle=Win32NativeMethods.LoadLibrary(dllName);
if (dllHandle==IntPtr.Zero)
return null;
_dllList.AddDllHandle(dllHandle);
IntPtr dllGetClassObjectPtr=Win32NativeMethods.GetProcAddress(dllHandle,
"DllGetClassObject");
if (dllGetClassObjectPtr==IntPtr.Zero)
return null;
DllGetClassObject dllGetClassObject=
(DllGetClassObject)Marshal.GetDelegateForFunctionPointer(
dllGetClassObjectPtr, typeof(DllGetClassObject));
Guid filterPersistGUID=new Guid(filterPersistClass);
Guid IClassFactoryGUID=new
Guid("00000001-0000-0000-C000-000000000046");
Object unk;
if (dllGetClassObject(ref filterPersistGUID,
ref IClassFactoryGUID, out unk)!=0)
return null;
return (unk as IClassFactory);
}
Once we have an IClassFactory object, we can use it to create instances of the class implementing our filter:
private static IFilter LoadFilterFromDll(string dllName,
string filterPersistClass)
{
IClassFactory classFactory=ComHelper.GetClassFactory(dllName,
filterPersistClass);
if (classFactory==null)
return null;
Guid IFilterGUID=new Guid("89BCB740-6119-101A-BCB7-00DD010655AF");
Object obj;
classFactory.CreateInstance(null, ref IFilterGUID, out obj);
return (obj as IFilter);
}
We finally have an IFilter instance that can be passed to our FilterReader (after doing the standard filter initialization code):
IPersistFile persistFile=(filter as IPersistFile);
if (persistFile!=null)
{
persistFile.Load(fileName, 0);
IFILTER_FLAGS flags;
IFILTER_INIT iflags =
IFILTER_INIT.CANON_HYPHENS |
IFILTER_INIT.CANON_PARAGRAPHS |
IFILTER_INIT.CANON_SPACES |
IFILTER_INIT.APPLY_INDEX_ATTRIBUTES |
IFILTER_INIT.HARD_LINE_BREAKS |
IFILTER_INIT.FILTER_OWNED_VALUE_OK;
if (filter.Init(iflags, 0, IntPtr.Zero, out flags)==IFilterReturnCode.S_OK)
return filter;
}
Note that because we didn't use any COM calls during that process, we get a "raw" interface pointer to the filter class and COM does not create any proxy\stubs to protect that interface.
Conclusion
I've been using this approach for several months now without any problems. Here's a summary of the benefits and implications with this approach:
Benefits
- No COM threading issues that cause certain filters not to function correctly.
- No need to mark your thread as
[STAThread] when using filters (this is a problem especially with web applications).
- The Adobe PDF filter does not crash at the end of the application.
- Better scalability when dealing with large files.
- Better filter search logic than
LoadIFilter (using content type).
Implications
- Bypassing COM may damage your health :) - Actually, I very much enjoy bypassing COM, but FDA regulations force me to have that warning here.
- Once filter DLLs are loaded into your application, they will stay loaded. If this is a problem for you, don't use this approach.
- No COM protection for multi-threaded access to filters (Yeah, so?).
Links and References
| You must Sign In to use this message board. |
|
|
 |
|
 |
This code doesn't crash or anything, but the text isn't extracted from the PDF file.
Has anyone gotten this to work on Win2003 64-bit? It's c# .net code so it shouldn't need a recompile. I tried compiling it to target x86 but that doesn't help either (gets a bad imageformat exception when I try choosing a file)...
Update: - works w/ the Foxit PDF IFilter - does *NOT* work with the Adobe PDF IFilter or the TET PDF IFilter Bug in this test app or in those filters? :-P
Update: - doesn't work w/ TET PDF filter because the eval version limits you to a 1MB or <10 page PDF file and my test PDF was bigger Still doesn't work w/ Adobe's PDF IFilter.
modified on Thursday, October 29, 2009 12:04 PM
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi there.
I have discovered a problem and I hope you can help me somehow.
In LoadAndInitIFilter(..) the call to persistFile.Load(fileName, 0); fails if the filename is something like: "~$w Microsoft Office Word Document.docx"
Of course I have installed the right filters, the problem seems to be just the filename...
To be honest, in that moment the file is also open by Winword (in fact it is its temp file!)... but why should that call fail? It is not exclusively opened...
Thanks for your help, G.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
When I try to open up a Microsoft Word document using the sample app (IFilterTester) I get the following stack trace:
System.Runtime.InteropServices.COMException (0x8004170C): Exception from HRESULT: 0x8004170C at System.Runtime.InteropServices.ComTypes.IPersistFile.Load(String pszFileName, Int32 dwMode) at EPocalipse.IFilter.FilterLoader.LoadAndInitIFilter(String fileName, String extension) in C:\Documents and Settings\Owner\Desktop\Source\EPocalipse.IFilter\FilterLoader.cs:line 85 at EPocalipse.IFilter.FilterReader..ctor(String fileName) in C:\Documents and Settings\Owner\Desktop\Source\EPocalipse.IFilter\FilterReader.cs:line 109 at IFilterTester.Form1.btnBrowse_Click(Object sender, EventArgs e) in C:\Documents and Settings\Owner\Desktop\Source\Tester\Form1.cs:line 26 at System.Windows.Forms.Control.OnClick(EventArgs e)
I think 0x80041700 is from Filterr.h -> FILTER_E_END_OF_CHUNKS
Has anyone seen this?
Thanks, Drew
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
This bug appears to be with the Microsoft Word iFilter and not with the iFilter C# project:
C:\Program Files\Microsoft SDKs\Windows Search 3x SDK\Indexing\Filtdump>FiltDump tmp60.doc FILE: tmp60.doc Error 0x8004170c loading IFilter
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
|
 |
|
 |
Where?
The filter has already been setup with IFILTER_INIT.HARD_LINE_BREAKS... it means that "soft returns" like in windows are replaced by hard returns. If any hard return is met while parsing, it gets doubled.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Which is the code for "module not found". This happens with Adobe v8 ( and perhaps 7) as the Ifilter implementation in Acr0RDIF.dll relies on other dlls in the install location which are not found. The path is searched which may not include the Adobe location. This can be worked around with something like this:
string path = Environment.GetEnvironmentVariable("PATH"); path = path + ";" + @"C:\Program Files\Adobe\Reader 8.0\Reader"; Environment.SetEnvironmentVariable("PATH", path);
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi
Thanks for providing wonderful code.
I'm getting following error when I close your tester executable after successfully reading selected PDF file.
--------------------------------------------------------------------------------------- .NET-BroadcastEventWindow.2.0.0.0.378734a.0: IFilterTester.exe - Application Error --------------------------------------------------------------------------------------- The instruction at "0x0700609c" referenced memory at "0x00000014". The memory could not be "read".
Thanks Muni
|
| Sign In·View Thread·PermaLink | 1.33/5 |
|
|
|
 |
|
 |
I want to use IPersistStream istead of IPersistFile,because I want to extract text from a stream not file can you give me some suggestion,I have tried ,but failed
thanks
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
|
 |
|
 |
I've got a problem when using a pdf IFilter and asp.net 2.0 can somebody help me?My page is quite simple: a file uploader, a button and a textbox.I'm using pratically the same source code of the example with the little difference that I set in the LoadIFilter(string ext) method a default path where to get the .dll and the .dll ID instead of using the sample methods to look for it.I had success using .dlls to extract text from office2003 and office2007 files but when using any kind of.dll to get text from a .pdf (foxit and various versions of acrobat) file the browser displays the page "impossible to show the page" (or something like that...my browser is in italian...) without passing through any of the breakpoints set...the "funny" thing is that the same edited libraries work perfectly when in a client windows form... Thanks a lot in advance!!
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
If i Try to get the text from a powerpoint document. Ifilter works but it gives all the text as whole text.
Is there a way to get the text of slide by slide.
Is there a way to get the text of just the 3. slide of a powerpoint document?
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
First of all: Very good work.
Some small problems, I found:
First, if you load a file with a wrong extension, you get an unhandled exception: System.Runtime.InteropServices.COMException (0x80030050): "" ist bereits vorhanden. (Ausnahme von HRESULT: 0x80030050 (STG_E_FILEALREADYEXISTS)) bei System.Runtime.InteropServices.ComTypes.IPersistFile.Load(String pszFileName, Int32 dwMode) bei EPocalipse.IFilter.FilterLoader.LoadAndInitIFilter(String fileName, String extension) bei EPocalipse.IFilter.FilterReader..ctor(String fileName) bei IFilterTester.Form1.btnBrowse_Click(Object sender, EventArgs e)
Second: If you load an unsupported file type (a file type with no IFilter installed), you get also an unhandled exception.
Third: Rtf-files return their full content with the formatting information. I think, it would be better to return only the content text. My question: Does someone have a solution for extracting the plain text from rtf files? Microsoft suggest to load rtf files into a System.Windows.Forms.RichTextBox and then read the text property from the control. But if I only want to convert rtf to text and do not show the text to the user this solution is quite crazy.
|
| Sign In·View Thread·PermaLink | 5.00/5 |
|
|
|
 |
|
 |
I just wanted to let people know how to fix this.
Once you've installed the Adobe IFilter (ifilter60.exe) you may run into this message:
"An unhandled exception of type 'System.ArgumentException' occurred in EPocalipse.IFilter.dll Additional information: no filter defined for C:\Users\Ben\Desktop\DCS Blackshark\dcs-bs_flight_manual_eng.pdf"
If you are using a 64 bit machine change the Target Platform to x86 (32 bit) and it'll work.
Hope that helps somebody.
Cheers, Ben
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hello,
Ths is a very nice article, bravo! I was wondering why do you say that a MTA filter cannot be loaded in a STA thread? And viceversa.
There are situations indeed when using in-proc COM objects in other apartments then the one component has set can be buggy but I am not sure if it's the case for filters, since the IFilter interface is very simple and has no sinks\callbacks.
If you create your thread in a different apartment then the component's, COM does all the work for you making sure COM threading rules to access COM objects are not violated.
Please check the table at the end of the article: http://support.microsoft.com/kb/150777[^]
Also I noticed that your custome LoadIFilter implementation sometimes it fails. I tried it on a machine with Acrobat Reader 9 and it failed to find the filter DLL path. However, I wrote a small application calling the original LoadIFilter function and it succeeded. so LoadIFilter is I think does other things then the steps you follow (I know these steps are described in MSDN also).
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Is there a way to put text from doc file into a database? I have texts that pass over 200000 chars!
Thanks
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Line breaks are not pulled from DOCX files, so all lines are concatenated. This absolutely ruins regex searches that use the beginning-of-line and end-of-line markers. Any suggestions (other than opening the DOCX as a zip and parsing its XML manually)? I've managed to implement ReadLine on just about every other sort of file I've tried, since Chr(13) or vbCrLf, etc., come through in the stream.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi
I agree that this is a problem. Especially when you have bulleted lists etc without punctuation, not only are lines concatenated, but the last word of a bullet point will be concatenated with the first word of the next bullet point, causing the text to make no sense.
For the application I am developing, I would really like to rely 100% on the IFilter concept, and not having a lot of different strategies for different document types. And starting to treat docx files as zip files and extracting the content from XML etc. will quickly start polluting the design. But of course, if there is no way to make the IFilter implementation behave for Office 2007, then I guess I will have no choice.
But for now, I hope I will find a solution, or that someone who knows of a way to do it, will respond to this thread
/Brian
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
My guess is it is the MS IFilter causing the problem.
What kind of system are you running on? The MS IFilter package does not want to work on Vista for me with docx files.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
I am running on XP Pro, and everything works just fine except from the fact that newline control characters seems to be removed instead of replaced by e.g. a blank char.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
I'm also using XP Pro and it works fine. I haven't tried it on my Vista laptop. Anyway, the issue, it turns out, is that the line breaks don't exist as characters in the DOCX file. The DOCX file is actually just a ZIP archive with an XML tree hierarchy within. Either a new IFilter must be written, or exceptions must be made to deal with DOCX files by unzipping them and parsing the XML appropriately.
God, what a pain. :(
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
 |
It works great. I failed to extract text from embedded object in doc file.
Anyone know how to do that? Do I need to implement something like IPropertyStorage?
Thanks
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
I added a new constructor to FilterReader permitting to specify the file extension explicitely:
public FilterReader(string fileName, string extension) { _filter = FilterLoader.LoadAndInitIFilter(fileName, extension); if (_filter == null) throw new ArgumentException("no filter defined for " + fileName + " with extension " + extension); }
We store files locally as ".bin" with the original filename beeing kept in a database.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|