Click here to Skip to main content
15,887,676 members
Articles / Programming Languages / C#

Implementing a TextReader to extract various files contents using IFilter

Rate me:
Please Sign up or sign in to vote.
4.89/5 (14 votes)
9 Feb 2011Eclipse3 min read 146.3K   4.1K   82   48
A solution that can extract various file contents using an IFilter implementation. Special thanks to Eyal Post and his article 'Using IFilter in C# '.

Introduction

The IFilter interface is an important component for the Microsoft Indexing Service. The intent of this project is to provide a solution which is high performing and also has a low memory footprint. This will be accomplished by using a TextReader and manipulating a wrapper of IFilter to interop with managed code.

The article is written to target users who have had development experience with COM in a managed environment, and is aimed at creating a solution that can extract various file contents using an IFilter implementation.

The solution in the source code package contains a test project, which includes a Unit Test and a load test. These were created using a higher edition of Visual Studio. If you cannot open the test project, you can always remove the IFilterTest project.

Background

The IFilter component is an in-process COM server that extracts the text and values for a specific file type. The appropriate IFilter component for a file type is called by the Filtering component.

A customized IFilter component can be developed for almost any selected file type. The standard IFilter components supplied with Indexing Service include the following:

File Name

Description

mimefilt.dll

Filters Multipurpose Internet Mail Extension (MIME) files.

nlhtml.dll

Filters HTML 3.0 or earlier files.

offfilt.dll

Filters Microsoft Office files: Microsoft Word, Microsoft Excel, and Microsoft PowerPoint®.

query.dll

Filters plain text files (default filter) and binary files (null filter).

The IFilter interface scans documents for text and properties (also called attributes). It extracts chunks of text from these documents, filtering out embedded formatting and retaining information about the position of the text. It also extracts chunks of values, which are properties of an entire document or of well-defined parts of a document. IFilter provides the foundation for building higher-level applications such as document indexers and application-independent viewers.

Picture1.png

Main Classes Diagram

Image 2

IFilter Interface and Mixed Class

The IFilter code:

C#
[ComImport, Guid(Constants.IFilterGUID), 
    InterfaceType(ComInterfaceType.InterfaceIsIUnknown),
    SuppressUnmanagedCodeSecurity, ComVisible(true), AutomationProxy(false)]
public interface IFilter
{
    /// <summary>
    /// The IFilter::Init method initializes a filtering session.
    /// </summary>
    [PreserveSig]
    [MethodImpl(MethodImplOptions.InternalCall,
                MethodCodeType = MethodCodeType.Runtime)]
    IFilterReturnCodes Init(
        //[in] Flag settings from the IFILTER_INIT enumeration for
        // controlling text standardization, property output, embedding
        // scope, and IFilter access patterns. 
        [MarshalAs(UnmanagedType.U4)]IFILTER_INIT grfFlags,
        // [in] The size of the attributes array. When nonzero, cAttributes
        //  takes 
        // precedence over attributes specified in grfFlags. If no
        // attribute flags 
        // are specified and cAttributes is zero, the default is given by
        // the 
        // PSGUID_STORAGE storage property set, which contains the date and
        //  time 
        // of the last write to the file, size, and so on; and by the
        //  PID_STG_CONTENTS 
        // 'contents' property, which maps to the main contents of the
        // file. 
        // For more information about properties and property sets, see
        // Property Sets. 
        uint cAttributes,
        //[in] Array of pointers to FULLPROPSPEC structures for the
        // requested properties. 
        // When cAttributes is nonzero, only the properties in aAttributes
        // are returned. 
        FULLPROPSPEC[] aAttributes,
        // [out] Information about additional properties available to the
        //  caller; from the IFILTER_FLAGS enumeration. 
        out IFILTER_FLAGS pdwFlags);
    /// <summary>
    /// The IFilter::GetChunk method positions the filter at the beginning
    /// of the next chunk, 
    /// or at the first chunk if this is the first call to the GetChunk
    /// method, and returns a description of the current chunk. 
    /// </summary>
    [PreserveSig]
    [MethodImpl(MethodImplOptions.InternalCall,
     MethodCodeType = MethodCodeType.Runtime)]
    IFilterReturnCodes GetChunk(out STAT_CHUNK pStat);
    /// <summary>
    /// The IFilter::GetText method retrieves text (text-type properties)
    /// from the current chunk, 
    /// which must have a CHUNKSTATE enumeration value of CHUNK_TEXT.
    /// </summary>
    [PreserveSig]
    [MethodImpl(MethodImplOptions.InternalCall,
     MethodCodeType = MethodCodeType.Runtime)]
    IFilterReturnCodes GetText(
        // [in/out] On entry, the size of awcBuffer array in wide/Unicode
        // characters. On exit, the number of Unicode characters written to
        // awcBuffer. 
        // Note that this value is not the number of bytes in the buffer. 
        ref uint pcwcBuffer,
        // Text retrieved from the current chunk. Do not terminate the
        // buffer with a character.
        [Out]IntPtr awcBuffer);
    /// <summary>
    /// The IFilter::GetValue method retrieves a value (public
    /// value-type property) from a chunk, 
    /// which must have a CHUNKSTATE enumeration value of CHUNK_VALUE.
    /// </summary>
    [PreserveSig]
    [MethodImpl(MethodImplOptions.InternalCall,
     MethodCodeType = MethodCodeType.Runtime)]
    IFilterReturnCodes GetValue(
        // Allocate the PROPVARIANT structure with CoTaskMemAlloc. Some
        // PROPVARIANT 
        // structures contain pointers, which can be freed by calling the
        // PropVariantClear function. 
        // It is up to the caller of the GetValue method to call the
        //   PropVariantClear method.
        // ref IntPtr ppPropValue
        // [MarshalAs(UnmanagedType.Struct)]
        out PROPVARIANT PropVal);
    /// <summary>
    /// The IFilter::BindRegion method retrieves an interface representing
    /// the specified portion of the object. 
    /// Currently reserved for future use.
    /// </summary>
    [PreserveSig]
    [MethodImpl(MethodImplOptions.InternalCall,
     MethodCodeType = MethodCodeType.Runtime)]
    IFilterReturnCodes BindRegion(ref FILTERREGION origPos,ref Guid riid,
        ref object ppunk);
}

The mixed class:

C#
public class MixedIFilterClass : IFilterClass, IDisposable
{
    public override string TmpFilePath
    {
        get;
        set;
    }
    public override Object InternalObj
    { 
        get;
        set;
    }
    //private MixedIFilterClass()
    //{
    //    InternalPtr = Marshal.GetComInterfaceForObject(this, typeof(IFilter));
    //}
    ~MixedIFilterClass()
    {
        Dispose(false);
    }
    protected virtual void Dispose(bool disposing)
    {
        if(null != InternalObj)
        {
            Marshal.ReleaseComObject(InternalObj);
            InternalObj = null;
        }
        if (null != TmpFilePath)
            try
            {
                File.Delete(TmpFilePath);
                TmpFilePath = null;
            }
            catch { }
        if (disposing)
            GC.SuppressFinalize(this);
    }
    public void Dispose()
    {
        Dispose(true);
    }
}

How it Works

There are two steps needed to show how the process works They are:

  1. Get the current chunk
  2. Call GetText() on the chunk

Step 1: Get the current chunk

If you reach the last chunk, terminate the reading process.

C#
var returnCode = _filter.GetChunk(out chunk);

Step 2: Call GetText() on the chunk

Depending on the state gotten from the GetChunk method, call the GetText method on the text chunk. When reading the end of current chunk flag, repeat step 1.

C#
while (true)
{
    if (remaining <= _topSize)
        return;
    bool useBuffer = !forceDirectlyWrite && remaining < BufferSize;
    var size = BufferSize;
    if (useBuffer)
        size -= _topSize;
    else
    {
        if (remaining < BufferSize)
            size = (uint)remaining;
    }
    if (size < ResBufSize)
        size = ResBufSize;
    var handle = GCHandle.Alloc(useBuffer ? _buffer : array,
        GCHandleType.Pinned);
    var ptr = Marshal.UnsafeAddrOfPinnedArrayElement(
        useBuffer ? _buffer : array, useBuffer ? (int)_topSize : offset);
    IFilterReturnCodes returnCode;
    try
    {
#if DEBUG
        Trace.Write(size);
#endif
        returnCode = _filter.GetText(ref size, ptr);
#if DEBUG
        Trace.WriteLine("->"+size);
#endif
    }
    finally 
    {
        handle.Free();
    }
    if(returnCode != IFilterReturnCodes.FILTER_E_NO_TEXT)
    {
        if (useBuffer)
            _topSize += size;
        else
        {
            offset += (int)size;
            remaining -= (int)size;
        }
        if(_topSize > BufferSize)
        {
            _resTopSize = _topSize - BufferSize;
            _topSize = BufferSize;
        }
    }
    if (returnCode == IFilterReturnCodes.FILTER_S_LAST_TEXT || 
        returnCode == IFilterReturnCodes.FILTER_E_NO_MORE_TEXT ||
        (returnCode == IFilterReturnCodes.FILTER_E_NO_TEXT && size != 0) ||
        (null == FileName && IgnoreError && returnCode == 
        IFilterReturnCodes.E_INVALIDARG))
    {
        _endOfCurrChunk = true;
        if (remaining <= _topSize)
            return;
        break;
    }
    if(returnCode != IFilterReturnCodes.S_OK)
    {
        throw new Exception(
            "a error occur when getting text by current filter",
            new Exception(returnCode.ToString()));
    }
}

Using the Code

The following code uses just a filename:

C#
var fileName = "";
using (var reader = new FilterReader(fileName))
{
    reader.Init();
    //
    // write your code here;
    //
}

This code will specify a file and an extension:

C#
using (var reader = new FilterReader(fileName, ".docx"))
{
    reader.Init();
    //
    // write your code here;
    //
}
using (var reader = new FilterReader(fileName, 0x1000))
{
    reader.Init();
    //
    // write your code here;
    //
}

The code below shows how to pass a byte array into a FilterReader:

C#
byte[] bytes = null;
using (var reader = new FilterReader(bytes, ".docx"))
{
    reader.Init();
    //
    // write your code here;
    //
}
using (var reader = new FilterReader(bytes, ".docx", 0x1000))
{
    reader.Init();
    //
    // write your code here;
    //
}
using (var reader = new FilterReader(bytes))
{
    reader.Init();
    //
    // write your code here;
    //
}

Reference

History

  • 2/9/2010 - Updated download files.
  • 3/03/2009 - v2.0: Replaced Copyright and added OLE FMTIDs, Windows Search Schema, and OLE property interfaces for further needs.
  • 2/05/2009 - v2.0: Reconstructed some phrases (thanks Sean Kenney for reviewing this article).
  • 1/07/2009 - v2.0: Added comments for Adobe's PDF filter and small changes.
  • 12/09/2008 - v2.0 [Stable release].
  • 11/24/2008 - v1.0 [Initial release].

License

This article, along with any associated source code and files, is licensed under The Eclipse Public License 1.0


Written By
Technical Lead HP
China China
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionPowerShell Interface Pin
Mark Farzan31-Oct-23 12:22
Mark Farzan31-Oct-23 12:22 
QuestionException when trying to use this to read the provided sample PDF Pin
Brian C Hart3-Sep-17 10:42
professionalBrian C Hart3-Sep-17 10:42 
BugException from HRESULT : 0x8004170C Pin
Member 1096235812-Oct-14 5:11
Member 1096235812-Oct-14 5:11 
GeneralRe: Exception from HRESULT : 0x8004170C Pin
gerfasan2323-Apr-15 6:51
gerfasan2323-Apr-15 6:51 
SuggestionSupport for PDF file indexing Pin
Daniel Kornev4-May-14 14:03
Daniel Kornev4-May-14 14:03 
GeneralRe: Support for PDF file indexing Pin
vladimir-v7721-Sep-14 23:39
vladimir-v7721-Sep-14 23:39 
QuestionURGENT: Unable to extract text from docx, xlsx files Pin
Shiva Wahi16-Oct-13 20:23
Shiva Wahi16-Oct-13 20:23 
AnswerRe: URGENT: Unable to extract text from docx, xlsx files Pin
vladimir-v773-Oct-14 2:19
vladimir-v773-Oct-14 2:19 
QuestionUnable to retrieve content from .pdf, .docx, .xlsx... Pin
RicardoRomão1-Jul-13 7:44
RicardoRomão1-Jul-13 7:44 
AnswerRe: Unable to retrieve content from .pdf, .docx, .xlsx... Pin
santoshthankachan28-Aug-13 19:54
santoshthankachan28-Aug-13 19:54 
QuestionLoading from a Stream? Pin
Matt Johnson14-Jan-13 3:25
Matt Johnson14-Jan-13 3:25 
BugPeek() still doesn't work Pin
Member 874096323-May-12 4:13
Member 874096323-May-12 4:13 
BugFrequent use of FilterReader() breaks it Pin
Snæbjørn23-Apr-12 1:59
Snæbjørn23-Apr-12 1:59 
QuestionUnable to extract content from rtf file (Rich Text Format) Pin
ajayk.gupta20-Feb-12 22:21
ajayk.gupta20-Feb-12 22:21 
AnswerRe: Unable to extract content from rtf file (Rich Text Format) Pin
Snæbjørn23-Apr-12 4:16
Snæbjørn23-Apr-12 4:16 
AnswerRe: Unable to extract content from rtf file (Rich Text Format) Pin
vladimir-v7712-Dec-14 4:41
vladimir-v7712-Dec-14 4:41 
GeneralMy vote of 5 Pin
Manoj Kumar Choubey9-Feb-12 22:15
professionalManoj Kumar Choubey9-Feb-12 22:15 
GeneralMy vote of 5 Pin
Jonathan Manley29-Jun-11 9:19
Jonathan Manley29-Jun-11 9:19 
QuestionReading document properties by using the IFilter Pin
Ramakrishnan Seerangasamy1-Jun-11 0:07
Ramakrishnan Seerangasamy1-Jun-11 0:07 
GeneralMy vote of 5 Pin
alessandro7216-Feb-11 23:59
alessandro7216-Feb-11 23:59 
Generala bug while text extracting from "docx" Pin
Emre Özgür İnce11-Feb-11 2:40
Emre Özgür İnce11-Feb-11 2:40 
GeneralRe: a bug while text extracting from "docx" Pin
alex_zero11-Feb-11 18:58
alex_zero11-Feb-11 18:58 
GeneralRe: a bug while text extracting from "docx" Pin
Emre Özgür İnce13-Feb-11 20:36
Emre Özgür İnce13-Feb-11 20:36 
GeneralRe: a bug while text extracting from "docx" Pin
alex_zero14-Feb-11 17:23
alex_zero14-Feb-11 17:23 
hi Emre, i can't see any significant wrong behavior according to the result.
Look text below Roll eyes | :rolleyes: :



LABORATUVAR SERVİSLERİ GENEL İÇERİK DOKÜMANI



Belge Bilgileri
Belge başlığı: LABORATUVAR SERVİSLERİ GENEL İÇERİK DOKÜMANI Belge dosya adı: LABORATUVAR SERVİSLERİ GENEL İÇERİK DOKÜMANI.docx Revizyon numarası: 1.0 Düzenleyen: Tuğba Çağlayan Düzenleme Tarihi: 02.02.2010 Durum: Tamamlandı


Belge Onayları

İsmail Komaç Genel Müdür Yardımcısı İmza Tarih
Şule Tabak Yazılım Müdürü İmza Tarih
Ömür Şimşek Kalite Güvence Mühendisi İmza Tarih Revizyon Geçmişi
Revizyon Tarih Yazar Değişikliğin Tanımı 1.0 02.02.2010 Tuğba Çağlayan İlk Sürüm

İçindekiler
HASTA TETKİK LİSTELE 3
KURUM TETKİK PANEL LİSTESİ EKLE 3
KURUM TETKİK PANEL LİSTESİ GETİR 3
TETKİK DURUM GÜNCELLE 3
TETKİK GETİR 3
TETKİK KAYDET 3
TETKİK LİSTELE 3
TETKİK SONUÇ EKLE 3



HASTA TETKİK LİSTELE

İSTEK
DOKUMAN BILGISI
HASTA BILGISI
DOGUM TARIHI RESMI
DOGUM TARIHI BEYAN
CINSIYET RESMI
CINSIYET BEYAN
ISTEM DURUMU

CEVAP
DOKUMAN BILGISI
LABORATUVAR TETKIK LISTE BILGISI
TETKIK LISTE BILGISI
TETKIK ID
ANA ID
KOK ID
PROTOKOL NO
BARKOD
HASTA BILGISI
DOGUM TARIHI RESMI
DOGUM TARIHI BEYAN
CINSIYET RESMI
CINSIYET BEYAN
KURUM KAYIT BILGISI
TETKIK BILGISI
TETKIK KAYIT LISTE
ISTEM DURUMU
HEKIM BILGISI

KURUM TETKİK PANEL LİSTESİ EKLE

İSTEK
DOKUMAN BILGISI
KURUM TETKİK LİSTESİ
ÜNİTE BİLGİSİ
PANEL BİLGİSİ
PANELLER
TETKİ BİLGİSİ
TETKİK
AKTIF

CEVAP
DOKUMAN BILGISI






KURUM TETKİK PANEL LİSTESİ GETİR

İSTEK
DOKUMAN BILGISI

CEVAP
DOKUMAN BILGISI
KURUM TETKİK LİSTESİ
ÜNİTE BİLGİSİ
PANEL BİLGİSİ
PANELLER
TETKİ BİLGİSİ
TETKİK
AKTIF



TETKİK DURUM GÜNCELLE


İSTEK
DOKUMAN BILGISI
TETKIK ID
ANA ID
KOK ID
PROTOKOL NO
BARKOD

CEVAP
DOKUMAN BILGISI

TETKİK GETİR

İSTEK
DOKUMAN BILGISI
LABORATUVAR TETKIK ISTEK BILGISI
TETKIK ID
ANA ID
KOK ID
PROTOKOL NO
BARKOD

CEVAP
DOKUMAN BILGISI
TETKIK KAYIT BILGISI
TETKIK ID
ANA ID
KOK ID
PROTOKOL NO
BARKOD
HASTA BILGISI
DOGUM TARIHI RESMI
DOGUM TARIHI BEYAN
CINSIYET RESMI
CINSIYET BEYAN
TETKIK BILGISI
TETKIK KAYIT
TETKIK KODU
ALINDIGI TARIH
ALINDIGI SAAT
ISTEM ACIKLAMA
SONUC TARIH
SONUC SAAT
SONUC ACIKLAMA
SONUC DOSYA BILGISI
TANI BILGISI
TANI KOD
SONUC DOSYA BILGISI
DOSYA KAYIT
DOSYA
DOSYA UZANTISI


TETKİK KAYDET

İSTEK
DOKUMAN BILGISI
TETKIK KAYIT BİLGİSİ
TETKIK ID
ANA ID
KOK ID
PROTOKOL NO
BARKOD
HASTA BİLGİSİ
DOĞUMTARİHİRESMİ
DOĞUMTARİHİBEYAN
CİNSİYETRESMİ
CİNSİYETBEYAN
TETKİK BİLGİSİ
TETKİK KAYIT
TETKİK KODU
ALINDIGI TARIH
ALINDIGI SAAT
ISTEM ACIKLAMA
SONUC TARIH
SONUC SAAT
SONUC ACIKLAMA
SONUC DOSYA BILGISI
TANI BİLGİSİ
TANI KOD
ISTEM DURUMU
SONUC DOSYA BİLGİSİ
DOSYA KAYIT
DOSYA
DOSYA UZANTISI
SONUC ACIKLAMA

CEVAP
DOKUMAN BILGISI



TETKİK LİSTELE

İSTEK
DOKUMAN BILGISI
KURUM KAYIT BILGISI


CEVAP
DOKUMAN BILGISI
LABORATUVAR TETKIK LISTE BILGISI
TETKIK LISTE BILGISI
TETKIK ID
ANA ID
KOK ID
PROTOKOL NO
BARKOD
HASTA BILGISI
DOGUM TARIHI RESMI
DOGUM TARIHI BEYAN
CINSIYET RESMI
CINSIYET BEYAN
TETKIK BILGISI
TETKIK KAYIT LISTE



TETKİK SONUÇ EKLE

İSTEK
DOKUMAN BILGISI
LABORATUVAR TETKIK SONUC BİLGİSİ
TETKİK ID
ANA ID
KOK ID
PROTOKOL NO
BARKOD
SONUC DOSYA BİLGİSİ
DOSYA KAYIT
DOSYA
DOSYA UZANTISI
TETKİK KAYIT SONUC BİLGİSİ
TETKİK KAYIT SONUÇ
TETKİK KODU
SONUC TARIH
SONUC SAAT
SONUC ACIKLAMA
SONUC DOSYA BILGISI




CEVAP
DOKUMAN BILGISI
EK


DOKUMAN BILGISI










AHBSv40MimariTasarimRaporu (2).docx <user information> ( Software Productivity Centre, 1997




AHBS LABORATUVAR SERVİSLERİ GENEL İÇERİK DOKÜMANI Doküman No: Tarih: 15.01.2010 Revizyon No: 2.0 Hazırlayan: Tuğba Çağlayan

Sentim Bilişim Teknolojileri A.Ş. ® 2008 Sayfa 2 / 2

Doküman No: xxxxxxx r1.0 Doküman Revizyon Tarihi:25.04.2008


DOKUMAN
KOD
KOD SISTEM KOD
KOD SISTEM AD
KOD
AD
ID
ANA ID
KOK ID
PROTOKOL NO
BARKOD
TARIH
KULLANICI
KURUM BILGISI
KURUM KAYIT BILGISI
LOKASYON
TELEFON
GeneralRe: a bug while text extracting from "docx" Pin
Emre Özgür İnce14-Feb-11 23:29
Emre Özgür İnce14-Feb-11 23:29 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.