Click here to Skip to main content
11,641,971 members (59,782 online)
Click here to Skip to main content

Desktop Search Application: Part 1

, 6 Jul 2006 305.6K 7.4K 247
Rate this:
Please Sign up or sign in to vote.
Building an application that searches your Office documents in tenths of a second.

Sample Image - screenshot1.gif

Introduction

Let's take the following exercise: Build a C# application that instantly searches your Documents folder. We need the application to:

  • Search the contents of Office documents (Word, Excel, PowerPoint). I.e., XLS, DOC, PPT files.
  • Search the contents of HTML documents.
  • Search the contents of text documents.
  • Search quickly (i.e., give results in less than a second).
  • Allow to open the documents directly from the results list.

Update: This Desktop Search application is now part of the Seekafile Server 1.0 - Open-Source Indexing Server. The server provides automatic indexing on the background, and you can search the index through a Windows Forms client search application. See also the Seekafile Server roadmap.

Task 1: Full-Text Indexing

For the indexing and searching, we will use an excellent search engine called DotLucene. It's a C# port of Java Lucene, maintained by George Aroush. It has many great features:

  • Very good performance.
  • Ranked search results.
  • Search query highlighting in results.
  • Searches structured and unstructured data.
  • Metadata searching (query by date, search custom fields...).
  • Index size approximately 30% of the indexed text.
  • Can also store fully indexed documents.
  • Pure managed .NET in a single assembly.
  • Very friendly licensing (Apache Software License 2.0).
  • Localizable (support for Brazilian, Czech, Chinese, Dutch, English, French, German, Japanese, Korean, and Russian included in the DotLucene National Language Support Pack).
  • Extensible (source code included).

For more details on creating the index and searching, see my previous article: DotLucene: Full-Text Search for Your Intranet or Website using 37 Lines of Code.

Task 2: Parsing Office Documents

As DotLucene can index only plain text, we need to parse the Office documents and extract text from them. Reading their binary structure isn't an easy job. However, on Windows 2000+, we can use the IFilter interface which is a part of the Windows Indexing Service. This is installed by default on all Windows 2000+ systems (no Office installation is required).

The IFilter API is also being used by the Windows Desktop Search (MSN Search Toolbar) and Lookout, so you don't have to be afraid that we will use something obscure to parse the documents. It can also parse other file types if you install the appropriate filter.

Working with the IFilter interface requires a lot of COM interop which is a bit tricky. After tweaking the samples available on the web, I finally had a code that worked correctly in most cases:

public static string Parse(string filename)
{
  IFilter filter = null;
  try {
    StringBuilder plainTextResult = new StringBuilder();
    filter = loadIFilter(filename); 
    STAT_CHUNK ps = new STAT_CHUNK();
    IFILTER_INIT mFlags = 0;
    uint i = 0;
    filter.Init( mFlags, 0, null, ref i);
    int resultChunk = 0;
    resultChunk = filter.GetChunk(out ps);
    while (resultChunk == 0)
    {
      if (ps.flags == CHUNKSTATE.CHUNK_TEXT)
      {
        uint sizeBuffer = 60000;
        int resultText = 0;
        while (resultText == Constants.FILTER_S_LAST_TEXT || resultText == 0)
        {
          sizeBuffer = 60000;
          System.Text.StringBuilder sbBuffer = 
             new System.Text.StringBuilder((int)sizeBuffer);
          resultText = filter.GetText(ref sizeBuffer, sbBuffer);
          if (sizeBuffer > 0 && sbBuffer.Length > 0)
          {
            string chunk = sbBuffer.ToString(0, (int)sizeBuffer);
            plainTextResult.Append(chunk);
          }
        }
      }
      resultChunk = filter.GetChunk(out ps);
    }
    return plainTextResult.ToString();
  }
  finally
  {
    if (filter != null)
      Marshal.ReleaseComObject(filter);
  }  
}

Assembling the Application

In short:

  • The index can only be built from scratch.
  • We are not returning the sample of the found document.
  • We are skipping the files that can't be parsed successfully.
  • We are loading the associated explorer icon for each document in the results.
  • You can choose the folder to be indexed; by default, it's your Documents folder.
  • The indexed file types are hard-coded (txt, htm/html, doc, xls, ppt).
  • The index is stored in your profile in the Local Settings/Application Data/DesktopSearch folder.
  • Remember that it is possible to search while the indexing is in progress.

Performance

Some statistics (Athlon XP 2000+, 1GB RAM, Seagate SATA drive 7200 RPM):

  • Documents indexed: 1185
  • Total size of indexed documents: 120,690,622 bytes
  • Rebuilding the index took: 5 minutes 55 seconds
  • Index size: 4,339,950 bytes
  • Search time (including the rendering on display): from 0.0937 seconds (25 found items) to 0.3125 seconds (213 found items)

To Be Continued...

In the next part of the article, we will extend this simple application with the following features:

  • The indexing will be handled by a separate application which will update the index continuously on the background.
  • The results will contain a sample of the document with highlighted query words.
  • We will index and search the file name and the last modified date.

Resources and Acknowledgements

Search Engine:

Seekafile Server - Open-Source Indexing Server

Office Documents Parsing:

Appearance:

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Share

About the Author

Dan Letecky
Czech Republic Czech Republic
My open-source event calendar/scheduling AJAX controls:

DayPilot for JavaScript/HTML5/jQuery
DayPilot for ASP.NET
DayPilot for MVC
DayPilot for Java

You may also be interested in...

Comments and Discussions

 
AnswerRe: How to search in pdf files? Pin
Dan Letecky2-May-07 5:02
memberDan Letecky2-May-07 5:02 
GeneralVB.Net Version Pin
Jonathan M Patrick8-Sep-06 18:10
memberJonathan M Patrick8-Sep-06 18:10 
QuestionPart 2? Pin
james_carter18-Jul-06 2:45
memberjames_carter18-Jul-06 2:45 
Hi there,
have you any plans to write Part2 of this excellent article?
I am particularly interested in the possibilty of a preview with the search terms highlighted.

Obviously I accept that you are busy but if you could give me a couple of pointers on how to achieve this myself I would be very grateful

cheers and thanks for your time

james
AnswerRe: Part 2? Pin
Dan Letecky2-May-07 4:55
memberDan Letecky2-May-07 4:55 
GeneralRe: Part 2? Pin
james_carter2-May-07 5:01
memberjames_carter2-May-07 5:01 
QuestionIS Dotlucene Free? Pin
visva6-Apr-06 17:45
membervisva6-Apr-06 17:45 
AnswerRe: IS Dotlucene Free? Pin
Dan Letecky6-Apr-06 21:57
memberDan Letecky6-Apr-06 21:57 
GeneralTitle and score questions Pin
mbowles20120-Mar-06 6:22
membermbowles20120-Mar-06 6:22 
GeneralNeed Information Pin
Vijay Kumar Raja Grandhi3-Mar-06 3:15
memberVijay Kumar Raja Grandhi3-Mar-06 3:15 
GeneralRe: Need Information Pin
Saltire12-Mar-06 11:32
memberSaltire12-Mar-06 11:32 
QuestionHow i can use different filters with our altering my code Pin
Vijay Kumar Raja Grandhi28-Feb-06 6:53
memberVijay Kumar Raja Grandhi28-Feb-06 6:53 
GeneralIs Natural Search possible with this application Pin
Vijay Kumar Raja Grandhi22-Feb-06 1:33
memberVijay Kumar Raja Grandhi22-Feb-06 1:33 
GeneralRe: Is Natural Search possible with this application Pin
Saltire12-Mar-06 12:57
memberSaltire12-Mar-06 12:57 
GeneralRe: Is Natural Search possible with this application Pin
Vijay Kumar Raja Grandhi12-Mar-06 22:11
memberVijay Kumar Raja Grandhi12-Mar-06 22:11 
GeneralRe: Is Natural Search possible with this application Pin
Saltire12-Mar-06 23:10
memberSaltire12-Mar-06 23:10 
GeneralRe: Is Natural Search possible with this application Pin
Vijay Kumar Raja Grandhi12-Mar-06 23:26
memberVijay Kumar Raja Grandhi12-Mar-06 23:26 
GeneralRe: Is Natural Search possible with this application Pin
Saltire12-Mar-06 23:40
memberSaltire12-Mar-06 23:40 
GeneralRe: Is Natural Search possible with this application Pin
Vijay Kumar Raja Grandhi12-Mar-06 23:54
memberVijay Kumar Raja Grandhi12-Mar-06 23:54 
GeneralRe: Is Natural Search possible with this application Pin
Vijay Kumar Raja Grandhi13-Mar-06 1:47
memberVijay Kumar Raja Grandhi13-Mar-06 1:47 
GeneralRe: Is Natural Search possible with this application Pin
Saltire13-Mar-06 2:55
memberSaltire13-Mar-06 2:55 
GeneralRe: Is Natural Search possible with this application Pin
Vijay Kumar Raja Grandhi13-Mar-06 3:13
memberVijay Kumar Raja Grandhi13-Mar-06 3:13 
QuestionQuestion regarding Lock facility Pin
Vijay Kumar Raja Grandhi22-Feb-06 1:21
memberVijay Kumar Raja Grandhi22-Feb-06 1:21 
QuestionHow to Add a document to an existing index file Pin
Vijay Kumar Raja Grandhi20-Feb-06 4:53
memberVijay Kumar Raja Grandhi20-Feb-06 4:53 
QuestionHow to search in MS Outlook Email Pin
kashifhameed16-Feb-06 1:57
memberkashifhameed16-Feb-06 1:57 
AnswerRe: How to search in MS Outlook Email Pin
Vijay Kumar Raja Grandhi3-Mar-06 3:19
memberVijay Kumar Raja Grandhi3-Mar-06 3:19 
QuestionRe: How to search in MS Outlook Email Pin
Jonathan M Patrick8-Sep-06 18:00
memberJonathan M Patrick8-Sep-06 18:00 
QuestionHow to search text in Visio documents Pin
Jerome Barras26-Jan-06 21:15
memberJerome Barras26-Jan-06 21:15 
AnswerRe: How to search text in Visio documents Pin
Inspector2-Feb-06 9:59
memberInspector2-Feb-06 9:59 
GeneralRe: How to search text in Visio documents Pin
Jerome Barras2-Feb-06 23:17
memberJerome Barras2-Feb-06 23:17 
GeneralRe: How to search text in Visio documents Pin
Inspector3-Feb-06 5:22
memberInspector3-Feb-06 5:22 
GeneralRe: How to search text in Visio documents Pin
Jerome Barras7-Feb-06 3:07
memberJerome Barras7-Feb-06 3:07 
GeneralRe: How to search text in Visio documents Pin
Inspector7-Feb-06 5:42
memberInspector7-Feb-06 5:42 
GeneralRe: How to search text in Visio documents Pin
Jerome Barras9-Feb-06 1:26
memberJerome Barras9-Feb-06 1:26 
GeneralRe: How to search text in Visio documents Pin
Vijay Kumar Raja Grandhi20-Feb-06 0:57
memberVijay Kumar Raja Grandhi20-Feb-06 0:57 
GeneralRe: How to search text in Visio documents Pin
Jerome Barras1-Mar-06 2:20
memberJerome Barras1-Mar-06 2:20 
GeneralRe: How to search text in Visio documents Pin
Vijay Kumar Raja Grandhi20-Feb-06 0:56
memberVijay Kumar Raja Grandhi20-Feb-06 0:56 
GeneralMaybe a Critical Problem Pin
Spongesong14-Jan-06 6:55
memberSpongesong14-Jan-06 6:55 
GeneralRe: Maybe a Critical Problem Pin
Inspector27-Jan-06 8:03
memberInspector27-Jan-06 8:03 
GeneralSame problem Pin
Spongesong31-Jan-06 18:31
memberSpongesong31-Jan-06 18:31 
GeneralRe: Same problem Pin
Inspector1-Feb-06 5:27
memberInspector1-Feb-06 5:27 
GeneralDelete item from index Pin
mbowles20113-Jan-06 3:06
membermbowles20113-Jan-06 3:06 
GeneralRe: Delete item from index Pin
Dan Letecky13-Jan-06 3:16
memberDan Letecky13-Jan-06 3:16 
GeneralRe: Delete item from index Pin
mbowles20113-Jan-06 3:29
membermbowles20113-Jan-06 3:29 
GeneralRe: Delete item from index Pin
Vijay Kumar Raja Grandhi20-Feb-06 0:28
memberVijay Kumar Raja Grandhi20-Feb-06 0:28 
GeneralRe: Delete item from index Pin
Vijay Kumar Raja Grandhi20-Feb-06 0:28
memberVijay Kumar Raja Grandhi20-Feb-06 0:28 
GeneralRe: Delete item from index Pin
mbowles20121-Feb-06 3:03
membermbowles20121-Feb-06 3:03 
GeneralRe: Delete item from index Pin
Vijay Kumar Raja Grandhi21-Feb-06 3:16
memberVijay Kumar Raja Grandhi21-Feb-06 3:16 
Generalmissing matches Pin
novalis7827-Nov-05 0:32
membernovalis7827-Nov-05 0:32 
QuestionVB Code? Pin
morgen19-Nov-05 16:08
membermorgen19-Nov-05 16:08 
AnswerRe: VB Code? Pin
FunkyMonkey18-Jan-06 19:14
memberFunkyMonkey18-Jan-06 19:14 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.150731.1 | Last Updated 6 Jul 2006
Article Copyright 2005 by Dan Letecky
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid