Click here to Skip to main content
Click here to Skip to main content

Seekafile Server 1.0 - Flexible open-source search server

, 8 Mar 2006
Rate this:
Please Sign up or sign in to vote.
A Windows Service that indexes DOC, PDF, XLS, PPT, RTF, HTML, TXT, XML, and other file formats. Desktop and ASP.NET search samples included.

Sample ASP.NET client

Sample Windows Forms client

Introduction

It has never been easier to create applications with search capabilities - open-source DotLucene [dotlucene.net] allows building powerful and super-fast full-text search applications. Moreover, it's easy to use. Let's demonstrate it by exploring Seekafile Server [seekafile.org] - a flexible indexing server with capabilities similar to that of Windows Indexing Service [microsoft.com].

This article is a follow-up of Desktop Search Application: Part 1. In that article, I have discussed indexing and searching Office document using DotLucene [dotlucene.net]. This time we will build a more serious application that can be either used directly or as a studying material for practical usage of DotLucene.

In this article, you will learn:

  • How to perform indexing in the background.
  • How to update documents in DotLucene index.
  • How to create queries for DotLucene programmatically.
  • How to use IFilter to parse Office documents, Adobe PDF and other file types correctly (it includes the updated parsing code from Desktop Search Application: Part 1).

Features

Seekafile Server [seekafile.org] is a Windows service that indexes documents in the specified directories and watches them for changes.

  • Background indexing
    • The indexer runs as a Windows service.
    • You specify the directories to be watched for changes in the configuration file.
    • Indexer works on the background (it doesn't slow down other operations).
    • It recognizes any change within a second.
  • Powered by DotLucene [dotlucene.net]
    • Super-fast searching.
    • The index is stored in DotLucene/Lucene 1.3+ compatible format.
    • The index can be accessed directly from other applications (you can search even when the indexing is in progress).
    • Access the index from any custom application (ASP.NET, Windows Forms application, Java application).
  • Built-in support for common file formats:
    • Microsoft PowerPoint (PPT)
    • Microsoft Word (DOC)
    • Microsoft Excel (XLS)
    • HTML (HTM/HTML)
    • Text files (TXT)
    • Rich Text Format (RTF)
  • Supports custom plug-ins written in C# or VB.NET.
  • Supports IFilter for searching other extensions:
    • Adobe Acrobat (PDF)
    • Microsoft Visio (VSD)
    • XML
    • and other...
  • Runs on Windows 2000/XP/2003

How it works

Architecture

The is the overview of the architecture:

The architecture is index-centric. It uses the index to communicate with the client search applications. The index is flexible enough to allow this:

  • It is possible to search the index while the Seekafile Server is modifying it.
  • There can be multiple clients accessing the index simultaneously.
  • The changes are visible immediately to all the clients.
  • The only information clients need to know is the index location and the available DotLucene document fields [dotlucene.net].
  • The index is compatible with the Java version - you can access it from a Java client as well.

Watching changes

This is an overview of the indexing process:

  1. When the service is started it checks whether the index was already created at the specified location; if not it creates a new one:
    if (!IndexReader.IndexExists(cfg.IndexPath))
    {
        Log.Echo("Creating a new index");
        IndexWriter writer = new IndexWriter(cfg.IndexPath, 
                              new StandardAnalyzer(), true);
        writer.Close();
    }
  2. It goes through all the indexed directories and adds all the files to the IndexerQueue (to ensure that everything is indexed properly):
    foreach (string folder in cfg.Items)
    {
        IndexerQueue.Add(folder);
        startWatcher(folder);
    }
  3. It starts the FileSystemWatcher to watch all file changes in the indexed directories:
    private void startWatcher(string directory)
    {
        watcher = new FileSystemWatcher();
        watcher.Path = directory;
    
        watcher.NotifyFilter = NotifyFilters.LastWrite | 
                                 NotifyFilters.FileName | 
                                 NotifyFilters.DirectoryName;
        watcher.IncludeSubdirectories = true;
        
        watcher.Filter = "";
        
        watcher.Changed += new FileSystemEventHandler(OnChanged);
        watcher.Created += new FileSystemEventHandler(OnChanged);
        watcher.Deleted += new FileSystemEventHandler(OnChanged);
        watcher.Renamed += new RenamedEventHandler(OnRenamed);
    
        // start watching
        watcher.EnableRaisingEvents = true;
    }
  4. If there is a change event, it adds the file to the IndexerQueue:
    private void OnChanged(object source, FileSystemEventArgs e)
    {
        // skip directory changes if it's not a name change
        if (Directory.Exists(e.FullPath) && 
                e.ChangeType == WatcherChangeTypes.Changed)
            return;
    
        IndexerQueue.Add(e.FullPath);
            
    }
    
    private void OnRenamed(object source, RenamedEventArgs e)
    {
        IndexerQueue.Add(e.OldFullPath);
        IndexerQueue.Add(e.FullPath);
    }

IndexerQueue

The IndexerQueue works this way:

  1. It works in a separate thread. There is only a single thread processing a single queue at any moment:
    public static void Start()
    {
        if (instanceDirectory == null)
            throw new ApplicationException("You must " + 
               "initialize the queue first by calling Init().");
    
        lock (runningLock)
        {
            if (!isRunning)
            {
                indexerThread = new Thread(new ThreadStart(Run));
                indexerThread.Name = "Indexer";
                indexerThread.Start();
            }
        }
    }
  2. It processes the items from the queue. It waits if there is nothing in the queue:
    while (!shouldStop)
    {
        if (nextPath != null)
        {
            // process nextPath
            // ...
    
            // remove it from the list
            lock (items.SyncRoot) 
            {
                items.Remove(nextPath);
            }
        }
        // nothing to do, let the processor do something else
        else
        {
            Thread.Sleep(100);
        }
        // try to take a next item
        nextPath = next();
    }
  3. If the path is a directory, it goes through it and adds all its content to the queue:
    private static void parseDirectory(DirectoryInfo di)
    {
        foreach (FileInfo f in di.GetFiles())
        {
            Add(f.FullName, false);
        }
    
        foreach (DirectoryInfo d in di.GetDirectories())
        {
            parseDirectory(d);
        }
    }
  4. If the path does not exist, it deletes it from the index (deleteDocuments) including all subfiles if there are any (deleteDirectory):
    private static void deleteDocuments(string fullName)
    {
        IndexReader r = IndexReader.Open(instanceDirectory);
        int deletedCount = r.Delete(new Term("fullname", fullName));
        r.Close();
    }
    
    private static void deleteDirectory(string fullName)
    {
        IndexReader r = IndexReader.Open(instanceDirectory);
        int deletedCount = r.Delete(new Term("parent", fullName));
        r.Close();
    }
  5. If the path is already in the index, it checks whether there is any change in file length, creation time, or last write time. To check whether the document is in the index, we create a query programmatically using BooleanQuery and TermQuery classes:
    private static bool isInIndex(FileInfo fi)
    {
        IndexSearcher searcher = new IndexSearcher(instanceDirectory);
    
        BooleanQuery bq = new BooleanQuery();
        bq.Add(new TermQuery(new Term("fullname", 
                             fi.FullName)), true, false);
        bq.Add(new TermQuery(new Term("length", 
                             fi.Length.ToString())), true, false);
        bq.Add(new TermQuery(new Term("created", 
               DateField.DateToString(fi.CreationTime))), true, false);
        bq.Add(new TermQuery(new Term("modified", 
              DateField.DateToString(fi.LastWriteTime))), true, false);
    
        Hits hits = searcher.Search(bq);
        int count = hits.Length();
        searcher.Close();
    
        return count == 1;
    }
  6. If there are changes it updates the document in the index. Updating requires deleting the old document and adding a new one:
    // updates are expensive - proceed only if the 
    // file is not up-to-date
    if (isInIndex(fi))
        return;
    
    // delete all existing document with this name
    deleteDocuments(fi.FullName);
    
    // add the document again
    addDocument(fi);
  7. When adding a document, we record the following metadata:
    • name: file name, e.g. document.doc,
    • fullname: path, e.g. c:\storage\marketing\document.doc,
    • parent: all parent directories, inserted as multiple fields, e.g. c:\; c:\storage; c:\storage\marketing,
    • created: creation time,
    • modified: last write time,
    • length: file length in bytes,
    • extension: file extensions, e.g. .doc.
    Document doc = new Document();
    doc.Add(new Field("name", fi.Name, true, true, true));
    doc.Add(new Field("fullname", fi.FullName, true, 
                                            true, false));
        
    DirectoryInfo di = fi.Directory;
    while (di != null)
    {
        doc.Add(new Field("parent", di.FullName, true, 
                                            true, false));
        di = di.Parent;
    }
        
    doc.Add(Field.Keyword("created", 
                DateField.DateToString(fi.CreationTime)));
    doc.Add(Field.Keyword("modified", 
                DateField.DateToString(fi.LastWriteTime)));
    doc.Add(Field.Keyword("length", fi.Length.ToString()));
    doc.Add(Field.Keyword("extension", fi.Extension));

Parsing the files

DotLucene is able to index only plain text. Therefore, we need to extract the plain text from the rich file formats like Microsoft Word DOC, RTF, or Adobe PDF. The parsing can be done using a .NET plug-in found in the plugins subdirectory of the Seekafile Server or by IFilter interface (which is available in all Windows 2000/XP/2003 installations).

Read more about IFilter:

Plug-ins

Generally, there are two ways of extending the parsing system:

Read more about custom plug-ins:

There is also a sample plug-in included in Seekafile Server download [seekafile.org].

Sample ASP.NET client search application

This ASP.NET application accesses the index directly to search it. It searches the file content only (file and directory names are ignored). It shows a relevant snippet from the document.

Read more about building an ASP.NET client search application [seekafile.org].

Download [seekafile.org] this sample as a part of the Seekafile Server from seekafile.org.

Sample Windows Forms client search application

This Windows Forms application accesses the index directly to search it. It searches the file content only (file and directory names are ignored).

Read more about building a Windows Forms client search application [seekafile.org].

Download [seekafile.org] this sample as a part of the Seekafile Server from seekafile.org.

Features planned for next versions

  • Exclude filters.
  • Multiple indexes per service.
  • Windows Forms client search application.
  • Simple GUI management.
  • Convenient installer.
  • Indexing status and notification support.
  • Multi-user desktop search.

Acknowledgements

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Share

About the Author

Dan Letecky

Czech Republic Czech Republic
My open-source event calendar/scheduling AJAX controls:
 
DayPilot for JavaScript/HTML5/jQuery
DayPilot for ASP.NET
DayPilot for MVC
DayPilot for Java

Comments and Discussions

 
GeneralSeekafile.org is expired! Pinmemberkv40009-Apr-09 17:07 
QuestionCan I create multiple indexes on different folders? PinmemberCPShameem30-Dec-08 8:40 
GeneralIs it possible to search in the SharePoint Document using this PinmemberLoveConsultant3-Dec-08 19:51 
Generalseek a file server Pinmembervenky45628-Oct-08 20:31 
QuestionSeekafile Server 1.5 beta 3 Pinmemberchinna7921-Oct-08 5:44 
QuestionHow to index my own local or web site Pinmemberxmindx30-Sep-08 5:22 
AnswerRe: How to index my own local or web site PinmemberDan Letecky30-Sep-08 22:52 
GeneralRe: How to index my own local or web site Pinmemberxmindx2-Oct-08 12:00 
GeneralI don't understand how the project works Pinmemberalisson_abreu22-Sep-08 4:14 
GeneralRe: I don't understand how the project works PinmemberDan Letecky22-Sep-08 8:02 
GeneralBooleanQuery PinmemberAntonioFc31-May-08 5:00 
Questionweb sample seraches c: temp only why ? Pinmemberxmindx16-May-08 1:52 
QuestionSearch Scanned Image PDF Pinmemberkapilmalhotra25-Nov-07 19:58 
AnswerRe: Search Scanned Image PDF Pinmemberubgunner9-Sep-09 13:14 
GeneralDisplay title, subject of file word, pdf.. in properties-summary file Pinmembertuanlib6-Jul-07 17:41 
Questionwaste space about the meta information?! Pinmembergobr1-Jun-07 20:16 
QuestionIndexing CitekNet filters PinmemberSairaj Sunil14-Dec-06 16:33 
QuestionDisplay web path instead of directory path PinmemberBurdekinSC17-Oct-06 20:36 
AnswerRe: Display web path instead of directory path Pinmemberkevdelkevdel5-Jan-07 13:11 
QuestionIs it possible to index a network shared folder PinmemberSairaj Sunil23-Sep-06 20:46 
AnswerRe: Is it possible to index a network shared folder PinmemberDan Letecky24-Sep-06 6:23 
QuestionRe: Is it possible to index a network shared folder [modified] PinmemberSairaj Sunil24-Sep-06 19:27 
AnswerRe: Is it possible to index a network shared folder PinmemberDan Letecky2-Oct-06 20:43 
QuestionRe: Is it possible to index a network shared folder [modified] PinmemberSairaj Sunil2-Oct-06 20:50 
AnswerRe: Is it possible to index a network shared folder PinmemberJason( J.Zhang)16-Jul-07 0:03 
QuestionIs it possible to search for partial words for file names ? PinmemberSyborg11-Sep-06 4:51 
AnswerRe: Is it possible to search for partial words for file names ? PinmemberDan Letecky12-Sep-06 8:05 
GeneralRe: Indexing XML documents PinmemberDan Letecky6-Sep-06 2:38 
QuestionCreating a plugin ? PinmemberSairaj Sunil29-Aug-06 18:31 
AnswerRe: Creating a plugin ? PinmemberDan Letecky29-Aug-06 20:52 
QuestionHow do I index pdf documents using Seekafile Server PinmemberSairaj Sunil28-Aug-06 22:04 
AnswerRe: How do I index pdf documents using Seekafile Server PinmemberDan Letecky28-Aug-06 22:44 
GeneralRe: How do I index pdf documents using Seekafile Server PinmemberSairaj Sunil29-Aug-06 2:58 
GeneralRe: How do I index pdf documents using Seekafile Server PinmemberDan Letecky29-Aug-06 20:53 
GeneralProblems PinmemberBassam Abdul-Baki10-May-06 6:26 
Generalgreat job, but some questions Pinmembermargiex19-Apr-06 23:42 
GeneralRe: great job, but some questions PinmemberDan Letecky20-Apr-06 0:49 
GeneralRemovable Storage/Static Index Pinmemberkennster6-Apr-06 9:38 
GeneralRe: Removable Storage/Static Index PinmemberDan Letecky14-Apr-06 9:48 
GeneralDotlucene 1.9 Pinmemberjoecod22-Mar-06 1:57 
GeneralRe: Dotlucene 1.9 PinmemberDan Letecky22-Mar-06 3:29 
GeneralRe: Dotlucene 1.9 PinmemberNGH8-Apr-09 18:59 
GeneralFAILs scanning directory Pinmemberbrunocol12-Mar-06 12:03 
GeneralRe: FAILs scanning directory PinmemberDan Letecky16-Mar-06 0:59 
GeneralCrawler PinmemberJohn Osborn12-Mar-06 2:35 
GeneralRe: Crawler Pinmemberbrunocol12-Mar-06 11:02 
GeneralGreat article...you get 5 from me... PinmemberBrian Pautsch8-Mar-06 3:53 
GeneralRe: Great article...you get 5 from me... PinmemberDan Letecky8-Mar-06 4:34 
GeneralURL's in asp.net search app Pinmembertony_bbc6-Mar-06 23:52 
GeneralRe: URL's in asp.net search app Pinmembertony_bbc7-Mar-06 0:45 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.141015.1 | Last Updated 8 Mar 2006
Article Copyright 2006 by Dan Letecky
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid