![]() |
General Programming »
Internet / Network »
Internet
Intermediate
Seekafile Server 1.0 - Flexible open-source search serverBy Dan LeteckyA Windows Service that indexes DOC, PDF, XLS, PPT, RTF, HTML, TXT, XML, and other file formats. Desktop and ASP.NET search samples included. |
C#, Windows, .NET, Visual Studio, Dev
|
|
Advanced Search Add to IE Search |
|
|
|
||||||||||||||||

It has never been easier to create applications with search capabilities - open-source DotLucene [dotlucene.net] allows building powerful and super-fast full-text search applications. Moreover, it's easy to use. Let's demonstrate it by exploring Seekafile Server [seekafile.org] - a flexible indexing server with capabilities similar to that of Windows Indexing Service [microsoft.com].
This article is a follow-up of Desktop Search Application: Part 1. In that article, I have discussed indexing and searching Office document using DotLucene [dotlucene.net]. This time we will build a more serious application that can be either used directly or as a studying material for practical usage of DotLucene.
In this article, you will learn:
Seekafile Server [seekafile.org] is a Windows service that indexes documents in the specified directories and watches them for changes.
The is the overview of the architecture:

The architecture is index-centric. It uses the index to communicate with the client search applications. The index is flexible enough to allow this:
This is an overview of the indexing process:
if (!IndexReader.IndexExists(cfg.IndexPath))
{
Log.Echo("Creating a new index");
IndexWriter writer = new IndexWriter(cfg.IndexPath,
new StandardAnalyzer(), true);
writer.Close();
}
IndexerQueue (to ensure that everything is indexed properly): foreach (string folder in cfg.Items)
{
IndexerQueue.Add(folder);
startWatcher(folder);
}
FileSystemWatcher to watch all file changes in the indexed directories: private void startWatcher(string directory)
{
watcher = new FileSystemWatcher();
watcher.Path = directory;
watcher.NotifyFilter = NotifyFilters.LastWrite |
NotifyFilters.FileName |
NotifyFilters.DirectoryName;
watcher.IncludeSubdirectories = true;
watcher.Filter = "";
watcher.Changed += new FileSystemEventHandler(OnChanged);
watcher.Created += new FileSystemEventHandler(OnChanged);
watcher.Deleted += new FileSystemEventHandler(OnChanged);
watcher.Renamed += new RenamedEventHandler(OnRenamed);
// start watching
watcher.EnableRaisingEvents = true;
}
IndexerQueue: private void OnChanged(object source, FileSystemEventArgs e)
{
// skip directory changes if it's not a name change
if (Directory.Exists(e.FullPath) &&
e.ChangeType == WatcherChangeTypes.Changed)
return;
IndexerQueue.Add(e.FullPath);
}
private void OnRenamed(object source, RenamedEventArgs e)
{
IndexerQueue.Add(e.OldFullPath);
IndexerQueue.Add(e.FullPath);
}The IndexerQueue works this way:
public static void Start()
{
if (instanceDirectory == null)
throw new ApplicationException("You must " +
"initialize the queue first by calling Init().");
lock (runningLock)
{
if (!isRunning)
{
indexerThread = new Thread(new ThreadStart(Run));
indexerThread.Name = "Indexer";
indexerThread.Start();
}
}
}
while (!shouldStop)
{
if (nextPath != null)
{
// process nextPath
// ...
// remove it from the list
lock (items.SyncRoot)
{
items.Remove(nextPath);
}
}
// nothing to do, let the processor do something else
else
{
Thread.Sleep(100);
}
// try to take a next item
nextPath = next();
}
private static void parseDirectory(DirectoryInfo di)
{
foreach (FileInfo f in di.GetFiles())
{
Add(f.FullName, false);
}
foreach (DirectoryInfo d in di.GetDirectories())
{
parseDirectory(d);
}
}
deleteDocuments) including all subfiles if there are any (deleteDirectory): private static void deleteDocuments(string fullName)
{
IndexReader r = IndexReader.Open(instanceDirectory);
int deletedCount = r.Delete(new Term("fullname", fullName));
r.Close();
}
private static void deleteDirectory(string fullName)
{
IndexReader r = IndexReader.Open(instanceDirectory);
int deletedCount = r.Delete(new Term("parent", fullName));
r.Close();
}
BooleanQuery and TermQuery classes: private static bool isInIndex(FileInfo fi)
{
IndexSearcher searcher = new IndexSearcher(instanceDirectory);
BooleanQuery bq = new BooleanQuery();
bq.Add(new TermQuery(new Term("fullname",
fi.FullName)), true, false);
bq.Add(new TermQuery(new Term("length",
fi.Length.ToString())), true, false);
bq.Add(new TermQuery(new Term("created",
DateField.DateToString(fi.CreationTime))), true, false);
bq.Add(new TermQuery(new Term("modified",
DateField.DateToString(fi.LastWriteTime))), true, false);
Hits hits = searcher.Search(bq);
int count = hits.Length();
searcher.Close();
return count == 1;
}
// updates are expensive - proceed only if the
// file is not up-to-date
if (isInIndex(fi))
return;
// delete all existing document with this name
deleteDocuments(fi.FullName);
// add the document again
addDocument(fi);
Document doc = new Document();
doc.Add(new Field("name", fi.Name, true, true, true));
doc.Add(new Field("fullname", fi.FullName, true,
true, false));
DirectoryInfo di = fi.Directory;
while (di != null)
{
doc.Add(new Field("parent", di.FullName, true,
true, false));
di = di.Parent;
}
doc.Add(Field.Keyword("created",
DateField.DateToString(fi.CreationTime)));
doc.Add(Field.Keyword("modified",
DateField.DateToString(fi.LastWriteTime)));
doc.Add(Field.Keyword("length", fi.Length.ToString()));
doc.Add(Field.Keyword("extension", fi.Extension));DotLucene is able to index only plain text. Therefore, we need to extract the plain text from the rich file formats like Microsoft Word DOC, RTF, or Adobe PDF. The parsing can be done using a .NET plug-in found in the plugins subdirectory of the Seekafile Server or by IFilter interface (which is available in all Windows 2000/XP/2003 installations).
Read more about IFilter:
Generally, there are two ways of extending the parsing system:
Read more about custom plug-ins:
There is also a sample plug-in included in Seekafile Server download [seekafile.org].

This ASP.NET application accesses the index directly to search it. It searches the file content only (file and directory names are ignored). It shows a relevant snippet from the document.
Read more about building an ASP.NET client search application [seekafile.org].
Download [seekafile.org] this sample as a part of the Seekafile Server from seekafile.org.

This Windows Forms application accesses the index directly to search it. It searches the file content only (file and directory names are ignored).
Read more about building a Windows Forms client search application [seekafile.org].
Download [seekafile.org] this sample as a part of the Seekafile Server from seekafile.org.
General
News
Question
Answer
Joke
Rant
Admin
|
PermaLink |
Privacy |
Terms of Use
Last Updated: 8 Mar 2006 Editor: Smitha Vijayan |
Copyright 2006 by Dan Letecky Everything else Copyright © CodeProject, 1999-2009 Web19 | Advertise on the Code Project |