|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
DotLucene: excellent full-text search engineCan there be a full-text search coded on 37 lines? Well, I am going to cheat a bit and use DotLucene for the dirty work. DotLucene is a .NET port of Jakarta Lucene search engine maintained by George Aroush et al. Here is a quick list of its features:
WarningDon't take the line count too seriously. I will show you that the core functionality doesn't take more than 37 lines of code, but to make it a real application you will need to spend some more time on it... Demo projectWe will build a simple demo project that shows how to:
But DotLucene has more potential. In real-world application, you would probably want to:
Why Not to Use Microsoft Indexing Server?If you are happy with the Indexing Server, no problem. However, DotLucene has many advantages:
Line 1: Creating the IndexThe following line of code creates a new index stored on disk. IndexWriter writer =
new IndexWriter(directory, new StandardAnalyzer(), true);
In this example, we create the index from scratch. This is not necessary, you can also open an existing index and add documents to it. You can also update existing documents by deleting it and adding a new version. Lines 2 - 12: Adding documentsFor each HTML document, we will add two fields into the index:
public void AddHtmlDocument(string path)
{
Document doc = new Document();
string rawText;
using (StreamReader sr =
new StreamReader(path, System.Text.Encoding.Default))
{
rawText = parseHtml(sr.ReadToEnd());
}
doc.Add(Field.UnStored("text", rawText));
doc.Add(Field.Keyword("path", path));
writer.AddDocument(doc);
}
Lines 13 - 14: Optimizing and Saving the IndexAfter adding the documents, you need to close the indexer. Optimization will improve search performance. writer.Optimize();
writer.Close();
Line 15: Opening the Index for SearchingBefore doing any search, you need to open the index. IndexSearcher searcher = new IndexSearcher(directory);
Lines 16 - 27: SearchingNow we can parse the query ( Query query =
QueryParser.Parse(q, "text", new StandardAnalyzer());
Hits hits = searcher.Search(query);
Variable DataTable dt = new DataTable();
dt.Columns.Add("path", typeof(string));
dt.Columns.Add("sample", typeof(string));
for (int i = 0; i < hits.Length(); i++)
{
// get the document from index
Document doc = hits.Doc(i);
// get the document filename
// we can't get the text from the index
//because we didn't store it there
DataRow row = dt.NewRow();
row["path"] = doc.Get("path");
dt.Rows.Add(row);
}
Lines 28 - 37: Query HighlightingLet's create a QueryHighlightExtractor highlighter =
new QueryHighlightExtractor(query, new StandardAnalyzer(),
"<B>", "</B>");
During the result fetching, we will load the relevant part of the original text. for (int i = 0; i < hits.Length(); i++)
{
// ...
string plainText;
using (StreamReader sr =
new StreamReader(doc.Get("filename"),
System.Text.Encoding.Default))
{
plainText = parseHtml(sr.ReadToEnd());
}
row["sample"] =
highlighter.GetBestFragments(plainText, 80, 2, "...");
// ...
}
Resources
|
||||||||||||||||||||||