5,447,640 members and growing! (21,189 online)
Email Password   helpLost your password?
Web Development » ASP.NET » Samples     Intermediate

DotLucene: Full-Text Search for Your Intranet or Website using 37 Lines of Code

By Dan Letecky

An introduction to DotLucene, open source full-text search engine.
C#, Windows, .NET 1.1, .NET, ASP.NET, Visual Studio, VS.NET2003, Dev

Posted: 1 Feb 2005
Updated: 30 Mar 2005
Views: 123,381
Bookmarked: 164 times
Announcements
Want a new Job?



Search    
Advanced Search
Sitemap
47 votes for this Article.
Popularity: 7.87 Rating: 4.70 out of 5
3 votes, 6.4%
1
0 votes, 0.0%
2
0 votes, 0.0%
3
6 votes, 12.8%
4
38 votes, 80.9%
5

DotLucene: excellent full-text search engine

Can there be a full-text search coded on 37 lines? Well, I am going to cheat a bit and use DotLucene for the dirty work. DotLucene is a .NET port of Jakarta Lucene search engine maintained by George Aroush et al. Here is a quick list of its features:

  • It can be used in ASP.NET, Win Forms or console applications.
  • Very good performance.
  • Ranked search results.
  • Search query highlighting in results.
  • Searches structured and unstructured data.
  • Metadata searching (query by date, search custom fields...).
  • Index size approximately 30% of the indexed text.
  • Can also store full indexed documents.
  • Pure managed .NET in a single assembly (244 KB).
  • Very friendly licensing (Apache Software License 2.0).
  • Localizable (support for Brazilian, Czech, Chinese, Dutch, English, French, Japanese, Korean and Russian included).
  • Extensible (source code included).

Warning

Don't take the line count too seriously. I will show you that the core functionality doesn't take more than 37 lines of code, but to make it a real application you will need to spend some more time on it...

Demo project

We will build a simple demo project that shows how to:

  • index HTML files found in a specified directory (including subdirectories).
  • search the index using a ASP.NET application.
  • highlight the query words in the search results.

But DotLucene has more potential. In real-world application, you would probably want to:

  • Add the new documents to the index when they appear in the directory. You don't need to rebuild the whole index.
  • Include other file types. DotLucene can index any file type which you are able to convert to plain text.

Why Not to Use Microsoft Indexing Server?

If you are happy with the Indexing Server, no problem. However, DotLucene has many advantages:

  • DotLucene is a single assembly of 100% managed code. It has no external dependencies.
  • It can be used on a shared hosting. If you prepare the index in advance, you need no permissions to write on disk.
  • You can use it to index any type of data (e-mails, XML, HTML files, etc.) from any source (database, web, etc.). That's because you need to supply plain text to the indexer. Loading and parsing the source is up to you.
  • Allows you to specify the attributes ("fields") that should be included in the index. You can search using these fields (e.g. by author, date, keywords).
  • It is an open source.
  • It is easily extensible.

Line 1: Creating the Index

The following line of code creates a new index stored on disk. directory is a path to the directory where the index will be stored.

IndexWriter writer = 
   new IndexWriter(directory, new StandardAnalyzer(), true);

In this example, we create the index from scratch. This is not necessary, you can also open an existing index and add documents to it. You can also update existing documents by deleting it and adding a new version.

Lines 2 - 12: Adding documents

For each HTML document, we will add two fields into the index:

  • text field that contains the text of the HTML file (with stripped tags). The text itself won't be stored in the index.
  • path field that contains the file path. It will be indexed and stored in full in the index.
public void AddHtmlDocument(string path)
{
    Document doc = new Document();

    string rawText;
    using (StreamReader sr = 
       new StreamReader(path, System.Text.Encoding.Default))
    {
        rawText = parseHtml(sr.ReadToEnd());
    }
    
    doc.Add(Field.UnStored("text", rawText));
    doc.Add(Field.Keyword("path", path));
    writer.AddDocument(doc);
}

Lines 13 - 14: Optimizing and Saving the Index

After adding the documents, you need to close the indexer. Optimization will improve search performance.

writer.Optimize();
writer.Close();

Line 15: Opening the Index for Searching

Before doing any search, you need to open the index. directory is the path to the directory where the index was stored.

IndexSearcher searcher = new IndexSearcher(directory);

Lines 16 - 27: Searching

Now we can parse the query (text is the default field to search for).

Query query = 
   QueryParser.Parse(q, "text", new StandardAnalyzer()); 
Hits hits = searcher.Search(query);

Variable hits is a collection of result documents. We will go through it and store the results in a DataTable.

DataTable dt = new DataTable();
dt.Columns.Add("path", typeof(string));
dt.Columns.Add("sample", typeof(string));

for (int i = 0; i < hits.Length(); i++) 
{
    // get the document from index

    Document doc = hits.Doc(i);

    // get the document filename

    // we can't get the text from the index 

    //because we didn't store it there

    DataRow row = dt.NewRow();
    row["path"] = doc.Get("path");

    dt.Rows.Add(row);
}

Lines 28 - 37: Query Highlighting

Let's create a highlighter. We will use bold font for highlighting (<B>phrase</B>).

QueryHighlightExtractor highlighter = 
  new QueryHighlightExtractor(query, new StandardAnalyzer(), 
                         "<B>", "</B>");

During the result fetching, we will load the relevant part of the original text.

for (int i = 0; i < hits.Length(); i++) 
{
    // ...

    string plainText;
    using (StreamReader sr = 
      new StreamReader(doc.Get("filename"), 
                  System.Text.Encoding.Default))
    {
        plainText = parseHtml(sr.ReadToEnd());
    }
    row["sample"] = 
       highlighter.GetBestFragments(plainText, 80, 2, "...");
    // ...

}

Resources

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Dan Letecky


My open-source ASP.NET 2.0 controls:

DayPilot - Outlook-like calendar/scheduling control
DayPilot MonthPicker - Light-weight month picker
MenuPilot - Hover context menu

Location: Czech Republic Czech Republic

Other popular ASP.NET articles:

Article Top
Sign Up to vote for this article
You must Sign In to use this message board.
FAQ FAQ Noise ToleranceSearch Search Messages 
 Layout  Per page   
 Msgs 1 to 25 of 74 (Total in Forum: 74) (Refresh)FirstPrevNext
Subject  Author Date 
QuestionLucine.NET relevancememberArsenmkrt4:16 6 Aug '08  
AnswerRe: Lucine.NET relevancememberDan Letecky4:31 14 Aug '08  
GeneralType of highlighter for snowballanalyzermemberraypang200:29 24 Jul '08  
GeneralNot able to use CodememberMember 29389713:00 5 Jun '08  
GeneralProblem with Russian Language in Lucene 2.1memberAnna200822:36 4 Apr '08  
QuestionIndexmemberMember 43275326:23 31 Mar '08  
GeneralAccess to the path "segments" is denied.memberlferrarez6:01 12 Dec '07  
GeneralAccess to the path "c:\Inetpub\Wwwroot\pdf\segments" is denied.memberlferrarez6:00 12 Dec '07  
GeneralMULTIPLE TERM & AND MULTIPLE VALUE Query !!!memberTedManowar1:49 3 Oct '07  
GeneralRe: MULTIPLE TERM & AND MULTIPLE VALUE Query !!!memberTedManowar2:56 4 Oct '07  
GeneralRe: MULTIPLE TERM & AND MULTIPLE VALUE Query !!!memberkennster9:09 19 Feb '08  
Generallucene.net (SpellChecker.Net)memberploufs5:47 28 Sep '07  
GeneralWhere can I get Lucene.Net.dll?memberWin32nipuh7:21 18 Jul '07  
GeneralRe: Where can I get Lucene.Net.dll?memberDan Letecky7:35 18 Jul '07  
GeneralRe: Where can I get Lucene.Net.dll?memberWin32nipuh20:44 18 Jul '07  
GeneralRe: Where can I get Lucene.Net.dll?memberDan Letecky21:01 18 Jul '07  
GeneralUsing Lucene.NET 2.0memberSandeep Akhare1:11 27 Jun '07  
GeneralRe: Using Lucene.NET 2.0memberDan Letecky2:50 27 Jun '07  
GeneralRe: Using Lucene.NET 2.0memberSandeep Akhare3:04 27 Jun '07  
GeneralHi therememberjoey27278:04 9 May '07  
GeneralRanking Results!memberSuperEric20:20 27 Apr '07  
Generalchanging banners in adrotatormembermnz4web4:36 24 Apr '07  
GeneralRebuild indexmemberferra766:30 6 Jan '07  
QuestionUpdating exsiting document field valuememberledesma2:20 3 Sep '06  
AnswerRe: Updating exsiting document field valuememberDan Letecky5:09 3 Sep '06  

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

PermaLink | Privacy | Terms of Use
Last Updated: 30 Mar 2005
Editor: Rinish Biju
Copyright 2005 by Dan Letecky
Everything else Copyright © CodeProject, 1999-2008
Web09 | Advertise on the Code Project