Click here to Skip to main content
15,889,176 members
Articles / Web Development / ASP.NET

DotLucene: Full-Text Search for Your Intranet or Website using 37 Lines of Code

Rate me:
Please Sign up or sign in to vote.
4.81/5 (65 votes)
6 Nov 2012Apache3 min read 383.4K   6.1K   302   93
An introduction to Lucene.Net, the open source full-text search engine.

Image 1

Update 

November 6, 2012: The project is now working with Lucene.Net 3.0 and .NET Framework 4.0. Includes Visual Studio 2010 solution.  

Lucene.Net: Excellent Full-Text Search Engine

Can there be a full-text search coded on 37 lines? Well, I am going to cheat a bit and use Lucene.Net for the dirty work. Lucene.Net is a .NET port of Jakarta Lucene search engine. Here is a quick list of its features:

  • It can be used in ASP.NET, Win Forms or console applications.
  • Very good performance.
  • Ranked search results.
  • Search query highlighting in results.
  • Searches structured and unstructured data.
  • Metadata searching (query by date, search custom fields...).
  • Index size approximately 30% of the indexed text.
  • Can also store full indexed documents.
  • Pure managed .NET.
  • Very friendly licensing (Apache Software License 2.0).
  • Localizable (support for Brazilian, Czech, Chinese, Dutch, English, French, Japanese, Korean and Russian included).
  • Extensible (source code included).

Warning 

Don't take the line count too seriously. I will show you that the core functionality doesn't take more than 37 lines of code, but to make it a real application you will need to spend some more time on it...

Demo Project 

We will build a simple demo project that shows how to:

  • index HTML files found in a specified directory (including subdirectories).
  • search the index using a ASP.NET application.
  • highlight the query words in the search results.

But Lucene.Net has more potential. In real-world application, you would probably want to:

  • Add the new documents to the index when they appear in the directory. You don't need to rebuild the whole index.
  • Include other file types. Lucene.Net can index any file type which you are able to convert to plain text.

Why Not to Use Microsoft Indexing Server?

If you are happy with the Indexing Server, no problem. However, Lucene.Net has many advantages:

  • Lucene.Net is a single assembly of 100% managed code. It has no external dependencies. 
  • You can use it to index any type of data (e-mails, XML, HTML files, etc.) from any source (database, web, etc.). That's because you need to supply plain text to the indexer. Loading and parsing the source is up to you.
  • Allows you to specify the attributes ("fields") that should be included in the index. You can search using these fields (e.g. by author, date, keywords).
  • It is an open source.
  • It is easily extensible.

Line 1: Creating the Index

The following line of code creates a new index stored on disk. directory is a path to the directory where the index will be stored. 

C#
IndexWriter writer = new IndexWriter(FSDirectory.Open(directory), new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.LIMITED); 

In this example, we create the index from scratch. This is not necessary, you can also open an existing index and add documents to it. You can also update existing documents by deleting it and adding a new version.

Lines 2 - 12: Adding documents 

For each HTML document, we will add two fields into the index: 

  • text field that contains the text of the HTML file (with stripped tags). The text itself won't be stored in the index.
  • path field that contains the file path. It will be indexed and stored in full in the index.
C#
public void AddHtmlDocument(string path)
{
    Document doc = new Document();

    string rawText;
    using (StreamReader sr = 
       new StreamReader(path, System.Text.Encoding.Default))
    {
        rawText = parseHtml(sr.ReadToEnd());
    }
    
    doc.Add(new Field("text", rawText, Field.Store.YES, Field.Index.ANALYZED));
    doc.Add(new Field("path", path, Field.Store.YES, Field.Index.NOT_ANALYZED));
    writer.AddDocument(doc);
}  

Lines 13 - 14: Optimizing and Saving the Index

After adding the documents, you need to close the indexer. Optimization will improve search performance.

C#
writer.Optimize();
writer.Close(); 

Line 15: Opening the Index for Searching

Before doing any search, you need to open the index. directory is the path to the directory where the index was stored. 

C#
IndexSearcher searcher = new IndexSearcher(FSDirectory.Open(indexDirectory));  

Lines 16 - 27: Searching 

Now we can parse the query (text is the default field to search for).

C#
var parser = new QueryParser(Version.LUCENE_30, "text", analyzer);
Query query = parser.Parse(this.Query); 
TopDocs hits = searcher.Search(query, 200);  

Variable hits is a collection of result documents. We will go through it and store the results in a DataTable.

C#
DataTable dt = new DataTable();
dt.Columns.Add("path", typeof(string));
dt.Columns.Add("sample", typeof(string));

for (int i = 0; i < hits.TotalHits; i++) 
{
    // get the document from index
    Document doc = searcher.Doc(hits.ScoreDocs[i].Doc);

    // get the document filename
    // we can't get the text from the index 
    //because we didn't store it there
    DataRow row = dt.NewRow();
    row["path"] = doc.Get("path");

    dt.Rows.Add(row);
}  

Lines 28 - 37: Query Highlighting 

Let's create a highlighter. We will use bold font for highlighting (<B>phrase</B>). 

C#
IFormatter formatter = new SimpleHTMLFormatter("<span style=\"font-weight:bold;\">", "</span>");
SimpleFragmenter fragmenter = new SimpleFragmenter(80);
QueryScorer scorer = new QueryScorer(query);
Highlighter highlighter = new Highlighter(formatter, scorer);
highlighter.TextFragmenter = fragmenter; 

During the result fetching, we will load the relevant part of the original text.

C#
for (int i = 0; i < hits.TotalHits; i++) 
{
    // ...
    TokenStream stream = analyzer.TokenStream("", new StringReader(doc.Get("text")));
    row["sample"] = highlighter.GetBestFragments(stream, doc.Get("text"), 2, "...");    string plainText;
    // ...
}  

Resources

License

This article, along with any associated source code and files, is licensed under The Apache License, Version 2.0


Written By
Czech Republic Czech Republic
My open-source event calendar/scheduling web UI components:

DayPilot for JavaScript, Angular, React and Vue

Comments and Discussions

 
AnswerRe: How engine works? Pin
Anonymous12-Sep-05 9:22
Anonymous12-Sep-05 9:22 
GeneralRe: How engine works? Pin
AGorbushkin13-Sep-05 2:30
AGorbushkin13-Sep-05 2:30 
GeneralDeploy docs Pin
AbhiV25-Aug-05 19:22
AbhiV25-Aug-05 19:22 
GeneralRe: Deploy docs Pin
AbhiV25-Aug-05 19:40
AbhiV25-Aug-05 19:40 
GeneralTech docs Pin
Jakub Florczyk15-Aug-05 14:47
Jakub Florczyk15-Aug-05 14:47 
GeneralIndex and search for german text Pin
dirkb27-Jun-05 22:08
dirkb27-Jun-05 22:08 
GeneralRe: Index and search for german text Pin
Dan Letecky28-Jun-05 11:31
Dan Letecky28-Jun-05 11:31 
GeneralRe: Index and search for german text Pin
dirkb22-Jul-05 5:53
dirkb22-Jul-05 5:53 
Thanks for the answer, but I still do have a lot of trouble using Lucene... D'Oh! | :doh:

I have used the source of Lucene.Net v1.4.002 and Highlighter 1.4

If you use the GermanAnalyzer whithout any parameters in the constructor, the default list of typical german stopwords is used. In this list you can find duplicates and the programm throws an exception. You have to fix the list like shown here in Lucene.Net.Analysis.DE.GermanAnalyzer

<br />
	public class GermanAnalyzer : Analyzer<br />
	{<br />
		/// <summary> List of typical german stopwords.</summary><br />
		private System.String[] GERMAN_STOP_WORDS = new System.String[]<br />
            {<br />
                "einer", "eine", "eines", "einem", "einen", "der", "die", <br />
                "das", "dass", "daß", "du", "er", "sie", "es", "was", "wer", <br />
                "wie", "wir", "und", "oder", "ohne", "mit", "am", "im", "in", <br />
                "aus", "auf", "ist", "sein", "war", "wird", "ihr", "ihre", <br />
                "ihres", "als", "für", "von", "dich", "dir", "mich", <br />
                "mir", "mein", "kein", "durch", "wegen"<br />
            };<br />


I tried to create an index with the GermanAnalyzer like this:

IndexWriter writer = new IndexWriter(indexFile, new GermanAnalyzer(), true);

but after I tried to add the doc to the index with

<br />
// Create a new Index-Document<br />
Document doc = new Document();<br />
<br />
doc.Add(Field.Text("Kunde", strFieldValue));<br />
<br />
// Save Index-Document<br />
writer.AddDocument(doc);<br />


I got a strange Exception from private void Strip(System.Text.StringBuilder buffer) of the model GermanStemmer. The text I tried to add was "TIGER TEST FIRMA". The Exception was

<br />
Exception: System.ArgumentOutOfRangeException<br />
Message: Index and length must refer to a location within the string.<br />
Parameter name: length<br />
Source: mscorlib<br />
   at System.String.Substring(Int32 startIndex, Int32 length)<br />
   at System.Text.StringBuilder.ToString(Int32 startIndex, Int32 length)<br />
   at Lucene.Net.Analysis.DE.GermanStemmer.RemoveParticleDenotion(StringBuilder buffer) in F:\DataCenter\Software\LCC\Source\Lucene.Net\Analysis\DE\GermanStemmer.cs:line 148<br />
   at Lucene.Net.Analysis.DE.GermanStemmer.Stem(String term) in F:\DataCenter\Software\LCC\Source\Lucene.Net\Analysis\DE\GermanStemmer.cs:line 59<br />
   at Lucene.Net.Analysis.DE.GermanStemFilter.Next() in F:\DataCenter\Software\LCC\Source\Lucene.Net\Analysis\DE\GermanStemFilter.cs:line 67<br />
   at Lucene.Net.Index.DocumentWriter.InvertDocument(Document doc) in F:\DataCenter\Software\LCC\Source\Lucene.Net\Index\DocumentWriter.cs:line 148<br />
   at Lucene.Net.Index.DocumentWriter.AddDocument(String segment, Document doc) in F:\DataCenter\Software\LCC\Source\Lucene.Net\Index\DocumentWriter.cs:line 83<br />
   at Lucene.Net.Index.IndexWriter.AddDocument(Document doc, Analyzer analyzer) in F:\DataCenter\Software\LCC\Source\Lucene.Net\Index\IndexWriter.cs:line 388<br />
   at Lucene.Net.Index.IndexWriter.AddDocument(Document doc) in F:\DataCenter\Software\LCC\Source\Lucene.Net\Index\IndexWriter.cs:line 376<br />
   at LCC.KuDaBa.LuceneIndexer.LuceneIndexer.StartIndexing() in f:\datacenter\software\lcc\source\lcc.kudaba.luceneindexer\luceneindexer.cs:line 339<br />


As I follwed with the debugger I saw that the error occured in the line 93 of Strip but I really have no idea what makes the trouble here. Both codeparts of the if statement can be evaluated during runtime but the exception comes direclty after this line...

<br />
...<br />
else if ( (buffer.Length + substCount > 4) && buffer.ToString(buffer.Length - 2, buffer.Length).Equals("em") ) {<br />
...<br />



So I gave up to use the GermanAnalyzer an used the StandardAnalyzer. I started to read a database table with about 600000 rows. After processing about 5000 rows without problems I got the following error:

<br />
Exception: System.IO.IOException<br />
Message: Cannot rename segments.new to segments<br />
Source: Lucene.Net<br />
   at Lucene.Net.Store.FSDirectory.RenameFile(String from, String to) in F:\DataCenter\Software\LCC\Source\Lucene.Net\Store\FSDirectory.cs:line 438<br />
   at Lucene.Net.Index.SegmentInfos.Write(Directory directory) in F:\DataCenter\Software\LCC\Source\Lucene.Net\Index\SegmentInfos.cs:line 105<br />
   at Lucene.Net.Index.AnonymousClassWith2.DoBody() in F:\DataCenter\Software\LCC\Source\Lucene.Net\Index\IndexWriter.cs:line 122<br />
   at Lucene.Net.Store.With.Run() in F:\DataCenter\Software\LCC\Source\Lucene.Net\Store\Lock.cs:line 126<br />
   at Lucene.Net.Index.IndexWriter.MergeSegments(Int32 minSegment) in F:\DataCenter\Software\LCC\Source\Lucene.Net\Index\IndexWriter.cs:line 605<br />
   at Lucene.Net.Index.IndexWriter.MaybeMergeSegments() in F:\DataCenter\Software\LCC\Source\Lucene.Net\Index\IndexWriter.cs:line 559<br />
   at Lucene.Net.Index.IndexWriter.AddDocument(Document doc, Analyzer analyzer) in F:\DataCenter\Software\LCC\Source\Lucene.Net\Index\IndexWriter.cs:line 392<br />
   at Lucene.Net.Index.IndexWriter.AddDocument(Document doc) in F:\DataCenter\Software\LCC\Source\Lucene.Net\Index\IndexWriter.cs:line 376<br />
   at LCC.KuDaBa.LuceneIndexer.LuceneIndexer.StartIndexing() in f:\datacenter\software\lcc\source\lcc.kudaba.luceneindexer\luceneindexer.cs:line 339<br />


I the folder I see a file named segments and a file segments.new Confused | :confused:

So, please can someone give me some assistance with this problems?

Thanks,
Dirk
GeneralRe: Index and search for german text Pin
dirkb25-Jul-05 2:05
dirkb25-Jul-05 2:05 
GeneralQueryHighlightExtractor is nowhere to be found Pin
DaberElay27-Jun-05 13:11
DaberElay27-Jun-05 13:11 
GeneralRe: QueryHighlightExtractor is nowhere to be found Pin
Dan Letecky28-Jun-05 10:27
Dan Letecky28-Jun-05 10:27 
QuestionWhere is QueryHighlightExtractor? Pin
Samuel Chen28-Apr-05 22:12
Samuel Chen28-Apr-05 22:12 
AnswerRe: Where is QueryHighlightExtractor? Pin
Dan Letecky10-May-05 2:49
Dan Letecky10-May-05 2:49 
GeneralRe: Where is QueryHighlightExtractor? Pin
Samuel Chen17-May-05 21:24
Samuel Chen17-May-05 21:24 
GeneralRe: Where is QueryHighlightExtractor? Pin
Samuel Chen24-May-05 21:10
Samuel Chen24-May-05 21:10 
QuestionHow can I get it to parse PDF, DOC, XLS files? Pin
flipdoubt5-Apr-05 2:08
flipdoubt5-Apr-05 2:08 
AnswerRe: How can I get it to parse PDF, DOC, XLS files? Pin
Dan Letecky5-Apr-05 22:40
Dan Letecky5-Apr-05 22:40 
AnswerRe: How can I get it to parse PDF, DOC, XLS files? Pin
ShadowDesire27-Jul-05 14:48
ShadowDesire27-Jul-05 14:48 
GeneralTerm-document matrix Pin
Anonymous29-Mar-05 3:24
Anonymous29-Mar-05 3:24 
GeneralRe: Term-document matrix Pin
Anonymous6-May-05 9:44
Anonymous6-May-05 9:44 
GeneralChinese translation Pin
Samuel Chen24-Feb-05 21:35
Samuel Chen24-Feb-05 21:35 
GeneralRe: Chinese translation Pin
Anonymous1-Mar-05 17:37
Anonymous1-Mar-05 17:37 
GeneralRe: Chinese translation Pin
Anonymous1-Mar-05 17:46
Anonymous1-Mar-05 17:46 
GeneralRe: Chinese translation Pin
Dan Letecky1-Mar-05 19:04
Dan Letecky1-Mar-05 19:04 
GeneralRe: Chinese translation Pin
Samuel Chen14-Mar-05 19:15
Samuel Chen14-Mar-05 19:15 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.