Click here to Skip to main content
11,633,830 members (78,958 online)
Click here to Skip to main content

DotLucene: Full-Text Search for Your Intranet or Website using 37 Lines of Code

, 6 Nov 2012 Apache 283.5K 5.4K 296
Rate this:
Please Sign up or sign in to vote.
An introduction to Lucene.Net, the open source full-text search engine.

Update 

November 6, 2012: The project is now working with Lucene.Net 3.0 and .NET Framework 4.0. Includes Visual Studio 2010 solution.  

Lucene.Net: Excellent Full-Text Search Engine

Can there be a full-text search coded on 37 lines? Well, I am going to cheat a bit and use Lucene.Net for the dirty work. Lucene.Net is a .NET port of Jakarta Lucene search engine. Here is a quick list of its features:

  • It can be used in ASP.NET, Win Forms or console applications.
  • Very good performance.
  • Ranked search results.
  • Search query highlighting in results.
  • Searches structured and unstructured data.
  • Metadata searching (query by date, search custom fields...).
  • Index size approximately 30% of the indexed text.
  • Can also store full indexed documents.
  • Pure managed .NET.
  • Very friendly licensing (Apache Software License 2.0).
  • Localizable (support for Brazilian, Czech, Chinese, Dutch, English, French, Japanese, Korean and Russian included).
  • Extensible (source code included).

Warning 

Don't take the line count too seriously. I will show you that the core functionality doesn't take more than 37 lines of code, but to make it a real application you will need to spend some more time on it...

Demo Project 

We will build a simple demo project that shows how to:

  • index HTML files found in a specified directory (including subdirectories).
  • search the index using a ASP.NET application.
  • highlight the query words in the search results.

But Lucene.Net has more potential. In real-world application, you would probably want to:

  • Add the new documents to the index when they appear in the directory. You don't need to rebuild the whole index.
  • Include other file types. Lucene.Net can index any file type which you are able to convert to plain text.

Why Not to Use Microsoft Indexing Server?

If you are happy with the Indexing Server, no problem. However, Lucene.Net has many advantages:

  • Lucene.Net is a single assembly of 100% managed code. It has no external dependencies. 
  • You can use it to index any type of data (e-mails, XML, HTML files, etc.) from any source (database, web, etc.). That's because you need to supply plain text to the indexer. Loading and parsing the source is up to you.
  • Allows you to specify the attributes ("fields") that should be included in the index. You can search using these fields (e.g. by author, date, keywords).
  • It is an open source.
  • It is easily extensible.

Line 1: Creating the Index

The following line of code creates a new index stored on disk. directory is a path to the directory where the index will be stored. 

IndexWriter writer = new IndexWriter(FSDirectory.Open(directory), new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.LIMITED); 

In this example, we create the index from scratch. This is not necessary, you can also open an existing index and add documents to it. You can also update existing documents by deleting it and adding a new version.

Lines 2 - 12: Adding documents 

For each HTML document, we will add two fields into the index: 

  • text field that contains the text of the HTML file (with stripped tags). The text itself won't be stored in the index.
  • path field that contains the file path. It will be indexed and stored in full in the index.
public void AddHtmlDocument(string path)
{
    Document doc = new Document();

    string rawText;
    using (StreamReader sr = 
       new StreamReader(path, System.Text.Encoding.Default))
    {
        rawText = parseHtml(sr.ReadToEnd());
    }
    
    doc.Add(new Field("text", rawText, Field.Store.YES, Field.Index.ANALYZED));
    doc.Add(new Field("path", path, Field.Store.YES, Field.Index.NOT_ANALYZED));
    writer.AddDocument(doc);
}  

Lines 13 - 14: Optimizing and Saving the Index

After adding the documents, you need to close the indexer. Optimization will improve search performance.

writer.Optimize();
writer.Close(); 

Line 15: Opening the Index for Searching

Before doing any search, you need to open the index. directory is the path to the directory where the index was stored. 

IndexSearcher searcher = new IndexSearcher(FSDirectory.Open(indexDirectory));  

Lines 16 - 27: Searching 

Now we can parse the query (text is the default field to search for).

var parser = new QueryParser(Version.LUCENE_30, "text", analyzer);
Query query = parser.Parse(this.Query); 
TopDocs hits = searcher.Search(query, 200);  

Variable hits is a collection of result documents. We will go through it and store the results in a DataTable.

DataTable dt = new DataTable();
dt.Columns.Add("path", typeof(string));
dt.Columns.Add("sample", typeof(string));

for (int i = 0; i < hits.TotalHits; i++) 
{
    // get the document from index
    Document doc = searcher.Doc(hits.ScoreDocs[i].Doc);

    // get the document filename
    // we can't get the text from the index 
    //because we didn't store it there
    DataRow row = dt.NewRow();
    row["path"] = doc.Get("path");

    dt.Rows.Add(row);
}  

Lines 28 - 37: Query Highlighting 

Let's create a highlighter. We will use bold font for highlighting (<B>phrase</B>). 

IFormatter formatter = new SimpleHTMLFormatter("<span style=\"font-weight:bold;\">", "</span>");
SimpleFragmenter fragmenter = new SimpleFragmenter(80);
QueryScorer scorer = new QueryScorer(query);
Highlighter highlighter = new Highlighter(formatter, scorer);
highlighter.TextFragmenter = fragmenter; 

During the result fetching, we will load the relevant part of the original text.

for (int i = 0; i < hits.TotalHits; i++) 
{
    // ...
    TokenStream stream = analyzer.TokenStream("", new StringReader(doc.Get("text")));
    row["sample"] = highlighter.GetBestFragments(stream, doc.Get("text"), 2, "...");    string plainText;
    // ...
}  

Resources

License

This article, along with any associated source code and files, is licensed under The Apache License, Version 2.0

Share

About the Author

Dan Letecky
Czech Republic Czech Republic
My open-source event calendar/scheduling AJAX controls:

DayPilot for JavaScript/HTML5/jQuery
DayPilot for ASP.NET
DayPilot for MVC
DayPilot for Java

You may also be interested in...

Comments and Discussions

 
SuggestionMy vote of 3 Pin
Member 37525617-Dec-14 6:45
memberMember 37525617-Dec-14 6:45 
GeneralRe: My vote of 3 Pin
Dan Letecky7-Dec-14 7:13
memberDan Letecky7-Dec-14 7:13 
QuestionHow can I search on dynamic content Pin
raju@shekhar19-Nov-14 17:32
memberraju@shekhar19-Nov-14 17:32 
GeneralMy vote of 5 Pin
csharpbd22-Mar-13 0:01
membercsharpbd22-Mar-13 0:01 
GeneralRe: My vote of 5 Pin
Dan Letecky24-Mar-13 21:31
memberDan Letecky24-Mar-13 21:31 
QuestionLocalization Pin
petr.snobelt13-Nov-12 0:36
memberpetr.snobelt13-Nov-12 0:36 
AnswerRe: Localization Pin
Dan Letecky24-Mar-13 21:54
memberDan Letecky24-Mar-13 21:54 
GeneralMy vote of 5 Pin
manoj kumar choubey9-Feb-12 2:34
membermanoj kumar choubey9-Feb-12 2:34 
QuestionIs the index file indexed? Pin
Daniel Cohen Gindi16-Sep-10 10:35
memberDaniel Cohen Gindi16-Sep-10 10:35 
GeneralMy vote of 5 Pin
Jacques Lemaire15-Jul-10 7:03
memberJacques Lemaire15-Jul-10 7:03 
Generalrepeater does not show the results Pin
faezeh667-Jul-10 21:39
memberfaezeh667-Jul-10 21:39 
GeneralHelp with Lucene.Net SpellChecker!! Pin
Member 268155330-Jul-09 23:29
memberMember 268155330-Jul-09 23:29 
Questionpdf to text converter Pin
arun patidar13-Apr-09 12:25
memberarun patidar13-Apr-09 12:25 
GeneralExposing "DotLucene: Full-Text Search for Your Intranet or Website using 37 Lines of Code". as a web services Pin
kbsnet6-Dec-08 1:01
memberkbsnet6-Dec-08 1:01 
QuestionAnyone can help me, when instanciating an IndexSearcher() , An 'Index out of range exception' occurs.!!! Pin
mohsenjahan25-Oct-08 2:12
membermohsenjahan25-Oct-08 2:12 
QuestionHow can I get the source code of Lucene.net? Pin
alisson_abreu20-Oct-08 5:56
memberalisson_abreu20-Oct-08 5:56 
AnswerRe: How can I get the source code of Lucene.net? Pin
Robert Dondo5-Jan-10 19:22
memberRobert Dondo5-Jan-10 19:22 
QuestionLucine.NET relevance Pin
Arsenmkrt6-Aug-08 3:16
memberArsenmkrt6-Aug-08 3:16 
AnswerRe: Lucine.NET relevance Pin
Dan Letecky14-Aug-08 3:31
memberDan Letecky14-Aug-08 3:31 
GeneralType of highlighter for snowballanalyzer Pin
raypang2023-Jul-08 23:29
memberraypang2023-Jul-08 23:29 
GeneralRe: Type of highlighter for snowballanalyzer Pin
cup12321-Jan-09 22:55
membercup12321-Jan-09 22:55 
GeneralNot able to use Code Pin
Member 29389715-Jun-08 2:00
memberMember 29389715-Jun-08 2:00 
GeneralProblem with Russian Language in Lucene 2.1 Pin
Anna20084-Apr-08 21:36
memberAnna20084-Apr-08 21:36 
QuestionIndex Pin
Member 432753231-Mar-08 5:23
memberMember 432753231-Mar-08 5:23 
GeneralAccess to the path "segments" is denied. Pin
lferrarez12-Dec-07 5:01
memberlferrarez12-Dec-07 5:01 
GeneralAccess to the path "c:\Inetpub\Wwwroot\pdf\segments" is denied. Pin
lferrarez12-Dec-07 5:00
memberlferrarez12-Dec-07 5:00 
GeneralMULTIPLE TERM & AND MULTIPLE VALUE Query !!! Pin
TedManowar3-Oct-07 0:49
memberTedManowar3-Oct-07 0:49 
GeneralRe: MULTIPLE TERM & AND MULTIPLE VALUE Query !!! Pin
TedManowar4-Oct-07 1:56
memberTedManowar4-Oct-07 1:56 
GeneralRe: MULTIPLE TERM & AND MULTIPLE VALUE Query !!! Pin
kennster19-Feb-08 8:09
memberkennster19-Feb-08 8:09 
Generallucene.net (SpellChecker.Net) Pin
ploufs28-Sep-07 4:47
memberploufs28-Sep-07 4:47 
QuestionWhere can I get Lucene.Net.dll? Pin
Win32nipuh18-Jul-07 6:21
memberWin32nipuh18-Jul-07 6:21 
AnswerRe: Where can I get Lucene.Net.dll? Pin
Dan Letecky18-Jul-07 6:35
memberDan Letecky18-Jul-07 6:35 
GeneralRe: Where can I get Lucene.Net.dll? Pin
Win32nipuh18-Jul-07 19:44
memberWin32nipuh18-Jul-07 19:44 
GeneralRe: Where can I get Lucene.Net.dll? Pin
Dan Letecky18-Jul-07 20:01
memberDan Letecky18-Jul-07 20:01 
AnswerRe: Where can I get Lucene.Net.dll? Pin
Arul1432-Sep-10 1:52
memberArul1432-Sep-10 1:52 
GeneralUsing Lucene.NET 2.0 Pin
Sandeep Akhare27-Jun-07 0:11
memberSandeep Akhare27-Jun-07 0:11 
Hi
I am using Lucene.NET 2.0 dll in my search engine in that i am searching the content of the files the problem is it is not searching in the big files (mre than 1 MB ) What could be the problem ?

Thanks and Regards
Sandeep

If If you look at what you do not have in life, you don't have anything,
If you look at what you have in life, you have everything... "




GeneralRe: Using Lucene.NET 2.0 Pin
Dan Letecky27-Jun-07 1:50
memberDan Letecky27-Jun-07 1:50 
GeneralRe: Using Lucene.NET 2.0 Pin
Sandeep Akhare27-Jun-07 2:04
memberSandeep Akhare27-Jun-07 2:04 
GeneralHi there Pin
joey27279-May-07 7:04
memberjoey27279-May-07 7:04 
GeneralRanking Results! Pin
SuperEric27-Apr-07 19:20
memberSuperEric27-Apr-07 19:20 
Generalchanging banners in adrotator Pin
mnz4web24-Apr-07 3:36
membermnz4web24-Apr-07 3:36 
GeneralRebuild index Pin
ferra766-Jan-07 5:30
memberferra766-Jan-07 5:30 
QuestionUpdating exsiting document field value Pin
ledesma3-Sep-06 1:20
memberledesma3-Sep-06 1:20 
AnswerRe: Updating exsiting document field value Pin
Dan Letecky3-Sep-06 4:09
memberDan Letecky3-Sep-06 4:09 
GeneralRe: Updating exsiting document field value Pin
ledesma4-Sep-06 11:29
memberledesma4-Sep-06 11:29 
QuestionVB Source Code? Pin
thesavage47-Aug-06 16:53
memberthesavage47-Aug-06 16:53 
QuestionWhere is the primary source for download? Pin
Michael Freidgeim30-Jul-06 17:51
memberMichael Freidgeim30-Jul-06 17:51 
GeneralDynamic Pages Pin
sayedwasim16-Jul-06 21:46
membersayedwasim16-Jul-06 21:46 
GeneralNice but cannot work with Korean language [modified] Pin
dae-heon kang19-Jun-06 21:55
memberdae-heon kang19-Jun-06 21:55 
GeneralDot Lucene With Remoting Pin
loyola stalin soosai15-Jun-06 19:08
memberloyola stalin soosai15-Jun-06 19:08 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.150728.1 | Last Updated 6 Nov 2012
Article Copyright 2005 by Dan Letecky
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid