Introducing Lucene.Net

AndrewSmith

4.86/5 (83 votes)

Sep 29, 2008

CPOL

9 min read

432439

A plunge into creating a fast, full text index, with advanced searching capabilites.

What is Lucene.Net?

Lucene.Net is a high performance Information Retrieval (IR) library, also known as a search engine library. Lucene.Net contains powerful APIs for creating full text indexes and implementing advanced and precise search technologies into your programs. Some people may confuse Lucene.net with a ready to use application like a web search/crawler, or a file search application, but Lucene.Net is not such an application, it's a framework library. Lucene.Net provides a framework for implementing these difficult technologies yourself. Lucene.Net makes no discriminations on what you can index and search, which gives you a lot more power compared to other full text indexing/searching implications; you can index anything that can be represented as text. There are also ways to get Lucene.Net to index HTML, Office documents, PDF files, and much more.

Lucene.Net is an API per API port of the original Lucene project, which is written in Javal even the unit tests were ported to guarantee the quality. Also, Lucene.Net index is fully compatible with the Lucene index, and both libraries can be used on the same index together with no problems. A number of products have used Lucene and Lucene.Net to build their searches; some well known websites include Wikipedia, CNET, Monster.com, Mayo Clinic, FedEx, and many more. But, it’s not just web sites that have used Lucene; there is also a product that has used Lucene.Net, called Lookout, which is a search tool for Microsoft Outlook that just brought Outlook’s integrated search to look painfully slow and inaccurate.

Lucene.Net is currently undergoing incubation at the Apache Software Foundation. Its source code is held in a subversion repository and can be found here. If you need help downloading the source, you can use the free TortoiseSVN, or RapidSVN. The Lucene.Net project always welcomes new contributors. And, remember, there are many ways to contribute to an open source project other than writing code.

Creating a search solution

There are roughly two main parts to a search solution. Indexing the content you wish to search, and actually searching the content. And, it is pretty much as simple as that. After we have an index, we will perform a search.

What you need to create an index

Let’s see an example of what it takes to create an index and to populate it.

//state the file location of the index
string indexFileLocation = @"C:\Index"; 
Lucene.Net.Store.Directory dir =
    Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation, true);

//create an analyzer to process the text
Lucene.Net.Analysis.Analyzer analyzer = new
Lucene.Net.Analysis.Standard.StandardAnalyzer(); 

//create the index writer with the directory and analyzer defined.
Lucene.Net.Index.IndexWriter indexWriter = new
Lucene.Net.Index.IndexWriter(dir, analyzer, 
           /*true to create a new index*/ true); 

//create a document, add in a single field
Lucene.Net.Documents.Document doc = new
Lucene.Net.Documents.Document();

Lucene.Net.Documents.Field fldContent = 
  new Lucene.Net.Documents.Field("content", 
  "The quick brown fox jumps over the lazy dog",
  Lucene.Net.Documents.Field.Store.YES, 


Lucene.Net.Documents.Field.Index.TOKENIZED, 
Lucene.Net.Documents.Field.TermVector.YES);

doc.Add(fldContent);

//write the document to the index
indexWriter.AddDocument(doc);

//optimize and close the writer
indexWriter.Optimize(); 
indexWriter.Close();

Alright, not bad, let’s take a look at what we just did. There are five main classes in use here, and they are Directory, Analyzer, IndexWriter, Document, and Field. We create a Directory that lets Lucene know where we want to store the index. The Analyzer is used to analyze the text. We have an IndexWriter that uses the Directory and Analyzer to create and write out the index. Then, we create a new Document object, and create a Field that has it’s field name set to “content” and the value to “The quick brown fox jumps over the lazy dog”. We add the Field to the Document, and now, we can index the newly created Document with the IndexWriter. Then, we have a funny looking call to Optimize (more on this later), and call Close to close the writer when we are done. We have successfully created a full text index that’s ready to be searched. First, let’s elaborate a little bit on some of the classes that we just used.

Lucene.Net.Store.Directory – The Directory is a base class that is used to provide an abstract view of a directory. There are two implementations packaged with Lucene.Net. FSDirectory works with a file directory to store the index. RAMDirectory is an in memory directory that you can use to store the index. You can inherit from the Directory class to implement your own custom directory object to store the index.
Lucene.Net.Analysis.Analyzer – The Analyzer is a base class that is responsible for breaking the text down into single words or terms, and removing any noise words, or what Lucene.net calls stop words; stop words include “and”, “a”, “the” etc. For now, we will just use the StandardAnalyzer class as it’s a very good first choice. You can pass in a list of your own stop words to the constructor of the StandardAnalyzer as a sting array. Using the default constructor will use the default list of stop words. You can inherit from the Analyzer to implement a custom way to handle the documents that are to be indexed.
Lucene.Net.Index.IndexWriter – The IndexWriter takes on the responsibility of coordinating the Analyzer and throwing the results to the Directory for storage. During the creation of the index, the writer will create some files in the Directory. When we add some documents to the index writer, the index writer will use the Analyzer to break down each of the fields and find a place to store the indexed document in the Directory. After a session of indexing documents, it is encouraged that you optimize the index, which compacts the index for a less resource-intensive model. Also note that it is not recommended that you call Optimize for every Document you add to the index, just once after an indexing session, if you can. At the end of the IndexWriter’s constructor, we specify true to create a new index. To add more documents to the index, you would specify false here, to avoid overwriting the index.
Lucene.Net.Documents.Document – The Document class is what gets indexed by the IndexWriter. You can think of a Document as an entity that you want to retrieve; a Document could represent an email, or a web page, or a recipe, or even a CodeProject article.
Lucene.Net.Documents.Field – The Document contains a list of Fields that are used to describe the document. Every field has a name and a value. Each of the field’s values contains the text that you want to make searchable. The other parts of the field's constructor contains instructions for how to handle an individual field. The Field.Store instructions tell the IndexWriter that you want to store the field’s value inside the index, so later the value can be retrieved and acted upon like showing the data to the user in the search results or storing an identifier value like the primary key of the object that this field's document represents.

Other instructions are the Field.Index values, which tell the IndexWriter how to index (if at all) the field. Possible values include Field.Index.TOKENIZED, meaning that we want to break down the string with the IndexWriter’s supplied Analyzer and make it searchable. Another option is Field.Index.UNTOKENIZED, which will still index the field but as a whole, and it is not broken down by the Analyzer. The difference between storing a value and indexing the value is that when you store the value, the purpose is to be able to retrieve the value back from the index. And, the purpose behind indexing a value is to make the field’s value searchable. It is totally acceptable to store a value but not have it indexed, like you would probably want to store an identifier value but not want to index it, and it’s possible to want to index the content of an email but you don’t really want to display the content within the search results. The other set of instructions define how to handle TermVectors. Storing the TermVectors in an index is used for an advanced feature of Lucene that doesn’t exactly match a search query term, but with term vectors, you will be able to retrieve related documents - as in documents that are about the same subject.

After a good indexing of some documents, I’m sure that we are ready for the fun part.

What you need to search an index

Let’s take a look at an example of what you need to perform a simple search.

//state the file location of the index
string indexFileLocation = @"C:\Index";
Lucene.Net.Store.Directory dir =
    Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation, true);

//create an index searcher that will perform the search
Lucene.Net.Search.IndexSearcher searcher = new
Lucene.Net.Search.IndexSearcher(dir);

//build a query object
Lucene.Net.Index.Term searchTerm = 
  new Lucene.Net.Index.Term("content", "fox");
Lucene.Net.Search.Query query = new Lucene.Net.Search.TermQuery(searchTerm);

//execute the query
Lucene.Net.Search.Hits hits = searcher.Search(query);

//iterate over the results.
for (int i = 0; i < hits.Length(); i++)
{
    Document doc = hits.Doc(i);
    string contentValue = doc.Get("content");

    Console.WriteLine(contentValue);

}

With this small bit of code, we defined where our index is stored, again through the use of a Directory class. But now, we have this IndexSearch object, which does all the heavy lifting of the actual search. To use the IndexSearcher, we have to pass it a Query object. You call the Search method from the IndexSearcher object, while passing in the Query object to the search. And, it will return you a Hits object. And finally, by iterating through Hits, we are able to pull out the Documents that match that query. After we have our document, we can pull out a field's value that was previously stored with the document when it was indexed. Let's look into the classes a little more closer!

Lucene.Net.Search.IndexSearcher – The IndexSearcher object again does all the heavy lifting of doing the actual search. When a search is to be performed, it will use the Directory object passed into the IndexSearcher’s constructor to open the index as a read-only file. There are more methods on the IndexSearcher object that provides some other ways to query an index.
Lucene.Net.Index.Term – A Term is the most basic construct for searching. A Term consists of two parts, the name of a field you wish to search, and the value of the field.
Lucene.Net.Search.Query – A base class that works with the IndexSearcher to provide the results. The Query is an abstract base class. In the example above, we used a TermQuery object that makes a query of a single Term. There are many other ways to create a query. Some implementations of the Query class, besides the TermQuery, include BooleanQuery, PhraseQuery, PrefixQuery, PhrasePrefixQuery, RangeQuery, FilteredQuery, and SpanQuery. With all these choices in how to query, we need a way to let the user make a powerful query from a single textbox. This is were another class comes in, and that is the QueryParser. More on this soon.
Lucene.Net.Search.Hits – This represents a list of documents that were returned in the search. A Hits object can be iterated over, and is responsible for getting the documents from the search. For larger indexes, it is not recommended to iterate over all the search results. Also, it’s good to note that the Hits object doesn’t load all the documents initially, it only loads a portion of the documents. Otherwise, it will lead to performance issues. After you have a Hits object, you can call the Doc(int index) method which will return the document associated with a single hit.

Like I mentioned earlier, there are many implementations of the Query class, each of them has a place in queries. Mostly, you wouldn’t create a query object yourself, but let a powerful parser build a complex query for you with some simple syntax, much like how you search Google. This is were I introduce you to the QueryParser. A QueryParser instance has a method called Parse(string query). Here is a small example on using the QueryParser:

//create an analyzer to process the text
Lucene.Net.Analysis.Analyzer analyzer = new
Lucene.Net.Analysis.Standard.StandardAnalyzer();

//create the query parser, with the default search feild set to "content"
Lucene.Net.QueryParsers.QueryParser queryParser = new
    Lucene.Net.QueryParsers.QueryParser("content", analyzer);

//parse the query string into a Query object
Lucene.Net.Search.Query
query = queryParser.Parse("fox");

And, if you think all this stuff is neat, we have barely even scratched the surface. But, this will be all of the article for now. If you want to find out some more, let me know, and I’ll work on another article about Lucene.Net.