What is Lucene.Net?
Lucene.Net is a high performance Information Retrieval (IR) library, also known as a search engine library. Lucene.Net contains powerful APIs for creating full text indexes and implementing advanced and precise search technologies into your programs. Some people may confuse Lucene.net with a ready to use application like a web search/crawler, or a file search application, but Lucene.Net is not such an application, it's a framework library. Lucene.Net provides a framework for implementing these difficult technologies yourself. Lucene.Net makes no discriminations on what you can index and search, which gives you a lot more power compared to other full text indexing/searching implications; you can index anything that can be represented as text. There are also ways to get Lucene.Net to index HTML, Office documents, PDF files, and much more.
Lucene.Net is an API per API port of the original Lucene project, which is written in Javal even the unit tests were ported to guarantee the quality. Also, Lucene.Net index is fully compatible with the Lucene index, and both libraries can be used on the same index together with no problems. A number of products have used Lucene and Lucene.Net to build their searches; some well known websites include Wikipedia, CNET, Monster.com, Mayo Clinic, FedEx, and many more. But, it’s not just web sites that have used Lucene; there is also a product that has used Lucene.Net, called Lookout, which is a search tool for Microsoft Outlook that just brought Outlook’s integrated search to look painfully slow and inaccurate.
Lucene.Net is currently undergoing incubation at the Apache Software Foundation. Its source code is held in a subversion repository and can be found here. If you need help downloading the source, you can use the free TortoiseSVN, or RapidSVN. The Lucene.Net project always welcomes new contributors. And, remember, there are many ways to contribute to an open source project other than writing code.
Creating a search solution
There are roughly two main parts to a search solution. Indexing the content you wish to search, and actually searching the content. And, it is pretty much as simple as that. After we have an index, we will perform a search.
What you need to create an index
Let’s see an example of what it takes to create an index and to populate it.
string indexFileLocation = @"C:\Index";
Lucene.Net.Store.Directory dir =
Lucene.Net.Analysis.Analyzer analyzer = new
Lucene.Net.Index.IndexWriter indexWriter = new
Lucene.Net.Documents.Document doc = new
Lucene.Net.Documents.Field fldContent =
"The quick brown fox jumps over the lazy dog",
Alright, not bad, let’s take a look at what we just did. There are five main classes in use here, and they are
Field. We create a
Directory that lets Lucene know where we want to store the index. The
Analyzer is used to analyze the text. We have an
IndexWriter that uses the
Analyzer to create and write out the index. Then, we create a new
Document object, and create a
Field that has it’s field name set to “content” and the value to “The quick brown fox jumps over the lazy dog”. We add the
Field to the
Document, and now, we can index the newly created
Document with the
IndexWriter. Then, we have a funny looking call to
Optimize (more on this later), and call
Close to close the writer when we are done. We have successfully created a full text index that’s ready to be searched. First, let’s elaborate a little bit on some of the classes that we just used.
Lucene.Net.Store.Directory – The
Directory is a base class that is used to provide an abstract view of a directory. There are two implementations packaged with Lucene.Net.
FSDirectory works with a file directory to store the index.
RAMDirectory is an in memory directory that you can use to store the index. You can inherit from the
Directory class to implement your own custom directory object to store the index.
Lucene.Net.Analysis.Analyzer – The
Analyzer is a base class that is responsible for breaking the text down into single words or terms, and removing any noise words, or what Lucene.net calls stop words; stop words include “and”, “a”, “the” etc. For now, we will just use the
StandardAnalyzer class as it’s a very good first choice. You can pass in a list of your own stop words to the constructor of the
StandardAnalyzer as a sting array. Using the default constructor will use the default list of stop words. You can inherit from the
Analyzer to implement a custom way to handle the documents that are to be indexed.
Lucene.Net.Index.IndexWriter – The
IndexWriter takes on the responsibility of coordinating the
Analyzer and throwing the results to the
Directory for storage. During the creation of the index, the writer will create some files in the
Directory. When we add some documents to the index writer, the index writer will use the
Analyzer to break down each of the fields and find a place to store the indexed document in the
Directory. After a session of indexing documents, it is encouraged that you optimize the index, which compacts the index for a less resource-intensive model. Also note that it is not recommended that you call
Optimize for every
Document you add to the index, just once after an indexing session, if you can. At the end of the
IndexWriter’s constructor, we specify
true to create a new index. To add more documents to the index, you would specify
false here, to avoid overwriting the index.
Lucene.Net.Documents.Document – The
Document class is what gets indexed by the
IndexWriter. You can think of a
Document as an entity that you want to retrieve; a
Document could represent an email, or a web page, or a recipe, or even a CodeProject article.
Lucene.Net.Documents.Field – The
Document contains a list of
Fields that are used to describe the document. Every field has a name and a value. Each of the field’s values contains the text that you want to make searchable. The other parts of the field's constructor contains instructions for how to handle an individual field. The
Field.Store instructions tell the
IndexWriter that you want to store the field’s value inside the index, so later the value can be retrieved and acted upon like showing the data to the user in the search results or storing an identifier value like the primary key of the object that this field's document represents.
Other instructions are the
Field.Index values, which tell the
IndexWriter how to index (if at all) the field. Possible values include
Field.Index.TOKENIZED, meaning that we want to break down the string with the
Analyzer and make it searchable. Another option is
Field.Index.UNTOKENIZED, which will still index the field but as a whole, and it is not broken down by the
Analyzer. The difference between storing a value and indexing the value is that when you store the value, the purpose is to be able to retrieve the value back from the index. And, the purpose behind indexing a value is to make the field’s value searchable. It is totally acceptable to store a value but not have it indexed, like you would probably want to store an identifier value but not want to index it, and it’s possible to want to index the content of an email but you don’t really want to display the content within the search results. The other set of instructions define how to handle
TermVectors. Storing the
TermVectors in an index is used for an advanced feature of Lucene that doesn’t exactly match a search query term, but with term vectors, you will be able to retrieve related documents - as in documents that are about the same subject.
After a good indexing of some documents, I’m sure that we are ready for the fun part.
What you need to search an index
Let’s take a look at an example of what you need to perform a simple search.
string indexFileLocation = @"C:\Index";
Lucene.Net.Store.Directory dir =
Lucene.Net.Search.IndexSearcher searcher = new
Lucene.Net.Index.Term searchTerm =
new Lucene.Net.Index.Term("content", "fox");
Lucene.Net.Search.Query query = new Lucene.Net.Search.TermQuery(searchTerm);
Lucene.Net.Search.Hits hits = searcher.Search(query);
for (int i = 0; i < hits.Length(); i++)
Document doc = hits.Doc(i);
string contentValue = doc.Get("content");
With this small bit of code, we defined where our index is stored, again through the use of a
Directory class. But now, we have this
IndexSearch object, which does all the heavy lifting of the actual search. To use the
IndexSearcher, we have to pass it a
Query object. You call the
Search method from the
IndexSearcher object, while passing in the
Query object to the search. And, it will return you a
Hits object. And finally, by iterating through
Hits, we are able to pull out the
Documents that match that query. After we have our document, we can pull out a field's value that was previously stored with the document when it was indexed. Let's look into the classes a little more closer!
Lucene.Net.Search.IndexSearcher – The
IndexSearcher object again does all the heavy lifting of doing the actual search. When a search is to be performed, it will use the
Directory object passed into the
IndexSearcher’s constructor to open the index as a read-only file. There are more methods on the
IndexSearcher object that provides some other ways to query an index.
Lucene.Net.Index.Term – A
Term is the most basic construct for searching. A
Term consists of two parts, the name of a field you wish to search, and the value of the field.
Lucene.Net.Search.Query – A base class that works with the
IndexSearcher to provide the results. The
Query is an abstract base class. In the example above, we used a
TermQuery object that makes a query of a single
Term. There are many other ways to create a query. Some implementations of the
Query class, besides the
SpanQuery. With all these choices in how to query, we need a way to let the user make a powerful query from a single textbox. This is were another class comes in, and that is the
QueryParser. More on this soon.
Lucene.Net.Search.Hits – This represents a list of documents that were returned in the search. A
Hits object can be iterated over, and is responsible for getting the documents from the search. For larger indexes, it is not recommended to iterate over all the search results. Also, it’s good to note that the
Hits object doesn’t load all the documents initially, it only loads a portion of the documents. Otherwise, it will lead to performance issues. After you have a
Hits object, you can call the
Doc(int index) method which will return the document associated with a single hit.
Like I mentioned earlier, there are many implementations of the
Query class, each of them has a place in queries. Mostly, you wouldn’t create a query object yourself, but let a powerful parser build a complex query for you with some simple syntax, much like how you search Google. This is were I introduce you to the
QueryParser instance has a method called
Parse(string query). Here is a small example on using the
Lucene.Net.Analysis.Analyzer analyzer = new
Lucene.Net.QueryParsers.QueryParser queryParser = new
query = queryParser.Parse("fox");
And, if you think all this stuff is neat, we have barely even scratched the surface. But, this will be all of the article for now. If you want to find out some more, let me know, and I’ll work on another article about Lucene.Net.
I'm a proud father and a software developer. I'm fascinated by a few particular .Net projects such as Lucene.Net, NHibernate, Quartz.Net, and others. I love learning and studying code to learn how other people solve software problems.