Lucene.Net - Text Analysis

AndrewSmith

4.94/5 (41 votes)

Jan 3, 2009

Apache

12 min read

197255

9812

How to work with Lucene.Net's analysis.

What is Lucene.Net?

Lucene.Net is a high performance Information Retrieval (IR) library, also known as a search engine library. Lucene.Net contains powerful APIs for creating full text indexes and implementing advanced and precise search technologies into your programs. Some people may confuse Lucene.net with a ready to use application like a web search/crawler, or a file search application, but Lucene.Net is not such an application, it's a framework library. Lucene.Net provides a framework for implementing these difficult technologies yourself. Lucene.Net makes no discriminations on what you can index and search, which gives you a lot more power compared to other full text indexing/searching implications; you can index anything that can be represented as text. There are also ways to get Lucene.Net to index HTML, Office documents, PDF files, and much more.

Lucene.Net is an API per API port of the original Lucene project, which is written in Java even the unit tests were ported to guarantee the quality. Also, Lucene.Net index is fully compatible with the Lucene index, and both libraries can be used on the same index together with no problems. A number of products have used Lucene and Lucene.Net to build their searches; some well known websites include Wikipedia, CNET, Monster.com, Mayo Clinic, FedEx, and many more. But, it’s not just web sites that have used Lucene; there is also a product that has used Lucene.Net, called Lookout, which is a search tool for Microsoft Outlook that just brought Outlook’s integrated search to look painfully slow and inaccurate.

Lucene.Net is currently undergoing incubation at the Apache Software Foundation. Its source code is held in a subversion repository and can be found here. If you need help downloading the source, you can use the free TortoiseSVN, or RapidSVN. The Lucene.Net project always welcomes new contributors. And, remember, there are many ways to contribute to an open source project other than writing code.

Welcome to the second article

Hey i would just like to welcome you to my second article on code project. For those of you who seemed to have stumbled upon this first, check out my first article Introducing Lucene.Net. The goal of this article is to introduce you to Lucene.Net Analyzers, more about on how they work and how the affect your searching.

What are Analyzers?

An Analyzer has a single job, and that is to be a advanced word breaker. Which an object that will read a stream of text and break apart the words into objects called Tokens. The Token class will generally hold the results of the analysis as individual words. This is a very brief summary of what an Analyzer can do and how it affects your full text index. A good Analyzer will not only break the words apart, but it is also performs a transformation of the text to make it more suitable for indexing. One simple transformation an Analyzer can do is to lowercase everything it comes across, that way your index will be case insensitive.

In the Lucene framework there are two major spots where an Analyzer is used, and that is when indexing and then searching. For the indexing portion, the direct results of the Analyzer is what gets indexed. So for example, in a previous example of an Analyzer that will convert everything to lowercase, if we come across the word "CAT", the analyzer will output "cat", and in the full text index, a Term of "cat" will be associated with the Document. For an even bigger example if we use an Analyzer that will break the words apart with the spaces, and then the Analyzer will convert it all to lowercase the follow the results should look something like this.

NOTE: The brackets show the different Tokens returned from the Analyzer.

Source Text

The Cat in the Hat.

Analysis Output

[the] [cat] [in] [the] [hat.]

Now, when you are searching, most of the time you will be using the QueryParser class you will use to construct the Query object that you will use to search the full text index with. Part of using the QueryParser class, is that you will have to supply an instance of an Analyzer to the QueryParser, The QueryParser will use the Analyzer to normalize the Term or Terms that you will actually be querying for.

Now there is a relationship between the Analyzer that works with the indexing process and the Analyzer that works with building the Query object. Most of the time this will be the exact same kind of Analyzer will be used to do both of the jobs. This is because during the search process, it will only match terms that are exactly the same as what is in the index, this includes case sensitivity. according to the index, when you index the Term "cat", it is considered a completely different term than the word "Cat". This includes punctuational as well. So like in our Analysis sample above the output of [the] [cat] [in] [the] [hat.]. if we were to directly search for the word, "hat", this document would not be a match, because what is index is the Term [hat.] (with a period).

The point here is that the Terms being indexed must be the same as the Terms that you are searching with. This can be achieved by using a consistent method of analysis in both indexing and searching.

The Analyzer Viewer Application

Attached to this article is the the Analyzer Viewer application, that I made. Attached are both the source and a ready to run binary of the application.. The sample is more like a little utility to see how the basic Analyzers included with Lucene.Net will view text. The application will allow you to directly input some text, and it will show you all the results of the text analysis, and how it split them up into tokens and what transformations it applied.

Some interesting things to looks at include, typing in email addresses, numbers with letters, numbers alone, acronyms, alternating cases, and just anything else you want to play with to see how the indexing process goes.

The Built-in Analyzers

Lucene.Net has several different built-in analyzers. Each has it's own uses, here's a list of the built-in analyzers.

KeywordAnalyzer

"Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names.

WhitespaceAnalyzer

An Analyzer that uses WhitespaceTokenizer.

StopAnalyzer

Filters LetterTokenizer with LowerCaseFilter and StopFilter.

SimpleAnalyzer

An Analyzer that filters LetterTokenizer with LowerCaseFilter.

StandardAnalyzer

Filters StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words.

PerFieldAnalyzerWrapper

This analyzer is used to facilitate scenarios where different fields require different analysis techniques. When you create an PerFieldAnalyzerWrapper object you must specify an analyzer to use for default, then use AddAnalyzer(string fieldName, Analyzer analyzer) to add a non-default analyzer on a field name basis.

How Does This Work?

An Analyzer is an object that inherits from an abstract base class called Analyzer. The Analyzer class lives in the Lucene.Net.Analysis namespace. To implement a custom Analyzer you only need to implement one method called TokenStream. Which takes two parameters, a string of the field name that will be passed to you, and a TextReader object that will contain the text to be read. The signature will be as follows:

public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
{
            //TODO: Return a token stream object
}

An Analyzer builds a TokenStream, which is simply just a stream of Token objects. So when an analyzer get's used, it will return a TokenStream object that you can use to iterate over the tokens that will be returned from the analysis. Here we are introducing two new classes to become familiar with. The first being a TokenStream and the second being a Token. The summary from Lucene's documentation are

TokenStream - A TokenStream enumerates the sequence of tokens, either from fields of a document or from query text.

Token - A Token is an occurrence of a term from the text of a field. It consists of a term's text, the start and end offset of the term in the text of the field, and a type string. The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC (KeyWord In Context) display, etc. The type is an interned string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might be implemented with type "eos". The default token type is "word".

Basically a TokenStream is a way to provide access to a stream of Tokens, and a Token simply just holds some information just as some text representing a Term, and the Term's field as well as the offset of where this Token is located in the original text.

How does an `Analyzer` Create a `TokenStream`?

The TokenStream is an public abstract base class. There are two other two other abstract classes that inherit from the TokenStream class that take the TokenStream a little bit further. And they are the Tokenizer and the TokenFilter classes. As I mentioned earlier both of these classes are implementations of a TokenStream class, and also both of these are also abstract.

These Tokenizer and TokenFilter classes have different jobs in the the analysis of the text, These two classes were created to separate the responsibility of what an Analyzer does. The Tokenizer acts like the word breaker, it's main job is to process the TextReader and return how it breaks the words into Token objects. On the other hand is the TokenFilter class. When you create an implementation of a TokenFilter class, it will expect a inner TokenStream as a parameter of the constructor. It's more like having a TokenStream wrapped around another TokenStream is what I'm trying to get to. When you call to get the next token from a TokenFilter, it will call the input TokenStream to get the next Token from it, then it will evaluate the Token and it will perform different operations with the information. Some examples of what a TokenFilter can do is convert all the text to lowercase, or the TokenFilter can decide that the Token is not important and discard it so it does not ever get indexed or seen. The TokenFilter is more about filtering the content from a TokenStream.

When an Analyzer creates a TokenStream, they are usually just creating a single Tokenizer object to break the words and then using one or more TokenFilters to filter the results of the Tokenizer.

Implementations of a `Tokenizer`.

As i mentioned earlier the Tokenizer class is an abstract base class of a TokenStream. Lucene.Net provides a few implementations of a Tokenizer that it uses in some of the Analyzers. Here is a couple of them and a small description of each.

KeywordTokenizer - This Tokenizer will read the entire stream of text and return the whole things as a single Token.

CharTokenizer - This is an abstract base Tokenizer than is implemented by two other Tokenizers, this is a good starting point to create your own Tokenizer. It has a single method that you must implement and that is IsTokenChar(char c), and returns a Boolean. if it is a character that belongs to a Token. once you hit something that isn't a Token character the method would return false, and then it will create the Token depending upon where you split the Token characters.

WhitespaceTokenizer - This Tokenizer inherits from the CharTokenizer and it breaks the words just according to white space.

LetterTokenizer - This Tokenizer inherits from the CharTokenizer and it breaks the words according to just letters, the moment it hits a number of a symbol or white space it ignores these and only creates a Term with just the letters.

LowerCaseTokenizer - The LowerCaseTokenizer inheirts from the LetterTokenizer and just performs and extra step of converting the results returned to lowercase.

StandardTokenizer - This Tokenizer is a pretty good choice for most European languages. This is a grammer based Tokenizer that will reconize email addresses, host names, and acronyms.

Implementations of a `TokenFilter`.

As i mentioned earlier the TokenFilter class is an abstract base class of a TokenStream. Lucene.Net provides a few implementations of a TokenFilter that it uses in some of the Analyzers. Here is a couple of them and a small description of each.

LowerCaseFilter - the LowerCaseFilter will take the incoming Tokens and will convert all the letters to lower case. useful for case insensitive indexing and searching.

StopFilter - the StopFilter will filter out 'stop words'. Stop words are defined as common words that should be ignored, such as 'a' 'and' 'the' 'but' 'so' 'also' etc.., A constructor of the StopFilter requires you to pass in an string array that contains a list of the stop words you want to define.

LengthFilter - The LengthFilter is useful if you want to remove words that are too long or too short from being returned.

StandardFilter - The StandardFilter is used to normalize the results from the StandardTokenizer.

ISOLatin1AccentFilter - the ISOLatin1AccentFilter is a filter that replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalent. The case will not be altered. For instance, 'à' will be replaced by 'a'.

PorterStemFilter - the PorterStemFilter transforms the token stream as per the Porter stemming algorithm. Note: the input to the stemming filter must already be in lower case, so you will need to use LowerCaseFilter or LowerCaseTokenizer farther down the Tokenizer chain in order for this to work properly! To use this filter with other analyzers, you'll want to write an Analyzer class that sets up the TokenStream chain as you want it. To use this with LowerCaseTokenizer.

Hierarchy Graph

Here is a cheap hierarchy graph constructed of some of the Hierarchy starting with a TokenStream class.

* TokenStream

    * Tokenizer
        * KeywordTokenizer
        * CharTokenizer
            * WhitespaceTokenizer
            * LetterTokenizer
                * LowerCaseTokenizer
        * StandardTokenizer

    * TokenFilter
        * LowerCaseFilter
        * StopFilter
        * StandardFilter
        * PorterStemFilter
        * LengthFilter
        * ISOLatin1AccentFilter

Points of Interest

I think this is a very interesting topic, because this greatly affects how your users will search and what kind of results they will be back. I wanted to create a custom analyzer with this article but i felt that this was more about how they worked, And it will give me an excuse to make another article again :)

I hope this was interesting to you and I hope I was as clear as can be, and again, if your not understanding what I'm trying to communicate just point it out to me, thanks!

History

1/2/2009 - Initial Released Article