What is Lucene.Net?
Lucene.Net is a high performance Information Retrieval (IR) library, also known as a search engine library. Lucene.Net contains powerful APIs for creating full text indexes and implementing advanced and precise search technologies into your programs. Some people may confuse Lucene.net with a ready to use application like a web search/crawler, or a file search application, but Lucene.Net is not such an application, it's a framework library. Lucene.Net provides a framework for implementing these difficult technologies yourself. Lucene.Net makes no discriminations on what you can index and search, which gives you a lot more power compared to other full text indexing/searching implications; you can index anything that can be represented as text. There are also ways to get Lucene.Net to index HTML, Office documents, PDF files, and much more.
Lucene.Net is an API per API port of the original Lucene project, which is written in Java even the unit tests were ported to guarantee the quality. Also, Lucene.Net index is fully compatible with the Lucene index, and both libraries can be used on the same index together with no problems. A number of products have used Lucene and Lucene.Net to build their searches; some well known websites include Wikipedia, CNET, Monster.com, Mayo Clinic, FedEx, and many more. But, it’s not just web sites that have used Lucene; there is also a product that has used Lucene.Net, called Lookout, which is a search tool for Microsoft Outlook that just brought Outlook’s integrated search to look painfully slow and inaccurate.
Lucene.Net is currently undergoing incubation at the Apache Software Foundation. Its source code is held in a subversion repository and can be found here. If you need help downloading the source, you can use the free TortoiseSVN, or RapidSVN. The Lucene.Net project always welcomes new contributors. And, remember, there are many ways to contribute to an open source project other than writing code.
Welcome to the second article
Hey i would just like to welcome you to my second article on code project. For those of you who seemed to have stumbled upon this first, check out my first article Introducing Lucene.Net. The goal of this article is to introduce you to Lucene.Net Analyzers, more about on how they work and how the affect your searching.
What are Analyzers?
Analyzerhas a single job, and that is to be a advanced word breaker. Which an object that will read a stream of text and break apart the words into objects called
Tokenclass will generally hold the results of the analysis as individual words. This is a very brief summary of what an
Analyzer can do and how it affects your full text index. A good
Analyzerwill not only break the words apart, but it is also performs a transformation of the text to make it more suitable for indexing. One simple transformation an
Analyzercan do is to lowercase everything it comes across, that way your index will be case insensitive.
In the Lucene framework there are two major spots where an
Analyzeris used, and that is when indexing and then searching. For the indexing portion, the direct results of the
Analyzeris what gets indexed. So for example, in a previous example of an
Analyzerthat will convert everything to lowercase, if we come across the word "CAT", the analyzer will output "cat", and in the full text index, a
Termof "cat" will be associated with the
Document. For an even bigger example if we use an
Analyzerthat will break the words apart with the spaces, and then the
Analyzerwill convert it all to lowercase the follow the results should look something like this.
NOTE: The brackets show the different
Tokensreturned from the
The Cat in the Hat.
[the] [cat] [in] [the] [hat.]
Now, when you are searching, most of the time you will be using the
QueryParserclass you will use to construct the
Queryobject that you will use to search the full text index with. Part of using the
QueryParserclass, is that you will have to supply an instance of an
QueryParserwill use the
Analyzerto normalize the
Term or Termsthat you will actually be querying for.
Now there is a relationship between the
Analyzerthat works with the indexing process and the
Analyzerthat works with building the
Queryobject. Most of the time this will be the exact same kind of
Analyzerwill be used to do both of the jobs. This is because during the search process, it will only match terms that are exactly the same as what is in the index, this includes case sensitivity. according to the index, when you index the
Term"cat", it is considered a completely different
termthan the word "Cat". This includes punctuational as well. So like in our Analysis sample above the output of [the] [cat] [in] [the] [hat.]. if we were to directly search for the word, "hat", this document would not be a match, because what is index is the Term [hat.] (with a period).
The point here is that the
Termsbeing indexed must be the same as the
Termsthat you are searching with. This can be achieved by using a consistent method of analysis in both indexing and searching.
The Analyzer Viewer Application
Attached to this article is the the Analyzer Viewer application, that I made. Attached are both the source and a ready to run binary of the application.. The sample is more like a little utility to see how the basic
Analyzersincluded with Lucene.Net will view text. The application will allow you to directly input some text, and it will show you all the results of the text analysis, and how it split them up into tokens and what transformations it applied.
Some interesting things to looks at include, typing in email addresses, numbers with letters, numbers alone, acronyms, alternating cases, and just anything else you want to play with to see how the indexing process goes.
The Built-in Analyzers
Lucene.Net has several different built-in analyzers. Each has it's own uses, here's a list of the built-in analyzers.
"Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names.
An Analyzer that uses WhitespaceTokenizer.
Filters LetterTokenizer with LowerCaseFilter and StopFilter.
An Analyzer that filters LetterTokenizer with LowerCaseFilter.
StopFilter, using a list of English stop words.
This analyzer is used to facilitate scenarios where different fields require different analysis techniques. When you create an PerFieldAnalyzerWrapper object you must specify an analyzer to use for default, then use
AddAnalyzer(string fieldName, Analyzer analyzer) to add a non-default analyzer on a field name basis.
How Does This Work?
Analyzeris an object that inherits from an abstract base class called
Analyzerclass lives in the
Lucene.Net.Analysis namespace. To implement a custom
Analyzeryou only need to implement one method called
TokenStream. Which takes two parameters, a
stringof the field name that will be passed to you, and a
TextReaderobject that will contain the text to be read. The signature will be as follows:
public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
TokenStream, which is simply just a stream of
Tokenobjects. So when an analyzer get's used, it will return a
TokenStreamobject that you can use to iterate over the tokens that will be returned from the analysis. Here we are introducing two new classes to become familiar with. The first being a
TokenStreamand the second being a
Token. The summary from Lucene's documentation are
TokenStream- A TokenStream enumerates the sequence of tokens, either from
fields of a document or from query text.
Token- A Token is an occurrence of a term from the text of a field. It consists
of a term's text, the start and end offset of the term in the text of
the field, and a type string. The start and end offsets permit
applications to re-associate a token with its source text, e.g., to
display highlighted query terms in a document browser, or to show
matching text fragments in a KWIC (KeyWord In Context) display, etc.
The type is an interned string, assigned by a lexical analyzer (a.k.a.
tokenizer), naming the lexical or syntactic class that the token
belongs to. For example an end of sentence marker token might be
implemented with type "eos". The default token type is "word".
TokenStreamis a way to provide access to a stream of
Tokens, and a
Token simply just holds some information just as some text representing a
Term, and the
Term's field as well as the offset of where this
Tokenis located in the original text.
How does an
TokenStreamis an public abstract base class. There are two other two other abstract classes that inherit from the
TokenStreamclass that take the
TokenStreama little bit further. And they are the
TokenFilterclasses. As I mentioned earlier both of these classes are implementations of a
TokenStreamclass, and also both of these are also abstract.
TokenFilterclasses have different jobs in the the analysis of the text, These two classes were created to separate the responsibility of what an
Tokenizeracts like the word breaker, it's main job is to process the TextReader and return how it breaks the words into
Tokenobjects. On the other hand is the
TokenFilter class. When you create an implementation of a
TokenFilterclass, it will expect a inner TokenStream as a parameter of the constructor. It's more like having a
TokenStreamwrapped around another
TokenStream is what I'm trying to get to. When you call to get the next token from a
TokenFilter, it will call the input
TokenStreamto get the next
Tokenfrom it, then it will evaluate the
Tokenand it will perform different operations with the information. Some examples of what a
TokenFiltercan do is convert all the text to lowercase, or the
TokenFiltercan decide that the
Tokenis not important and discard it so it does not ever get indexed or seen. The
TokenFilteris more about filtering the content from a
TokenStream, they are usually just creating a single
Tokenizer object to break the words and then using one or more
TokenFiltersto filter the results of the
Implementations of a
As i mentioned earlier the
Tokenizerclass is an abstract base class of a
TokenStream. Lucene.Net provides a few implementations of a
Tokenizerthat it uses in some of the Analyzers. Here is a couple of them and a small description of each.
Tokenizerwill read the entire stream of text and return the whole things as a single
CharTokenizer- This is an abstract base
Tokenizerthan is implemented by two other Tokenizers, this is a good starting point to create your own
Tokenizer. It has a single method that you must implement and that is
charc), and returns a
Boolean. if it is a character that belongs to a
Token. once you hit something that isn't a
Tokencharacter the method would return false, and then it will create the
Tokendepending upon where you split the
Tokenizerinherits from the
CharTokenizerand it breaks the words just according to white space.
Tokenizerinherits from the
CharTokenizerand it breaks the words according to just letters, the moment it hits a number of a symbol or white space it ignores these and only creates a
Termwith just the letters.
LowerCaseTokenizerinheirts from the
LetterTokenizerand just performs and extra step of converting the results returned to lowercase.
Tokenizeris a pretty good choice for most European languages. This is a grammer based
Tokenizerthat will reconize email addresses, host names, and acronyms.
Implementations of a
As i mentioned earlier the
TokenFilterclass is an abstract base class of a
TokenStream. Lucene.Net provides a few implementations of a
TokenFilterthat it uses in some of the Analyzers. Here is a couple of them and a small description of each.
LowerCaseFilterwill take the incoming
Tokenand will convert all the letters to lower case. useful for case insensitive indexing and searching.
StopFilterwill filter out 'stop words'. Stop words are defined as common words that should be ignored, such as 'a' 'and' 'the' 'but' 'so' 'also' etc.., A constructor of the
StopFilterrequires you to pass in an string array that contains a list of the stop words you want to define.
LengthFilter - The
LengthFilteris useful if you want to remove words that are too long or too short from being returned.
StandardFilteris used to normalize the results from the
ISOLatin1AccentFilter - the
ISOLatin1AccentFilteris a filter that replaces accented characters in the ISO Latin 1 character set
(ISO-8859-1) by their unaccented equivalent. The case will not be altered.
For instance, 'à' will be replaced by 'a'.
PorterStemFiltertransforms the token stream as per the Porter stemming algorithm.
Note: the input to the stemming filter must already be in lower case,
so you will need to use
down the Tokenizer chain in order for this to work properly! To use this filter with other
analyzers, you'll want to write an
Analyzerclass that sets up the
TokenStreamchain as you want it.
To use this with
Here is a cheap hierarchy graph constructed of some of the Hierarchy starting with a
Points of Interest
I think this is a very interesting topic, because this greatly affects how your users will search and what kind of results they will be back. I wanted to create a custom analyzer with this article but i felt that this was more about how they worked, And it will give me an excuse to make another article again
I hope this was interesting to you and I hope I was as clear as can be, and again, if your not understanding what I'm trying to communicate just point it out to me, thanks!
1/2/2009 - Initial Released Article