Lucene.Net – Custom Synonym Analyzer






4.92/5 (24 votes)
How to use Lucene.net search to work with synonyms
What is Lucene.Net?
Lucene.Net is a high performance Information Retrieval (IR) library, also known as a search engine library. Lucene.Net contains powerful APIs for creating full text indexes and implementing advanced and precise search technologies into your programs. Some people may confuse Lucene.net with a ready to use application like a web search/crawler, or a file search application, but Lucene.Net is not such an application, it's a framework library. Lucene.Net provides a framework for implementing these difficult technologies yourself. Lucene.Net makes no discriminations on what you can index and search, which gives you a lot more power compared to other full text indexing/searching implications; you can index anything that can be represented as text. There are also ways to get Lucene.Net to index HTML, Office documents, PDF files, and much more.
Lucene.Net is an API per API port of the original Lucene project, which is written in Java. Even the unit tests were ported to guarantee the quality. Also, Lucene.Net index is fully compatible with the Lucene index, and both libraries can be used on the same index together with no problems. A number of products have used Lucene and Lucene.Net to build their searches; some well known websites include Wikipedia, CNET, Monster.com, Mayo Clinic, FedEx, and many more. But, it’s not just web sites that have used Lucene; there is also a product that has used Lucene.Net, called Lookout, which is a search tool for Microsoft Outlook that just brought Outlook’s integrated search to look painfully slow and inaccurate.
Lucene.Net is currently undergoing incubation at the Apache Software Foundation. Its source code is held in a subversion repository and can be found here. If you need help downloading the source, you can use the free TortoiseSVN, or RapidSVN. The Lucene.Net project always welcomes new contributors. And, remember, there are many ways to contribute to an open source project other than writing code.
How Do I Get Lucene.Net to Work with Synonyms?
The goal here is to be able to search for a word and be able to retrieve results that contain words that have the same meaning as the words you are searching for. This will allow you to be able to kind of search by meaning than search by the keywords.
We can easily get Lucene.Net to work with synonyms by creating a custom Analyzer
class. The Analyzer
will be able to inject the synonyms into the full text index. For some details on the internals of an Analyzer
, please see my previous article Lucene.Net – Text Analysis.
Creating the Analyzer
The first thing we want to do is sort of abstract the work of getting the synonyms. So we will create a simple interface to do this.
public interface ISynonymEngine
{
IEnumerable<string> GetSynonyms(string word);
}
Great, now let’s work on an implementation of the synonym engine.
public class XmlSynonymEngine : ISynonymEngine
{
//this will contain a list, of lists of words that go together
private List<ReadOnlyCollection<string>> SynonymGroups =
new List<ReadOnlyCollection<string>>();
public XmlSynonymEngine(string xmlSynonymFilePath)
{
// create an XML document object, and load it from the specified file.
XmlDocument Doc = new XmlDocument();
Doc.Load(xmlSynonymFilePath);
// get all the <group> nodes
var groupNodes = Doc.SelectNodes("/synonyms/group");
//enumerate groups
foreach (XmlNode g in groupNodes)
{
//get all the <syn> elements from the group nodes.
XmlNodeList synNodes = g.SelectNodes("child::syn");
//create a list that will hold the items for this group
List<string> synonymGroupList = new List<string>();
//enumerate them and add them to the list,
//and add each synonym group to the list
foreach (XmlNode synNode in g)
{
synonymGroupList.Add(synNode.InnerText.Trim());
}
//add single synonym group to the list of synonm groups.
SynonymGroups.Add(new ReadOnlyCollection<string>(synonymGroupList));
}
// clear the XML document
Doc = null;
}
#region ISynonymEngine Members
public IEnumerable<string> GetSynonyms(string word)
{
//enumerate all the synonym groups
foreach (var synonymGroup in SynonymGroups)
{
//if the word is a part of the group return
//the group as the results.
if (synonymGroup.Contains(word))
{
//gonna use a read only collection for security purposes
return synonymGroup;
}
}
return null;
}
#endregion
}
Now let's look at a sample document that our XmlSynonymEngine
will read:
<?xml version="1.0" encoding="utf-8" ?>
<synonyms>
<group>
<syn>fast</syn>
<syn>quick</syn>
<syn>rapid</syn>
</group>
<group>
<syn>slow</syn>
<syn>decrease</syn>
</group>
<group>
<syn>google</syn>
<syn>search</syn>
</group>
<group>
<syn>check</syn>
<syn>lookup</syn>
<syn>look</syn>
</group>
</synonyms>
When thinking about creating any analyzer
that will provide a new capability to Lucene, it’s best to think about instead of putting your logic in the Analyzer
class, to place it either in the Tokenizer
or TokenFilter
class. The injecting of synonyms is more of a TokenFilter
area, so I will create a SynonmFilter
class that will act as a TokenFilter
. This implementation of a TokenFilter
will only require us to override one method of the TokenFilter
base class and that is the Next()
method which returns a token. Here is the implementation for the SynonymFilter
class:
public class SynonymFilter : TokenFilter
{
private Queue<Token> synonymTokenQueue
= new Queue<Token>();
public ISynonymEngine SynonymEngine { get; private set; }
public SynonymFilter(TokenStream input, ISynonymEngine synonymEngine)
: base(input)
{
if (synonymEngine == null)
throw new ArgumentNullException("synonymEngine");
SynonymEngine = synonymEngine;
}
public override Token Next()
{
// if our synonymTokens queue contains any tokens, return the next one.
if (synonymTokenQueue.Count > 0)
{
return synonymTokenQueue.Dequeue();
}
//get the next token from the input stream
Token t = input.Next();
//if the token is null, then it is the end of stream, so return null
if (t == null)
return null;
//retrieve the synonyms
IEnumerable<string> synonyms = SynonymEngine.GetSynonyms(t.TermText());
//if we don't have any synonyms just return the token
if (synonyms == null)
{
return t;
}
//if we do have synonyms, add them to the synonymQueue,
// and then return the original token
foreach (string syn in synonyms)
{
//make sure we don't add the same word
if ( ! t.TermText().Equals(syn))
{
//create the synonymToken
Token synToken = new Token(syn, t.StartOffset(),
t.EndOffset(), "<SYNONYM>");
// set the position increment to zero
// this tells lucene the synonym is
// in the exact same location as the originating word
synToken.SetPositionIncrement(0);
//add the synToken to the synonyms queue
synonymTokenQueue.Enqueue(synToken);
}
}
//after adding the syn to the queue, return the original token
return t;
}
}
And finally the SynonymAnalyzer
:
public class SynonymAnalyzer : Analyzer
{
public ISynonymEngine SynonymEngine { get; private set; }
public SynonymAnalyzer(ISynonymEngine engine)
{
SynonymEngine = engine;
}
public override TokenStream TokenStream
(string fieldName, System.IO.TextReader reader)
{
//create the tokenizer
TokenStream result = new StandardTokenizer(reader);
//add in filters
// first normalize the StandardTokenizer
result = new StandardFilter(result);
// makes sure everything is lower case
result = new LowerCaseFilter(result);
// use the default list of Stop Words, provided by the StopAnalyzer class.
result = new StopFilter(result, StopAnalyzer.ENGLISH_STOP_WORDS);
// injects the synonyms.
result = new SynonymFilter(result, SynonymEngine);
//return the built token stream.
return result;
}
}
Now let's see the results:
Analyzer Viewer, Looking at the Tokens using The StandardAnalyzer
Analyzer Viewer, Looking at the Tokens using The SynonymAnalyzer
Points of Interest
The SynonymAnalyzer
is really great for indexing, but I think it might junk up a Query if you plan to use the SynonymAnalyzer
for use with a QueryParser
to construct a query. One way around this is to modify the SynonymFilter
, and SynonymAnalyzer
to have a bool switch to turn the synonym injection on and off. That way you could turn the synonym injection off while you are using it with a QueryParser
.
The code attached includes the Analyzer Viewer application that I had in my last article, but it also includes an update to include our brand new synonym analyzer.
History
- 1/2/2009 - Initial release