Lucene.Net – Custom Synonym Analyzer

AndrewSmith

4.92/5 (24 votes)

Jan 3, 2009

Apache

4 min read

116658

3431

How to use Lucene.net search to work with synonyms

What is Lucene.Net?

Lucene.Net is a high performance Information Retrieval (IR) library, also known as a search engine library. Lucene.Net contains powerful APIs for creating full text indexes and implementing advanced and precise search technologies into your programs. Some people may confuse Lucene.net with a ready to use application like a web search/crawler, or a file search application, but Lucene.Net is not such an application, it's a framework library. Lucene.Net provides a framework for implementing these difficult technologies yourself. Lucene.Net makes no discriminations on what you can index and search, which gives you a lot more power compared to other full text indexing/searching implications; you can index anything that can be represented as text. There are also ways to get Lucene.Net to index HTML, Office documents, PDF files, and much more.

Lucene.Net is an API per API port of the original Lucene project, which is written in Java. Even the unit tests were ported to guarantee the quality. Also, Lucene.Net index is fully compatible with the Lucene index, and both libraries can be used on the same index together with no problems. A number of products have used Lucene and Lucene.Net to build their searches; some well known websites include Wikipedia, CNET, Monster.com, Mayo Clinic, FedEx, and many more. But, it’s not just web sites that have used Lucene; there is also a product that has used Lucene.Net, called Lookout, which is a search tool for Microsoft Outlook that just brought Outlook’s integrated search to look painfully slow and inaccurate.

Lucene.Net is currently undergoing incubation at the Apache Software Foundation. Its source code is held in a subversion repository and can be found here. If you need help downloading the source, you can use the free TortoiseSVN, or RapidSVN. The Lucene.Net project always welcomes new contributors. And, remember, there are many ways to contribute to an open source project other than writing code.

How Do I Get Lucene.Net to Work with Synonyms?

The goal here is to be able to search for a word and be able to retrieve results that contain words that have the same meaning as the words you are searching for. This will allow you to be able to kind of search by meaning than search by the keywords.

We can easily get Lucene.Net to work with synonyms by creating a custom Analyzer class. The Analyzer will be able to inject the synonyms into the full text index. For some details on the internals of an Analyzer, please see my previous article Lucene.Net – Text Analysis.

Creating the Analyzer

The first thing we want to do is sort of abstract the work of getting the synonyms. So we will create a simple interface to do this.

    public interface ISynonymEngine
    {
        IEnumerable<string> GetSynonyms(string word);
    }

Great, now let’s work on an implementation of the synonym engine.

public class XmlSynonymEngine : ISynonymEngine
    {
        //this will contain a list, of lists of words that go together
        private List<ReadOnlyCollection<string>> SynonymGroups =
            new List<ReadOnlyCollection<string>>();

        public XmlSynonymEngine(string xmlSynonymFilePath)
        {
            // create an XML document object, and load it from the specified file.
            XmlDocument Doc = new XmlDocument();
            Doc.Load(xmlSynonymFilePath);

            // get all the <group> nodes
            var groupNodes = Doc.SelectNodes("/synonyms/group");

            //enumerate groups
            foreach (XmlNode g in groupNodes)
            {
                //get all the <syn> elements from the group nodes.
                XmlNodeList synNodes = g.SelectNodes("child::syn");

                //create a list that will hold the items for this group
                List<string> synonymGroupList = new List<string>();

                //enumerate them and add them to the list,
                //and add each synonym group to the list
                foreach (XmlNode synNode in g)
                {
                    synonymGroupList.Add(synNode.InnerText.Trim());
                }

                //add single synonym group to the list of synonm groups.
                SynonymGroups.Add(new ReadOnlyCollection<string>(synonymGroupList));
            }

            // clear the XML document
            Doc = null;
        }

        #region ISynonymEngine Members

        public IEnumerable<string> GetSynonyms(string word)
        {
            //enumerate all the synonym groups
            foreach (var synonymGroup in SynonymGroups)
            {
                //if the word is a part of the group return 
                //the group as the results.
                if (synonymGroup.Contains(word))
                {
                    //gonna use a read only collection for security purposes
                    return synonymGroup;
                }
            }

            return null;
        }

        #endregion
    }

Now let's look at a sample document that our XmlSynonymEngine will read:

<?xml version="1.0" encoding="utf-8" ?>
<synonyms>
  <group>
    <syn>fast</syn>
    <syn>quick</syn>
    <syn>rapid</syn>
  </group>

  <group>
    <syn>slow</syn>
    <syn>decrease</syn>
  </group>

  <group>
    <syn>google</syn>
    <syn>search</syn>
  </group>

  <group>
    <syn>check</syn>
    <syn>lookup</syn>
    <syn>look</syn>
  </group>
  
</synonyms>

When thinking about creating any analyzer that will provide a new capability to Lucene, it’s best to think about instead of putting your logic in the Analyzer class, to place it either in the Tokenizer or TokenFilter class. The injecting of synonyms is more of a TokenFilter area, so I will create a SynonmFilter class that will act as a TokenFilter. This implementation of a TokenFilter will only require us to override one method of the TokenFilter base class and that is the Next() method which returns a token. Here is the implementation for the SynonymFilter class:

 public class SynonymFilter : TokenFilter
    {
        private Queue<Token> synonymTokenQueue
            = new Queue<Token>();

        public ISynonymEngine SynonymEngine { get; private set; }

        public SynonymFilter(TokenStream input, ISynonymEngine synonymEngine)
            : base(input)
        {
            if (synonymEngine == null)
                throw new ArgumentNullException("synonymEngine");

            SynonymEngine = synonymEngine;
        }

        public override Token Next()
        {
            // if our synonymTokens queue contains any tokens, return the next one.
            if (synonymTokenQueue.Count > 0)
            {
                return synonymTokenQueue.Dequeue();
            }

            //get the next token from the input stream
            Token t = input.Next();

            //if the token is null, then it is the end of stream, so return null
            if (t == null)
                return null;

            //retrieve the synonyms
            IEnumerable<string> synonyms = SynonymEngine.GetSynonyms(t.TermText());
            
            //if we don't have any synonyms just return the token
            if (synonyms == null)
            {
                return t;
            }

            //if we do have synonyms, add them to the synonymQueue, 
            // and then return the original token
            foreach (string syn in synonyms)
            {
                //make sure we don't add the same word 
                if ( ! t.TermText().Equals(syn))
                {
                    //create the synonymToken
                    Token synToken = new Token(syn, t.StartOffset(), 
                              t.EndOffset(), "<SYNONYM>");
                    
                    // set the position increment to zero
                    // this tells lucene the synonym is 
                    // in the exact same location as the originating word
                    synToken.SetPositionIncrement(0);

                    //add the synToken to the synonyms queue
                    synonymTokenQueue.Enqueue(synToken);
                }
            }

            //after adding the syn to the queue, return the original token
            return t;
        }
    }

And finally the SynonymAnalyzer:

    public class SynonymAnalyzer : Analyzer
    {
        public ISynonymEngine SynonymEngine { get; private set; }

        public SynonymAnalyzer(ISynonymEngine engine)
        {
            SynonymEngine = engine;
        }

        public override TokenStream TokenStream
		(string fieldName, System.IO.TextReader reader)
        {
            //create the tokenizer
            TokenStream result = new StandardTokenizer(reader);

            //add in filters
            // first normalize the StandardTokenizer
            result = new StandardFilter(result); 

            // makes sure everything is lower case
            result = new LowerCaseFilter(result);

            // use the default list of Stop Words, provided by the StopAnalyzer class.
            result = new StopFilter(result, StopAnalyzer.ENGLISH_STOP_WORDS); 

            // injects the synonyms. 
            result = new SynonymFilter(result, SynonymEngine); 

            //return the built token stream.
            return result;
        }
    }

Now let's see the results:

Analyzer Viewer, Looking at the Tokens using The StandardAnalyzer

lucene_custom_analyzer/standardview.jpg

Analyzer Viewer, Looking at the Tokens using The SynonymAnalyzer

lucene_custom_analyzer/synviewjpg.jpg

Points of Interest

The SynonymAnalyzer is really great for indexing, but I think it might junk up a Query if you plan to use the SynonymAnalyzer for use with a QueryParser to construct a query. One way around this is to modify the SynonymFilter, and SynonymAnalyzer to have a bool switch to turn the synonym injection on and off. That way you could turn the synonym injection off while you are using it with a QueryParser.

The code attached includes the Analyzer Viewer application that I had in my last article, but it also includes an update to include our brand new synonym analyzer.

History

1/2/2009 - Initial release