Click here to Skip to main content
15,884,629 members
Articles / Programming Languages / C#

Lucene.Net – Custom Synonym Analyzer

Rate me:
Please Sign up or sign in to vote.
4.92/5 (25 votes)
10 Sep 2013Apache4 min read 113.9K   3.4K   98   13
How to use Lucene.net search to work with synonyms

What is Lucene.Net?

Lucene.Net is a high performance Information Retrieval (IR) library, also known as a search engine library. Lucene.Net contains powerful APIs for creating full text indexes and implementing advanced and precise search technologies into your programs. Some people may confuse Lucene.net with a ready to use application like a web search/crawler, or a file search application, but Lucene.Net is not such an application, it's a framework library. Lucene.Net provides a framework for implementing these difficult technologies yourself. Lucene.Net makes no discriminations on what you can index and search, which gives you a lot more power compared to other full text indexing/searching implications; you can index anything that can be represented as text. There are also ways to get Lucene.Net to index HTML, Office documents, PDF files, and much more.

Lucene.Net is an API per API port of the original Lucene project, which is written in Java. Even the unit tests were ported to guarantee the quality. Also, Lucene.Net index is fully compatible with the Lucene index, and both libraries can be used on the same index together with no problems. A number of products have used Lucene and Lucene.Net to build their searches; some well known websites include Wikipedia, CNET, Monster.com, Mayo Clinic, FedEx, and many more. But, it’s not just web sites that have used Lucene; there is also a product that has used Lucene.Net, called Lookout, which is a search tool for Microsoft Outlook that just brought Outlook’s integrated search to look painfully slow and inaccurate.

Lucene.Net is currently undergoing incubation at the Apache Software Foundation. Its source code is held in a subversion repository and can be found here. If you need help downloading the source, you can use the free TortoiseSVN, or RapidSVN. The Lucene.Net project always welcomes new contributors. And, remember, there are many ways to contribute to an open source project other than writing code.

How Do I Get Lucene.Net to Work with Synonyms?

The goal here is to be able to search for a word and be able to retrieve results that contain words that have the same meaning as the words you are searching for. This will allow you to be able to kind of search by meaning than search by the keywords.

We can easily get Lucene.Net to work with synonyms by creating a custom Analyzer class. The Analyzer will be able to inject the synonyms into the full text index. For some details on the internals of an Analyzer, please see my previous article Lucene.Net – Text Analysis.

Creating the Analyzer 

The first thing we want to do is sort of abstract the work of getting the synonyms. So we will create a simple interface to do this.

C#
public interface ISynonymEngine
{
    IEnumerable<string> GetSynonyms(string word);
}

Great, now let’s work on an implementation of the synonym engine.

C#
public class XmlSynonymEngine : ISynonymEngine
    {
        //this will contain a list, of lists of words that go together
        private List<ReadOnlyCollection<string>> SynonymGroups =
            new List<ReadOnlyCollection<string>>();

        public XmlSynonymEngine(string xmlSynonymFilePath)
        {
            // create an XML document object, and load it from the specified file.
            XmlDocument Doc = new XmlDocument();
            Doc.Load(xmlSynonymFilePath);

            // get all the <group> nodes
            var groupNodes = Doc.SelectNodes("/synonyms/group");

            //enumerate groups
            foreach (XmlNode g in groupNodes)
            {
                //get all the <syn> elements from the group nodes.
                XmlNodeList synNodes = g.SelectNodes("child::syn");

                //create a list that will hold the items for this group
                List<string> synonymGroupList = new List<string>();

                //enumerate them and add them to the list,
                //and add each synonym group to the list
                foreach (XmlNode synNode in g)
                {
                    synonymGroupList.Add(synNode.InnerText.Trim());
                }

                //add single synonym group to the list of synonm groups.
                SynonymGroups.Add(new ReadOnlyCollection<string>(synonymGroupList));
            }

            // clear the XML document
            Doc = null;
        }

        #region ISynonymEngine Members

        public IEnumerable<string> GetSynonyms(string word)
        {
            //enumerate all the synonym groups
            foreach (var synonymGroup in SynonymGroups)
            {
                //if the word is a part of the group return 
                //the group as the results.
                if (synonymGroup.Contains(word))
                {
                    //gonna use a read only collection for security purposes
                    return synonymGroup;
                }
            }

            return null;
        }

        #endregion
    }

Now let's look at a sample document that our XmlSynonymEngine will read:

XML
<?xml version="1.0" encoding="utf-8" ?>
<synonyms>
  <group>
    <syn>fast</syn>
    <syn>quick</syn>
    <syn>rapid</syn>
  </group>

  <group>
    <syn>slow</syn>
    <syn>decrease</syn>
  </group>

  <group>
    <syn>google</syn>
    <syn>search</syn>
  </group>

  <group>
    <syn>check</syn>
    <syn>lookup</syn>
    <syn>look</syn>
  </group>
  
</synonyms>

When thinking about creating any analyzer that will provide a new capability to Lucene, it’s best to think about instead of putting your logic in the Analyzer class, to place it either in the Tokenizer or TokenFilter class. The injecting of synonyms is more of a TokenFilter area, so I will create a SynonmFilter class that will act as a TokenFilter. This implementation of a TokenFilter will only require us to override one method of the TokenFilter base class and that is the Next() method which returns a token. Here is the implementation for the SynonymFilter class:

C#
public class SynonymFilter : TokenFilter
   {
       private Queue<Token> synonymTokenQueue
           = new Queue<Token>();

       public ISynonymEngine SynonymEngine { get; private set; }

       public SynonymFilter(TokenStream input, ISynonymEngine synonymEngine)
           : base(input)
       {
           if (synonymEngine == null)
               throw new ArgumentNullException("synonymEngine");

           SynonymEngine = synonymEngine;
       }

       public override Token Next()
       {
           // if our synonymTokens queue contains any tokens, return the next one.
           if (synonymTokenQueue.Count > 0)
           {
               return synonymTokenQueue.Dequeue();
           }

           //get the next token from the input stream
           Token t = input.Next();

           //if the token is null, then it is the end of stream, so return null
           if (t == null)
               return null;

           //retrieve the synonyms
           IEnumerable<string> synonyms = SynonymEngine.GetSynonyms(t.TermText());

           //if we don't have any synonyms just return the token
           if (synonyms == null)
           {
               return t;
           }

           //if we do have synonyms, add them to the synonymQueue,
           // and then return the original token
           foreach (string syn in synonyms)
           {
               //make sure we don't add the same word
               if ( ! t.TermText().Equals(syn))
               {
                   //create the synonymToken
                   Token synToken = new Token(syn, t.StartOffset(),
                             t.EndOffset(), "<SYNONYM>");

                   // set the position increment to zero
                   // this tells lucene the synonym is
                   // in the exact same location as the originating word
                   synToken.SetPositionIncrement(0);

                   //add the synToken to the synonyms queue
                   synonymTokenQueue.Enqueue(synToken);
               }
           }

           //after adding the syn to the queue, return the original token
           return t;
       }
   }

And finally the SynonymAnalyzer:

C#
public class SynonymAnalyzer : Analyzer
{
    public ISynonymEngine SynonymEngine { get; private set; }

    public SynonymAnalyzer(ISynonymEngine engine)
    {
        SynonymEngine = engine;
    }

    public override TokenStream TokenStream
    (string fieldName, System.IO.TextReader reader)
    {
        //create the tokenizer
        TokenStream result = new StandardTokenizer(reader);

        //add in filters
        // first normalize the StandardTokenizer
        result = new StandardFilter(result);

        // makes sure everything is lower case
        result = new LowerCaseFilter(result);

        // use the default list of Stop Words, provided by the StopAnalyzer class.
        result = new StopFilter(result, StopAnalyzer.ENGLISH_STOP_WORDS);

        // injects the synonyms.
        result = new SynonymFilter(result, SynonymEngine);

        //return the built token stream.
        return result;
    }
}

Now let's see the results:

Analyzer Viewer, Looking at the Tokens using The StandardAnalyzer

lucene_custom_analyzer/standardview.jpg

Analyzer Viewer, Looking at the Tokens using The SynonymAnalyzer

lucene_custom_analyzer/synviewjpg.jpg

Points of Interest

The SynonymAnalyzer is really great for indexing, but I think it might junk up a Query if you plan to use the SynonymAnalyzer for use with a QueryParser to construct a query. One way around this is to modify the SynonymFilter, and SynonymAnalyzer to have a bool switch to turn the synonym injection on and off. That way you could turn the synonym injection off while you are using it with a QueryParser.

The code attached includes the Analyzer Viewer application that I had in my last article, but it also includes an update to include our brand new synonym analyzer.

History

  • 1/2/2009 - Initial release

License

This article, along with any associated source code and files, is licensed under The Apache License, Version 2.0


Written By
Software Developer
United States United States
I'm a proud father and a software developer. I'm fascinated by a few particular .Net projects such as Lucene.Net, NHibernate, Quartz.Net, and others. I love learning and studying code to learn how other people solve software problems.

Comments and Discussions

 
GeneralMy vote of 5 Pin
csharpbd31-May-16 11:09
professionalcsharpbd31-May-16 11:09 
QuestionPorting to 3.0.3 Pin
strong_bad17-Apr-15 10:45
strong_bad17-Apr-15 10:45 
AnswerRe: Porting to 3.0.3 Pin
mao135084857920-Dec-16 19:26
mao135084857920-Dec-16 19:26 
QuestionMultiple words Pin
KashShah2-Feb-12 23:44
KashShah2-Feb-12 23:44 
QuestionHow to search forms of words? Pin
Win32nipuh26-Jan-10 6:31
professionalWin32nipuh26-Jan-10 6:31 
AnswerRe: How to search forms of words? Pin
AndrewSmith9-Feb-10 18:04
AndrewSmith9-Feb-10 18:04 
QuestionProblem for searching Pin
saridemir13-Dec-09 3:06
saridemir13-Dec-09 3:06 
AnswerRe: Problem for searching Pin
AndrewSmith13-Dec-09 7:23
AndrewSmith13-Dec-09 7:23 
GeneralRe: Problem for searching Pin
saridemir13-Dec-09 7:54
saridemir13-Dec-09 7:54 
GeneralStemming too! Pin
Kenny G26-Aug-09 6:24
Kenny G26-Aug-09 6:24 
GeneralIndex Optimization Pin
DiegoJancic8-Jan-09 16:10
DiegoJancic8-Jan-09 16:10 
GeneralRe: Index Optimization Pin
AndrewSmith9-Jan-09 2:38
AndrewSmith9-Jan-09 2:38 
GeneralRe: Index Optimization Pin
Jörgen Andersson12-Jan-09 8:20
professionalJörgen Andersson12-Jan-09 8:20 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.