Click here to Skip to main content
15,896,912 members
Articles / SOLR

Decompounding, Solr and German Language

Rate me:
Please Sign up or sign in to vote.
5.00/5 (4 votes)
10 Oct 2018CPOL3 min read 13.7K   53   3   3
This article explains how we can customize decompounding in solr for languages such as German to get precise results.

Introduction

Decompounding and Solr often don’t work as expected out of the box especially in the case of German language. This article goes step by step from the default setting to the custom setting to make it work, near to perfect.

Basic Setting

So, let’s start with the basic configuration of decompounding in solr.

Lucene provides “DictionaryCompoundedWordTokenFilter”. This filter, decompounds compounded word into tokens, based on the dictionary that we have to provide. It also gives a set of configuration parameters such as minwordSize and maxWordSize, etc to make it more precise and to configure it as per our app data.

The setting is quite simple, we need to enable filter in the analyzer chain of the field type. For example, enable it for your field type in schema.xml.

XML
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="german.txt"/>

german.txt is the dictionary file and should be present in the config folder.

To check if the decompounding is working, reload the cores from solr admin UI. And go to Analysis. Type any compounded word selects the field type for which it is enabled in schema.xml and analyze. You should see input token is decompounded into further small tokens with the input token also preserved.

To get more details about this filter, please see solr official wiki.

Problems with Basic Setting

The very basic problem with the above setting is the filter that lucene provides doesn’t work as we expect it to work, it decompounds the word into just too many tokens if the dictionary used is very generic, the word like “Rotwein” would be broken into “rot”, “wein”,”ein”. Where “ein” is something we don’t want. To solve this problem, I have written a custom filter which breaks the compound words only to the best sub tokens. So in case of rotwein, tokens would be “rot” and “wein”.

So far, so good. But let’s see what happens when we index with this setting and query for the word rotwein. Consider that we have the same analyzer chain for both query and indexing.

Let’s say we have 4 documents with the name field as follows:

  1. rotwein => rotwein, rot, wein
  2. rot wein => rot, wein
  3. rot => rot
  4. wein => wein

The results that we want when we search for rotwein is doc 1 and 2. But this cannot be achieved with the work we have done so far.

If we use q.op as OR

It searches for rotwein OR rot OR wein, it gives 3 and 4 as well with 1 and 2.

If we use q.op as AND

It searches for rotwein AND rot AND wein, it gives only 1.

To achieve the expected results, we need to somehow change the query to include only rot AND wein, which means we need to somehow remove the original token after decompounding filter.

Custom Filter to Remove Original Token

This filter would remove the original token and thus keep only the decompounded tokens. In the cases where the original tokens are not compounded, this filter should not remove that token.

For example:

  • Rotwein => rot, wein
  • Milch => milch

This is just a filter class; make sure to write a factory for it.

Java
public class RemoveOriginalFilter extends TokenFilter {
       private CharTermAttribute charTermAttr;
       protected PositionIncrementAttribute posIncAtt;
       protected FlagsAttribute flagsAtt;
       private static int FLAG = 1 ;      

       public RemoveOriginalFilter(TokenStream input) {
              super(input);
              this.charTermAttr = addAttribute(CharTermAttribute.class);
              posIncAtt = addAttribute(PositionIncrementAttribute.class);
              flagsAtt = addAttribute(FlagsAttribute.class);
       }

       @Override
       public boolean incrementToken() throws IOException {
              if (!input.incrementToken()) {
                     return false;
              } else {
                     if (flagsAtt.getFlags() == FLAG) {
                           return input.incrementToken();
                     } else {
                           return true;
                     }
              }
       }
}

Make sure the above flag is set in CompoundWorkTokenFilterBase.

XML
@Override
  public final boolean incrementToken() throws IOException {
         if (!tokens.isEmpty()) {
             assert current != null;
             CompoundToken token = tokens.removeFirst();
             restoreState(current); // keep all other attributes untouched             

             termAtt.setEmpty().append(token.txt);
             offsetAtt.setOffset(token.startOffset, token.endOffset);
             posIncAtt.setPositionIncrement(1);
             return true;
           }

           current = null; // not really needed, but for safety
           if (input.incrementToken()) {
            // Only words longer than minWordSize get processed
             if (termAtt.length() >= this.minWordSize) {
               decompose();
               // only capture the state if we really need it for producing new tokens
               if (!tokens.isEmpty()) {
                    current = captureState();
                    flagsAtt.setFlags(FLAG);
               }
             }
             // return original token:
             return true;
           } else {
             return false;
           }
         }

Final Settings

After we have custom filter in place, enable this in the query analyzer chain of the field type after decompound filter.

Java
<filter class="de.custom.lucene.RemoveOriginalFilterFactory" />

Change the solrconfig.xml to have default parser as edismax and default q.OP to be “AND”,

XML
<requestHandler name="/select" class="solr.SearchHandler">
    <!-- default values for query parameters can be specified, these
         will be overridden by parameters in the request
      -->
     <lst name="defaults">
     <str name="defType">edismax</str>
                       <str name="q.op">AND</str>

Non Decompounded Fields

With the above settings, we make sure that the any compound word given in the form with whitespace "rot wein" would be perfectly matched, but for the compound words without space "rotwein". To enable this to be matched, add one more field with a new field type and do not include decompounding there.

Conclusion

As we see, decompunding doesn't work perfectly out of the box but with a little bit of customization, we can achieve good results.

 

 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Germany Germany
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionSnippet Pin
Nelek10-Oct-18 23:59
protectorNelek10-Oct-18 23:59 
QuestionGerman dictionary and filter Pin
Member 1391839519-Jul-18 5:09
Member 1391839519-Jul-18 5:09 
Where can I find a good dictionary and the filter you mentioned in your great article.

> To solve this problem, I have written a custom filter which breaks the compound words only to the best sub tokens. So in case of rotwein, tokens would be “rot” and “wein”.

I was using the dictionary from https://github.com/uschindler/german-decompounder with no filter. It contains for example the word `tun`. Decompounding the word `Erkältung` encountered the word `tun`, which results in a poor search. How to filter out `tun` or `ein` in your case?
GeneralMy vote of 5 Pin
Member 1251890622-Mar-18 19:24
Member 1251890622-Mar-18 19:24 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.