Click here to Skip to main content
Click here to Skip to main content
Go to top

Simple Fuzzy String Similarity in Java

, 17 Jan 2011
Rate this:
Please Sign up or sign in to vote.
Move beyond the String.equals() method of Java and introduce fuzzy String matching.

Introduction

Java provides an equals function in the java.lang.String class, it works fine and there is no need to mess with it. But what if you need to know if a string is "close" to another but not equals? Java does not have a built in "closeness" measurement. Close has always counted in the game of horseshoes, and it could be argued that close will count with hand grenades or nuclear warfare, but out of the box programming languages do not come with a closeness measurement as it's either equal or not equal (OK, leave out equalsIgnoreCase because upper/lower case is not the measurement discussed here); there is no way to say "it's 75% the same" with stock methods. This absence is noteworthy, especially since Java comes with toys like Regular Expressions (java.util.regex.Pattern, etc.). If you need a fuzzy string match, you're up the creek.

Thankfully, there has been a ton of development in this area and the approaches are well documented. Bridging the gap from a formula to code can be interesting. Applying a simple concept from fuzzy logic where 1.0 is true and 0.0 is false, it's possible to translate that concept to 1.0 meaning the strings are exactly equal (equality expressed as the same bigram set, explained below) and 0.0 meaning the strings have nothing in common.

One very simple way to accomplish a measurement of string closeness is to use Dice's coefficient. The formula is:

Dice's Coefficient

One additional idea is the concept of breaking a string down to bigrams. A bigram is two adjacent characters from a string. Taking the string "fubar", the set of bigrams would be {"fu", "ub", "ba", "ar"}. Likewise, "foobar" would break down into {"fo", "oo", "ob", "ba", "ar"}. With either "fubar" or "foobar", you are still up the creek without a paddle, but are they almost the same? Now that the bigrams have been created, the formula can be filled in.

A quick review of the symbols in the formula may be handy. Remember that means the cardinality of the set X. If X = {"fu", "ub", "ba", "ar"}, then |Dead | X| = 4. The symbol between is the intersection of the sets on the left and right, resulting in a set of common members in each original set. In the classic Venn diagram of two circles, the union represents the common area in both circles.

The union of the bigram set from "fubar" and "foobar" is: {"ba", "ar"}. Only two elements exist in this set, which gives 4/9 as the measurement of similarity between "fubar" and "foobar". Where did the 9 come from? There are 4 elements in the bigram set of "fubar" and 5 elements in the bigram set of "foobar".

A simple (remember, simple, not efficient or concerned about Unicode) bigram generator could be represented with a method such as:

public List<char[]> bigram(String input)
{
    ArrayList<char[]> bigram = new ArrayList<char[]>();
    for (int i = 0; i < input.length() - 1; i++)
    {
        char[] chars = new char[2];
        chars[0] = input.charAt(i);
        chars[1] = input.charAt(i+1);
        bigram.add(chars);
    }
    return bigram;
}

Given that there is a method to convert a String to a bigram set, a simple method to produce the measurement of similarity could be:

public double dice(List<char[]> bigram1, List<char[]> bigram2)
{
    List<char[]> copy = new ArrayList<char[]>(bigram2);
    int matches = 0;
    for (int I = bigram1.size(); --i >= 0;)
    {
        char[] bigram = bigram1.get(i);
        for (int j = copy.size(); --j >= 0;)
        {
            char[] toMatch = copy.get(j);
            if (bigram[0] == toMatch[0] && bigram[1] == toMatch[1])
            {
                copy.remove(j);
                matches += 2;
                break;
            }
        }
    }
    return (double) matches / (bigram1.size() + bigram2.size());
}

The above code is a very basic and straightforward implementation; the only tricks are to make a copy of one of the input bigrams to not modify what is passed in so that as matches are made, the bigram is removed from any further matching, and the increment of the local variable match by two each time a match is made instead of multiplying the local variable match by 2 at the end. The second part (increment of match) was just done in the example to see if anyone was paying attention.

Those methods above, bigram and dice, provide a suggestion to implement Dice's coefficient in Java to create a simple measurement of a fuzzy string similarity.

Points of Interest

Note that this not not a replacement for the stock equals() method on the String class.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

George Stragand
Software Developer (Senior)
United States United States
George Stragand is a software developer and manager who has had a 20+ year track record of delivering commercial software on time. In 2012, George published the Pocket Guide to Hiring Geeks, following up the 2010 publication of Agile Development & Business Goals which details his approach to agile development. George maintains his own website where he posts about the topics he is passionate about: software development, cars, guns, guitars and yes even banjos!

Comments and Discussions

 
QuestionIdea PinmemberMember 793111222-Oct-13 9:17 
GeneralMy vote of 5 PinmemberZhao Jiang4-Jun-13 4:50 
GeneralMy vote of 5 Pinmemberalxxl24-Jan-11 19:35 
GeneralLifecell Wrinkle Cream PinmemberJosepheane Brooks18-Jan-11 2:04 
GeneralMy vote of 5 Pinmemberdivetramp17-Jan-11 23:57 
GeneralRe: My vote of 5 PinmemberGeorge Stragand19-Jan-11 1:58 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140916.1 | Last Updated 17 Jan 2011
Article Copyright 2011 by George Stragand
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid