11,921,354 members (61,694 online)
alternative version

25.3K views
6 bookmarked
Posted

# Simple Fuzzy String Similarity in Java

, 17 Jan 2011 CPOL
 Rate this:
Move beyond the String.equals() method of Java and introduce fuzzy String matching.

## Introduction

Java provides an equals function in the `java.lang.String` class, it works fine and there is no need to mess with it. But what if you need to know if a string is "close" to another but not equals? Java does not have a built in "closeness" measurement. Close has always counted in the game of horseshoes, and it could be argued that close will count with hand grenades or nuclear warfare, but out of the box programming languages do not come with a closeness measurement as it's either equal or not equal (OK, leave out `equalsIgnoreCase` because upper/lower case is not the measurement discussed here); there is no way to say "it's 75% the same" with stock methods. This absence is noteworthy, especially since Java comes with toys like Regular Expressions (`java.util.regex.Pattern`, etc.). If you need a fuzzy string match, you're up the creek.

Thankfully, there has been a ton of development in this area and the approaches are well documented. Bridging the gap from a formula to code can be interesting. Applying a simple concept from fuzzy logic where 1.0 is true and 0.0 is false, it's possible to translate that concept to 1.0 meaning the strings are exactly equal (equality expressed as the same bigram set, explained below) and 0.0 meaning the strings have nothing in common.

One very simple way to accomplish a measurement of string closeness is to use Dice's coefficient. The formula is:

One additional idea is the concept of breaking a string down to bigrams. A bigram is two adjacent characters from a string. Taking the string "fubar", the set of bigrams would be {"fu", "ub", "ba", "ar"}. Likewise, "foobar" would break down into {"fo", "oo", "ob", "ba", "ar"}. With either "fubar" or "foobar", you are still up the creek without a paddle, but are they almost the same? Now that the bigrams have been created, the formula can be filled in.

A quick review of the symbols in the formula may be handy. Remember that means the cardinality of the set X. If X = {"fu", "ub", "ba", "ar"}, then | = 4. The symbol between is the intersection of the sets on the left and right, resulting in a set of common members in each original set. In the classic Venn diagram of two circles, the union represents the common area in both circles.

The union of the bigram set from "fubar" and "foobar" is: {"ba", "ar"}. Only two elements exist in this set, which gives 4/9 as the measurement of similarity between "fubar" and "foobar". Where did the 9 come from? There are 4 elements in the bigram set of "fubar" and 5 elements in the bigram set of "foobar".

A simple (remember, simple, not efficient or concerned about Unicode) bigram generator could be represented with a method such as:

```public List<char[]> bigram(String input)
{
ArrayList<char[]> bigram = new ArrayList<char[]>();
for (int i = 0; i < input.length() - 1; i++)
{
char[] chars = new char[2];
chars[0] = input.charAt(i);
chars[1] = input.charAt(i+1);
}
return bigram;
}```

Given that there is a method to convert a `String` to a bigram set, a simple method to produce the measurement of similarity could be:

```public double dice(List<char[]> bigram1, List<char[]> bigram2)
{
List<char[]> copy = new ArrayList<char[]>(bigram2);
int matches = 0;
for (int I = bigram1.size(); --i >= 0;)
{
char[] bigram = bigram1.get(i);
for (int j = copy.size(); --j >= 0;)
{
char[] toMatch = copy.get(j);
if (bigram[0] == toMatch[0] && bigram[1] == toMatch[1])
{
copy.remove(j);
matches += 2;
break;
}
}
}
return (double) matches / (bigram1.size() + bigram2.size());
}```

The above code is a very basic and straightforward implementation; the only tricks are to make a copy of one of the input bigrams to not modify what is passed in so that as matches are made, the bigram is removed from any further matching, and the increment of the local variable match by two each time a match is made instead of multiplying the local variable match by 2 at the end. The second part (increment of match) was just done in the example to see if anyone was paying attention.

Those methods above, bigram and dice, provide a suggestion to implement Dice's coefficient in Java to create a simple measurement of a fuzzy string similarity.

## Points of Interest

Note that this not not a replacement for the stock `equals()` method on the `String` class.

## About the Author

 Software Developer (Senior) United States
George Stragand is a software developer and manager who has had a 20+ year track record of delivering commercial software on time. In 2012, George published the Pocket Guide to Hiring Geeks, following up the 2010 publication of Agile Development & Business Goals which details his approach to agile development. George maintains his own website where he posts about the topics he is passionate about: software development, cars, guns, guitars and yes even banjos!

## Comments and Discussions

 First Prev Next
 Idea Member 793111222-Oct-13 10:17 Member 7931112 22-Oct-13 10:17
 My vote of 5 Zhao Jiang4-Jun-13 5:50 Zhao Jiang 4-Jun-13 5:50
 My vote of 5 alxxl24-Jan-11 20:35 alxxl 24-Jan-11 20:35
 Lifecell Wrinkle Cream Josepheane Brooks18-Jan-11 3:04 Josepheane Brooks 18-Jan-11 3:04
 My vote of 5 divetramp18-Jan-11 0:57 divetramp 18-Jan-11 0:57
 Re: My vote of 5 George Stragand19-Jan-11 2:58 George Stragand 19-Jan-11 2:58
 Last Visit: 31-Dec-99 19:00     Last Update: 24-Nov-15 19:57 Refresh 1

General    News    Suggestion    Question    Bug    Answer    Joke    Praise    Rant    Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.