Click here to Skip to main content
Click here to Skip to main content

Tagged as

Fuzzy String Matching with Edit Distance

, 28 Feb 2011 CPOL
Rate this:
Please Sign up or sign in to vote.
Move beyond the String.equals() method of Java and introduce fuzzy String matching

Introduction

Previously, I’ve posted about using the Dice Coefficient as a method to perform a fuzzy match on a string, producing a percentage as a result to measure the amount of similarity between two strings. The Dice Coefficient is a quick way to produce a measurement of similarity, but it does have some drawbacks. With small strings, the amount of similarity is small compared to larger strings. For example "cat" and "hat" come out with a rating of 50% using the Dice Coefficient. With longer strings, the "error" is smoothed. One benefit of the Dice Coefficient is that small runs of characters within the string (and by "small run", I really mean two characters given how the Dice Coefficient functions on bigrams) weigh into the overall rating of similarity. As good as the Dice Coefficient is, simple and fairly quick, it's not the tool for every job.

When dealing with small strings (and small is a relative measurement), the approach created by Vladimir Levenshtein in 1965 has a large benefit over the Dice Coefficient approach. Amazing to realize that the 1965 wasn't a typo; this isn’t a new idea or a new problem, but how many software developers in the 21st Century have heard of Levenshtein's algorithm? I'd guess only a small percentage! Yet this is a classic computer science problem, and one that has applications outside of spell checking.

Levenshtein's Edit Distance as a Fuzzy String Match

What Levenshtein's algorithm does is count the number of changes that must occur in one string to transform it into another string. Take the example of two strings, "George" and "Geordie"; how many characters would you have to change to transform "George" into "Geordie"? The answer is 2, here’s how:

  1. George ? Georde (substitute the 'g' with 'd')
  2. Georde ? Geordie (insert 'i' between the 'd' and 'e')

Pretty easy stuff for a human, getting a machine to calculate the answer is another problem. Tracking and measuring the inserting is the really interesting part of the problem and this is what Levenshtein's algorithm does very nicely.

The actual algorithm itself is very simple, requiring a matrix to represent the values as the calculation progresses through both strings. A simple version of the loops results in an O(n2) implementation, but since only very small strings (under 10 characters typically) existed in the problem space I was working within, this was not a huge performance drain.

One modification of the results of the algorithm that I made to produce the goal I had was to produce a percentage instead of a number. This is really only a minor tweak to the results to transform the number of edits required into a percentage. To transform the results, I subtracted from 1.0 the number of edits required divided by the length of the longest string. Taking the resulting double value and multiplying by 100 yields a percentage; this step was only done to produce a result that would be meaningful to a user. Using the example of transforming "George" into "Geordie", the result would be: 1.0 – (2 / 7) or 71 (rounded down, remember the goal I had was to produce a fuzzy result and that 71 is close enough). Another example would be transforming "Fred" into "George", almost nothing is the same and 5 edits are required so the result would be 0, 1.0 – (5/6) or 16% as very little is the same and what is isn't in the same position.

Before getting into the loop to calculate the edit distance, check to see if either string is empty. If one is, return the length of the other string. This points out one interesting note about Levelshtein's algorithm, in that the result is always bounded by the length of the longest of the two input parameter strings.

If neither string is empty, then construct the matrix for the calculation. The matrix should be the length of the first input string plus one by the length of the second input string plus one (in the "George" vs. "Geordie" example above, the matrix would measure 7 by 8). Note the "plus one", as these rows are the "seed" values to kick everything off and allow a very simple comparison later. The first column and row will not be modified later as the loops start at the column and row indexed by one, not zero.

ed1.jpg

Figure 1: Initial Matrix

While looping through the strings, the indexes of both strings are the coordinates of the matrix. Note that the string indexes and the matrix will be off by one as the matrix coordinates start at one and, in the Java code below, the strings start at zero. The actual calculation is interesting as it looks at three cells, the cell to the left, the cell above and the cell to the upper left. The point of the three comparisons is to take the minimum value from any of the three cells, this can be represented simply in Java with a single line of code:

Math.min(Math.min(a, b), c); 

Once the matrix is initialized, the remainder of the values may be filled in. Once the entire matrix is filled in, the answer is in the lower right hand cell.

For each comparison, either 0 or 1 is the result. If the two characters are the same, then the result is 0; if they are different, the result is 1. This is used in generating one of the inputs to the selection of the minimum value of the three candidates that will result in the value inserted into the next cell of the matrix. The result of comparing the 'G' to the 'G' would then be 0. The cell above has a value of 1,the cell to the left has a value of 1, and the value of the cell one up and one to the left plus the result of the comparison is 0. Feeding 1,1,0 into the three way minimum results in 0, which is then placed into the cell resulting in a matrix with these values:

ed2.jpg

Figure 2: Matrix after the first iteration

The result of the first pass would then be:

ed3.jpg

Figure 3: Matrix after the first pass

Letting the loops run to completion results in a matrix that looks like:

ed4.jpg

Figure 4: The completed matrix

And the result is 2, as predicted originally.

The Java Code

public class Levenshtein
{
    public Levenshtein()
    {
        super();
    }

    public double compare(final String s1, final String s2)
    {
        double retval = 0.0;
        final int n = s1.length();
        final int m = s2.length();
        if (0 == n)
        {
            retval = m;
        }
        else if (0 == m)
        {
            retval = n;
        }
        else
        {
            retval = 1.0 - (compare(s1, n, s2, m) / (Math.max(n, m)));
        }
        return retval;
    }

    private double compare(final String s1, final int n, 
                           final String s2, final int m)
    {
        int matrix[][] = new int[n + 1][m + 1];
        for (int i = 0; i <= n; i++)
        {
            matrix[i][0] = i;
        }
        for (int i = 0; i <= m; i++)
        {
            matrix[0][i] = i;
        }

        for (int i = 1; i <= n; i++)
        {
            int s1i = s1.codePointAt(i - 1);
            for (int j = 1; j <= m; j++)
            {
                int s2j = s2.codePointAt(j - 1);
                final int cost = s1i == s2j ? 0 : 1;
                matrix[i][j] = min3(matrix[i - 1][j] + 1, 
                                    matrix[i][j - 1] + 1, 
                                    matrix[i - 1][j - 1] + cost);
            }
        }
        return matrix[n][m];
    }

    private int min3(final int a, final int b, final int c)
    {
        return Math.min(Math.min(a, b), c);
    }
}

History

  • 27 Feb 2010: Original version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

George Stragand
Software Developer (Senior)
United States United States
George Stragand is a software developer and manager who has had a 20+ year track record of delivering commercial software on time. In 2012, George published the Pocket Guide to Hiring Geeks, following up the 2010 publication of Agile Development & Business Goals which details his approach to agile development. George maintains his own website where he posts about the topics he is passionate about: software development, cars, guns, guitars and yes even banjos!

Comments and Discussions

 
GeneralThank you.. PinmemberMember 1109946521-Sep-14 20:48 
QuestionIs it me PinmemberKillaspeed18-Oct-13 3:51 
AnswerRe: Is it me PinmemberGeorge Stragand18-Oct-13 12:05 
GeneralMy vote of 5 PinmemberZhao Jiang4-Jun-13 4:52 
GeneralRe: My vote of 5 PinmemberFranckePeixoto29-May-14 4:10 
GeneralMy vote of 5 PinmemberNagy Vilmos31-Mar-11 3:53 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.141022.2 | Last Updated 28 Feb 2011
Article Copyright 2011 by George Stragand
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid