Click here to Skip to main content
14,034,071 members
Click here to Skip to main content
Add your own
alternative version

Tagged as


16 bookmarked
Posted 28 Feb 2011
Licenced CPOL

Fuzzy String Matching with Edit Distance

, 28 Feb 2011
Rate this:
Please Sign up or sign in to vote.
Move beyond the String.equals() method of Java and introduce fuzzy String matching


Previously, I’ve posted about using the Dice Coefficient as a method to perform a fuzzy match on a string, producing a percentage as a result to measure the amount of similarity between two strings. The Dice Coefficient is a quick way to produce a measurement of similarity, but it does have some drawbacks. With small strings, the amount of similarity is small compared to larger strings. For example "cat" and "hat" come out with a rating of 50% using the Dice Coefficient. With longer strings, the "error" is smoothed. One benefit of the Dice Coefficient is that small runs of characters within the string (and by "small run", I really mean two characters given how the Dice Coefficient functions on bigrams) weigh into the overall rating of similarity. As good as the Dice Coefficient is, simple and fairly quick, it's not the tool for every job.

When dealing with small strings (and small is a relative measurement), the approach created by Vladimir Levenshtein in 1965 has a large benefit over the Dice Coefficient approach. Amazing to realize that the 1965 wasn't a typo; this isn’t a new idea or a new problem, but how many software developers in the 21st Century have heard of Levenshtein's algorithm? I'd guess only a small percentage! Yet this is a classic computer science problem, and one that has applications outside of spell checking.

Levenshtein's Edit Distance as a Fuzzy String Match

What Levenshtein's algorithm does is count the number of changes that must occur in one string to transform it into another string. Take the example of two strings, "George" and "Geordie"; how many characters would you have to change to transform "George" into "Geordie"? The answer is 2, here’s how:

  1. George ? Georde (substitute the 'g' with 'd')
  2. Georde ? Geordie (insert 'i' between the 'd' and 'e')

Pretty easy stuff for a human, getting a machine to calculate the answer is another problem. Tracking and measuring the inserting is the really interesting part of the problem and this is what Levenshtein's algorithm does very nicely.

The actual algorithm itself is very simple, requiring a matrix to represent the values as the calculation progresses through both strings. A simple version of the loops results in an O(n2) implementation, but since only very small strings (under 10 characters typically) existed in the problem space I was working within, this was not a huge performance drain.

One modification of the results of the algorithm that I made to produce the goal I had was to produce a percentage instead of a number. This is really only a minor tweak to the results to transform the number of edits required into a percentage. To transform the results, I subtracted from 1.0 the number of edits required divided by the length of the longest string. Taking the resulting double value and multiplying by 100 yields a percentage; this step was only done to produce a result that would be meaningful to a user. Using the example of transforming "George" into "Geordie", the result would be: 1.0 – (2 / 7) or 71 (rounded down, remember the goal I had was to produce a fuzzy result and that 71 is close enough). Another example would be transforming "Fred" into "George", almost nothing is the same and 5 edits are required so the result would be 0, 1.0 – (5/6) or 16% as very little is the same and what is isn't in the same position.

Before getting into the loop to calculate the edit distance, check to see if either string is empty. If one is, return the length of the other string. This points out one interesting note about Levelshtein's algorithm, in that the result is always bounded by the length of the longest of the two input parameter strings.

If neither string is empty, then construct the matrix for the calculation. The matrix should be the length of the first input string plus one by the length of the second input string plus one (in the "George" vs. "Geordie" example above, the matrix would measure 7 by 8). Note the "plus one", as these rows are the "seed" values to kick everything off and allow a very simple comparison later. The first column and row will not be modified later as the loops start at the column and row indexed by one, not zero.


Figure 1: Initial Matrix

While looping through the strings, the indexes of both strings are the coordinates of the matrix. Note that the string indexes and the matrix will be off by one as the matrix coordinates start at one and, in the Java code below, the strings start at zero. The actual calculation is interesting as it looks at three cells, the cell to the left, the cell above and the cell to the upper left. The point of the three comparisons is to take the minimum value from any of the three cells, this can be represented simply in Java with a single line of code:

Math.min(Math.min(a, b), c); 

Once the matrix is initialized, the remainder of the values may be filled in. Once the entire matrix is filled in, the answer is in the lower right hand cell.

For each comparison, either 0 or 1 is the result. If the two characters are the same, then the result is 0; if they are different, the result is 1. This is used in generating one of the inputs to the selection of the minimum value of the three candidates that will result in the value inserted into the next cell of the matrix. The result of comparing the 'G' to the 'G' would then be 0. The cell above has a value of 1,the cell to the left has a value of 1, and the value of the cell one up and one to the left plus the result of the comparison is 0. Feeding 1,1,0 into the three way minimum results in 0, which is then placed into the cell resulting in a matrix with these values:


Figure 2: Matrix after the first iteration

The result of the first pass would then be:


Figure 3: Matrix after the first pass

Letting the loops run to completion results in a matrix that looks like:


Figure 4: The completed matrix

And the result is 2, as predicted originally.

The Java Code

public class Levenshtein
    public Levenshtein()

    public double compare(final String s1, final String s2)
        double retval = 0.0;
        final int n = s1.length();
        final int m = s2.length();
        if (0 == n)
            retval = m;
        else if (0 == m)
            retval = n;
            retval = 1.0 - (compare(s1, n, s2, m) / (Math.max(n, m)));
        return retval;

    private double compare(final String s1, final int n, 
                           final String s2, final int m)
        int matrix[][] = new int[n + 1][m + 1];
        for (int i = 0; i <= n; i++)
            matrix[i][0] = i;
        for (int i = 0; i <= m; i++)
            matrix[0][i] = i;

        for (int i = 1; i <= n; i++)
            int s1i = s1.codePointAt(i - 1);
            for (int j = 1; j <= m; j++)
                int s2j = s2.codePointAt(j - 1);
                final int cost = s1i == s2j ? 0 : 1;
                matrix[i][j] = min3(matrix[i - 1][j] + 1, 
                                    matrix[i][j - 1] + 1, 
                                    matrix[i - 1][j - 1] + cost);
        return matrix[n][m];

    private int min3(final int a, final int b, final int c)
        return Math.min(Math.min(a, b), c);


  • 27 Feb 2010: Original version


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

George Stragand
Software Developer (Senior)
United States United States
George Stragand is a software developer and manager who has had a 20+ year track record of delivering commercial software on time. In 2012, George published the Pocket Guide to Hiring Geeks, following up the 2010 publication of Agile Development & Business Goals which details his approach to agile development. George maintains his own website where he posts about the topics he is passionate about: software development, cars, guns, guitars and yes even banjos!

You may also be interested in...

Comments and Discussions

GeneralC++ Port Pin
vlzvl66611-Jan-17 14:25
membervlzvl66611-Jan-17 14:25 
GeneralThank you.. Pin
Member 1109946521-Sep-14 20:48
memberMember 1109946521-Sep-14 20:48 
QuestionIs it me Pin
Killaspeed18-Oct-13 3:51
memberKillaspeed18-Oct-13 3:51 
AnswerRe: Is it me Pin
George Stragand18-Oct-13 12:05
memberGeorge Stragand18-Oct-13 12:05 
GeneralMy vote of 5 Pin
Zhao Jiang4-Jun-13 4:52
memberZhao Jiang4-Jun-13 4:52 
GeneralRe: My vote of 5 Pin
FranckePeixoto29-May-14 4:10
memberFranckePeixoto29-May-14 4:10 
GeneralMy vote of 5 Pin
Nagy Vilmos31-Mar-11 3:53
professionalNagy Vilmos31-Mar-11 3:53 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile
Web02 | 2.8.190423.1 | Last Updated 28 Feb 2011
Article Copyright 2011 by George Stragand
Everything else Copyright © CodeProject, 1999-2019
Layout: fixed | fluid