Previously, I’ve posted about using the Dice Coefficient as a method to perform a fuzzy match on a
string, producing a percentage as a result to measure the amount of similarity between two
strings. The Dice Coefficient is a quick way to produce a measurement of similarity, but it does have some drawbacks. With small
strings, the amount of similarity is small compared to larger
strings. For example "
cat" and "
hat" come out with a rating of 50% using the Dice Coefficient. With longer
strings, the "error" is smoothed. One benefit of the Dice Coefficient is that small runs of characters within the
string (and by "small run", I really mean two characters given how the Dice Coefficient functions on bigrams) weigh into the overall rating of similarity. As good as the Dice Coefficient is, simple and fairly quick, it's not the tool for every job.
When dealing with small
strings (and small is a relative measurement), the approach created by Vladimir Levenshtein in 1965 has a large benefit over the Dice Coefficient approach. Amazing to realize that the 1965 wasn't a typo; this isn’t a new idea or a new problem, but how many software developers in the 21st Century have heard of Levenshtein's algorithm? I'd guess only a small percentage! Yet this is a classic computer science problem, and one that has applications outside of spell checking.
Levenshtein's Edit Distance as a Fuzzy String Match
What Levenshtein's algorithm does is count the number of changes that must occur in one
string to transform it into another
string. Take the example of two
George" and "
Geordie"; how many characters would you have to change to transform "
George" into "
Geordie"? The answer is 2, here’s how:
Georde (substitute the 'g' with 'd')
Geordie (insert 'i' between the 'd' and 'e')
Pretty easy stuff for a human, getting a machine to calculate the answer is another problem. Tracking and measuring the inserting is the really interesting part of the problem and this is what Levenshtein's algorithm does very nicely.
The actual algorithm itself is very simple, requiring a matrix to represent the values as the calculation progresses through both
strings. A simple version of the loops results in an O(n2) implementation, but since only very small
strings (under 10 characters typically) existed in the problem space I was working within, this was not a huge performance drain.
One modification of the results of the algorithm that I made to produce the goal I had was to produce a percentage instead of a number. This is really only a minor tweak to the results to transform the number of edits required into a percentage. To transform the results, I subtracted from 1.0 the number of edits required divided by the length of the longest
string. Taking the resulting double value and multiplying by 100 yields a percentage; this step was only done to produce a result that would be meaningful to a user. Using the example of transforming "
George" into "
Geordie", the result would be: 1.0 – (2 / 7) or 71 (rounded down, remember the goal I had was to produce a fuzzy result and that 71 is close enough). Another example would be transforming "
Fred" into "
George", almost nothing is the same and 5 edits are required so the result would be 0, 1.0 – (5/6) or 16% as very little is the same and what is isn't in the same position.
Before getting into the loop to calculate the edit distance, check to see if either
string is empty. If one is, return the length of the other
string. This points out one interesting note about Levelshtein's algorithm, in that the result is always bounded by the length of the longest of the two input parameter
string is empty, then construct the matrix for the calculation. The matrix should be the length of the first input
string plus one by the length of the second input
string plus one (in the "
George" vs. "
Geordie" example above, the matrix would measure 7 by 8). Note the "plus one", as these rows are the "seed" values to kick everything off and allow a very simple comparison later. The first column and row will not be modified later as the loops start at the column and row indexed by one, not zero.
Figure 1: Initial Matrix
While looping through the
strings, the indexes of both
strings are the coordinates of the matrix. Note that the
string indexes and the matrix will be off by one as the matrix coordinates start at one and, in the Java code below, the
strings start at zero. The actual calculation is interesting as it looks at three cells, the cell to the left, the cell above and the cell to the upper left. The point of the three comparisons is to take the minimum value from any of the three cells, this can be represented simply in Java with a single line of code:
Math.min(Math.min(a, b), c);
Once the matrix is initialized, the remainder of the values may be filled in. Once the entire matrix is filled in, the answer is in the lower right hand cell.
For each comparison, either
1 is the result. If the two characters are the same, then the result is
0; if they are different, the result is
1. This is used in generating one of the inputs to the selection of the minimum value of the three candidates that will result in the value inserted into the next cell of the matrix. The result of comparing the '
G' to the '
G' would then be
0. The cell above has a value of
1,the cell to the left has a value of
1, and the value of the cell one up and one to the left plus the result of the comparison is
1,1,0 into the three way minimum results in
0, which is then placed into the cell resulting in a matrix with these values:
Figure 2: Matrix after the first iteration
The result of the first pass would then be:
Figure 3: Matrix after the first pass
Letting the loops run to completion results in a matrix that looks like:
Figure 4: The completed matrix
And the result is
2, as predicted originally.
The Java Code
public class Levenshtein
public double compare(final String s1, final String s2)
double retval = 0.0;
final int n = s1.length();
final int m = s2.length();
if (0 == n)
retval = m;
else if (0 == m)
retval = n;
retval = 1.0 - (compare(s1, n, s2, m) / (Math.max(n, m)));
private double compare(final String s1, final int n,
final String s2, final int m)
int matrix = new int[n + 1][m + 1];
for (int i = 0; i <= n; i++)
matrix[i] = i;
for (int i = 0; i <= m; i++)
matrix[i] = i;
for (int i = 1; i <= n; i++)
int s1i = s1.codePointAt(i - 1);
for (int j = 1; j <= m; j++)
int s2j = s2.codePointAt(j - 1);
final int cost = s1i == s2j ? 0 : 1;
matrix[i][j] = min3(matrix[i - 1][j] + 1,
matrix[i][j - 1] + 1,
matrix[i - 1][j - 1] + cost);
private int min3(final int a, final int b, final int c)
return Math.min(Math.min(a, b), c);
- 27 Feb 2010: Original version