Add your own alternative version
Stats
135.3K views 5.1K downloads 75 bookmarked
Posted
22 Mar 2006

Comments and Discussions


OK, so I did some work since yesterday, and here is what I came up with.
The changes I did :
1) Before I start filling up the S vectors, I remove all the identical first letters.
For example, for the 2 words CHEMISE and CHEIMSE, the "CHE" part is the same, so I disregard this. And the 2 new words that will be processed (and inserted in the 2 vectors) are "MISE" and "IMSE".
Should save a few milliseconds, but when you have millions of rows to compare, they add up...
2) As I wrote previously, I wanted to stop if the current Levenshtein distance was greater than a number.
For example, I don't want all the words with a Levenshtein distance greater than 3.
The way I do is is that I look at the v1 column (once all the slots have numbers), and take the MIN.
This is the current Levenshtein distance. If this current distance is greater than my limit, I exit the function with a value of 99, as I don't need to compute the other columns (as the overall Levenshtein distance can only increase, and never decrease (as you only add up 1s)
3) Optional. If you have to work with accentuated characters, and you want that the Levenshtein distance don't count the difference between, say, "Helene" and "Hélène", then you have to use a function that replaces all the accentuated characters with their "normal" letters.
Function Levenshtein(ByVal String1 As String, ByVal String2 As String, ByVal LMax As Integer) As Integer
Dim cost As Integer
Dim CurrentCost As Integer
Dim MinLen As Integer = Math.Min(String1.Length, String2.Length)
If (String1.Length > 1000 Or String2.Length > 1000) Then
MsgBox("La taille limite des mots est de 1000 lettres", MsgBoxStyle.Exclamation, "Texte trop long")
Return 99
End If
If (Math.Abs(String1.Length  String2.Length) > LMax) Then
Return 99
End If
String1 = String1.ToUpper
String2 = String2.ToUpper
String1 = No_accents(String1)
String2 = No_accents(String2)
For i = 0 To MinLen  1
If (String1(i) = String2(i)) Then
If i = MinLen  1 Then
String1 = String1.Substring(i + 1)
String2 = String2.Substring(i + 1)
Exit For
Else
Continue For
End If
End If
String1 = String1.Substring(i)
String2 = String2.Substring(i)
Exit For
Next
Dim String1Len As Integer = String1.Length
Dim String2Len As Integer = String2.Length
If (String1Len = 0 Or String2Len = 0) Then Return Math.Max(String1.Length, String2.Length)
Dim v0 As Integer() = New Integer(String1Len + 1) {}
Dim v1 As Integer() = New Integer(String1Len + 1) {}
For i = 1 To String1Len
v0(i) = i
Next
For i = 1 To String2Len
v1(0) = i
For j = 1 To String1Len
If (String1(j  1) = String2(i  1)) Then
cost = 0
Else
cost = 1
End If
v1(j) = Math.Min(v0(j) + 1, Math.Min(v1(j  1) + 1, v0(j  1) + cost))
If j = 1 Then
CurrentCost = v1(1)
Else
CurrentCost = Math.Min(v1(j), CurrentCost)
End If
Next
If CurrentCost > LMax Then Return 99
Dim vTmp = v0
v0 = v1
v1 = vTmp
Next
Return (v0(String1Len))
End Function
Function No_accents(ByVal Chaine As String) As String
Dim a As String = "ÀÁÂÃÄÅÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝàáâãäåèéêëìíîïðñòóôõöùúûüýÿ"
Dim b As String = "AAAAAAEEEEIIIINOOOOOUUUUYaaaaaaeeeeiiiionooooouuuuyy"
For i As Integer = 0 To a.Count  1
Chaine = Chaine.Replace(a.Substring(i, 1), b.Substring(i, 1))
Next i
Return Chaine.ToUpper
End Function
Hope this will be usefull.
modified 2 days ago.





Hi all,
I tried to port this code to VB.Net, but I can't make it give me the expected results..
Here is the code :
<pre> Function Levenshtein(ByVal sRow As String, ByVal sCol As String, ByVal LMax As Integer) As Integer
Dim RowLen As Integer = sRow.Length
Dim ColLen As Integer = sCol.Length
Dim MaxLen = Math.Max(RowLen, ColLen)
Dim cost As Integer
If MaxLen > 1000 Then
MsgBox("La taille limite des mots est de 1000 lettres", MsgBoxStyle.Exclamation, "Texte trop long")
Return 99
End If
If (RowLen = 0 Or ColLen = 0) Then Return MaxLen
Dim v0 As Integer() = New Integer(RowLen + 1) {}
Dim v1 As Integer() = New Integer(RowLen + 1) {}
For i = 1 To RowLen
v0(i) = i
Next
For j = 1 To ColLen
v1(0) = j
For i = 1 To RowLen
If (sRow(i  1) = sCol(j  1)) Then
cost = 0
Else
cost = 1
End If
v1(i) = Math.Min(v0(i) + 1, Math.Min(v1(i  1) + 1, v0(i  1) + cost))
Next
Dim vTmp = v0
v0 = v1
v1 = vTmp
Next
Return ((100 * v0(RowLen)) / MaxLen)
End Function
When I try to launch the function with "CHEMISE" and "CHEMISIE", it gives me a result of 12...
I most probably made a mistake in the "translation" from C to VB, but I can't find it.
Also, I'd like to change the code to stop if the Levenshtein distance is greater than a value (this is to avoid spending too much time computing different strings, while I only want them to be, for example, only 2 changes apart.
Any idea on how to do this ?
Thanks a lot






You are welcome

Sten Hjelmqvist





Thank you, Sten. I really appreciate this. It saved me a lot of time. I cleaned up the code to suit my style. I used var when possible, renamed the local variables, reduced scope, etc, etc.
public static float LevenshteinDistance(string a, string b)
{
var rowLen = a.Length;
var colLen = b.Length;
var maxLen = Math.Max(rowLen, colLen);
if (rowLen == 0  colLen == 0)
{
return maxLen;
}
var v0 = new int[rowLen + 1];
var v1 = new int[rowLen + 1];
for (var i = 1; i <= rowLen; i++)
{
v0[i] = i;
}
for (var j = 1; j <= colLen; j++)
{
v1[0] = j;
for (var i = 1; i <= rowLen; i++)
{
var cost = (a[i  1] == b[j  1]) ? 0 : 1;
v1[i] = Math.Min(v0[i] + 1, Math.Min(v1[i  1] + 1, v0[i  1] + cost));
}
var vTmp = v0;
v0 = v1;
v1 = vTmp;
}
return 100.0f * v0[rowLen] / maxLen;
}





OK, I found out why...
In the original code, the result was :
return 100.0f * v0[rowLen] / maxLen;
I just changed it to
return v0[rowLen];
I also add a ToUpper to the 2 strings, so that
Levenshtein ("CHAT", "chat") would give 0 (which is what I expect, but you may disagree).
Now, trying to find a solution to the limit of the function....





This is stupid fast. I've been tasked with finding a first and last name in roughly 11,000 text files from OCRed PDFs. 21.2 million word comparisons in ~15 seconds with great results.
Thanks for sharing, this is awesome work!





I'm glad that you find it useful but I'm not sure how.
That is how do you use this algorithm to find names?
When I compare the two strings:
"Fast, memory efficient Levenshtein algorithm" : length=45
and
"Levenshtein" : length=12
I get 33 , it makes sense as 45  12 = 33.
And 45  33 = 12, that is the length of "Levenshtein".
Is that how you use the algorithm?

Sten Hjelmqvist





Some what like that. I'm not comparing a name to a string of words, rather a name to each individual word, hence the 10M compares. If the result between the two are with in a given range I flag that document for review. Since the files are OCRed you might get Sc0t7 and not Scott. There are other processing steps such as soundex that are also used, but this has helped increase an accuracy hit rate to, in some cases, 6070%.





such as,
akitten To sitting：
kitten （a→ ）
sitten （k→s）
sittin （e→i）
sitting（ →g）
I hope to get the detail change list, not only the Distance int value
Thanks





In my testing (differences between email addresses) I have found that a single character difference (something that posgresql's levenshtein algorthim returns 1), this code returns 3 for both iLD and LD. Therefore it doesn't matter how fast of efficient your code is, the end result is incorrect.





As mentioned below by "Member 11645800" there are different versions of the Levenshtein algorithm.
Are you sure the one I describe here and the one implemented in posgresql are the same?

Sten Hjelmqvist





Tom,
Can you supply an example of a string (not necessarily a valid email address) that exhibits the length difference you describe?
Jim





I have seen an algorithm which uses the original method of Levenshtein algo. After generating the matrix and calculating the minimum edit distance it back traces the matrix to create an array that contains the transformations needed.
But in your implementation we dont have a matrix to do that. Can you tell me how to get the changes?






Sorry, but I don't understand what you mean.
Could you please elaborate?

Sten Hjelmqvist





Tranposition is not part of the Levenshtein distance, but instead a different algorithm based on it called DamerauLevenshtein distance.
http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
Quote: The Damerau–Levenshtein distance differs from the classical Levenshtein distance by including transpositions among its allowable operations





The result is 6 comparing two strings "ROBERTMELANSON" and "ROBERTOMELANSON". I think it should be 1 (correct me if I'm wrong).





When I run the program comparing "ROBERTMELANSON" with "ROBERTOMELANSON" I get the result 1 not 6.
I don't understand how you get the result = 6.

Sten Hjelmqvist







I'm glad you find it useful.

Sten Hjelmqvist





Thanks.Your improvement is very interesting and useful





As you don't use previous column values again after calculating a new column vector can't you avoid swapping the column vectors: and just overwrite the 'previous column' vector with the new vector?
The new vector is then repopulated with new values, next iteration (if any)
If so, you'd save a couple of array (re)assignments.






Looking at the lines:
if (Math.Max(sRow.Length, sCol.Length) > Math.Pow(2, 31))
throw (new Exception("\nMaximum string length in Levenshtein.iLD is " + Math.Pow(2, 31) +
".\nYours is " + Math.Max(sRow.Length, sCol.Length) + "."));
it looks to me like this is ported code. Let's examine it:
Math.Pow(2, 31) is the same as int.MaxValue.
Math.Max(sRow.Length, sCol.Length) will return a value <= int.MaxValue as Length from a string is of type int.
Thus, the whole if condition can be removed as the left side will never be greater than the right side.
A pretty easy spot





You are correct.
The test is only relevant if you use strings with a length short enough to be stored in a ushort instead of int.
The test would be like this:
if (Math.Max(sRow.Length, sCol.Length) > Math.Pow(2, 16))
throw (new Exception("\nMaximum string length in Levenshtein.iLD is " + Math.Pow(2, 16) +
".\nYours is " + Math.Max(sRow.Length, sCol.Length) + "."));

Sten Hjelmqvist





Very useful indeed; thank you! I am using this as part of the political science research website http://www.ccesd.ac.uk.





Hi,
the code is fantastic but i want to know the Operations performed on the strings apart from the distance. Do you have any idea how we can display the operations performed on it ?
and how about the "String edit in both the forward and backward direction" ??
thanks,
piyush





Thanks  the code is great, it works exactly as expected!





I'm looking to integrate this kind of fuzzy string compare in to my ASP.NET application. Is this code production ready?
Also, have you seen this? Does it use the same algorithm?





My article is about how to speed up the Levenshtein algorithm,
it's not a product. Sorry!
The product you link to is probably using a similar algorithm.

Sten Hjelmqvist






Your code assumes that nul is a terminator in C#/.Net. It isn't. I declare your port of ANSIC to C#.Net a FAIL
http://musingmarc.blogspot.com






A word of warning to users of the Yeti c# port: it is fast but has a problem: try matching the following strings in YetiLevenshtein:
"ABCxxx" and "ABC1xx"  this returns 0 but should return 1 because only the 1 and the x are different.. pass the strings in reverse order and the result is 1.
This behavior happens every time you compare strings in the following format
[Prefix][character x][Suffix starting with character x]
[Prefix][character y][Suffix starting with character x]
The problem is in the memchrRPLC(..) method but also caused by the prefix/suffix handling of YetiLevenshtein.
in the above example
"ABCxxx" is reduced to "xxx" and then "x"
"ABC1xx"; is reduced to "1xx" and then "1".
this is sent to memchrRPLC but memchr compares
not this: ABC[x]xx, ABC[1]xx => x != 1
but this: ABC[x]xx, ABC1[x]x. => x == x
My fix for memchrRPLC:
static unsafe int memchrRPLC(char* buffer, char c, int count)
{
char* p = buffer;
char* e = buffer + count;
while (p < e)
{
if (*(p++) == c)
return 1;
}
return 0;
}
@Cetin I got the latest revision from sourceforge so this should be fixed there...





unfortunately, i'm disapointed about the article body...
why don't you explain more ?





I agree  source code is great and all, but why don't you tell us a little about how your new version of the algorithm works.





The idea behind this editdistance algorithm is bottomup dynamic programming, that is systematically traversing the matrix from small to big. It means the solution of new and bigger problem is computed based on previous smaller problem. Looking back at the original fomular :
"d[i][j] = Minimum (d[i1][j]+1, d[i][j1]+1, d[i1][j1] + cost);"
What are essential evidences to compute the solution at [i][j] ? you need to maintain only two separate "one dimensional" arrays. The first one stores all results of row at "i1": d[i  1][], the second one stores all results of current row at "i": d[i][]. This is why the author said that we only need 2*StrLen elements. These arrays contain all parameters satisfy to above formular: "d[i1][j]+1, d[i][j1]+1, d[i1][j1] + cost". After each loop, we just need to replace array of "i1" by array of "i" and go next row to continue the computation
Thanh Dao





v2.0 wrote: unfortunately, i'm disapointed about the article body...
why don't you explain more ?
Ok, I have added a more comprehensive explaination of the algorithm.

Sten Hjelmqvist






General News Suggestion Question Bug Answer Joke Praise Rant Admin Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

