Add your own alternative version
Stats
138K views 5.1K downloads 77 bookmarked
Posted
22 Mar 2006

Comments and Discussions


OK, so I did some work since yesterday, and here is what I came up with.
The changes I did :
1) Before I start filling up the S vectors, I remove all the identical first letters.
For example, for the 2 words CHEMISE and CHEIMSE, the "CHE" part is the same, so I disregard this. And the 2 new words that will be processed (and inserted in the 2 vectors) are "MISE" and "IMSE".
Should save a few milliseconds, but when you have millions of rows to compare, they add up...
2) As I wrote previously, I wanted to stop if the current Levenshtein distance was greater than a number.
For example, I don't want all the words with a Levenshtein distance greater than 3.
The way I do is is that I look at the v1 column (once all the slots have numbers), and take the MIN.
This is the current Levenshtein distance. If this current distance is greater than my limit, I exit the function with a value of 99, as I don't need to compute the other columns (as the overall Levenshtein distance can only increase, and never decrease (as you only add up 1s)
3) Optional. If you have to work with accentuated characters, and you want that the Levenshtein distance don't count the difference between, say, "Helene" and "Hélène", then you have to use a function that replaces all the accentuated characters with their "normal" letters.
Function Levenshtein(ByVal String1 As String, ByVal String2 As String, ByVal LMax As Integer) As Integer
Dim cost As Integer
Dim CurrentCost As Integer
Dim MinLen As Integer = Math.Min(String1.Length, String2.Length)
If (String1.Length > 1000 Or String2.Length > 1000) Then
MsgBox("La taille limite des mots est de 1000 lettres", MsgBoxStyle.Exclamation, "Texte trop long")
Return 99
End If
If (Math.Abs(String1.Length  String2.Length) > LMax) Then
Return 99
End If
String1 = String1.ToUpper
String2 = String2.ToUpper
String1 = No_accents(String1)
String2 = No_accents(String2)
For i = 0 To MinLen  1
If (String1(i) = String2(i)) Then
If i = MinLen  1 Then
String1 = String1.Substring(i + 1)
String2 = String2.Substring(i + 1)
Exit For
Else
Continue For
End If
End If
String1 = String1.Substring(i)
String2 = String2.Substring(i)
Exit For
Next
Dim String1Len As Integer = String1.Length
Dim String2Len As Integer = String2.Length
If (String1Len = 0 Or String2Len = 0) Then Return Math.Max(String1.Length, String2.Length)
Dim v0 As Integer() = New Integer(String1Len + 1) {}
Dim v1 As Integer() = New Integer(String1Len + 1) {}
For i = 1 To String1Len
v0(i) = i
Next
For i = 1 To String2Len
v1(0) = i
For j = 1 To String1Len
If (String1(j  1) = String2(i  1)) Then
cost = 0
Else
cost = 1
End If
v1(j) = Math.Min(v0(j) + 1, Math.Min(v1(j  1) + 1, v0(j  1) + cost))
If j = 1 Then
CurrentCost = v1(1)
Else
CurrentCost = Math.Min(v1(j), CurrentCost)
End If
Next
If CurrentCost > LMax Then Return 99
Dim vTmp = v0
v0 = v1
v1 = vTmp
Next
Return (v0(String1Len))
End Function
Function No_accents(ByVal Chaine As String) As String
Dim a As String = "ÀÁÂÃÄÅÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝàáâãäåèéêëìíîïðñòóôõöùúûüýÿ"
Dim b As String = "AAAAAAEEEEIIIINOOOOOUUUUYaaaaaaeeeeiiiionooooouuuuyy"
For i As Integer = 0 To a.Count  1
Chaine = Chaine.Replace(a.Substring(i, 1), b.Substring(i, 1))
Next i
Return Chaine.ToUpper
End Function
Hope this will be usefull.
modified 20Oct17 14:37pm.





Hi all,
I tried to port this code to VB.Net, but I can't make it give me the expected results..
Here is the code :
<pre> Function Levenshtein(ByVal sRow As String, ByVal sCol As String, ByVal LMax As Integer) As Integer
Dim RowLen As Integer = sRow.Length
Dim ColLen As Integer = sCol.Length
Dim MaxLen = Math.Max(RowLen, ColLen)
Dim cost As Integer
If MaxLen > 1000 Then
MsgBox("La taille limite des mots est de 1000 lettres", MsgBoxStyle.Exclamation, "Texte trop long")
Return 99
End If
If (RowLen = 0 Or ColLen = 0) Then Return MaxLen
Dim v0 As Integer() = New Integer(RowLen + 1) {}
Dim v1 As Integer() = New Integer(RowLen + 1) {}
For i = 1 To RowLen
v0(i) = i
Next
For j = 1 To ColLen
v1(0) = j
For i = 1 To RowLen
If (sRow(i  1) = sCol(j  1)) Then
cost = 0
Else
cost = 1
End If
v1(i) = Math.Min(v0(i) + 1, Math.Min(v1(i  1) + 1, v0(i  1) + cost))
Next
Dim vTmp = v0
v0 = v1
v1 = vTmp
Next
Return ((100 * v0(RowLen)) / MaxLen)
End Function
When I try to launch the function with "CHEMISE" and "CHEMISIE", it gives me a result of 12...
I most probably made a mistake in the "translation" from C to VB, but I can't find it.
Also, I'd like to change the code to stop if the Levenshtein distance is greater than a value (this is to avoid spending too much time computing different strings, while I only want them to be, for example, only 2 changes apart.
Any idea on how to do this ?
Thanks a lot






You are welcome

Sten Hjelmqvist





Thank you, Sten. I really appreciate this. It saved me a lot of time. I cleaned up the code to suit my style. I used var when possible, renamed the local variables, reduced scope, etc, etc.
public static float LevenshteinDistance(string a, string b)
{
var rowLen = a.Length;
var colLen = b.Length;
var maxLen = Math.Max(rowLen, colLen);
if (rowLen == 0  colLen == 0)
{
return maxLen;
}
var v0 = new int[rowLen + 1];
var v1 = new int[rowLen + 1];
for (var i = 1; i <= rowLen; i++)
{
v0[i] = i;
}
for (var j = 1; j <= colLen; j++)
{
v1[0] = j;
for (var i = 1; i <= rowLen; i++)
{
var cost = (a[i  1] == b[j  1]) ? 0 : 1;
v1[i] = Math.Min(v0[i] + 1, Math.Min(v1[i  1] + 1, v0[i  1] + cost));
}
var vTmp = v0;
v0 = v1;
v1 = vTmp;
}
return 100.0f * v0[rowLen] / maxLen;
}





OK, I found out why...
In the original code, the result was :
return 100.0f * v0[rowLen] / maxLen;
I just changed it to
return v0[rowLen];
I also add a ToUpper to the 2 strings, so that
Levenshtein ("CHAT", "chat") would give 0 (which is what I expect, but you may disagree).
Now, trying to find a solution to the limit of the function....





This is stupid fast. I've been tasked with finding a first and last name in roughly 11,000 text files from OCRed PDFs. 21.2 million word comparisons in ~15 seconds with great results.
Thanks for sharing, this is awesome work!





I'm glad that you find it useful but I'm not sure how.
That is how do you use this algorithm to find names?
When I compare the two strings:
"Fast, memory efficient Levenshtein algorithm" : length=45
and
"Levenshtein" : length=12
I get 33 , it makes sense as 45  12 = 33.
And 45  33 = 12, that is the length of "Levenshtein".
Is that how you use the algorithm?

Sten Hjelmqvist





Some what like that. I'm not comparing a name to a string of words, rather a name to each individual word, hence the 10M compares. If the result between the two are with in a given range I flag that document for review. Since the files are OCRed you might get Sc0t7 and not Scott. There are other processing steps such as soundex that are also used, but this has helped increase an accuracy hit rate to, in some cases, 6070%.





such as,
akitten To sitting：
kitten （a→ ）
sitten （k→s）
sittin （e→i）
sitting（ →g）
I hope to get the detail change list, not only the Distance int value
Thanks





In my testing (differences between email addresses) I have found that a single character difference (something that posgresql's levenshtein algorthim returns 1), this code returns 3 for both iLD and LD. Therefore it doesn't matter how fast of efficient your code is, the end result is incorrect.





As mentioned below by "Member 11645800" there are different versions of the Levenshtein algorithm.
Are you sure the one I describe here and the one implemented in posgresql are the same?

Sten Hjelmqvist





Tom,
Can you supply an example of a string (not necessarily a valid email address) that exhibits the length difference you describe?
Jim





I have seen an algorithm which uses the original method of Levenshtein algo. After generating the matrix and calculating the minimum edit distance it back traces the matrix to create an array that contains the transformations needed.
But in your implementation we dont have a matrix to do that. Can you tell me how to get the changes?






Sorry, but I don't understand what you mean.
Could you please elaborate?

Sten Hjelmqvist





Tranposition is not part of the Levenshtein distance, but instead a different algorithm based on it called DamerauLevenshtein distance.
http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
Quote: The Damerau–Levenshtein distance differs from the classical Levenshtein distance by including transpositions among its allowable operations





The result is 6 comparing two strings "ROBERTMELANSON" and "ROBERTOMELANSON". I think it should be 1 (correct me if I'm wrong).





When I run the program comparing "ROBERTMELANSON" with "ROBERTOMELANSON" I get the result 1 not 6.
I don't understand how you get the result = 6.

Sten Hjelmqvist







I'm glad you find it useful.

Sten Hjelmqvist





Thanks.Your improvement is very interesting and useful





As you don't use previous column values again after calculating a new column vector can't you avoid swapping the column vectors: and just overwrite the 'previous column' vector with the new vector?
The new vector is then repopulated with new values, next iteration (if any)
If so, you'd save a couple of array (re)assignments.







General News Suggestion Question Bug Answer Joke Praise Rant Admin Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

