Introduction
Levenshtein's distance measures the minimum number of character edits required to change a word into another. In this tip, we'll see a simple implementation of the Levenshtein algorithm in Visual Basic. It will be useful in several situations, when managing - for example - large amount of text, and we are in need of fast and massive modifications in our data, to spot and correct typos, or to match string
s in terms of similarity, and not simply in terms of equal/not equal. Levenshtein's algorithm allows to compute a score regarding string
similarity. We'll see how in a moment.
Standard Levenshtein Algorithm
Here follows the standard Levenshtein implementation in VB.NET, according to the algorithm as shown at Wikipedia.
Public Function Levenshtein(ByVal s As String, ByVal t As String) As Integer
Dim n As Integer = s.Length
Dim m As Integer = t.Length
Dim d(n + 1, m + 1) As Integer
If n = 0 Then
Return m
End If
If m = 0 Then
Return n
End If
Dim i As Integer
Dim j As Integer
For i = 0 To n
d(i, 0) = i
Next
For j = 0 To m
d(0, j) = j
Next
For i = 1 To n
For j = 1 To m
Dim cost As Integer
If t(j - 1) = s(i - 1) Then
cost = 0
Else
cost = 1
End If
d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
Next
Next
Return d(n, m)
End Function
Examples
Using the above function(s) is trivial: it's sufficient to call it by passing two string
s, plus the optional boolean flag for case-sensitivity check. Some examples could be:
MsgBox(Levenshtein("Inwards", "inwards").ToString)
MsgBox(Levenshtein("Inwards", "inwards").ToString)
MsgBox(Levenshtein("towards", "towards").ToString)
MsgBox(Levenshtein("dinner", "breakfast").ToString)
MsgBox(Levenshtein("breakfast", "braekfast").ToString)
MsgBox(Levenshtein("efficient", "sufficient").ToString)
MsgBox(Levenshtein("grandma", "anathema").ToString)
MsgBox(Levenshtein("aunt", "ant").ToString)
The results (1, 0, 0, 9, 2, 2, 5, 1) are the Levenshtein's distances between given string
s, i.e., a score regarding string
s similarity. The lower the score, the nearer are the two entities. A value of zero means, obviously, a total convergence of the two. We could use a function like this (with predetermined conditions to be satisfied, like "the score must not exceed 3", for example) to correct typos (as in the "breakfast - braekfast" example), or to search for differences in hypothetical data (like "new York - New York"), and so on.
Bibliography
History
- 2015-01-06 Added standard algorithm, revised text, revised code
- 2015-01-03 First release for CodeProject
Working in IT since 2003 as Software Developer for Essetre Srl, a company in Northern Italy.
I was awarded in 2014, 2015 and 2016 with Microsoft MVP, for Visual Studio and Development Technologies expertise. My technology interests and main skills are in .NET Framework, Visual Basic, Visual C# and SQL Server, but i'm proficient in PHP and MySQL also.