Click here to Skip to main content
15,886,422 members
Articles / Programming Languages / C#

Fast, memory efficient Levenshtein algorithm

Rate me:
Please Sign up or sign in to vote.
4.76/5 (34 votes)
26 Mar 2012CPOL5 min read 219.8K   6.1K   81   41
A version of the Levenshtein algorithm that uses 2*Min(StrLen1,StrLen2) bytes instead of StrLen1*StrLen2 bytes.

Introduction

The Levenshtein distance is the difference between two strings. I use it in a web crawler application to compare the new and old versions of a web page. If it has changed enough, I update it in my database.

Description

The original algorithm creates a matrix, where the size is StrLen1*StrLen2. If both strings are 1000 chars long, the resulting matrix is 1M elements; if the strings are 10,000 chars, the matrix will be 100M elements. If the elements are integers, it will be 4*100M == 400MB. Ouch!

This version of the algorithm uses only 2*StrLen elements, so the latter example would give 2*10,000*4 = 80 KB. The result is that, not only does it use less memory but it's also faster because the memory allocation takes less time. When both strings are about 1K in length, the new version is more than twice as fast.

Example

The original version would create a matrix[6+1,5+1], my version creates two vectors[6+1] (the yellow elements). In both versions, the order of the strings is irrelevant, that is, it could be matrix[5+1,6+1] and two vectors[5+1].

The new algorithm

Steps

Step Description
1 Set n to be the length of s. ("GUMBO")
Set m to be the length of t. ("GAMBOL")
If n = 0, return m and exit.
If m = 0, return n and exit.
Construct two vectors, v0[m+1] and v1[m+1], containing 0..m elements.
2 Initialize v0 to 0..m.
3 Examine each character of s (i from 1 to n).
4 Examine each character of t (j from 1 to m).
5 If s[i] equals t[j], the cost is 0.
If s[i] is not equal to t[j], the cost is 1.
6 Set cell v1[j] equal to the minimum of:
a. The cell immediately above plus 1: v1[j-1] + 1.
b. The cell immediately to the left plus 1: v0[j] + 1.
c. The cell diagonally above and to the left plus the cost: v0[j-1] + cost.
7 After the iteration steps (3, 4, 5, 6) are complete, the distance is found in the cell v1[m].

This section shows how the Levenshtein distance is computed when the source string is "GUMBO" and the target string is "GAMBOL":

Steps 1 and 2 

  v0 v1        
    G U M B O
  0 1 2 3 4 5
G 1          
A 2          
M 3          
B 4          
O 5          
L 6          

Steps 3 to 6, when i = 1

  v0 v1        
    G U M B O
  0 1 2 3 4 5
G 1 0        
A 2 1        
M 3 2        
B 4 3        
O 5 4        
L 6 5        

Steps 3 to 6, when i = 2

SWAP(v0,v1): If you look in the code you will see that I don't swap the content of the vectors but I refer to them.

Set v1[0] to the column number, e.g. 2.

    v0 v1      
    G U M B O
  0 1 2 3 4 5
G 1 0 1      
A 2 1 1      
M 3 2 2      
B 4 3 3      
O 5 4 4      
L 6 5 5      

Steps 3 to 6, when i = 3

SWAP(v0,v1).

Set v1[0] to the column number, e.g. 3.

      v0 v1    
    G U M B O
  0 1 2 3 4 5
G 1 0 1 2    
A 2 1 1 2    
M 3 2 2 1    
B 4 3 3 2    
O 5 4 4 3    
L 6 5 5 4    

Steps 3 to 6, when i = 4

SWAP(v0,v1).

Set v1[0] to the column number, e.g. 4.

        v0 v1  
    G U M B O
  0 1 2 3 4 5
G 1 0 1 2 3  
A 2 1 1 2 3  
M 3 2 2 1 2  
B 4 3 3 2 1  
O 5 4 4 3 2  
L 6 5 5 4 3  

Steps 3 to 6, when i = 5

SWAP(v0,v1).

Set v1[0] to the column number, e.g. 5.

          v0 v1
    G U M B O
  0 1 2 3 4 5
G 1 0 1 2 3 4
A 2 1 1 2 3 4
M 3 2 2 1 2 3
B 4 3 3 2 1 2
O 5 4 4 3 2 1
L 6 5 5 4 3 2

Step 7

The distance is in the lower right hand corner of the matrix, v1[m] == 2. This corresponds to our intuitive realization that "GUMBO" can be transformed into "GAMBOL" by substituting "A" for "U" and adding "L" (one substitution and one insertion = two changes).

Improvements

If you are sure that your strings will never be longer than 2^16 chars, you could use ushort instead of int, if the strings are less than 2^8 chars, you could use byte. I guess, the algorithm would be even faster if we use unmanaged code, but I have not tried it.

References

History

  • 2006-03-22
    • Version 1.0.
  • 2006-03-24
    • Detailed description of the algorithm. The code has been rewritten so that it now follows the description. :-)

 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior) SEB bank
Sweden Sweden
I work as a developer in the 'Risk control' department at SEB bank in Stockholm,Sweden and I have been designing software since the early 80's.

Comments and Discussions

 
GeneralMy vote of 5 Pin
Alper Sünter27-Nov-20 16:53
Alper Sünter27-Nov-20 16:53 
SuggestionMy changes to this algorithm Pin
jjcasalo20-Oct-17 2:46
jjcasalo20-Oct-17 2:46 
OK, so I did some work since yesterday, and here is what I came up with.

The changes I did :

1) Before I start filling up the S vectors, I remove all the identical first letters.
For example, for the 2 words CHEMISE and CHEIMSE, the "CHE" part is the same, so I disregard this. And the 2 new words that will be processed (and inserted in the 2 vectors) are "MISE" and "IMSE".
Should save a few milliseconds, but when you have millions of rows to compare, they add up...

2) As I wrote previously, I wanted to stop if the current Levenshtein distance was greater than a number.
For example, I don't want all the words with a Levenshtein distance greater than 3.
The way I do is is that I look at the v1 column (once all the slots have numbers), and take the MIN.
This is the current Levenshtein distance. If this current distance is greater than my limit, I exit the function with a value of 99, as I don't need to compute the other columns (as the overall Levenshtein distance can only increase, and never decrease (as you only add up 1s)

3) Optional. If you have to work with accentuated characters, and you want that the Levenshtein distance don't count the difference between, say, "Helene" and "Hélène", then you have to use a function that replaces all the accentuated characters with their "normal" letters.


VB
Function Levenshtein(ByVal String1 As String, ByVal String2 As String, ByVal LMax As Integer) As Integer

        Dim cost As Integer                  ' cost
        Dim CurrentCost As Integer          ' Current Levenshtein distance, for each Column.
        Dim MinLen As Integer = Math.Min(String1.Length, String2.Length)

        If (String1.Length > 1000 Or String2.Length > 1000) Then
            MsgBox("La taille limite des mots est de 1000 lettres", MsgBoxStyle.Exclamation, "Texte trop long")
            Return 99
        End If

        ' If difference of length between 2 strings is greater than LMax, then return 99 
        If (Math.Abs(String1.Length - String2.Length) > LMax) Then
            Return 99
        End If

        String1 = String1.ToUpper
        String2 = String2.ToUpper

        ' OPTIONAL for accentuated words
        String1 = No_accents(String1)
        String2 = No_accents(String2)

        ' Step 0 : going fast through the first N identical letters.
        For i = 0 To MinLen - 1
            If (String1(i) = String2(i)) Then
                If i = MinLen - 1 Then
                    String1 = String1.Substring(i + 1)
                    String2 = String2.Substring(i + 1)
                    Exit For
                Else
                    Continue For
                End If
            End If
            String1 = String1.Substring(i)
            String2 = String2.Substring(i)
            Exit For
        Next
        Dim String1Len As Integer = String1.Length  ' length of String1
        Dim String2Len As Integer = String2.Length  ' length of String2

        ' Step 1
        If (String1Len = 0 Or String2Len = 0) Then Return Math.Max(String1.Length, String2.Length)

        ' Create the two vectors
        Dim v0 As Integer() = New Integer(String1Len + 1) {}
        Dim v1 As Integer() = New Integer(String1Len + 1) {}
        'Dim vTmp As Integer()

        ' Step 2
        ' Initialize the first vector
        For i = 1 To String1Len
            v0(i) = i
        Next

        ' Step 3
        ' For each column
        For i = 1 To String2Len
            ' Set the 0'th element to the column number
            v1(0) = i

            ' Reset CurrentCost to 0
            ' Step 4
            For j = 1 To String1Len

                ' Step 5
                If (String1(j - 1) = String2(i - 1)) Then
                    cost = 0
                Else
                    cost = 1
                End If

                ' Step 6
                ' Find minimum
                v1(j) = Math.Min(v0(j) + 1, Math.Min(v1(j - 1) + 1, v0(j - 1) + cost))

                ' We get the current Levenshtein distance, which is the minimum value of the column
                If j = 1 Then
                    CurrentCost = v1(1)
                Else
                    CurrentCost = Math.Min(v1(j), CurrentCost)
                End If
            Next
            ' The current Levenshtein distance is greater than the set limit.
            ' No need to continue, as Levenshtein distance can only grow up
            If CurrentCost > LMax Then Return 99

            ' Swap the vectors
            Dim vTmp = v0
            v0 = v1
            v1 = vTmp
        Next

        ' Step 7
        ' The vectors where swaped one last time at the end of the last loop,
        ' that is why the result is now in v0 rather than in v1
        Return (v0(String1Len))
  
    End Function

    Function No_accents(ByVal Chaine As String) As String
        Dim a As String = "ÀÁÂÃÄÅÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝàáâãäåèéêëìíîïðñòóôõöùúûüýÿ"
        Dim b As String = "AAAAAAEEEEIIIINOOOOOUUUUYaaaaaaeeeeiiiionooooouuuuyy"
        For i As Integer = 0 To a.Count - 1
            Chaine = Chaine.Replace(a.Substring(i, 1), b.Substring(i, 1))
        Next i
        Return Chaine.ToUpper
    End Function



Hope this will be usefull.

modified 20-Oct-17 14:37pm.

QuestionWhat do I do wrong ? (VB.net code) Pin
jjcasalo19-Oct-17 10:05
jjcasalo19-Oct-17 10:05 
QuestionHelpful Pin
SusannePorte21-Nov-16 21:24
SusannePorte21-Nov-16 21:24 
AnswerRe: Helpful Pin
Sten Hjelmqvist22-Nov-16 5:37
Sten Hjelmqvist22-Nov-16 5:37 
QuestionCleaned up the code Pin
John Henckel12-Sep-16 8:45
John Henckel12-Sep-16 8:45 
AnswerRe: Cleaned up the code Pin
jjcasalo19-Oct-17 10:06
jjcasalo19-Oct-17 10:06 
GeneralAwesome work! Pin
Member 1079435925-Sep-15 13:25
Member 1079435925-Sep-15 13:25 
GeneralRe: Awesome work! Pin
Sten Hjelmqvist28-Sep-15 20:49
Sten Hjelmqvist28-Sep-15 20:49 
GeneralRe: Awesome work! Pin
Member 107943598-Oct-15 13:33
Member 107943598-Oct-15 13:33 
QuestionThank you very much! How can I get the detail change list? Pin
Member 1191077915-Aug-15 16:51
Member 1191077915-Aug-15 16:51 
BugResults are not accurate Pin
Tom Medhurst18-Jun-15 0:35
Tom Medhurst18-Jun-15 0:35 
GeneralRe: Results are not accurate Pin
Sten Hjelmqvist5-Jul-15 20:53
Sten Hjelmqvist5-Jul-15 20:53 
GeneralRe: Results are not accurate Pin
jfos28-Sep-15 5:39
jfos28-Sep-15 5:39 
QuestionWhat about the list of changes? Pin
UzairSyed10-Jun-15 3:41
UzairSyed10-Jun-15 3:41 
QuestionWhat about the transpose operation? Pin
Member 1164580029-Apr-15 19:45
Member 1164580029-Apr-15 19:45 
AnswerRe: What about the transpose operation? Pin
Sten Hjelmqvist1-May-15 1:14
Sten Hjelmqvist1-May-15 1:14 
AnswerRe: What about the transpose operation? Pin
james.manning14-May-15 10:44
james.manning14-May-15 10:44 
QuestionBug Pin
xidongzhang25-Nov-14 6:57
xidongzhang25-Nov-14 6:57 
AnswerRe: Bug Pin
Sten Hjelmqvist8-Dec-14 22:38
Sten Hjelmqvist8-Dec-14 22:38 
GeneralMy vote of 5 Pin
Amir Mehrabi-Jorshari13-Nov-13 10:06
Amir Mehrabi-Jorshari13-Nov-13 10:06 
QuestionPerfect for my needs Pin
Greg Lindberg5-Nov-13 3:32
Greg Lindberg5-Nov-13 3:32 
AnswerRe: Perfect for my needs Pin
Sten Hjelmqvist5-Nov-13 4:01
Sten Hjelmqvist5-Nov-13 4:01 
GeneralMy vote of 5 Pin
meys_online26-Aug-12 13:38
meys_online26-Aug-12 13:38 
QuestionThanks - very interesting. Pin
SimonDrift20-Jan-12 15:14
SimonDrift20-Jan-12 15:14 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.