12,545,820 members (61,945 online)
alternative version

117.8K views
74 bookmarked
Posted

# Fast, memory efficient Levenshtein algorithm

, 26 Mar 2012 CPOL
 Rate this:
A version of the Levenshtein algorithm that uses 2*Min(StrLen1,StrLen2) bytes instead of StrLen1*StrLen2 bytes.

## Introduction

The Levenshtein distance is the difference between two strings. I use it in a web crawler application to compare the new and old versions of a web page. If it has changed enough, I update it in my database.

## Description

The original algorithm creates a matrix, where the size is `StrLen1*StrLen2`. If both strings are 1000 chars long, the resulting matrix is 1M elements; if the strings are 10,000 chars, the matrix will be 100M elements. If the elements are integers, it will be 4*100M == 400MB. Ouch!

This version of the algorithm uses only 2*`StrLen` elements, so the latter example would give 2*10,000*4 = 80 KB. The result is that, not only does it use less memory but it's also faster because the memory allocation takes less time. When both strings are about 1K in length, the new version is more than twice as fast.

## Example

The original version would create a matrix[6+1,5+1], my version creates two vectors[6+1] (the yellow elements). In both versions, the order of the strings is irrelevant, that is, it could be matrix[5+1,6+1] and two vectors[5+1].

## The new algorithm

### Steps

Step Description
1 Set n to be the length of s. ("GUMBO")
Set m to be the length of t. ("GAMBOL")
If n = 0, return m and exit.
If m = 0, return n and exit.
Construct two vectors, v0[m+1] and v1[m+1], containing 0..m elements.
2 Initialize v0 to 0..m.
3 Examine each character of s (i from 1 to n).
4 Examine each character of t (j from 1 to m).
5 If s[i] equals t[j], the cost is 0.
If s[i] is not equal to t[j], the cost is 1.
6 Set cell v1[j] equal to the minimum of:
a. The cell immediately above plus 1: v1[j-1] + 1.
b. The cell immediately to the left plus 1: v0[j] + 1.
c. The cell diagonally above and to the left plus the cost: v0[j-1] + cost.
7 After the iteration steps (3, 4, 5, 6) are complete, the distance is found in the cell v1[m].

This section shows how the Levenshtein distance is computed when the source string is "GUMBO" and the target string is "GAMBOL":

#### Steps 1 and 2

 v0 v1 G U M B O 0 1 2 3 4 5 G 1 A 2 M 3 B 4 O 5 L 6

#### Steps 3 to 6, when i = 1

 v0 v1 G U M B O 0 1 2 3 4 5 G 1 0 A 2 1 M 3 2 B 4 3 O 5 4 L 6 5

#### Steps 3 to 6, when i = 2

SWAP(v0,v1): If you look in the code you will see that I don't swap the content of the vectors but I refer to them.

Set v1[0] to the column number, e.g. 2.

 v0 v1 G U M B O 0 1 2 3 4 5 G 1 0 1 A 2 1 1 M 3 2 2 B 4 3 3 O 5 4 4 L 6 5 5

#### Steps 3 to 6, when i = 3

SWAP(v0,v1).

Set v1[0] to the column number, e.g. 3.

 v0 v1 G U M B O 0 1 2 3 4 5 G 1 0 1 2 A 2 1 1 2 M 3 2 2 1 B 4 3 3 2 O 5 4 4 3 L 6 5 5 4

#### Steps 3 to 6, when i = 4

SWAP(v0,v1).

Set v1[0] to the column number, e.g. 4.

 v0 v1 G U M B O 0 1 2 3 4 5 G 1 0 1 2 3 A 2 1 1 2 3 M 3 2 2 1 2 B 4 3 3 2 1 O 5 4 4 3 2 L 6 5 5 4 3

#### Steps 3 to 6, when i = 5

SWAP(v0,v1).

Set v1[0] to the column number, e.g. 5.

 v0 v1 G U M B O 0 1 2 3 4 5 G 1 0 1 2 3 4 A 2 1 1 2 3 4 M 3 2 2 1 2 3 B 4 3 3 2 1 2 O 5 4 4 3 2 1 L 6 5 5 4 3 2

#### Step 7

The distance is in the lower right hand corner of the matrix, v1[m] == 2. This corresponds to our intuitive realization that "GUMBO" can be transformed into "GAMBOL" by substituting "A" for "U" and adding "L" (one substitution and one insertion = two changes).

## Improvements

If you are sure that your strings will never be longer than 2^16 chars, you could use `ushort` instead of `int`, if the strings are less than 2^8 chars, you could use `byte`. I guess, the algorithm would be even faster if we use unmanaged code, but I have not tried it.

## History

• 2006-03-22
• Version 1.0.
• 2006-03-24
• Detailed description of the algorithm. The code has been rewritten so that it now follows the description. :-)

## Share

 Software Developer (Senior) SQL Service AB Sweden
I'm a computer consultant at 'SQL Service AB' , in Stockholm,Sweden and I have been designing software since the early 80's.

## You may also be interested in...

 Pro Pro

 First Prev Next
 Cleaned up the code John Henckel12-Sep-16 8:45 John Henckel 12-Sep-16 8:45
 Awesome work! Member 1079435925-Sep-15 13:25 Member 10794359 25-Sep-15 13:25
 Re: Awesome work! Sten Hjelmqvist28-Sep-15 20:49 Sten Hjelmqvist 28-Sep-15 20:49
 Re: Awesome work! Member 107943598-Oct-15 13:33 Member 10794359 8-Oct-15 13:33
 Thank you very much! How can I get the detail change list? Member 1191077915-Aug-15 16:51 Member 11910779 15-Aug-15 16:51
 Results are not accurate Tom Medhurst18-Jun-15 0:35 Tom Medhurst 18-Jun-15 0:35
 Re: Results are not accurate Sten Hjelmqvist5-Jul-15 20:53 Sten Hjelmqvist 5-Jul-15 20:53
 Re: Results are not accurate jfos28-Sep-15 5:39 jfos 28-Sep-15 5:39
 What about the list of changes? UzairSyed10-Jun-15 3:41 UzairSyed 10-Jun-15 3:41
 What about the transpose operation? Member 1164580029-Apr-15 19:45 Member 11645800 29-Apr-15 19:45
 Re: What about the transpose operation? Sten Hjelmqvist1-May-15 1:14 Sten Hjelmqvist 1-May-15 1:14
 Re: What about the transpose operation? james.manning14-May-15 10:44 james.manning 14-May-15 10:44
 Bug xidongzhang25-Nov-14 6:57 xidongzhang 25-Nov-14 6:57
 Re: Bug Sten Hjelmqvist8-Dec-14 22:38 Sten Hjelmqvist 8-Dec-14 22:38
 My vote of 5 Amir Mehrabi-Jorshari13-Nov-13 10:06 Amir Mehrabi-Jorshari 13-Nov-13 10:06
 Perfect for my needs Greg Lindberg5-Nov-13 3:32 Greg Lindberg 5-Nov-13 3:32
 Re: Perfect for my needs Sten Hjelmqvist5-Nov-13 4:01 Sten Hjelmqvist 5-Nov-13 4:01
 My vote of 5 meys_online26-Aug-12 13:38 meys_online 26-Aug-12 13:38
 Thanks - very interesting. Simon Hill20-Jan-12 15:14 Simon Hill 20-Jan-12 15:14
 Re: Thanks - very interesting. Sten Hjelmqvist25-Mar-12 23:40 Sten Hjelmqvist 25-Mar-12 23:40
 Did you port this code from somewhere? :) Mikael Svenson8-Sep-11 23:05 Mikael Svenson 8-Sep-11 23:05
 Re: Did you port this code from somewhere? :) Sten Hjelmqvist9-Sep-11 0:39 Sten Hjelmqvist 9-Sep-11 0:39
 Great stuff ed_ward_graham28-Feb-11 5:41 ed_ward_graham 28-Feb-11 5:41
 Operation Performed on strings? Piyush Patell4-Oct-10 17:11 Piyush Patell 4-Oct-10 17:11
 Very useful. OrganicHuman24-Aug-09 10:22 OrganicHuman 24-Aug-09 10:22
 Commercial quality? Blake Johnston21-Nov-07 18:37 Blake Johnston 21-Nov-07 18:37
 Re: Commercial quality? Sten Hjelmqvist21-Nov-07 22:23 Sten Hjelmqvist 21-Nov-07 22:23
 Improvements Available [modified] cetinsert6-Nov-07 4:49 cetinsert 6-Nov-07 4:49
 Re: Improvements Available Marc Brooks28-Jul-08 12:49 Marc Brooks 28-Jul-08 12:49
 Re: Improvements Available cetinsert7-Aug-08 0:15 cetinsert 7-Aug-08 0:15
 Re: Improvements Available [modified] hbrowser4-Jul-12 4:33 hbrowser 4-Jul-12 4:33
 attractive title v2.022-Mar-06 5:09 v2.0 22-Mar-06 5:09
 Re: attractive title aprenot22-Mar-06 7:27 aprenot 22-Mar-06 7:27
 Re: attractive title Thanh Dao22-Mar-06 22:16 Thanh Dao 22-Mar-06 22:16
 Re: attractive title Sten Hjelmqvist26-Mar-06 23:18 Sten Hjelmqvist 26-Mar-06 23:18
 Last Visit: 31-Dec-99 18:00     Last Update: 20-Oct-16 6:20 Refresh 1