Soundex Implementation in C# and VB.NET






4.11/5 (12 votes)
Mar 4, 2006
1 min read

102793
A simple soundex implementation in C# and VB.NET to recognize phonetically similar words based on basic soundex algorithms.
Introduction
While working on adding an English dictionary to a company website, I ran upon the problem of mispelling a word while testing the application. As this is likely to be a common user error, I decided to read up on basic phonetic matching. While SQL Server implements the Soundex function, Microsoft Access (the format in which the dictionary is stored) does not.
So the task was simple. Find an algorithm on the internet that could be used to populate a Soundex field within the database, for use in phonetic comparisons.
Unfortunately, when I went looking for sample code on the internet, most of it was terribly outdated. Most of the code, written for either VBScript, or Visual Basic 6 or earlier, made heavy use of expensive functions such as MID and LEFT. These functions, to put it mildly, are not effecient, when compared to accessing characters directly via a character array.
Since I was going to be processing well over 100,000 articles, I decided to write my own Soundex functions based on standardized algorithms, using a tighter, more effecient loop. The resulting code is included below.
VISUAL BASIC CODE SAMPLE
Public Shared Function Compute(ByVal Word As String) As String Return Compute(Word, 4) End Function Public Shared Function Compute(ByVal Word As String, ByVal Length As Integer) As String ' Value to return Dim Value As String = "" ' Size of the word to process Dim Size As Integer = Word.Length ' Make sure the word is at least two characters in length If (Size > 1) Then ' Convert the word to all uppercase Word = Word.ToUpper() ' Conver to the word to a character array for faster processing Dim Chars() As Char = Word.ToCharArray() ' Buffer to build up with character codes Dim Buffer As New System.Text.StringBuilder Buffer.Length = 0 ' The current and previous character codes Dim PrevCode As Integer = 0 Dim CurrCode As Integer = 0 ' Append the first character to the buffer Buffer.Append(Chars(0)) ' Prepare variables for loop Dim i As Integer Dim LoopLimit As Integer = Size - 1 ' Loop through all the characters and convert them to the proper character code For i = 1 To LoopLimit Select Case Chars(i) Case "A", "E", "I", "O", "U", "H", "W", "Y" CurrCode = 0 Case "B", "F", "P", "V" CurrCode = 1 Case "C", "G", "J", "K", "Q", "S", "X", "Z" CurrCode = 2 Case "D", "T" CurrCode = 3 Case "L" CurrCode = 4 Case "M", "N" CurrCode = 5 Case "R" CurrCode = 6 End Select ' Check to see if the current code is the same as the last one If (CurrCode <> PrevCode) Then ' Check to see if the current code is 0 (a vowel); do not proceed If (CurrCode <> 0) Then Buffer.Append(CurrCode) End If End If ' If the buffer size meets the length limit, then exit the loop If (Buffer.Length = Length) Then Exit For End If Next ' Padd the buffer if required Size = Buffer.Length If (Size < Length) Then Buffer.Append("0", (Length - Size)) End If ' Set the return value Value = Buffer.ToString() End If ' Return the computed soundex Return Value End Function
C SHARP CODE SAMPLE
public static string Compute(string word) { return Compute(word, 4); }
public static string Compute(string word, int length) { // Value to return string value = ""; // Size of the word to process int size = word.Length; // Make sure the word is at least two characters in length if (size > 1) { // Convert the word to all uppercase word = word.ToUpper(); // Convert the word to character array for faster processing char[] chars = word.ToCharArray(); // Buffer to build up with character codes StringBuilder buffer = new StringBuilder(); buffer.Length = 0; // The current and previous character codes int prevCode = 0; int currCode = 0; // Append the first character to the buffer buffer.Append(chars[0]); // Loop through all the characters and convert them to the proper character code for (int i = 1; i < size; i++) { switch (chars[i]) { case 'A': currCode = 0; break; case 'E': currCode = 0; break; case 'I': currCode = 0; break; case 'O': currCode = 0; break; case 'U': currCode = 0; break; case 'H': currCode = 0; break; case 'W': currCode = 0; break; case 'Y': currCode = 0; break; case 'B': currCode = 1; break; case 'F': currCode = 1; break; case 'P': currCode = 1; break; case 'V': currCode = 1; break; case 'C': currCode = 2; break; case 'G': currCode = 2; break; case 'J': currCode = 2; break; case 'K': currCode = 2; break; case 'Q': currCode = 2; break; case 'S': currCode = 2; break; case 'X': currCode = 2; break; case 'Z': currCode = 2; break; case 'D': currCode = 3; break; case 'T': currCode = 3; break; case 'L': currCode = 4; break; case 'M': currCode = 5; break; case 'N': currCode = 5; break; case 'R': currCode = 6; break; }
// Check to see if the current code is the same as the last one if (currCode != prevCode) { // Check to see if the current code is 0 (a vowel); do not process vowels if (currCode != 0) buffer.Append(currCode); } // Set the new previous character code prevCode = currCode; // If the buffer size meets the length limit, then exit the loop if (buffer.Length == length) break; }
// Pad the buffer, if required size = buffer.Length; if (size < length) buffer.Append('0', (length - size)); // Set the value to return value = buffer.ToString(); } // Return the value return value; } }