Click here to Skip to main content
15,867,594 members
Articles / Programming Languages / C#
Article

A SoundEx implementation in .NET

Rate me:
Please Sign up or sign in to vote.
4.94/5 (15 votes)
21 Jan 2002BSD4 min read 139K   1.5K   69   9
Demonstrates an OO implementation of 4 SoundEx variants in .NET

Introduction

Every computer scientist has heard of SoundEx. It is perhaps the most infamous text processing/searching algorithm around. SoundEx promises a great deal - that of matching words with similar sounding words, but actually delivers, at best, a large number of inaccurate matches. Even though SoundEx was patented, variations have arisen, whether through poor understanding of the algorithm or through attempts to improve its accuracy. Thus, this article presents four popular implementations of SoundEx written in C# and .Net to allow you to perform your own benchmarking on your own data sets.

SoundEx was originally invented to find alternative spellings for surnames. The problem came from mis-spellings on US census returns and therefore the SoundEx algorithms are biased towards English speaking names. There are further variants for some western European languages, but in general the algorithm has never been considered a panacea for all fuzzy text matching.

Most SoundEx algorithms are a variant on the following

  • Retain the first letter of the name
  • Drop all vowels, and h, y in all but the first letter
  • Replace b,f,p,v with the number 1
  • Replace c,g,j,k,q,s,x,z with the number 2
  • Replace d and t with the number 3
  • Replace l with 4, m and n with 5 and r with 6
  • Finally convert to a 4 character code like {letter}{number}{number}{number} by truncation
Some algorithms have grouping and adjacent letter rules in addition.

Implementation Details

Several implementations of the same algorithm implies a strategy pattern, so a common base class named ISoundEx was created. The four implementations are:

  • Miracode - The modified SoundEx from 1910
  • Simplified - The original SoundEx from late 1800s
  • KnuthEd2 - The SoundEx algorithm from The Art of Computer Programming
  • SQLServer - SQL Server's variant on SoundEx
Each implements GenerateSoundEx() and ValidateAlgorithm(). A virtual member of ISoundEx is EncodeChar() which provides the basic character to number mapping specified by SoundEx.
protected virtual string EncodeChar(char c) {
	// C# will re-order this list and produce a look-up list from it
	// C# will do all the work we would otherwise do by building arrays of values
	switch(Char.ToLower(c)) {
		case 'b':
		case 'f':
		case 'p':
		case 'v':
			return "1";
		case 'c':
		case 'g':
		case 'j':
		case 'k':
		case 'q':
		case 's':
		case 'x':
		case 'z':
			return "2";
		case 'd':
		case 't':
			return "3";
		case 'l':
			return "4";
		case 'm':
		case 'n':
			return "5";
		case 'r':
			return "6";
		default:
			return string.Empty;
	}
}

An aside into MSIL

How C# compiles switch statements such as EncodeChar() could be an article in itself. To demonstrate the optimizations it applies, look at the CIL disassembly below.

.method family hidebysig newslot virtual
        instance string  EncodeChar(char c) cil managed
{
  // Code size       176 (0xb0)
  .maxstack  2
  .locals ([0] string CS$00000003$00000000,
           [1] char CS$00000002$00000001)
  IL_0000:  ldarg.1
  IL_0001:  call       char [mscorlib]System.Char::ToLower(char)
  IL_0006:  stloc.1
  IL_0007:  ldloc.1
  IL_0008:  ldc.i4.s   98
  IL_000a:  sub
  IL_000b:  switch     (
                        IL_0076,
                        IL_007e,
                        IL_0086,
                        IL_00a6,
                        IL_0076,
                        IL_007e,
                        IL_00a6,
                        IL_00a6,
                        IL_007e,
                        IL_007e,
                        IL_008e,
                        IL_0096,
                        IL_0096,
                        IL_00a6,
                        IL_0076,
                        IL_007e,
                        IL_009e,
                        IL_007e,
                        IL_0086,
                        IL_00a6,
                        IL_0076,
                        IL_00a6,
                        IL_007e,
                        IL_00a6,
                        IL_007e)
  IL_0074:  br.s       IL_00a6
  IL_0076:  ldstr      "1"
  IL_007b:  stloc.0
  IL_007c:  br.s       IL_00ae
  IL_007e:  ldstr      "2"
  IL_0083:  stloc.0
  IL_0084:  br.s       IL_00ae
  IL_0086:  ldstr      "3"
  IL_008b:  stloc.0
  IL_008c:  br.s       IL_00ae
  IL_008e:  ldstr      "4"
  IL_0093:  stloc.0
  IL_0094:  br.s       IL_00ae
  IL_0096:  ldstr      "5"
  IL_009b:  stloc.0
  IL_009c:  br.s       IL_00ae
  IL_009e:  ldstr      "6"
  IL_00a3:  stloc.0
  IL_00a4:  br.s       IL_00ae
  IL_00a6:  ldsfld     string [mscorlib]System.String::Empty
  IL_00ab:  stloc.0
  IL_00ac:  br.s       IL_00ae
  IL_00ae:  ldloc.0
  IL_00af:  ret
} // end of method ISoundEx::EncodeChar

Basically, the C# compiler has re-ordered the characters in the switch statement so they are in ascending order according to the Unicode code points. It adds extra 'case' entries for missing letters as a sort of padding. The CIL switch statement is a lookup table. The value at the top of the stack when switch is executed is used as a zero-based index into the list of labels to jump to. This gives us a nicely optimized lookup table without us trying too hard!

Back to SoundEx

ISoundEx.ValidateAlgorithm() provides a test of various names against the known SoundEx code for that name.

Image 1

Dictionary Lookup

The code also provides a means to load a dictionary of words or surnames and perform a fast and efficient lookup of similar words. CustomDictionary adds words to internal lookup tables with the following code:

public void AddWord(string s) {
	if(_oAllWords.ContainsKey(s)==false) {
		string zSoundEx=_oSoundExAlgorithm.GenerateSoundEx(s);

		// Add the word to the list of all words and associate
		// a SoundEx value with it
		_oAllWords.Add(s, zSoundEx);

		// Create a new SoundEx group
		if(_oAllSoundEx.ContainsKey(zSoundEx)==false) {
			_oAllSoundEx.Add(zSoundEx,new StringCollection());
		}
		// Now add the string into the collection of all strings
		// with the same SoundEx
		((StringCollection)_oAllSoundEx[zSoundEx]).Add(s);
	}
}

_oAllWords is a StringDictionary containing a unique list of all the words in the dictionary with it's SoundEx code. _oAllSoundEx is a HashTable containing a unique list of all the SoundEx values. Each entry in the HashTable contains a StringCollection of words with that SoundEx. Thus we have a two-way link between the SoundEx code and the words which have that SoundEx.

Two text files are provided. The first is the /usr/dict/words database from all Unix systems. As this is a list of commonly used English language words and not surnames, it shows how inappropriate SoundEx is for these words. The second file is OldPend Surnames 2002-01.txt which is a list of over 15,000 unique surnames from the Old Pendleton Database. This is a database of individuals whose families have roots in the Old Pendleton District of South Carolina and can be obtained from oldpendleton.homestead.com. This provides better SoundEx matches, but still shows how inaccurate the algorithm can be.

Searching for surnames matching 'Watkins'

Conclusion

As can be seen from the above screenshot, the results vary. Names which in no-way can be considered related are returned while, conversely, names which should be related fail to be returned. Knuth's examples were 'Rogers' and 'Rodgers', 'Sinclair' and 'St. Clair' which fail.

Other similar sounding match algorithms have been invented in the 120 years following the introduction of SoundEx. Metaphone is one of these, which should be implementable easily within the code structure presented here.

Finally, if you intend on using SoundEx within a commercial application be aware of its pitfalls.

License

This article, along with any associated source code and files, is licensed under The BSD License


Written By
Web Developer
United Kingdom United Kingdom
Richard Birkby is a software engineer from London, UK, specializing in .Net. Richard has coded for many different sized companies from small venture-capital funded start-ups, to multi-national corporations (ie Microsoft). When he's not programming, he enjoys driving his sports car or eating curry (although never at the same time!).

Richard helps run CurryPages.com and has several other covert ventures in development. Stay tuned!

Comments and Discussions

 
QuestionNow on GitHub Pin
Richard Birkby26-Aug-12 6:59
Richard Birkby26-Aug-12 6:59 
GeneralBroken Link Pin
Palladino16-Mar-09 6:33
Palladino16-Mar-09 6:33 
AnswerRe: Broken Link Pin
Richard Birkby16-Mar-09 22:06
Richard Birkby16-Mar-09 22:06 
GeneralRe: Broken Link Pin
Palladino17-Mar-09 14:09
Palladino17-Mar-09 14:09 
QuestionDouble Metaphone? Pin
ferdna3-Jan-09 15:24
ferdna3-Jan-09 15:24 
GeneralSoundex Implementation Performance Pin
jsdavid21-Nov-07 18:26
jsdavid21-Nov-07 18:26 
GeneralSoundex Pin
0300621219-Jul-05 1:00
0300621219-Jul-05 1:00 
GeneralSome Corrections Pin
Alex Zhyk11-Jun-03 3:01
Alex Zhyk11-Jun-03 3:01 
GeneralRe: Some Corrections Pin
Richard Birkby28-Jul-03 20:58
Richard Birkby28-Jul-03 20:58 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.