Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

A SoundEx implementation in .NET

0.00/5 (No votes)
21 Jan 2002 2  
Demonstrates an OO implementation of 4 SoundEx variants in .NET

Introduction

Every computer scientist has heard of SoundEx. It is perhaps the most infamous text processing/searching algorithm around. SoundEx promises a great deal - that of matching words with similar sounding words, but actually delivers, at best, a large number of inaccurate matches. Even though SoundEx was patented, variations have arisen, whether through poor understanding of the algorithm or through attempts to improve its accuracy. Thus, this article presents four popular implementations of SoundEx written in C# and .Net to allow you to perform your own benchmarking on your own data sets.

SoundEx was originally invented to find alternative spellings for surnames. The problem came from mis-spellings on US census returns and therefore the SoundEx algorithms are biased towards English speaking names. There are further variants for some western European languages, but in general the algorithm has never been considered a panacea for all fuzzy text matching.

Most SoundEx algorithms are a variant on the following

  • Retain the first letter of the name
  • Drop all vowels, and h, y in all but the first letter
  • Replace b,f,p,v with the number 1
  • Replace c,g,j,k,q,s,x,z with the number 2
  • Replace d and t with the number 3
  • Replace l with 4, m and n with 5 and r with 6
  • Finally convert to a 4 character code like {letter}{number}{number}{number} by truncation
Some algorithms have grouping and adjacent letter rules in addition.

Implementation Details

Several implementations of the same algorithm implies a strategy pattern, so a common base class named ISoundEx was created. The four implementations are:

  • Miracode - The modified SoundEx from 1910
  • Simplified - The original SoundEx from late 1800s
  • KnuthEd2 - The SoundEx algorithm from The Art of Computer Programming
  • SQLServer - SQL Server's variant on SoundEx
Each implements GenerateSoundEx() and ValidateAlgorithm(). A virtual member of ISoundEx is EncodeChar() which provides the basic character to number mapping specified by SoundEx.
protected virtual string EncodeChar(char c) {
	// C# will re-order this list and produce a look-up list from it

	// C# will do all the work we would otherwise do by building arrays of values

	switch(Char.ToLower(c)) {
		case 'b':
		case 'f':
		case 'p':
		case 'v':
			return "1";
		case 'c':
		case 'g':
		case 'j':
		case 'k':
		case 'q':
		case 's':
		case 'x':
		case 'z':
			return "2";
		case 'd':
		case 't':
			return "3";
		case 'l':
			return "4";
		case 'm':
		case 'n':
			return "5";
		case 'r':
			return "6";
		default:
			return string.Empty;
	}
}

An aside into MSIL

How C# compiles switch statements such as EncodeChar() could be an article in itself. To demonstrate the optimizations it applies, look at the CIL disassembly below.

.method family hidebysig newslot virtual
        instance string  EncodeChar(char c) cil managed
{
  // Code size       176 (0xb0)

  .maxstack  2
  .locals ([0] string CS$00000003$00000000,
           [1] char CS$00000002$00000001)
  IL_0000:  ldarg.1
  IL_0001:  call       char [mscorlib]System.Char::ToLower(char)
  IL_0006:  stloc.1
  IL_0007:  ldloc.1
  IL_0008:  ldc.i4.s   98
  IL_000a:  sub
  IL_000b:  switch     (
                        IL_0076,
                        IL_007e,
                        IL_0086,
                        IL_00a6,
                        IL_0076,
                        IL_007e,
                        IL_00a6,
                        IL_00a6,
                        IL_007e,
                        IL_007e,
                        IL_008e,
                        IL_0096,
                        IL_0096,
                        IL_00a6,
                        IL_0076,
                        IL_007e,
                        IL_009e,
                        IL_007e,
                        IL_0086,
                        IL_00a6,
                        IL_0076,
                        IL_00a6,
                        IL_007e,
                        IL_00a6,
                        IL_007e)
  IL_0074:  br.s       IL_00a6
  IL_0076:  ldstr      "1"
  IL_007b:  stloc.0
  IL_007c:  br.s       IL_00ae
  IL_007e:  ldstr      "2"
  IL_0083:  stloc.0
  IL_0084:  br.s       IL_00ae
  IL_0086:  ldstr      "3"
  IL_008b:  stloc.0
  IL_008c:  br.s       IL_00ae
  IL_008e:  ldstr      "4"
  IL_0093:  stloc.0
  IL_0094:  br.s       IL_00ae
  IL_0096:  ldstr      "5"
  IL_009b:  stloc.0
  IL_009c:  br.s       IL_00ae
  IL_009e:  ldstr      "6"
  IL_00a3:  stloc.0
  IL_00a4:  br.s       IL_00ae
  IL_00a6:  ldsfld     string [mscorlib]System.String::Empty
  IL_00ab:  stloc.0
  IL_00ac:  br.s       IL_00ae
  IL_00ae:  ldloc.0
  IL_00af:  ret
} // end of method ISoundEx::EncodeChar


Basically, the C# compiler has re-ordered the characters in the switch statement so they are in ascending order according to the Unicode code points. It adds extra 'case' entries for missing letters as a sort of padding. The CIL switch statement is a lookup table. The value at the top of the stack when switch is executed is used as a zero-based index into the list of labels to jump to. This gives us a nicely optimized lookup table without us trying too hard!

Back to SoundEx

ISoundEx.ValidateAlgorithm() provides a test of various names against the known SoundEx code for that name.

Dictionary Lookup

The code also provides a means to load a dictionary of words or surnames and perform a fast and efficient lookup of similar words. CustomDictionary adds words to internal lookup tables with the following code:

public void AddWord(string s) {
	if(_oAllWords.ContainsKey(s)==false) {
		string zSoundEx=_oSoundExAlgorithm.GenerateSoundEx(s);

		// Add the word to the list of all words and associate

		// a SoundEx value with it

		_oAllWords.Add(s, zSoundEx);

		// Create a new SoundEx group

		if(_oAllSoundEx.ContainsKey(zSoundEx)==false) {
			_oAllSoundEx.Add(zSoundEx,new StringCollection());
		}
		// Now add the string into the collection of all strings

		// with the same SoundEx

		((StringCollection)_oAllSoundEx[zSoundEx]).Add(s);
	}
}

_oAllWords is a StringDictionary containing a unique list of all the words in the dictionary with it's SoundEx code. _oAllSoundEx is a HashTable containing a unique list of all the SoundEx values. Each entry in the HashTable contains a StringCollection of words with that SoundEx. Thus we have a two-way link between the SoundEx code and the words which have that SoundEx.

Two text files are provided. The first is the /usr/dict/words database from all Unix systems. As this is a list of commonly used English language words and not surnames, it shows how inappropriate SoundEx is for these words. The second file is OldPend Surnames 2002-01.txt which is a list of over 15,000 unique surnames from the Old Pendleton Database. This is a database of individuals whose families have roots in the Old Pendleton District of South Carolina and can be obtained from oldpendleton.homestead.com. This provides better SoundEx matches, but still shows how inaccurate the algorithm can be.

Searching for surnames matching 'Watkins'

Conclusion

As can be seen from the above screenshot, the results vary. Names which in no-way can be considered related are returned while, conversely, names which should be related fail to be returned. Knuth's examples were 'Rogers' and 'Rodgers', 'Sinclair' and 'St. Clair' which fail.

Other similar sounding match algorithms have been invented in the 120 years following the introduction of SoundEx. Metaphone is one of these, which should be implementable easily within the code structure presented here.

Finally, if you intend on using SoundEx within a commercial application be aware of its pitfalls.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here