Click here to Skip to main content
Click here to Skip to main content

Implement Phonetic ("Sounds-like") Name Searches with Double Metaphone Part V: .NET Implementation

By , 19 Mar 2007
 

Abstract

Simple information searches -- name lookups, word searches, etc. -- are often implemented in terms of an exact match criterion. However, given both the diversity of homophonic (pronounced the same) words and names, as well as the propensity for humans to misspell surnames, this simplistic criterion often yields less than desirable results, in the form of reduced result sets, missing records that differ by a misplaced letter or different national spelling.

This article series discusses Lawrence Phillips' Double Metaphone phonetic matching algorithm, and provides several useful implementations, which can be employed in a variety of solutions to create more useful, effective searches of proper names in databases and other collections.

Introduction

This article series discusses the practical use of the Double Metaphone algorithm to phonetically search name data, using the author's implementations written for C++, COM (Visual Basic, etc.), scripting clients (VBScript, JScript, ASP), SQL, and .NET (C#, VB.NET, and any other .NET language). For a discussion of the Double Metaphone algorithm itself, and Phillips' original code, see Phillips' article in the June 2000 CUJ, available here.

Part I introduces Double Metaphone and describes the author's C++ implementation and its use. Part II discusses the use of the author's COM implementation from within Visual Basic. Part III demonstrates use of the COM implementation from ASP and with VBScript. Part IV shows how to perform phonetic matching within SQL Server using the author's extended stored procedure. Part V demonstrates the author's .NET implementation. Finally, Part VI closes with a survey of phonetic matching alternatives, and pointers to other resources.

Background

Part I of this article series discussed the Double Metaphone algorithm, its origin and use, and the author's C++ implementation. While this section summarizes the key information from that article, readers are encouraged to review the entire article, even if the reader has no C++ experience.

The Double Metaphone algorithm, developed by Lawrence Phillips and published in the June 2000 issue of C/C++ Users Journal, is part of a class of algorithms known as "phonetic matching" or "phonetic encoding" algorithms. These algorithms attempt to detect phonetic ("sounds-like") relationships between words. For example, a phonetic matching algorithm should detect a strong phonetic relationship between "Nelson" and "Nilsen", and no phonetic relationship between "Adam" and "Nelson."

Double Metaphone works by producing one or possibly two phonetic keys, given a word. These keys represent the "sound" of the word. A typical Double Metaphone key is four characters long, as this tends to produce the ideal balance between specificity and generality of results.

The first, or primary, Double Metaphone key represents the American pronunciation of the source word. All words have a primary Double Metaphone key.

The second, or alternate, Double Metaphone key represents an alternate, national pronunciation. For example, many Polish surnames are "Americanized", yielding two possible pronunciations, the original Polish, and the American. For this reason, Double Metaphone computes alternate keys for some words. Note that the vast majority (very roughly, 90%) of words will not yield an alternate key, but when an alternate is computed, it can be pivotal in matching the word.

To compare two words for phonetic similarity, one computes their respective Double Metaphone keys, and then compares each combination:

  • Word 1 Primary - Word 2 Primary
  • Word 1 Primary - Word 2 Alternate
  • Word 1 Alternate - Word 2 Primary
  • Word 1 Alternate - Word 2 Alternate

Obviously if the keys in any of these comparisons are not produced for the given words, the comparisons involving those keys are not performed.

Depending upon which of the above comparisons matches, a match strength is computed. If the first comparison matches, the two words have a strong phonetic similarity. If the second or third comparison matches, the two words have a medium phonetic similarity. If the fourth comparison matches, the two words have a minimal phonetic similarity. Depending upon the particular application requirements, one or more match levels may be excluded from match results.

.NET implementation

The .NET implementation of Double Metaphone is very similar in design and use to the C++ implementation presented in Part I. To use the .NET implementation, simply add the Metaphone.NET.dll assembly to your project's references in Visual Studio. NET, import the nullpointer.Metaphone namespace into the source files, and instantiate the DoubleMetaphone or ShortDoubleMetaphone classes, for string and unsigned short Metaphone keys, respectively.

For example, to compute the Metaphone keys for the name "Nelson", code similar to that listed below may be used (C# code listed; the .NET implementation is callable from VB.NET, J#, and all other .NET languages):

using nullpointer.Metaphone;

DoubleMetaphone mphone = new DoubleMetaphone("Nelson");
System.Console.WriteLine(String.Format("{0} {1}",
                             mphone.PrimaryKey,
                            mphone.AlternateKey));

Note that the Metaphone keys are obtained via the PrimaryKey and AlternateKey properties.

As with the C++ implementation, an existing instance of a DoubleMetaphone or ShortDoubleMetaphone class can be used to compute the Metaphone keys for a new word, by calling the computeKeys method:

using nullpointer.Metaphone;

DoubleMetaphone mphone = new DoubleMetaphone();
mphone.computeKeys("Nelson");
System.Console.WriteLine(String.Format("{0} {1}",
                             mphone.PrimaryKey,
                             mphone.AlternateKey));

As with all of the implementations presented in this article series, a sample application—CS Word Lookup--written in C# is presented to demonstrate the use of the .NET implementation. CS Word Lookup uses a Hashtable collection class to map Metaphone phonetic keys to an ArrayList class, containing the words which produce the said Metaphone keys.

Performance notes

While the .NET CLR performs reasonably well, it must be stated that the C++ implementation of Double Metaphone will likely perform significantly faster than the .NET version, due primarily to the fact that the C++ version judiciously avoids memory allocation and buffer copies, while the .NET implementation is unable to avoid such constructs. The ambitious reader is encouraged to optimize the .NET implementation, perhaps through the use of the unsafe keyword, to perform direct memory access, at the expense of CLR compliance.

Conclusion

This brief article introduced the author's .NET implementation of Double Metaphone, including code snippets and a brief discussion of performance issues. Continue to Part VI for a review of alternative phonetic matching techniques, and a list of phonetic matching resources, including links to other Double Metaphone implementations.

History

  • 7-22-03 Initial publication
  • 7-31-03 Added hyperlinks between articles in the series

Article Series

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Adam Nelson
Web Developer
United States United States
Member
My name is Adam Nelson. I've been a professional programmer since 1996, working on everything from database development, early first-generation web applications, modern n-tier distributed apps, high-performance wireless security tools, to my last job as a Senior Consultant at BearingPoint posted in Baghdad, Iraq training Iraqi developers in the wonders of C# and ASP.NET. I am currently an Engineering Director at Dell.
 
I have a wide range of skills and interests, including cryptography, image processing, computational linguistics, military history, 3D graphics, database optimization, and mathematics, to name a few.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionXPMetaphone in 64 bit?memberDunc_NZ10 Apr '11 - 10:32 
Hi,
 
I don't suppose you have the SQL XPMetaphone compiled in a 64bit version? I've tried to compile it myself using Visual Studio 2010 but it just brings up a bunch of errors.
 
Thannks in advance,
 
Dunc
GeneralStrongly signed dllmemberChris Copac2 Mar '11 - 10:59 
Is it acceptable to compile the source into a strongly signed dll for inclusion with another commercially available strongly signed dll? If so, are there any pitfalls we should be aware of to ensure we comply with licensing concerns?
GeneralRe: Strongly signed dllmemberAdam Nelson15 Mar '11 - 14:15 
Yes, commercial use is acceptable. Lawrence Phillips' original code was as far as I can tell public domain, and mine is too.
GeneralNice implementation of a killer algorithmmemberDimitri Troncquo22 Feb '11 - 22:55 
Very nice. You got to love this algorithm. Especially living in a country where Germanic languages, French and English are commonplace.
 
I used the lib as is for now, but i might look into pounding out an unsafe implementation when i have some spare time. It'll probably shave off a few millis for large lookups.
 
You wouldn't have a metric lying around to compare performance on the C# and the C++ implementaions would you?
GeneralRe: Nice implementation of a killer algorithmmemberAdam Nelson23 Feb '11 - 4:02 
Thanks, I'm glad you found it useful.
 
I don't have the metrics I did back when I wrote the article, and given the improvements in .NET performance it's probably different now.
Generalhelp implementing this in asp.netmembervelascojames23 Aug '10 - 22:59 
where can i found the .dll? please help me.. i hope this thread is still alive
Generaldll requiredmemberPrathapavidyadaran8 Sep '09 - 8:43 
Hi Nelson,
 
Please send the me the Dll for both com and .net... I dont have vc installed in my pc.
 
Regards.
Prathap
GeneralSuper implementation thanksmemberPaul Sinnema12 Mar '09 - 6:25 
Hi Adam,
 
Found you contribution today and implemented it in our Project. Works very nice. We've tried German, French and Italian names and all of them were found. Thanks a lot.
 
Regards,
Paul Sinnema.
GeneralRe: Super implementation thanksmemberAdam Nelson12 Mar '09 - 10:28 
Glad you were able to make use of it!
QuestionLicensing?memberCasey Gum10 Nov '08 - 10:40 
Hi There,
 
I was just wondering what the licensing/permissions were to use your code in a commercial product.
 
Thanks,
 
Casey
AnswerRe: Licensing?memberAdam Nelson10 Nov '08 - 11:11 
You may consider the code to be licensed under the BSD license, which permits commercial use provided you do not represent the code as being your own, usually with a credit in the manual or about box.
QuestionMetaphone.NET.dllmembersanjutvj30 May '07 - 0:52 
I did n't get the dll of Metaphone.NET. how to get that dll file?
QuestionError on x64 bit compilememberterry091716 Apr '07 - 5:30 
Getting the following error when i try to compile the code in x64 bit (VS.Net 2005) so i can use it on SQL x64 bit Server.
 
#############################################################################################
 
Error 7 fatal error LNK1181: cannot open input file 'opends60.lib' XPMetaphone
 
#############################################################################################
 
Please let me know if i need to anything to get rid of this error and compile without errors
 
thanks in advance
- T
 
- T Smile | :)

AnswerRe: Error on x64 bit compilemembercp197017 Apr '07 - 4:31 
I am also getting the same error when trying to compile on 64bit to use in SQL server.
 
Can you help me how I can compile without error?
 
Thank you,
CP
Question.NET 2.0 & Inconsistent resultsmemberMike Renno5 Dec '06 - 3:49 
Adam, Thanks for your work on these implementations - we’ve been using the extended stored procedure successfully for the past 2½ years. We’ve recently upgraded to SQL Server 2005 and will soon be changing to 64-bit hardware, which requires us to make some changes since 32-bit dlls aren’t supported on the new hardware. We would like to change this over to a CLR implementation, since Microsoft has deprecated extended stored procedures for SQL Server 2005. I’d like to request your help with a couple of issues:
 
1. Converting the DoubleMetaphone and ShortDoubleMetaphone classes to .NET 2.0, with interfaces suitable for use with the new CREATE ASSEMBLY statement (requires a static method), and accessible via a SQL scalar user-defined function (requires a single output parameter that matches a native SQL data type). We can handle this conversion ourselves, but I was hoping you might take an interest since the days of xp_metaphone.dll appear to be numbered.
 
2. The .NET implementation you published doesn’t return the same primary and alternate keys as the COM implementation for some names. (We found 1389 differences out of 159,289 names we have indexed.) I took a quick step through in debug and couldn’t see where the problem is, but based on spot checks it appears that the .NET implementation is the one with problems. Here are some examples; I’ll be happy to send you the entire list of differences if you’d like.
 
AGNEW, ALLOIS: No alternate key from .NET
ALLECIA, ARCHILLA: Different alternate keys
AUTHIER: This case might represent a gap in the algorithm, since neither the COM nor the .NET implementations return the keys I expected. The anglicized pronunciation is au-thir´ (key 0R), while the French pronunciation is o-tya´ (key T).
BAUMB, BAUX: Different primary keys
BEAUBIER, ROZIER: Alternate, primary keys out of sync
 
Thanks again,
Mike
AnswerRe: .NET 2.0 & Inconsistent resultsmemberAdam Nelson5 Dec '06 - 5:49 
Mike:
Thanks for your comments, and I'm glad you've found XP Metaphone useful.
 
Re-packaging the metaphone impl into a static class with a scalar function shouldn't be too hard. It shouldn't take but a few minutes.
 
This is the first I've heard of output disparities between the COM and .NET impl. Thanks for brining it to my attention, and with test data no less. I'll investigate further to see about fixing the problem. I might not get to it until the weekend.
 
Thanks again for your comments.
 

Adam

GeneralFound a couple of bugsmemberMike Renno9 Jan '07 - 10:16 
DoubleMetaphone.cs, line 139: Need 5 spaces of padding to handle the "CAESAR" case at line 219 for an input of "C". This same bug exists in the C++ version, but only raises an exception in C#.
 
DoubleMetaphone.cs, line 144: Need to set m_length = word.Length here, or else move the assignment statement ahead of the padding concatenation in line 139.
 
With these changes, the C# version returns the same values as xp_metaphone for my 158K test inputs, with the exception of "WJ" - I didn't take the time to track that one down.
 
I'd still be interested in your thoughts on a CLR implementation for use with SQL Server 2005. Thanks again for your work on this!
GeneralRe: Found a couple of bugsmemberAdam Nelson1 Feb '07 - 5:29 
Mike:
Thanks for looking into this, and my apologies for the delayed response.
 
I've put together a test rig that runs a list of names through Philips' original Double Metaphone impl, my C++ impl, and my C# impl. I didn't see the exception you reported for the 'CAESAR' case, but I do see several names producing different results under C# vs C++. I'm looking into this now.
 
Regarding SQL Server, it seems a static class with the [SqlFunction] attribute wrapping the existing DoubleMetaphone class would do the trick.

 

Adam

GeneralRe: Found a couple of bugsmemberMike Renno6 Feb '07 - 3:47 
Adam,
Thanks for your response. We ran into a few glitches with the CLR implementation for SQL Server:
 
1. SQL Server apparently doesn't allow namespaces in CLR classes, so we had to remove this from your original source.
2. Only a single dll file can be registered via the CREATE ASSEMBLY statement, so we had to combine the source files in order to use ShortDoubleMetaphone.
3. A SQL scalar UDF can only return a single parameter, so there wasn't a clean mapping to replace xp_metaphone with its separate output parameters for the primary and alternate metaphone keys. We opted to combine the values into a single BINARY(4) output parameter and then parse this back into two SMALLINTs after the UDF call, but this seems like a kludge. This is also where we ran into the glitch for the "WF" input parameter - we got x0000 back instead of the expected xFFFF for the alternate key.
 
What we have is working, but I would still be interested in your thoughts regarding a well-thought-out approach for SQL 2005.
 
Sorry if my 16:16 9 Jan '07 posting was unclear - the exception occurred at line 139 for an input of "C" rather than "CAESAR". The change to use 5 spaces of padding has corrected this.
 
Thanks again for your good work on this.
Mike Renno

GeneralRe: Found a couple of bugsmemberAdam Nelson1 Feb '07 - 5:43 
Mike:
I've implemented the fixes you proposed, and my test rig now confirms the C# impl produces identical results for all 21k test names, including 'CAESAR' and 'WJ'. I'm going to update the article with the new code, but that's done via email and may take some time; in the meanwhile, I could send you the code if you like.

 

Adam

QuestionChecking for null alternate keys with unsigned shortmembermill40237 Aug '06 - 11:25 
First, I just want to thank the author for this code. It works great. I'm looking at using the extended stored procedure as well as the .net assembly.
So my question is how do we know if a particular word has no alternate key when we are using the unsigned short version of the keys?
AnswerRe: Checking for null alternate keys with unsigned shortmemberAdam Nelson25 Sep '06 - 4:11 
I'm glad you found my code useful.
 
An alternate key of '0' should not be considered valid, so as long as you don't compare two 0 keys for equality, you should be fine. Is there some other reason you need to detect null keys?
 

Adam

GeneralRe: Checking for null alternate keys with unsigned shortmembermill402325 Sep '06 - 5:37 
Thanks. When I asked the question, I was trying to figure out what value represented the lack of an alternate key for a word. In the SQL xp implementation, you get a null value, but with ShortDoubleMetaphone, you get 65535. For several reasons, we wanted to compute the metaphone keys in our .Net app and compare them against a table of keys in SQL.
The only tricky part was figuring out how to translate between SQL server's smallint values and .Net's UInt16 values. So what I ended up doing is converting the results of the SQL XP to the equivalent UInt16 values and storing those in the key tables. Here is what worked well for us:
--This is the value used to represent a null or invalid metaphone key
DECLARE @maxKeyValue int SET @maxKeyValue = 65535
EXEC master..xp_metaphone @WorkWord, @primaryMetaphoneTemp output, @alternateMetaphoneTemp output
if @alternateMetaphoneTemp is null
set @alternateMetaphone = @maxKeyValue
else
if @alternateMetaphoneTemp < 0 --convert this smallint value to the equivalent unsigned int value
set @alternateMetaphone = @alternateMetaphoneTemp + @maxKeyValue + 1
else
set @alternateMetaphone = @alternateMetaphoneTemp
GeneralRe: Checking for null alternate keys with unsigned shortmemberAdam Nelson25 Sep '06 - 5:52 
Cool, that's a good solution.
QuestionPossible bug?memberwiseleyb4 Dec '05 - 5:44 
I think I might have found a small bug in your otherwise excellent code (thanks for doing this!).
 
When running CSWordLookup with this dictionary file the function nullpoint.Metaphone.DoubleMetaphone.areStringsAt(start,length,strings) failed with a index out of range error. I added a simple check to fix it. Modified function:
 

private bool areStringsAt(int start, int length, params String[] strings)
{
if (start < 0 || m_word.Length < length)
{
//Sometimes, as a result of expressions like "current - 2" for start,
//start ends up negative. Since no string can be present at a negative offset, this is always false
return false;
}
 
String target = m_word.Substring(start, length);

for (int idx = 0; idx < strings.Length; idx++) {
if (strings[idx] == target) {
return true;
}
}
 
return false;
}

 
-ben
http://mudabone.com

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web04 | 2.6.130523.1 | Last Updated 19 Mar 2007
Article Copyright 2003 by Adam Nelson
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid