Abstract
Simple information searches -- name lookups, word searches, etc. -- are often implemented in terms of an exact match criterion. However, given both the diversity of homophonic (pronounced the same) words and names, as well as the propensity for humans to misspell surnames, this simplistic criterion often yields less than desirable results, in the form of reduced result sets, missing records that differ by a misplaced letter or different national spelling.
This article series discusses Lawrence Phillips' Double Metaphone phonetic matching algorithm, and provides several useful implementations, which can be employed in a variety of solutions to create more useful, effective searches of proper names in databases and other collections.
Introduction
This article series discusses the practical use of the Double Metaphone algorithm to phonetically search name data, using the author's implementations written for C++, COM (Visual Basic, etc.), scripting clients (VBScript, JScript, ASP), SQL, and .NET (C#, VB.NET, and any other .NET language). For a discussion of the Double Metaphone algorithm itself, and Phillips' original code, see Phillips' article in the June 2000 CUJ, available here.
Part I introduces Double Metaphone and describes the author's C++ implementation and its use. Part II discusses the use of the author's COM implementation from within Visual Basic. Part III demonstrates use of the COM implementation from ASP and with VBScript. Part IV shows how to perform phonetic matching within SQL Server using the author's extended stored procedure. Part V demonstrates the author's .NET implementation. Finally, Part VI closes with a survey of phonetic matching alternatives, and pointers to other resources.
Background
Part I of this article series discussed the Double Metaphone algorithm, its origin and use, and the author's C++ implementation. While this section summarizes the key information from that article, readers are encouraged to review the entire article, even if the reader has no C++ experience.
The Double Metaphone algorithm, developed by Lawrence Phillips and published in the June 2000 issue of C/C++ Users Journal, is part of a class of algorithms known as "phonetic matching" or "phonetic encoding" algorithms. These algorithms attempt to detect phonetic ("sounds-like") relationships between words. For example, a phonetic matching algorithm should detect a strong phonetic relationship between "Nelson" and "Nilsen", and no phonetic relationship between "Adam" and "Nelson."
Double Metaphone works by producing one or possibly two phonetic keys, given a word. These keys represent the "sound" of the word. A typical Double Metaphone key is four characters long, as this tends to produce the ideal balance between specificity and generality of results.
The first, or primary, Double Metaphone key represents the American pronunciation of the source word. All words have a primary Double Metaphone key.
The second, or alternate, Double Metaphone key represents an alternate, national pronunciation. For example, many Polish surnames are "Americanized", yielding two possible pronunciations, the original Polish, and the American. For this reason, Double Metaphone computes alternate keys for some words. Note that the vast majority (very roughly, 90%) of words will not yield an alternate key, but when an alternate is computed, it can be pivotal in matching the word.
To compare two words for phonetic similarity, one computes their respective Double Metaphone keys, and then compares each combination:
- Word 1 Primary - Word 2 Primary
- Word 1 Primary - Word 2 Alternate
- Word 1 Alternate - Word 2 Primary
- Word 1 Alternate - Word 2 Alternate
Obviously if the keys in any of these comparisons are not produced for the given words, the comparisons involving those keys are not performed.
Depending upon which of the above comparisons matches, a match strength is computed. If the first comparison matches, the two words have a strong phonetic similarity. If the second or third comparison matches, the two words have a medium phonetic similarity. If the fourth comparison matches, the two words have a minimal phonetic similarity. Depending upon the particular application requirements, one or more match levels may be excluded from match results.
.NET implementation
The .NET implementation of Double Metaphone is very similar in design and use to the C++ implementation presented in Part I. To use the .NET implementation, simply add the Metaphone.NET.dll assembly to your project's references in Visual Studio. NET, import the nullpointer.Metaphone namespace into the source files, and instantiate the DoubleMetaphone or ShortDoubleMetaphone classes, for string and unsigned short Metaphone keys, respectively.
For example, to compute the Metaphone keys for the name "Nelson", code similar to that listed below may be used (C# code listed; the .NET implementation is callable from VB.NET, J#, and all other .NET languages):
using nullpointer.Metaphone;
DoubleMetaphone mphone = new DoubleMetaphone("Nelson");
System.Console.WriteLine(String.Format("{0} {1}",
mphone.PrimaryKey,
mphone.AlternateKey));
Note that the Metaphone keys are obtained via the PrimaryKey and AlternateKey properties.
As with the C++ implementation, an existing instance of a DoubleMetaphone or ShortDoubleMetaphone class can be used to compute the Metaphone keys for a new word, by calling the computeKeys method:
using nullpointer.Metaphone;
DoubleMetaphone mphone = new DoubleMetaphone();
mphone.computeKeys("Nelson");
System.Console.WriteLine(String.Format("{0} {1}",
mphone.PrimaryKey,
mphone.AlternateKey));
As with all of the implementations presented in this article series, a sample application�CS Word Lookup--written in C# is presented to demonstrate the use of the .NET implementation. CS Word Lookup uses a Hashtable collection class to map Metaphone phonetic keys to an ArrayList class, containing the words which produce the said Metaphone keys.
Performance notes
While the .NET CLR performs reasonably well, it must be stated that the C++ implementation of Double Metaphone will likely perform significantly faster than the .NET version, due primarily to the fact that the C++ version judiciously avoids memory allocation and buffer copies, while the .NET implementation is unable to avoid such constructs. The ambitious reader is encouraged to optimize the .NET implementation, perhaps through the use of the unsafe keyword, to perform direct memory access, at the expense of CLR compliance.
Conclusion
This brief article introduced the author's .NET implementation of Double Metaphone, including code snippets and a brief discussion of performance issues. Continue to Part VI for a review of alternative phonetic matching techniques, and a list of phonetic matching resources, including links to other Double Metaphone implementations.
History
- 7-22-03 Initial publication
- 7-31-03 Added hyperlinks between articles in the series
Article Series
| You must Sign In to use this message board. |
|
|
 |
|
 |
Hi Nelson,
Please send the me the Dll for both com and .net... I dont have vc installed in my pc.
Regards. Prathap
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi Adam,
Found you contribution today and implemented it in our Project. Works very nice. We've tried German, French and Italian names and all of them were found. Thanks a lot.
Regards, Paul Sinnema.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
 |
Hi There,
I was just wondering what the licensing/permissions were to use your code in a commercial product.
Thanks,
Casey
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
You may consider the code to be licensed under the BSD license, which permits commercial use provided you do not represent the code as being your own, usually with a credit in the manual or about box.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
 |
Getting the following error when i try to compile the code in x64 bit (VS.Net 2005) so i can use it on SQL x64 bit Server.
#############################################################################################
Error 7 fatal error LNK1181: cannot open input file 'opends60.lib' XPMetaphone
#############################################################################################
Please let me know if i need to anything to get rid of this error and compile without errors
thanks in advance - T
- T 
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
 |
I am also getting the same error when trying to compile on 64bit to use in SQL server.
Can you help me how I can compile without error?
Thank you, CP
|
| Sign In·View Thread·PermaLink | 3.50/5 |
|
|
|
 |
|
 |
Adam, Thanks for your work on these implementations - we’ve been using the extended stored procedure successfully for the past 2½ years. We’ve recently upgraded to SQL Server 2005 and will soon be changing to 64-bit hardware, which requires us to make some changes since 32-bit dlls aren’t supported on the new hardware. We would like to change this over to a CLR implementation, since Microsoft has deprecated extended stored procedures for SQL Server 2005. I’d like to request your help with a couple of issues:
1. Converting the DoubleMetaphone and ShortDoubleMetaphone classes to .NET 2.0, with interfaces suitable for use with the new CREATE ASSEMBLY statement (requires a static method), and accessible via a SQL scalar user-defined function (requires a single output parameter that matches a native SQL data type). We can handle this conversion ourselves, but I was hoping you might take an interest since the days of xp_metaphone.dll appear to be numbered.
2. The .NET implementation you published doesn’t return the same primary and alternate keys as the COM implementation for some names. (We found 1389 differences out of 159,289 names we have indexed.) I took a quick step through in debug and couldn’t see where the problem is, but based on spot checks it appears that the .NET implementation is the one with problems. Here are some examples; I’ll be happy to send you the entire list of differences if you’d like.
AGNEW, ALLOIS: No alternate key from .NET ALLECIA, ARCHILLA: Different alternate keys AUTHIER: This case might represent a gap in the algorithm, since neither the COM nor the .NET implementations return the keys I expected. The anglicized pronunciation is au-thir´ (key 0R), while the French pronunciation is o-tya´ (key T). BAUMB, BAUX: Different primary keys BEAUBIER, ROZIER: Alternate, primary keys out of sync
Thanks again, Mike
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Mike: Thanks for your comments, and I'm glad you've found XP Metaphone useful.
Re-packaging the metaphone impl into a static class with a scalar function shouldn't be too hard. It shouldn't take but a few minutes.
This is the first I've heard of output disparities between the COM and .NET impl. Thanks for brining it to my attention, and with test data no less. I'll investigate further to see about fixing the problem. I might not get to it until the weekend.
Thanks again for your comments.
Adam
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
DoubleMetaphone.cs, line 139: Need 5 spaces of padding to handle the "CAESAR" case at line 219 for an input of "C". This same bug exists in the C++ version, but only raises an exception in C#.
DoubleMetaphone.cs, line 144: Need to set m_length = word.Length here, or else move the assignment statement ahead of the padding concatenation in line 139.
With these changes, the C# version returns the same values as xp_metaphone for my 158K test inputs, with the exception of "WJ" - I didn't take the time to track that one down.
I'd still be interested in your thoughts on a CLR implementation for use with SQL Server 2005. Thanks again for your work on this!
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Mike: Thanks for looking into this, and my apologies for the delayed response.
I've put together a test rig that runs a list of names through Philips' original Double Metaphone impl, my C++ impl, and my C# impl. I didn't see the exception you reported for the 'CAESAR' case, but I do see several names producing different results under C# vs C++. I'm looking into this now.
Regarding SQL Server, it seems a static class with the [SqlFunction] attribute wrapping the existing DoubleMetaphone class would do the trick.
Adam
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Adam, Thanks for your response. We ran into a few glitches with the CLR implementation for SQL Server:
1. SQL Server apparently doesn't allow namespaces in CLR classes, so we had to remove this from your original source. 2. Only a single dll file can be registered via the CREATE ASSEMBLY statement, so we had to combine the source files in order to use ShortDoubleMetaphone. 3. A SQL scalar UDF can only return a single parameter, so there wasn't a clean mapping to replace xp_metaphone with its separate output parameters for the primary and alternate metaphone keys. We opted to combine the values into a single BINARY(4) output parameter and then parse this back into two SMALLINTs after the UDF call, but this seems like a kludge. This is also where we ran into the glitch for the "WF" input parameter - we got x0000 back instead of the expected xFFFF for the alternate key.
What we have is working, but I would still be interested in your thoughts regarding a well-thought-out approach for SQL 2005.
Sorry if my 16:16 9 Jan '07 posting was unclear - the exception occurred at line 139 for an input of "C" rather than "CAESAR". The change to use 5 spaces of padding has corrected this.
Thanks again for your good work on this. Mike Renno
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Mike: I've implemented the fixes you proposed, and my test rig now confirms the C# impl produces identical results for all 21k test names, including 'CAESAR' and 'WJ'. I'm going to update the article with the new code, but that's done via email and may take some time; in the meanwhile, I could send you the code if you like.
Adam
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
First, I just want to thank the author for this code. It works great. I'm looking at using the extended stored procedure as well as the .net assembly. So my question is how do we know if a particular word has no alternate key when we are using the unsigned short version of the keys?
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
I'm glad you found my code useful.
An alternate key of '0' should not be considered valid, so as long as you don't compare two 0 keys for equality, you should be fine. Is there some other reason you need to detect null keys?
Adam
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Thanks. When I asked the question, I was trying to figure out what value represented the lack of an alternate key for a word. In the SQL xp implementation, you get a null value, but with ShortDoubleMetaphone, you get 65535. For several reasons, we wanted to compute the metaphone keys in our .Net app and compare them against a table of keys in SQL. The only tricky part was figuring out how to translate between SQL server's smallint values and .Net's UInt16 values. So what I ended up doing is converting the results of the SQL XP to the equivalent UInt16 values and storing those in the key tables. Here is what worked well for us: --This is the value used to represent a null or invalid metaphone key DECLARE @maxKeyValue int SET @maxKeyValue = 65535 EXEC master..xp_metaphone @WorkWord, @primaryMetaphoneTemp output, @alternateMetaphoneTemp output if @alternateMetaphoneTemp is null set @alternateMetaphone = @maxKeyValue else if @alternateMetaphoneTemp < 0 --convert this smallint value to the equivalent unsigned int value set @alternateMetaphone = @alternateMetaphoneTemp + @maxKeyValue + 1 else set @alternateMetaphone = @alternateMetaphoneTemp
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
 |
I think I might have found a small bug in your otherwise excellent code (thanks for doing this!).
When running CSWordLookup with this dictionary file the function nullpoint.Metaphone.DoubleMetaphone.areStringsAt(start,length,strings) failed with a index out of range error. I added a simple check to fix it. Modified function:
private bool areStringsAt(int start, int length, params String[] strings) { if (start < 0 || m_word.Length < length) { return false; }
String target = m_word.Substring(start, length); for (int idx = 0; idx < strings.Length; idx++) { if (strings[idx] == target) { return true; } }
return false; }
-ben http://mudabone.com
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
 |
Good catch Ben. That code is probably in need of some refactoring anyway, if start can be negative. Thanks for the fix.
Adam
|
| Sign In·View Thread·PermaLink | 5.00/5 |
|
|
|
 |
|
 |
Hey Adam,
I have readed your work "Implement Phonetic ("Sounds-like") Name Searches with Double Metaphone". It is very interesting. Recently I found a paper (Phonetic String Matching: Lessons from Information Retrieval - Justin Zobel,Philip Dart) talking about aproximate string matching. Im plannig to experiment with Editex algorithm. Do you know where I can find more data about this?
Thank you for your time
Elvio Fernandez
Elvio Fernandez
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
I was almost ready to use the metaphone method when I stumbled across your articles on double metaphone. You did a VERY good job of explaining it and offering examples. The only thing I wish for was the source code in VB, but not a big deal. The only major thing I'll need to add is looking up on multiple words.
Thanks!
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
tequilacollins wrote: I was almost ready to use the metaphone method when I stumbled across your articles on double metaphone. You did a VERY good job of explaining it and offering examples.
Thanks, I'm glad you think so.
tequilacollins wrote: he only thing I wish for was the source code in VB, but not a big deal
For what it's worth, you can use the COM component from VB6, and the C# component from VB.NET..
tequilacollins wrote: The only major thing I'll need to add is looking up on multiple words.
Just make sure that you compute the Metaphone keys on each word individually; the algorithm is not designed to compute a key for multiple words at once.
Good luck.
Adam
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
For what it's worth, you can use the COM component from VB6, and the C# component from VB.NET..
I'm writing the app in ASP. It would have been nice to recreate the DLL with the additional function of multiple words, but I can just create a wrapper instead.
Just make sure that you compute the Metaphone keys on each word individually; the algorithm is not designed to compute a key for multiple words at once.
Yeah, already figured that part. I'll have to tokenize the words first.
Then I still have to figure out a scoring system. If I get one word with a strong hit and the other as a weak one, what do I call it?
I'll let you know how it turns out.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Does your verion of the .NET implementation produce the exact results as the orginal Philips version? I need to know this because we are currently using the Philips version and want to insure the compatibility of both versions for comparisons.
Thanks for your response.
Gary Fischbach
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
|