Simple information searches--name lookups, word searches, etc--are often implemented in terms of an exact match criterion. However, given both the diversity of homophonic (pronounced the same) words and names, as well as the propensity for humans to misspell surnames, this simplistic criterion often yields less than desirable results, in the form of reduced result sets, missing records that differ by a misplaced letter or different national spelling.
This article series discusses Lawrence Phillips' Double Metaphone phonetic matching algorithm, and provides several useful implementations, which can be employed in a variety of solutions to create more useful, effective searches of proper names in databases and other collections.
This article series discusses the practical use of the Double Metaphone algorithm to phonetically search name data, using the author's implementations written for C++, COM (Visual Basic, etc), scripting clients (VBScript, JScript, ASP), SQL, and .NET (C#, VB.NET, and any other .NET language). For a discussion of the Double Metaphone algorithm itself, and Phillips' original code, see Phillips' article in the June 2000 CUJ, available here.
Part I introduces Double Metaphone and describes the author's C++ implementation and its use. Part II discusses the use of the author's COM implementation from within Visual Basic. Part III demonstrates, use of the COM implementation from ASP and with VBScript. Part IV shows how to perform phonetic matching within SQL Server, using the author's extended stored procedure. Part V demonstrates the author's .NET implementation. Finally, Part VI closes with a survey of phonetic matching alternatives, and pointers to other resources.
Part I of this article series discussed the Double Metaphone algorithm, its origin and use, and the author's C++ implementation. While this section summarizes the key information from that article, readers are encouraged to review the entire article, even if the reader has no C++ experience.
The Double Metaphone algorithm, developed by Lawrence Phillips and published in the June 2000 issue of C/C++ Users Journal, is part of a class of algorithms known as "phonetic matching" or "phonetic encoding" algorithms. These algorithms attempt to detect phonetic ("sounds-like") relationships between words. For example, a phonetic matching algorithm should detect a strong phonetic relationship between "Nelson" and "Nilsen", and no phonetic relationship between "Adam" and "Nelson."
Double Metaphone works by producing one or possibly two phonetic keys, given a word. These keys represent the "sound" of the word. A typical Double Metaphone key is four characters long, as this tends to produce the ideal balance between specificity and generality of results.
The first, or primary, Double Metaphone key represents the American pronunciation of the source word. All words have a primary Double Metaphone key.
The second, or alternate, Double Metaphone key represents an alternate, national pronunciation. For example, many Polish surnames are "Americanized", yielding two possible pronunciations, the original Polish, and the American. For this reason, Double Metaphone computes alternate keys for some words. Note that the vast majority (very roughly, 90%) of words will not yield an alternate key, but when an alternate is computed, it can be pivotal in matching the word.
To compare two words for phonetic similarity, one computes their respective Double Metaphone keys, and then compares each combination:
- Word 1 Primary - Word 2 Primary
- Word 1 Primary - Word 2 Alternate
- Word 1 Alternate - Word 2 Primary
- Word 1 Alternate - Word 2 Alternate
Obviously if the keys in any of these comparisons are not produced for the given words, the comparisons involving those keys are not performed.
Depending upon which of the above comparisons match, a match strength is computed. If the first comparison matches, the two words have a strong phonetic similarity. If the second or third comparison matches, the two words have a medium phonetic similarity. If the fourth comparison matches, the two words have a minimal phonetic similarity. Depending upon the particular application requirements, one or more match levels may be excluded from match results.
The author has created a COM component, which wraps the C++ Double Metaphone implementation discussed in Part I. This COM component can be called from Visual Basic, Microsoft Access, Visual FoxPro, Delphi, and any other COM-compatible language. While this article focuses on Visual Basic, the concepts are applicable to any language.
This article will demonstrate the use of the Double Metaphone COM component using the VB Word Lookup sample application, a Visual Basic version of the Word Lookup C++ sample from Part I. VB Word Lookup reads a list of 21,000 names from a text file, then builds a map matching the Double Metaphone phonetic keys, with all words which yield those phonetic keys. The Visual Basic version will use the
Dictionary class to map phonetic keys to arrays of words matching that phonetic key. Lookups are performed by querying the
Dictionary class with the phonetic key of the search word; all strings in the array associated with the phonetic key of the search word, are phonetic matches of the search word.
The first step in using the Double Metaphone implementation is to add a reference to the Double Metaphone component to your project, using the Project | References menu:
Note that one must register the component before opening the References dialog. To register the component, open a command prompt, change to the directory containing the MetaphoneCOM.dll and enter the following command:
One need only register the component once on every machine with which one will be using Double Metaphone.
Once Double Metaphone is added to the project's references, it can be used like any other component. For example, examine this snippet from the
btnFind_Click handler from VB Word Lookup:
Dim mphone As New MetaphoneCOM.DoubleMetaphoneString
Dim primaryKey As String
Dim alternateKey As String
mphone.ComputeMetaphoneKeys searchWord, primaryKey, alternateKey
Upon examination of the VB Word Lookup code, it becomes obvious that use of Double Metaphone is remarkably simple. In fact, the most complex element of VB Word Lookup is not Double Metaphone, but the use of the
Dictionary class to map phonetic keys to variant arrays of words matching those phonetic keys. This is conceptually is identical to the C++ version, which uses similar (though more type-safe and vastly more efficient) containers in a similar configuration.
Note also that unlike Word Lookup, VB Word Lookup does not compute a match score based on which keys match which. This is left as an exercise for the reader.
As discussed in Part I of this series, an optimization exists for Double Metaphone, whereby the four-character phonetic keys are represented as unsigned shorts (
Integer in Visual Basic). This optimization is also exposed as a COM component, in the form of the
MetaphoneCom.DoubleMetaphoneShort class. Its usage is identical to
DoubleMetaphoneString, except that instead of
Strings as the keys, it produces
Integers. Try modifying the VB Word Lookup to use the unsigned short optimization and note the resulting change in performance of the lookup.
Note this: optimization is especially important for database applications, which benefit greatly from the reduced key size (an
Integer is two bytes, while a Metaphone key is usually five bytes including the terminating
NULL) and more importantly, the increased comparison speed. Modern CPUs contain instructions which can compare two
Integers in one clock cycle, while comparing two
Strings takes multiple clock cycles, meaning searches on
Integers will perform significantly faster than searches on
Double Metaphone and relational databases
Unlike C++, Visual Basic has extensive intrinsic support for relational databases. Therefore, the author has spared the discussion of Double Metaphone in database applications until this article, in which it will be easier to demonstrate concepts using VB than in C++. That said, the concepts in this section are specific to relational databases, not a particular database, and certainly not a particular language.
Up to this point, all examples of Double Metaphone applications have been rather unrealistic. While compelling as a demonstration, the use of static, in-memory containers, mapping a pre-defined list of names and their phonetic keys is somewhat lacking in one key aspect: few modern applications are based on small, static data sets. More likely, they are based on database tables containing structured information, dynamically updated as frequently as near-real-time. Adapting the model from the Word Lookup samples to this scenario would be difficult and time-consuming, and the results would be abysmal.
Instead, some means of storing phonetic keys and performing phonetic searches, which makes use of the inherent strengths of a relational database must be devised. Specifically, relational database systems excel at rapidly searching large amounts of structured data given specific search criteria, usually represented using Structured Query Language (SQL).
As many readers have no doubt guessed, the obvious solution is to compute and store Metaphone keys for the data to be searched, within the database itself. To perform phonetic searches against this data, simply compute the phonetic keys for the search terms, then build a SQL query which retrieves that data, with phonetic keys which match the search keys. While in implementation this can become quite complex, especially when searching multi-word data, the concept itself is quite simple.
To demonstrate, a variation of the VB Word Lookup sample called VB Word DB Lookup will be used. This version operates against a Microsoft Access 2000 database, containing a table,
Words, with the words from the word list used in previous samples, plus the unsigned short representation of each word's Double Metaphone keys, already computed and stored in separate columns of the table. The columns containing the keys have been indexed for maximal performance.
Note that Access was chosen primarily for convenience; most developers have a copy installed, and its file-based databases are easily exchanged without additional configuration. Experienced developers should have no trouble adapting this sample to SQL Server, DB2, Oracle, etc.
The source to VB Word DB Lookup, and the Access database against which it operates, are included in the source archive and will not be dwelt upon here. The key point to make is the SQL, which VB Word DB Lookup produces, to retrieve the phonetic match results. To explore the SQL, consider a sample search, which is produced by entering "Patricia" (which has an alternate Metaphone key) in the search box:
(key1 = -25654)
or (key2 = -25654)
or (key1 = -25651)
or (key2 = -25651)
This SQL is performing the same search logic as the Word Lookup and VB Word Lookup applications, albeit with a single line of SQL code, rather than tens of lines of C++ or VB code. In this case, -25654 is the unsigned short form of the primary Metaphone key of the search term, "Patricia", and -25651 is the alternate Metaphone key. The
WHERE clause is performing all four of the possible match tests for comparison of two words for phonetic similarity, first comparing the primary keys of the words in the table to the primary search key, then the alternate keys to the primary search key, then the primary keys to the alternate search key, and finally the alternate keys to the alternate search key.
While clearly the SQL version is much simpler, the use of a relational database for phonetic matching has significantly more important implications. First, the database engine implements sophisticated indexing and optimization techniques, particularly for indexed columns such as those containing the keys. Therefore, searches will tend to be significantly faster.
Second, many modern database engines support data sets measured in terabytes or larger. Thus by using a relational database, applications can perform phonetic matching on very large datasets with remarkable speed and surprisingly little programming effort.
Third, and perhaps most significant, relational databases are dynamic. As anyone who has written a database application knows, databases are much more powerful when they can be modified as well as queried. Database engines allow hundreds or thousands of concurrent connections to query and modify data, using locks to prevent data corruption. Clients written in a variety of languages can access and update databases, from the same machine or across the world. The database is already an information hub; adding phonetic matching at the database level, clearly makes the most sense.
This article has demonstrated the use of the Double Metaphone COM implementation from within Visual Basic, and has introduced the use of Double Metaphone-based phonetic matching with relational databases. It should be clear from this article that use of Double Metaphone for phonetic matching in general and for phonetic matching with relational databases in particular, presents significant opportunities for the development of remarkably powerful, easy-to-use applications with minimal computational and programmer effort.
For more Double Metaphone implementation samples, including additional relational database applications, continue reading the remaining articles in the series. Part III demonstrates using the COM implementation from ASP and VBScript, including phonetic searching against a database. Part IV introduces the author's SQL Server extended stored procedure, which enables computation of Double Metaphone keys from within SQL. Part IV also discusses optimizations of relational database phonetic matching solutions, and addresses some real-world database-related implementation issues. Part V explores the author's .NET Double Metaphone implementation, and includes a sample which performs phonetic searches against a relational database. Finally, Part VI concludes the discussion of Double Metaphone with an examination of alternate phonetic matching techniques, and pointers to other resources and Double Metaphone implementations.
- 7-22-03 Initial publication.
- 7-31-03 Added hyperlinks between articles in the series.