Abstract
Simple information searches -- name lookups, word searches, etc. -- are often implemented in terms of an exact match criterion. However, given both the diversity of homophonic (pronounced the same) words and names, as well as the propensity for humans to misspell surnames, this simplistic criterion often yields less than desirable results, in the form of reduced result sets, missing records that differ by a misplaced letter or different national spelling.
This article series discusses Lawrence Phillips' Double Metaphone phonetic matching algorithm, and provides several useful implementations which can be employed in a variety of solutions to create more useful, effective searches of proper names in databases and other collections.
Introduction
This article series discusses the practical use of the Double Metaphone algorithm to phonetically search name data, using the author's implementations written for C++, COM (Visual Basic, etc.), scripting clients (VBScript, Jscript, ASP), SQL, and .NET (C#, VB.NET, and any other .NET language). For a discussion of the Double Metaphone algorithm itself, and Phillips' original code, see Phillips' article in the June 2000 CUJ, available here.
Part I introduces Double Metaphone and describes the author's C++ implementation and its use. Part II discusses the use of the author's COM implementation from within Visual Basic. Part III demonstrates use of the COM implementation from ASP and with VBScript. Part IV shows how to perform phonetic matching within SQL Server using the author's extended stored procedure. Part V demonstrates the author's .NET implementation. Finally, Part VI closes with a survey of phonetic matching alternatives, and pointers to other resources.
Double Metaphone limitations
While this article series has focused entirely on the Double Metaphone algorithm as a means to implement phonetic matching in one's applications, Double Metaphone bears some weaknesses that may render it unsuitable for a particular application, including:
- Though it works as a general-purpose phonetic search algorithm, Double Metaphone was designed for, and works best with, searching lists of proper names rather than large fields of generic text.
- Double Metaphone provides minimal ranking ability, apart from the three match levels described elsewhere in the series. This limits the ability to tune search results.
- Being a phonetic matching (vs. fuzzy matching like q-grams and edit distances) algorithm, Double Metaphone may fail to match misspelled words when the misspelling substantively alters the phonetic structure of a word.
Even bearing these limitations in mind, Double Metaphone is free, efficient, easy to use, and adaptable to a number of scenarios. Ultimately, only the designer of a particular system can decide if Double Metaphone is appropriate to his/her particular problem space.
Alternatives to Double Metaphone
Numerous other algorithms and techniques have been developed, each for different purposes, and each with varying efficacy. This section will explore some of the better-known techniques, and provide resources for more information on each method.
Soundex
Soundex was one of the first, if not the first, formalized phonetic matching algorithm. Soundex was developed for use by the US Census in the late 19th century. Not surprisingly, this algorithm is remarkably simple, and woefully inadequate in most cases.
Nonetheless, one encounters Soundex in surprising places, even in modern software solutions. For example, Microsoft SQL Server offers a SOUNDEX
function which, given a word, computes Soundex keys.
For more information on Soundex, a simple Internet search on "soundex" will likely yield fruitful results. Once again, the reader is encouraged to consider more advanced alternatives for any production system, as Soundex is both primitive and limited.
Metaphone
Double Metaphone is only the latest incarnation of the Metaphone algorithm, originally published by Lawrence Phillips in 1990. While arguably inferior to Double Metaphone, Metaphone does incorporate similar heuristics, and has the added advantage (and disadvantage) of producing only one phonetic key for a given word.
Phonix
Phonix is an improved version of Soundex, developed by T.N. Gadd and published in Association for Information Management's journal, Program [Gadd, T.N. `Fisching fore werds': phonetic retrieval of written text in information systems, 22(3) 1988, p. 222] and [Gadd, T.N. PHONIX: the algorithm, 24(4) 1990, p. 363]. While the articles are not available online, Phonix has been incorporated into a number of WAIS implementations, including freeWAIS, which is open-source and therefore freely available in source form.
The author has never experimented with Phonix, and therefore cannot write authoritatively regarding its performance; however being Soundex-based, it bears much of the same baggage which limits Soundex's performance. The paper by Zobel and Dart referenced at the end of this article performs some objective comparison of Phonix with several other algorithms, producing results which confirm this author's suspicions.
q-Gram based algorithms
A q-gram (sometimes called n-gram, primarily to confuse readers) in this context refers to a sequence of letters, q letters long, from a given word. For example, for q = 2, the word Nelson
has the following q-grams:
NE EL LS SO ON
By comparison, Neilsen
breaks down into these q-grams (q = 2):
NE EI IL LS SE EN
Clearly, Nelson
and Neilsen
share the NE
and LS
q-grams in common.
Various techniques have been developed which compare two words based on their q-grams. A simple example would be counting the number of q-grams two words have in common, with a higher count yielding a stronger match.
Technically, q-gram algorithms aren't strictly phonetic matching, in that they do not operate based on comparison of the phonetic characteristics of words. Instead, q-grams can be thought to compute the "distance", or amount of difference, between two words. Since phonetically similar words often have similar spellings, this technique can provide favorable results, yet it also successfully matches misspelled or otherwise mutated words, even if they are rendered phonetically disparate.
Edit distance based algorithms
Edit distance computes the "distance" between two words by counting the number of insert, replace, and delete operations required to permute one word into another. In general the fewer operations required, the closer the match. Some implementations assign varying scores to the insert, replace, and delete operations. Another common variation varies which operations are considered when computing the distance; for example, the replace operation may not be considered, thereby defining the edit distance solely in terms of insert and delete options.
One of the more popular algorithms in the edit distance class is Levenshtein distance (note the spelling; some references use the spelling 'Levenstein', which is technically incorrect).
As with q-gram algorithms, edit distance is not strictly phonetic, but often matches phonetically similar words due to similarities in spelling.
Proprietary algorithms
Several organizations offer data scrubbing, de-duplication, data normalizing, and merge-purge software, which implement some form of approximate text matching, albeit with varying degrees of success. The exact algorithms used are often proprietary, and seldom documented. The applicability of these systems to a particular problem must be carefully analyzed before adopting any one solution.
Other sources
The above list of alternative matching algorithms is far from complete, and provides only enough information to begin a search for suitable algorithms -- not to end one. This section lists other sources of information on phonetic matching, and approximate text matching, as well as links to additional Double Metaphone implementations.
Additional Double Metaphone implementations
Additional Phonetic Matching/Approximate text matching resources
- "Phonetic String Matching: Lessons from Information Retrieval" by Zobel and Dart. - A very readable paper comparing all of the matching techniques discussed in this article (except Metaphone and Double Metaphone, unfortunately), and a few more. Includes tables containing quantitative test results comparing various.
- "A Guided Tour to Approximate String Matching" by Gonzalo Navarro. - An excellent survey of approximate string matching, which is different from, but related to, phonetic matching.
- "Approximate Text Searching" by Gonzalo Navarro. - A very exhaustive discussion of approximate text matching issues. May be too technical for some readers. Note that the English version follows the Spanish text.
- "Searching Proper Names in Databases" by Pfeifer, Poersch, and Fuhr. - A very accessible discussion of several techniques specifically designed for searching databases of proper (last) names. Includes results of efficacy tests by the authors.
- "Learning String Edit Distance " by Ristad, and Yianilos. - Paper describing edit distance algorithms, and presenting an interesting stochastic model for learning an edit distance function from a set of examples. This latter subject is likely to be of limited interest to those seeking approximate string matching techniques, however the former topic is highly relevant.
- Richard Birkby's CodeProject article presenting four Soundex variations in C#. - Includes a brief discussion of Soundex performance.
- A very readable essay by Michael Gillel and describing the Levenshtein Distance algorithm, along with source code in VB, C++, and Java. Be sure to take note of the Resources section, which includes links to additional sites and implementations.
- "Finding String Distances", Dr Dobb's Journal, April 1992, by Ray Valdes. Interesting article discussing Levenshtein distance, and its applications not only to string matching but also to molecular biology, handwriting recognition, etc. Article available on DDJ archive CD. Source code available here.
- "A Comparison of String Distance Metrics for Name-Matching Tasks" by Cohen, Ravikumar, and Feinberg. - A nice, technical survey of techniques for measuring the "distance" between strings, for the purposes of matching proper names. Great source of additional information on edit distance techniques, including Levenshtein distance.
- SourceForge home for SecondString, a Java approximate string matching library by the authors of "A Comparison of String Distance Metrics for Name-Matching Tasks" referenced above. Includes, among other things, an implementation of Levenshtein distance.
- A powerfull full-text search engine implemented entirely in Java. - Includes Levenshtein distance code, along with all of the typical full text indexing features.
- AGREP - AGREP is a tool, similar to egrep, fgrep, and grep, which searches for a given pattern in files. AGREP uses an edit distance algorithm to perform approximate matching, making it of interest for experimenting with the results one can expect from such algorithms.
Conclusion
This article concludes the article series on phonetic matching of name data with Double Metaphone. In this article, some of the other major techniques for phonetic matching are presented, as well as links to additional resources for the reader interested in further research on the subject. By this point, the reader should understand that no one solution exists to the problem of matching similar but not identical text, and that candidate solutions should be selected based on the reader's specific criteria.
That said, hopefully this article series will lead the reader to strongly consider Double Metaphone, due to its ease of use, respectable performance, and readily available implementations in a variety of languages.
History
- 7-22-03 Initial publication
- 7-29-03 Added reference to Richard Birkby's Soundex article
- 7-31-03 Added hyperlinks between articles in the series
- 8-03-03 Added additional resources pertaining to the Levenshtein Distance algorithm
Article Series