Click here to Skip to main content
Click here to Skip to main content

Implement Phonetic ("Sounds-like") Name Searches with Double Metaphone Part VI: Other Methods & Additional Resources

By , 19 Mar 2007
Rate this:
Please Sign up or sign in to vote.

Abstract

Simple information searches -- name lookups, word searches, etc. -- are often implemented in terms of an exact match criterion. However, given both the diversity of homophonic (pronounced the same) words and names, as well as the propensity for humans to misspell surnames, this simplistic criterion often yields less than desirable results, in the form of reduced result sets, missing records that differ by a misplaced letter or different national spelling.

This article series discusses Lawrence Phillips' Double Metaphone phonetic matching algorithm, and provides several useful implementations which can be employed in a variety of solutions to create more useful, effective searches of proper names in databases and other collections.

Introduction

This article series discusses the practical use of the Double Metaphone algorithm to phonetically search name data, using the author's implementations written for C++, COM (Visual Basic, etc.), scripting clients (VBScript, Jscript, ASP), SQL, and .NET (C#, VB.NET, and any other .NET language). For a discussion of the Double Metaphone algorithm itself, and Phillips' original code, see Phillips' article in the June 2000 CUJ, available here.

Part I introduces Double Metaphone and describes the author's C++ implementation and its use. Part II discusses the use of the author's COM implementation from within Visual Basic. Part III demonstrates use of the COM implementation from ASP and with VBScript. Part IV shows how to perform phonetic matching within SQL Server using the author's extended stored procedure. Part V demonstrates the author's .NET implementation. Finally, Part VI closes with a survey of phonetic matching alternatives, and pointers to other resources.

Double Metaphone limitations

While this article series has focused entirely on the Double Metaphone algorithm as a means to implement phonetic matching in one's applications, Double Metaphone bears some weaknesses that may render it unsuitable for a particular application, including:

  • Though it works as a general-purpose phonetic search algorithm, Double Metaphone was designed for, and works best with, searching lists of proper names rather than large fields of generic text.
  • Double Metaphone provides minimal ranking ability, apart from the three match levels described elsewhere in the series. This limits the ability to tune search results.
  • Being a phonetic matching (vs. fuzzy matching like q-grams and edit distances) algorithm, Double Metaphone may fail to match misspelled words when the misspelling substantively alters the phonetic structure of a word.

Even bearing these limitations in mind, Double Metaphone is free, efficient, easy to use, and adaptable to a number of scenarios. Ultimately, only the designer of a particular system can decide if Double Metaphone is appropriate to his/her particular problem space.

Alternatives to Double Metaphone

Numerous other algorithms and techniques have been developed, each for different purposes, and each with varying efficacy. This section will explore some of the better-known techniques, and provide resources for more information on each method.

Soundex

Soundex was one of the first, if not the first, formalized phonetic matching algorithm. Soundex was developed for use by the US Census in the late 19th century. Not surprisingly, this algorithm is remarkably simple, and woefully inadequate in most cases.

Nonetheless, one encounters Soundex in surprising places, even in modern software solutions. For example, Microsoft SQL Server offers a SOUNDEX function which, given a word, computes Soundex keys.

For more information on Soundex, a simple Internet search on "soundex" will likely yield fruitful results. Once again, the reader is encouraged to consider more advanced alternatives for any production system, as Soundex is both primitive and limited.

Metaphone

Double Metaphone is only the latest incarnation of the Metaphone algorithm, originally published by Lawrence Phillips in 1990. While arguably inferior to Double Metaphone, Metaphone does incorporate similar heuristics, and has the added advantage (and disadvantage) of producing only one phonetic key for a given word.

Phonix

Phonix is an improved version of Soundex, developed by T.N. Gadd and published in Association for Information Management's journal, Program [Gadd, T.N. `Fisching fore werds': phonetic retrieval of written text in information systems, 22(3) 1988, p. 222] and [Gadd, T.N. PHONIX: the algorithm, 24(4) 1990, p. 363]. While the articles are not available online, Phonix has been incorporated into a number of WAIS implementations, including freeWAIS, which is open-source and therefore freely available in source form.

The author has never experimented with Phonix, and therefore cannot write authoritatively regarding its performance; however being Soundex-based, it bears much of the same baggage which limits Soundex's performance. The paper by Zobel and Dart referenced at the end of this article performs some objective comparison of Phonix with several other algorithms, producing results which confirm this author's suspicions.

q-Gram based algorithms

A q-gram (sometimes called n-gram, primarily to confuse readers) in this context refers to a sequence of letters, q letters long, from a given word. For example, for q = 2, the word Nelson has the following q-grams:

NE EL LS SO ON

By comparison, Neilsen breaks down into these q-grams (q = 2):

NE EI IL LS SE EN

Clearly, Nelson and Neilsen share the NE and LS q-grams in common.

Various techniques have been developed which compare two words based on their q-grams. A simple example would be counting the number of q-grams two words have in common, with a higher count yielding a stronger match.

Technically, q-gram algorithms aren't strictly phonetic matching, in that they do not operate based on comparison of the phonetic characteristics of words. Instead, q-grams can be thought to compute the "distance", or amount of difference, between two words. Since phonetically similar words often have similar spellings, this technique can provide favorable results, yet it also successfully matches misspelled or otherwise mutated words, even if they are rendered phonetically disparate.

Edit distance based algorithms

Edit distance computes the "distance" between two words by counting the number of insert, replace, and delete operations required to permute one word into another. In general the fewer operations required, the closer the match. Some implementations assign varying scores to the insert, replace, and delete operations. Another common variation varies which operations are considered when computing the distance; for example, the replace operation may not be considered, thereby defining the edit distance solely in terms of insert and delete options.

One of the more popular algorithms in the edit distance class is Levenshtein distance (note the spelling; some references use the spelling 'Levenstein', which is technically incorrect).

As with q-gram algorithms, edit distance is not strictly phonetic, but often matches phonetically similar words due to similarities in spelling.

Proprietary algorithms

Several organizations offer data scrubbing, de-duplication, data normalizing, and merge-purge software, which implement some form of approximate text matching, albeit with varying degrees of success. The exact algorithms used are often proprietary, and seldom documented. The applicability of these systems to a particular problem must be carefully analyzed before adopting any one solution.

Other sources

The above list of alternative matching algorithms is far from complete, and provides only enough information to begin a search for suitable algorithms -- not to end one. This section lists other sources of information on phonetic matching, and approximate text matching, as well as links to additional Double Metaphone implementations.

Additional Double Metaphone implementations

Additional Phonetic Matching/Approximate text matching resources

Conclusion

This article concludes the article series on phonetic matching of name data with Double Metaphone. In this article, some of the other major techniques for phonetic matching are presented, as well as links to additional resources for the reader interested in further research on the subject. By this point, the reader should understand that no one solution exists to the problem of matching similar but not identical text, and that candidate solutions should be selected based on the reader's specific criteria.

That said, hopefully this article series will lead the reader to strongly consider Double Metaphone, due to its ease of use, respectable performance, and readily available implementations in a variety of languages.

History

  • 7-22-03 Initial publication
  • 7-29-03 Added reference to Richard Birkby's Soundex article
  • 7-31-03 Added hyperlinks between articles in the series
  • 8-03-03 Added additional resources pertaining to the Levenshtein Distance algorithm

Article Series

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Adam Nelson
Web Developer
United States United States
My name is Adam Nelson. I've been a professional programmer since 1996, working on everything from database development, early first-generation web applications, modern n-tier distributed apps, high-performance wireless security tools, to my last job as a Senior Consultant at BearingPoint posted in Baghdad, Iraq training Iraqi developers in the wonders of C# and ASP.NET. I am currently an Engineering Director at Dell.
 
I have a wide range of skills and interests, including cryptography, image processing, computational linguistics, military history, 3D graphics, database optimization, and mathematics, to name a few.

Comments and Discussions

 
GeneralSOUNDEX PinmemberRichardGrimmer3-Mar-06 1:54 
GeneralRe: SOUNDEX PinmemberAdam Nelson5-Mar-06 20:51 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140415.2 | Last Updated 19 Mar 2007
Article Copyright 2003 by Adam Nelson
Everything else Copyright © CodeProject, 1999-2014
Terms of Use
Layout: fixed | fluid