|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Services
Chapters
Feature Zones
|
AbstractSimple information searches -- name lookups, word searches, etc. -- are often implemented in terms of an exact match criterion. However, given both the diversity of homophonic (pronounced the same) words and names, as well as the propensity for humans to misspell surnames, this simplistic criterion often yields less than desirable results, in the form of reduced result sets, missing records that differ by a misplaced letter or different national spelling. This article series discusses Lawrence Phillips' Double Metaphone phonetic matching algorithm, and provides several useful implementations which can be employed in a variety of solutions, to create more useful, effective searches of proper names in databases and other collections. IntroductionWhen building solutions which perform searches of textual data given a search criteria, it is often desirable to account for the way in which the search terms are pronounced, compared to the pronunciation of the data being searched, producing as a result those records, which "sound like" the search terms. This type of comparison is known as "phonetic matching", and the algorithms by which it is performed are called "phonetic matching algorithms" or sometimes "phonetic encoding algorithms." A significant amount of effort has been expended in search of better phonetic matching algorithms, with mixed success. One of the oldest phonetic matching algorithms, and likely the most familiar, is the Soundex algorithm, which was patented in the early 20th century, as an algorithm used on mechanical punch-card based machines. Soundex produces a key, based on a simplistic computation, which is supposed to represent the pronunciation of the original word -- or at least the first part of the word. If one has ever implemented Soundex in a solution with more than a few records, one is aware of its limitations. It fails to match similar-sounding words, and proceeds to match inappropriate words. While Soundex is simple and fast, it is sadly lacking as a general purpose phonetic matching algorithm. In 1990, Lawrence Phillips published the Metaphone algorithm in Computer Language magazine. Phillips' algorithm took into account several English pronunciation rules, yielding a more reliable phonetic key, representing the sound of a word, in particular a proper name. While an improvement over Soundex, Metaphone still failed to match some obviously homophonic words, such as "Bryan" and "Brian." Finally, in the June 2000 issue of C/C++ User's Journal, Phillips published the Double Metaphone phonetic matching algorithm. Double Metaphone implemented additional heuristics to correct previous shortcomings of Metaphone, and produced a second, "alternate" key, which represented the name's native, as opposed to American, pronunciation. With these enhancements, Phillips had created a phonetic matching algorithm reliable enough for main stream use. This article series discusses the practical use of the Double Metaphone algorithm to phonetically search name data, using the author's implementations written for C++, COM (Visual Basic, etc.), scripting clients (VBScript, JScript, ASP), SQL, and .NET (C#, VB.NET, and any other .NET language). For a discussion of the Double Metaphone algorithm itself, and Phillips' original code, see the link above. Part I introduces Double Metaphone and describes the author's C++ implementation and its use. Part II discusses the use of the author's COM implementation from within Visual Basic. Part III demonstrates use of the COM implementation from ASP and with VBScript. Part IV shows how to perform phonetic matching within SQL Server using the author's extended stored procedure. Part V demonstrates the author's .NET implementation. Finally, Part VI closes with a survey of phonetic matching alternatives, and pointers to other resources. If one is only interested in how to implement phonetic matching for a particular language, one may still benefit from the background information in this part of the series. BackgroundPhillips' original Double Metaphone implementation took the form of a C++ class, The Metaphone keys are four letters long (three for short words). As a consequence, the Metaphone keys represent the pronunciation of part of the word, specifically the first part. Empirical tests by Phillips and the author have confirmed that four letters seems to be the "magic length", balancing tolerance with relevance of matches. To compare two words for a phonetic match, one compares the primary and alternate keys of the first word with the primary and alternate keys of the second. The two words are said to match phonetically if any one (or more) of the four combinations match. Relative match strength can be computed with the following table:
Depending upon the implementation requirements, results may be filtered to include only certain match levels. The improved implementationThe implementation presented in this article takes the form of a C++ template, taking a single template parameter, which specifies the key length to produce. As stated above, four letters seems ideal, but the template allows the programmer some experimentation. This implementation has several advantages over Phillips' original code: First, it is not based on MFC; therefore the MFC runtimes are not required to use the template. In fact, it is not based on any Windows-specific elements; therefore it should be usable with nothing more than a modern compiler with template support. Second, it is not derived from a string class. In the author's opinion, the entity being represented by the Double Metaphone implementation class should be the Metaphone keys. This leads to the next advantage… Third, it implements the C++ Fourth, due to minor optimizations and the fully inline nature of templates, this implementation computes keys approximately 2x faster than Phillips' original implementation, based on simplistic tests. To use this implementation, simply include the header file: #include "DoubleMetaphone.h" and instantiate a class based on the template by specifying the key length, then pass the word to be keyed to the constructor or the DoubleMetaphone<4> mphone("Nelson"); //OR: DoubleMetaphone<4> mphone; mphone.computeKeys("Nelson"); To obtain the primary or alternate keys from an instance, call DoubleMetaphone<4> mphone("Nelson"); cout << mphone.getPrimaryKey() << endl; cout << mphone.getAlternateKey() << endl; Note that if no alternate key is present for the word associated with the instance, As a convenience to users of the class, the To compare two words for phonetic similarity using the four-way comparison described in the previous section, simply instantiate two DoubleMetaphone<4> mphone1("Nelson"); DoubleMetaphone<4> mphone2("Neilsen"); if (mphone1 == mphone2) { cout << "Nelson = Neilsen" << endl; } else { cout << "Nelson != Neilsen" << endl; } For convenience, the header file includes a type definition of the class DoubleMetaphone4 mphone1("Nelson"); DoubleMetaphone4 mphone2("Neilsen"); if (mphone1 == mphone2) { cout << "Nelson = Neilsen" << endl; } else { cout << "Nelson != Neilsen" << endl; } Practical use of the implementationThe previous section provided a very brief discussion of the implementation's interface, and how to compute Metaphone keys for a word, while this section demonstrates the construction of a phonetic search application, Word Lookup, which retrieves all like-sounding names, given a search name.
While this solution is hardly representative of a production-grade system, it is demonstrative of the key principles of phonetic matching w/ Double Metaphone, and provides a test environment within which, various parameters may be tested and tuned to a particular environment. For example, try varying the length of Metaphone keys produced, and note what effect this has on the match results. Or, try using a list of words instead of proper names; does the phonetic matching perform as well for general words as it does for proper names? Each application of phonetic matching is different; therefore it is reasonable to expect that some tuning will be required for maximal performance in a given application. In practice, when implementing a solution which calls for phonetic matching, one does not find oneself comparing two words for similarity as in the above code snippets. Rather, a collection of records, consisting perhaps of last names or some other candidate data, must be searched in its entirety, producing one or more matching records. For this article, the aforementioned collection will be an STL The STL Before searching can take place, the container must be populated. To do this, the words are read from the file (one per line), each word is passed to the typedef std::multimap<string, string> WordMapType; WordMapType wordMap; DoubleMetaphone<METAPHONE_KEY_LENGTH> mphone; char line[100]; while (!file.eof()) { //Read a word from the file file.getline(line, 100); //Compute the Metaphone keys for the word mphone.computeKeys(line); //Add a string object containing the word to the map, //with the primary and alternate Metaphone keys as map keys string word = line; string key = mphone.getPrimaryKey(); wordMap.insert(WordMapType::value_type(key, word)); if (mphone.getAlternateKey() != NULL) { key = mphone.getAlternateKey(); wordMap.insert(WordMapType::value_type(key,word)); } } file.close(); In the above snippet, One must keep in mind that the typedef std::multimap<string, string> WordMapType; typedef std::set<string> WordListType; void phoneticMatch(WordMapType& words, const string& searchWord, WordListType& matchingWords) { //Compute Metaphone keys for search word DoubleMetaphone4 searchKey(searchWord.c_str()); //Search map with primary Metaphone key string search2 = searchKey.getPrimaryKey(); cout << "Searching for [" <<searchWord.c_str() << "]" << endl; for (WordMapType::iterator iter = words.lower_bound(search2); iter != words.upper_bound(search2); iter++) { matchingWords.insert((*iter).second); } if (searchKey.getAlternateKey() != NULL) { //Alternate key computed for search word, so //search map with alt key as well string search3 = searchKey.getAlternateKey(); for (iter = words.lower_bound(search3); iter != words.upper_bound(search3); iter++) { matchingWords.insert((*iter).second); } } } The above snippet contains the implementation of Clearly, the search process is remarkably simple, considering the sophisticated results being produced. First, a The final component of Word Lookup which will be examined here is the int computeMatchLevel(const string& searchWord, const string& candidateWord) { DoubleMetaphone4 searchWordMphone(searchWord.c_str()); DoubleMetaphone4 candidateWordMphone(candidateWord.c_str()); if (strcmp(searchWordMphone.getPrimaryKey(), candidateWordMphone.getPrimaryKey()) == 0) { //Primary-Primary match, that's level 1 (strongest) return 1; } if (searchWordMphone.getAlternateKey() != NULL) { if (strcmp(searchWordMphone.getAlternateKey(), candidateWordMphone.getPrimaryKey()) == 0) { //Alternate-Primary match, that's level 2 (normal) return 2; } } if (candidateWordMphone.getAlternateKey() != NULL) { if (strcmp(searchWordMphone.getPrimaryKey(), candidateWordMphone.getAlternateKey()) == 0) { //Primary-Alternate match, that's level 2 (normal) return 2; } } if (searchWordMphone.getAlternateKey() != NULL && candidateWordMphone.getAlternateKey() != NULL) { if (strcmp(searchWordMphone.getAlternateKey(), candidateWordMphone.getAlternateKey()) == 0) { //Alternate-Alternate match, that's level 3 (minimal) return 3; } } return 0; } When experimenting with the Word Lookup application, one often encounters matches of dubious validity; however these matches almost always have a match score of 3, indicating that the Double Metaphone algorithm considers them only distant matches. This is encouraging to the practitioner, as it means the width of the net cast in the search for phonetic matches, can be tuned by limiting results to a certain match level, with reasonable assurance that valid matches will not be rejected. The Unsigned Short optimizationWhile the implementation techniques in the previous sections are quite effective and reasonably efficient, the astute reader may be troubled by the significant number of string comparisons implicitly performed by the Fortunately, an alternative exists. In his 2000 CUJ article, Phillips proposed representing four-character Metaphone keys as four, four-bit nibbles, together forming a 16-bit unsigned short. Since a four-bit nibble can represent values from hex 0 to hex f (0-15 decimal), and the Metaphone algorithm produces keys from only 12 characters, one could assign each character a numerical equivalent, thereby representing a key as a "string" of nibbles, packed into 16-bit integer. While Phillips did not implement this optimization in his The use of
While not discussed here, the source archive includes a version of the Word Lookup sample which uses ConclusionThis article has explored the phonetic matching problem, introduced the Double Metaphone algorithm as a candidate solution, and demonstrated the use of the This article does not address these questions, primarily because the objective was simply to introduce phonetic matching, Double Metaphone, and the implementation in C++. The reader is encouraged to continue reading this article series, for further exploration of Double Metaphone applications, as well as examples of phonetic matching using relational databases, and other programming languages. Part II introduces a COM component implementation of Double Metaphone, which is callable from Visual Basic, Visual FoxPro, Delphi, and any other COM-compatible language. Part II also introduces phonetic matching against a relational database, and includes a sample Visual Basic application to that end. Part III demonstrates using the COM implementation from ASP and VBScript, including phonetic searching against a database. Part IV introduces the author's SQL Server extended stored procedure, which enables computation of Double Metaphone keys from within SQL. Part IV also discusses optimizations of relational database phonetic matching solutions. Part V explores the author's .NET Double Metaphone implementation, and includes a sample which performs phonetic searches against a relational database. Finally, Part VI concludes the discussion of Double Metaphone with an examination of alternate phonetic matching techniques, and pointers to other resources and Double Metaphone implementations. History
Article Series
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||