13,248,927 members (44,651 online)
alternative version

#### Stats

45.1K views
51 bookmarked
Posted 31 Aug 2007

# A spelling corrector based on Bayes Theorem (PHP, C#)

, 10 Sep 2007
 Rate this:
This article introduces SpellingDice a spelling corrector based on Bayes Thorem and Dr. Peter Norvig's essay

## Intro

No matter you are a native English speaker or not, you may make some spelling mistakes every now and then. Isn't it neat, if the computer can provide some spelling suggestion for your unsure input? The most famous one should be Google Suggest. Every time you input some mistakes, there will be "Did you mean? XXX XXX XXX" on the next page. We know the whole system is based on Bayesian network, but how? On the other hand, if our own website or small size program could include this great function, that would be a valuable asset to attract people.

Peter Norvig, the Director of Google Research revealed the secrets of Google Suggest. I believe it's not the whole algorithm, but it's good enough for us to add spices ourselves and utilize in our own small-scale program.

I believe his essay is much better than my following descriptions. You can find his paper at

Here [English]

Here[Chinese]

Here[Japanese]

Here[Russian]

Suppose someone input "Majar", the program should be able to identify what the user actually meant was "major".

Ok, then how does it work?

We let P(s|i) denotes the probability of one suggested word s by providing an input word i.

So we have P(s|i)=P(s,i)/P(i)

P(i) is the probability of one random input string. As the user can input anything, so the probability is neglectable. We let it be 1. So we have:

P(s|i)=P(s,i)/1=P(s,i)

At the same time,

P(s,i)=P(i|s) * P(s) so we have :

P(s|i)=P(s,i)= P(i|s) * P(s)

Where P(s) is the probability of one word's probability. Question: How can we define this number?

Answer: find any book and read it through then make statics of each word's occurrence. Or download the API which includes the 5000 most commonly used English word. For instance, the word "the" must have a lot higher probability than "tae" [therapeutic arterial embolization].

Ok, what about P(i|s)? literately, P(i|s) is the probability of one input string by providing an recommend string. But it does not make sense. So it becomes the probability of one suggested string by calculating the editing distance.

Oh, edit distance. You don't know what? Read http://en.wikipedia.org/wiki/Edit_distance

The most famous one is Levenshtein Distance which is a build-in function in PHP. Another example is Hamming Distance in Information Theory. For instance D(01010, 10010)=2 . now you get what I mean by edit distance, right?

Ok, let's continue. They less edit distance you have, the higher probability you have.

For instance if someone inputs "hallo", "hello" has much higher P(i|s) than "helen". What if the word is correct? In other words, if one in the dictionary matches the input. Then the program should not return any suggesition. To save time complexity, ths program should break here.

Understand what I said? If not, again, please refer to his original article.

Ok. Let's take a closer look at "hallo". The program may find "Hallow" "Mallo"(if indeed, there is such word) and "Hello". They may have the same P(s) as we won't see too many people use tons of "hello" in their formal writings. And they have the same edit distance, so should they be placed in the order of alphabets? No! Absolutely not. Use your head to think! How many people would mis-type the first letter? Not so many~ So get rid of "mallo". Cool! Then you will find people tend to mis-type more vowels than consonants. The reason is because it's very obvious for human eyes to discover the typos of consonants. So they may correct them before submit to computer. Thus, "hello" must have much better P(i|s) "hallow". Is that clear?

The above are all in Dr. Peter Norvig's essay. But I don't think that's totally enough for determining P(i|s). So I observe how people type and find out some of us like me have giant fingers. So instead of type "hello" , we type "hrllo" or "hekko". If you take a closer look at your keyboard you will realize why. Because these 'e' and 'r' are near so are 'k' and 'l'. So here we have another standard, the P(i|s) will be increased if a letter is replaced by its neighbors on the keyboard. So when type "hrllo", "hello" must have higher probability than "hillo" (if indeed this is such word"). Ok, I hope you are not lost here. This is the idea of SpellingDice. If you don't want to know about the Algorithm, you can skip the following and go directly to download the APIs for PHP and C# at HERE.

## Examples

Try out the online SpellingDice Corrector PHP

## Algorithm

Array<string> ReturnSuggestion(string inputWord)

{

Array<string> result=null;

if( inputWord[0].ToLowerCase()<='z' && inputWord[0].ToLowerCase()<='a') // if continues only if it's a English letter

{

Array<string> Candidates = ReadFromXML(inputWord[0].ToLowerCase()); // To speed up the process, only read the dictionary that start with the first letter. For instance, if input is "hallo" only read 'h.xml'.

Maps<string, double> RefinedCandidates;

foreach(string candidate in Candidates)

{

if(LevenshteinDistance(candidate, inputWord)<=2) //only check those words who have an edit distance of 2 comparing to the input. Otherwise the candidate list will be long and it does not help that much.

{

RefinedCandidates[candidate]=CalculateConditionProbability(candidate, inputWord)*probability(candidate);

}

}

Sort(RefinedCandidates);

result=ConvertMapIntoArray(RefinedCandidates);

}

return result;

}

## APIs

PHP

SpellDice::ReturnSuggestion(\$inputString);

// no matter the input is a sentence or a word, it will return a suggested sentence or 5 suggested words. if no suggestion, null will be returned. NOTE: When the input is a word, the return value will be a string array, if it is a sentence, a sentence string will be returned.

SpellDice::SentenceProcess(\$inputString);

//only process a sentence, return a suggested sentence

SpellDice::WordProcess(\$inputString)

//only process a word, a array of at most 5 suggested words will be returned.

C#

namespace SpellCheck

public class SpellingDice

{

public SpellingDice() //intilize method

public List<string> <string> ReturnSuggestion(string input); ///no matter the input is a sentence or a word, it will return a suggested sentence or 5 suggested words. if no suggestion, null will be returned. NOTE: When the input is a word, the return value will be a string array, if it is a sentence, an array of only one element will be returned.

public List<string> WordProcess(string word) //only process a word, a array of at most 5 suggested words will be returned.

}

## A Snippet of Codes

PHP
```function GuessCandidates(\$compareString)
{
global \$vocabularyArray;
\$letter=\$compareString[0];
if(count(\$vocabularyArray[\$letter])==0)
{
\$parser = xml_parser_create();

xml_parser_set_option(\$parser,XML_OPTION_SKIP_WHITE,1); <o:p />

xml_parser_set_option(\$parser,XML_OPTION_CASE_FOLDING,0);
<o:p> </o:p>
\$data = implode("",file('dictionary/'.\$letter.'.xml'));
xml_parse_into_struct(\$parser,\$data,&\$d_ar,&\$i_ar)
or print_error();
\$counter=0;
for(\$i=0; \$i<count(\$i_ar['WordEntry']);
\$i++) {
if(\$d_ar[\$i_ar['WordEntry'][\$i]]['type']=='open') {
for(\$j=\$i_ar['WordEntry'][\$i]; \$j<\$i_ar['WordEntry'][\$i+1];
\$j++) {
if(\$d_ar[\$j]['tag'] == 'word')
{
\$word = \$d_ar[\$j]['value'];
}elseif(\$d_ar[\$j]['tag'] == 'probability')
{
\$probability = \$d_ar[\$j]['value'];
}
}
\$editDistance=SpellDice::StringDistance(\$compareString,\$word);
if(\$compareString==\$word)
return \$word;
else
if(\$editDistance<=20)
\$CandidateArray[\$compareString][\$word]=\$probability*(20/\$editDistance);
}
}
xml_parser_free(\$parser);
}
arsort(\$CandidateArray[\$compareString]);
if(count(\$CandidateArray[\$compareString])>5)
\$CandidateArray[\$compareString]=array_slice(\$CandidateArray[\$compareString],
0, 5);
\$result=array();
foreach (\$CandidateArray[\$compareString]
as \$key=> \$value)
{
array_push(\$result,\$key);
}
return \$result;     <o:p />

}
```

C#

```public List<string> WordProcess(string
word)

{
List<string> result = new List<string>();
SortedDictionary<double, string>
listOfCandidates = new SortedDictionary<double, string>();
if (word[0] <= 'z' &&
word[0] >= 'a')
{
if (candidatesList[word[0]].Count
== 0)
foreach (Dictionary
DictionaryItem in candidatesList[word[0]])
{
if
(DictionaryItem.word == word)
{
return result;
}
else
{
int LD = Levenshtein(word, DictionaryItem.word);
if (LD <= 2)
{
double editDistance = StringDistance(word, DictionaryItem.word,
LD);

listOfCandidates[1 / (DictionaryItem.probability * (1 / editDistance))] = DictionaryItem.word;

}                }   } <o:p />
foreach (KeyValuePair<double,string> kvp in listOfCandidates)
{
}
// result.Reverse();
if (listOfCandidates.Count > 5)
result.RemoveRange(5,  listOfCandidates.Count - 5);
}
else
return result;
} ```

## Conclusion

The performance of this algorithm is actually very good as we have some heuristic to ensure the least loops.

It needs O(n) to read the dictionary

Then O(n) to loop the dictionary

O(m) to check every word( m) is the length of words)

O(nlogn) to sort

So in totally we are looking at a O(n)+O(nlogn)+o(n*m) algorithm.

The PHP version is slightly faster than C# as it has some neat build-in functions. C# don't have a auto-sorted data structure to handle the array (SortedDictionary only sort the keys not value, and it's from low to high which is something we want in opposite way).

The program also largely depends on your dictionary. My dictionary included has 5000 English words. But it's not enough. You may want to expand the dictionary yourself. I made them 26 XML files. You can do it more yourself. This program could deal with regular spelling corrections, but I cannot guarantee it provides the same effects as Google Suggest does as they have much larger data stored and other advantages.

A list of licenses authors might use can be found here

## Share

 Web Developer Horizon Ideas United States
My name is Jia Chen and I want you tell you about my childhood dream: being a problem-solver. My mom told me it was silly because it wasn't really a profession. Through the last decade, I have been a software engineer, a product manager, a repetitive student, a management consult and an entrepreneur. They appear far from my childhood dream. But I still think I am living it. Because the essence of it is to find problems and solve problems. Some times I may not solve new problems, but I always want to solve old problems in a new way.

## You may also be interested in...

 First Prev Next
 A spelling corrector based on Bayes Theorem (PHP, C#) I can`t Find API ? kaival0075-Jun-15 20:06 kaival007 5-Jun-15 20:06
 thank quocvietit7-Apr-08 7:50 quocvietit 7-Apr-08 7:50
 Very interesting Santiago Corredoira19-Sep-07 1:08 Santiago Corredoira 19-Sep-07 1:08
 Wrong link I believe kishkin9-Sep-07 21:13 kishkin 9-Sep-07 21:13
 Re: Wrong link I believe kishkin9-Sep-07 21:14 kishkin 9-Sep-07 21:14
 Re: Wrong link I believe Jia.C10-Sep-07 5:44 Jia.C 10-Sep-07 5:44
 Eat your own dog food nsoonhui3-Sep-07 16:47 nsoonhui 3-Sep-07 16:47
 Re: Eat your own dog food Jia.C3-Sep-07 17:34 Jia.C 3-Sep-07 17:34
 Re: Eat your own dog food ednrgc5-Sep-07 5:46 ednrgc 5-Sep-07 5:46
 Re: Eat your own dog food Jia.C5-Sep-07 6:43 Jia.C 5-Sep-07 6:43
 Re: Eat your own dog food nsoonhui5-Sep-07 15:51 nsoonhui 5-Sep-07 15:51
 Re: Eat your own dog food Jia.C5-Sep-07 20:14 Jia.C 5-Sep-07 20:14
 Re: Eat your own dog food nsoonhui5-Sep-07 20:50 nsoonhui 5-Sep-07 20:50
 Re: Eat your own dog food Jia.C5-Sep-07 20:52 Jia.C 5-Sep-07 20:52
 Re: Eat your own dog food ednrgc6-Sep-07 2:57 ednrgc 6-Sep-07 2:57
 Re: Eat your own dog food torial22-Jun-09 7:58 torial 22-Jun-09 7:58
 Re: Eat your own dog food T1TAN10-Sep-07 22:53 T1TAN 10-Sep-07 22:53
 Great Article merlin9813-Sep-07 4:44 merlin981 3-Sep-07 4:44
 ASpell and PSpell Vasudevan Deepak Kumar1-Sep-07 3:34 Vasudevan Deepak Kumar 1-Sep-07 3:34
 Re: ASpell and PSpell Jia.C1-Sep-07 10:19 Jia.C 1-Sep-07 10:19
 Last Visit: 31-Dec-99 19:00     Last Update: 18-Nov-17 15:23 Refresh 1

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Web01 | 2.8.171114.1 | Last Updated 10 Sep 2007