|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
IntroductionWe all make the odd tryping error, so spull checkers are an essential part of any application that involves the user editing large amounts of text. (OK, I promise, no more corny mis-spellings!) This article describes the spell checking engine that I developed for the editing-intensive application I am currently working on. The spell checking engine described is simply that - it checks words, and indicates whether or not they are spelt correctly. It is designed to sit behind a front end which allows the user to control spell checking, and as such, the design and operation of the user interface is beyond the scope of this article. Dictionary StructureWords are stored in a tree, with each node representing one character. Each node can have two children - a Next node and an Alternate node. The diagram below shows a dictionary containing three words, "horse", "horses" and "hose". Notice how we reuse nodes where the words start with the same common substring.
Storing the words in this manner is actually quite efficient in terms of storage space. When compared with an ASCII text file containing one word per line, the corresponding dictionary file is often considerably smaller, particularly when there are a large number of words. Adding and Removing WordsTo add a word, we traverse the tree to find as much of the word as possible, and then the remaining letters of the word are inserted into the tree. The best way to show this is by example, so say we wish to insert the word "hosts" into the dictionary above. We would perform the following steps :
Removing words is considerably easier. We traverse the tree to find the word, and then set the terminating character to something other than the null terminator. Although this does not reduce the size of the dictionary, it does mean that the word will no longer be found. Checking WordsTo check a word, we traverse the tree in much the same way as we did when we inserted a word. However, if we come to the point where there is no node with a matching character, the word is not in the dictionary and we end the search. Getting SuggestionsWhen getting suggestions for a word, we consider four types of error that the user could have made. We assume only one error has been made in a word - if we accounted for more errors, our suggestion list could grow to something which is too large to be of use to the user. The four errors that we account for are:
Because traversal of the tree is so fast, we individually check for each of the combinations if characters that make up each of the errors. So, when finding suggestions for the extra character error, we will try and find the words "orpse", "hrpse", "hopse", "horse", "horpe" and "horps". Any matches that we find are added to the suggestions list. Similarly, we try all the combinations of transposed characters to find suggestions. The missing and incorrect character errors use pattern matching to find suggestions. For the missing character errors, we insert a wildcard character at each character position, and see what words we can find that match the search string. When looking for incorrect character errors, we replace each character in turn with the wildcard character. So, if we were getting suggestions for the word "hrse", we would get the suggestions "horse" (missing character 'o') and "hose" (incorrect character 'r'). CaseWords are inserted into the dictionary in lower case, unless they contain upper case characters. So, place names will be inserted in "correct" case. When we try to find words, we always start by trying to get an exact match. If we do not, we then see what the best match we can get is - either matching apart from a capitalised first letter, matching but with mixed case, or no match at all. If the match indicates a capitalised first letter, it is up to the application to determine whether or not this is valid. For example, this may be treated as valid for the first letter of a sentence, but invalid elsewhere. When getting suggestions for a mis-spelt word, we get words regardless of case. However, there is also an option to only return words where they differ from the search word only by case. This may be used when the search indicated a valid word with incorrect case. Source CodeAll of the code is contained in the class CDictionary. The class CNode represents a node in the tree. CDictionary offers methods to find words, obtain suggestions, load and save dictionaries, and populate the dictionary from an ASCII text file (with one word per line). Demo ApplicationThe demo application demonstrates the functions offered by the dictionary class. One neat "bonus" feature is pattern matching. Entering a word to find, with a question mark for wildcard characters allows all matching words to be found, a handy tool for any crossword fans! ConclusionInitial use of the dictionary has found that it has been suitably fast both when loading the dictionary of approximately 220,000 words, and when searching for words and suggestions. There are no future enhancements planned at the moment - time will tell what changes are needed!
|
||||||||||||||||||||||