|
|||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
IntroductionThis package gives you the ability to spell check a word, and the ability to suggest some correctly spelled words the user might have meant when a misspelled word is encountered (spell guessing). It also provides support for a user's personal auxiliary dictionary. An American English lexicon is provided, and instructions for creating a lexical database in other languages is given. A guide is given if you wish to port this spelling checker code to another operating system. The code has been around for many years, and has proven itself to be quite fast and stable. I call this spelling checker, EDX. EDXSPELL.DLLThe edxspell.dll contains the following four routines:
edx$dic_lookup_wordedx$dic_lookup_wordYou pass a word to int edx$dic_lookup_word(char *spellwordptr, char *errbuf, int errbuflen, char *Dic_File_Name, char *Aux1_File_Name); char *spellwordptrThe word you want to check the spelling of (pointer to ASCIIZ string). This string should contain just the word. Any leading or trailing spaces will not be trimmed off for you. It's up to you to trim off any leading and trailing spaces. The case of the word (uppercase or lowercase) is not important. ( char *errbufint errbuflenYou provide an error buffer where error messages can be written to. Provide a pointer to a buffer, and provide the length of the buffer. One instance where you will get an error message is if the EDX lexical database file (i.e., the "dictionary") can't be found. The error message returned in "Error opening C:\Program Files\Multi-Edit 2006\EDXDIC.DIC
Error is: 2: The system cannot find the file specified."
It is up to you to display the error message to the user. char *Dic_File_NameASCIIZ string containing the filename of the lexical database file. This is usually the full path/filename, e.g., If you wish, you may rename EDXDIC.DIC to something else (perhaps EDX_DICTIONARY.DAT). Just specify the new name here. On the first call to char *Aux1_File_NameASCIIZ string containing the filename of the user's personal auxiliary dictionary file. You may use a null string here if the user does not have a personal auxiliary dictionary file. This is usually the full path/filename, e.g., A user's personal auxiliary dictionary file is a plain text file with one word per line. The user may edit this file with an ordinary text editor to add or remove words. You may also use the On the first call to Return values:Add the following three defines to your code: #define EDX__WORDFOUND 1 #define EDX__WORDNOTFOUND 2 #define EDX__ERROR 4 These are the three possible return values. Note there are two underscore characters after EDX. If the return value is Discussion:On the first call to Notes:
edx$spell_guessedx$spell_guessEvery call to int edx$spell_guess(char *guessword, char *errbuf, int errbuflen); char *guesswordPointer to a buffer which you supply to receive the guess-word. The buffer should be large enough to hold a 31 character word (don't forget the trailing char *errbufint errbuflenYou provide an error buffer where error messages can be written to. Provide a pointer to a buffer, and provide the length of the buffer. (I suggest an error buffer size of around 400 characters.) The only instance I know of where you will get an error message is if the EDX lexical database file (i.e., the "dictionary") is located on a remote computer and the network connection to that remote computer is lost. In this case, the error message returned in "EDXspell.dll encountered
error EXCEPTION_IN_PAGE_ERROR. This error can
occur if the EDX dictionary file is on a
remote computer and the network connection
to that remote computer is lost."
It is up to you to display the error message to the user. (You don't have to display it. You could just treat this error as if Return values:EDX__WORDFOUND - guessword is filled with another guess word.
EDX__WORDNOTFOUND - all out of guesses.
EDX__ERROR - EXCEPTION_IN_PAGE_ERROR (see above).
Here is an outline of how
The code takes care not to guess the same word twice. edx$add_persdicedx$add_persdicAdds a word to the user's auxiliary personal dictionary. The user's personal auxiliary dictionary is a plain text file with one word per line. The user may also edit this file with an ordinary text editor. char *newwordThe word you want to add to the user's personal auxiliary dictionary. (pointer to ASCIIZ string). Leading and trailing spaces are trimmed and the word is lowercased before adding to the file. The resulting word can be no longer than 31 characters. char *errbufint errbuflenYou provide an error buffer where error messages can be written to. Provide a pointer to a buffer, and provide the length of the buffer. It is up to you to display the error message to the user. (I suggest an error buffer size of around 400 characters.) Return values:
edx$dll_versionedx$dll_versionReturns a long string containing information. The string may look something like: EDX Spelling Checker file edxspell.dll version 7.2 November 26, 2006.
EDX dictionary file is version 5 (Extended ANSI character compatible)
There are no extended ANSI characters in the dictionary.
Extended ANSI Guessing is: OFF.
User's personal auxiliary dictionary file is: EDXMYAUX1DIC.TXT
char *bufint buflenYou provide a buffer where the version message string can be written to. Provide a pointer to a buffer, and provide the length of the buffer. (I suggest a buffer size of around 550 characters.) Running the demoTo try the demo:
Using the codeIf you are spell checking a buffer, then the general outline of code you would write would be:
Here is a pseudo code example which spell checks status = edx$dic_lookup_word(testword,errbuf,errbuflen,edxdic); switch( status ) { EDX__WORDFOUND: //Good. Word is correctly spelled. break; EDX__WORDNOTFOUND: //Word misspelled. Let's spell guess. while (EDX__WORDFOUND == (guess_status = edx$spell_guess(ResultBuf, errbuf, errbuflen))) { //ResultBuf contains a guess word. <Display guess word to user.> } //When we drop out here //guess_status is either EDX__WORDNOTFOUND //indicating no more guesses //or EDX__ERROR (which is very unlikely) if (guess_status == EDX__WORDNOTFOUND) { //No more guesses. } else if (guess_status == EDX__ERROR) { <Bad. Display error message and stop spell checking.> } break; EDX__ERROR: <Bad. Display error message and stop spell checking.> break; } For a simple working example, see file "CallEdxSpell.cpp" in the "CallEdxSpell Source" folder of the source download. Considerations when adding a spelling checker to your programParsing off the next wordIt's up to you to provide words to User's personal auxiliary dictionaryThe code now supports an optional user's personal auxiliary dictionary (also called the "User's Aux1 dictionary" or the "User's Aux1 Lexical Database"). This is a plain text file with one word per line. The contents of the file are loaded on the first call to Words in the user's personal auxiliary dictionary are checked when spell checking a word, and when spell guessing. Keep track of spelling correctionsA further enhancement would be to keep track of spelling corrections as they are made. If the user misspells a word, and selects a correctly spelled word from your list of guess words, then you can save that correction. If you encounter the same misspelled word again, offer to make the same change. (The EDX spelling checker does not do this for you.) Performance speedThis spelling checker has proven itself to be quite fast. The secret to the speed of the EDX Spelling Checker is to keep page faults down to a minimum. The design of the EDX lexical database file reflects this goal by keeping memory reads near each other. For more information about the layout of the lexical database file and how to optimize it, see the file "Lexical Database File Layout.txt" in the source download. For more information about what page faults are and why a good understanding of them is so crucial to program execution speed, see the file "PAGE_FAULTS_AND_ARRAY_ADDRESSING.TXT" in the "Documentation" folder of the source download. Loading the lexical database fileThe other secret to speed is to map the lexical database file ("dictionary") into virtual memory instead of reading it in. The dictionary could be loaded by first allocating enough memory to hold the file, and then reading the entire file into the allocated memory. This would be quite slow due to the large size of the database. Also, a user's page file quota limits the total amount of memory a user may allocate, and the memory required to hold the database file is a considerable amount of memory. Instead of this, we use system service calls to the Microsoft Windows Operating System supplied functions Now, when the program attempts to read some of the dictionary that's in that memory range, a page fault will occur if that page is not already in memory, and that page is automatically read into memory. And since we're not using the system paging file for this, the user's page file quota is not affected. It also helps if you defragment the file EDXDIC.DIC, since it's being used as a paging file. English Lexical Database ProvidedAn American English lexical database is provided with over 90,000+ words. Every effort has been made to assure all the words are correctly spelled. Other LanguagesYou may also create a lexical database in another language, if you wish. All you have to do is supply a file containing all the words in whatever language you want. The only limitation is that the maximum word length is 31 characters, and the file must be sorted by byte value. (This is the usual sort order, where we pay attention to the unsigned value of each byte, rather than what character the byte represents.) Below are links to a few places where you can get lexicons.Lexicons (a file that contains a list of words) for other languages may be found at SourceForge. (See SCOWL - Spelling Checker Oriented Word Lists) Be forewarned, the lexicons at the above web site contain a lot of words which aren't to be found in any dictionary! (They contain a lot of misspelled words, or words which should actually be hyphenated or two separate words.) A lot of work has gone into ensuring the words in my lexicon, EDX_DICTIONARY.TXT, are correctly spelled. Another site that has Lexicons in various languages is: WinEdt Dictionaries. For more info on creating the EDX lexical database file and optimizing your EDX_COMMONWORDS.TXT file, see "0Readme EDXBuildDictionary.txt" in the "Build EDX Dictionary Source" folder of the source download. Update: The code has been updated to handle all ANSI characters 128 - 255. So it can now handle characters such as: š œ ž ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ See files "EDX Using Extended ANSI Characters.txt" and "EDX lowercasing extended letters.htm" in the Documentation folder for more information about this. Other operating systemsSee the file "Porting EDXspell to other operating systems.txt" in the "Documentation" folder of the source download if you wish to port this code to another operating system besides Microsoft Windows. (The code was originally written for the VMS operating system, and later modified to work with Microsoft Windows.) History
Glossary
|
||||||||||||||||||||||||||||||