Click here to Skip to main content
Click here to Skip to main content

SmartLexicon

By , 30 Aug 2006
Rate this:
Please Sign up or sign in to vote.

Main Screen

Figure 1: Main Screen

Application features

SmartLexicon is a multi-lingual dictionary engine. I've created this program because I wanted to have an effective tool for browsing the dictionary databases available freely in the Web. After some consideration, I've concluded that my program should have the following features:

  • The application should load dictionaries in various formats. Emphasis should be given on dictionaries in simple text format and on DICT dictionaries. DICT is a client/server protocol for accessing dictionary definitions from a set of natural language dictionary databases. These databases are available under the GPL license.
  • The engine must be able to operate off-line, without the need of an internet connection. This means that the user should be able to download the dictionary database once, load it into the engine, and then be able to use it locally.
  • The results must be sorted in lexicographic order. Therefore, the application must be aware of the language used and load the appropriate locale.
  • Searching using regular expressions would be a plus.
  • The application must be able to perform searches in Latin based languages without having to change the keyboard's input language. To accomplish this, a quick search mode must be supported in which diacritics (marks added to a letter to indicate a special pronunciation) are ignored. For example, à, ä, å must be handled as a, ñ as n, ß as ss, etc.
  • The user must be able to load as many databases as he/she wants. There must be an easy way to switch from one database to another.
  • Most of the unknown words in foreign languages come up while browsing the Web. It would be great, if it were possible to launch the engine directly from the browser. Therefore, add-ons for both Internet Explorer and Firefox must be provided.

These are the main features of SmartLexicon. If you want to give it a try, download the installer. A Help file that describes the use of the program is included. You must also download one or more databases. These are some locations for free dictionary databases:

Please note that SmartLexicon has only been tested in Windows XP platforms. It is a Unicode application, therefore it won't run in Windows 98 systems. Unfortunately, I haven't tested it with right-to-left languages.

If you are also interested in how the application was coded, keep reading.

LexEngine Design

SmartLexicon is an MFC application. The solution consists of three projects:

  • SmartLexicon, this project is responsible for the creation of the UI.
  • LexEngine, which is the heart of the program as it implements the dictionary engine.
  • DictEngine, which implements a DLL for reading files in DICT's dz format. It is based on GPL'ed code, please see the Acknowledgements section for details.

In the present article, I will only describe LexEngine. I'll start from the class that represents a dictionary entry:

// Not a full listing!
class WordEntry
{
    CString word;           // Word
    CString meaning;        // Meaning
    CString category;       // Category (verb, adj, adv)
    CString pronouncation;  // Pronouncation
    CString forms;          // Grammatical forms (plural, gender)
    CStringArray examples;
    CStringArray examplesMeaning;
    CStringArray synonyms;
};

Most of the dictionary formats support only a few of the defined properties. However, I've tried to define a class as general as possible, since it will be shared by all the dictionary formats. I've also defined the copy constructor and the assignment operator for this class, so that it would be easier to work with WordEntry objects.

I've decided to represent a dictionary format by two abstract classes, CLexIndexFileBase and CLexSourceFileBase. CLexIndexFileBase defines an interface for creating, loading, and accessing an index file.

// Not a full listing!
class  CLexIndexFileBase
{
public:
    CLexIndexFileBase();
    virtual ~CLexIndexFileBase(void);

    static CLexIndexFileBase* CreateNew();

    virtual int LoadNew(CLexSourceFileBase *aSourceFile) = 0;
    virtual int Load(CLexSourceFileBase *aSourceFile, CString aName) = 0;
    virtual int FindWords(const CString& wordStart,
                          vector<int> *wordAppearancesVector,
                          BOOL completeWord = FALSE,
                          BOOL exactMatch = FALSE) = 0;

    void GetName(CString& aName) {aName = name;};
    void SetLang(CLang* l) { lang = l; };
    CLang* GetLang() { return(lang); };

protected:
    CString name;

private:
    CLang* lang;
};

An object of CLexIndexFileBase is created through the CreateNew method. The reason why this special method and not the constructor is used, is to enable a management object to easily create instances of different classes (more on this when CLexManagement is discussed). A new index file is created by the LoadNew method. This only needs to be done once. The index file is later loaded in memory with the Load method. In most cases, there is an actual file that is created with the LoadNew method and that stores the indexing information. However, this is not necessary. The indexing information could just as well be embedded in the source file. In that case, LoadNew and Load methods do nothing but return a status code. The index file is accessed with the FindWords method. The word to be searched is passed as an argument. The method returns the number of results, and fills the wordAppearancesVector with the indexing codes of the entries that match the search criteria. The results can be limited with the completeWord and the exactMatch flags. When completeWord is set to FALSE, all the words that start from wordStart are returned; in different cases, all the results must have the exact length as wordStart. When exactMatch is set to FALSE, a case-insensitive and a "diacritics-insensitive" search will be performed.

CLexSourceFileBase defines an interface for retrieving word entries from a dictionary database. An integer is used as a key. The caller uses the integers returned by the CLexIndexFileBase::FindWords method.

// Not a full listing
class CLexSourceFileBase
{
public:
    CLexSourceFileBase(void)};
    virtual ~CLexSourceFileBase(void) {};

    static CLexSourceFileBase* CreateNew();

    virtual BOOL Open(const CString& aName) = 0;
    virtual BOOL Analyse() = 0;
    virtual BOOL IsOpen() = 0;
    virtual void GetWordEntryAt(int index, vector<WordEntry> &v) = 0;
    virtual CString GetHeadWord(int index) = 0;
    virtual CString GetHeadWord(int index, CString& wordToSearch) = 0;
    virtual CString GetHeadWordLabel(int index, CString& wordToSearch) = 0;

    virtual int GetLineCount() {return(lineCount);};
    virtual int GetSize() {return(size);};
    virtual void GetName(CString& aName) { aName = name;};
    void SetLang(CLang* l) { lang = l; };
    CLang* GetLang() { return(lang); };

protected:
    CString name;
    int type;
private:
    CLang* lang;
};

The CLexSourceFileBase::GetHeadWord(int index) method returns the label of the keyword. This method is used to fill the listview. The CLexSourceFileBase::GetWordEntryAt method returns one or more WordEntry objects that correspond to the given index.

For every dictionary format, two classes that implement the above interfaces are created. The concept idea is that the user of the engine needs to know nothing about the intricacies of the different formats. It just handles them the same, by using the interface methods.

You may have noticed in the previous code listings that the CLexSourceFileBase and CLexIndexFileBase classes hold a pointer to a CLang object. This is a language object; the application creates one for each language it supports. The prototype of the class is depicted below:

class CLang : public CObject
{
public:
    CLang(void);
    ~CLang(void);

    virtual BOOL IsAlpha(TCHAR c);
    virtual int Compare(CString& str1, CString& str2);
    virtual int CompareNoCase(CString& str1, CString& str2);
    virtual int CompareNoStrict(CString& str1, CString& str2);
    virtual void MakeSimple(CString& str);
    virtual void ToUpper(CString& str);
    virtual void ToLower(CString& str);
    void GetName(CString &aName) { aName = name; };
    void GetLocale(CString &aLocale) { aLocale = locale; };

protected:
    CString name;
    CString locale;
};

Most of the defined methods are actually wrappers for the standard CString or run-time functions. The creation of the CLang class could be avoided, if it wasn't for the MakeSimple and CompareNoStrict methods. MakeSimple takes a string and removes the diacritic marks (for example, über is transformed to uber). CompareNoStrict is equivalent to transforming the two input strings with MakeSimple and then comparing them. This functionality isn't provided by locale run-time functions, and is necessary for the implementation of the quick search feature. CLang is the base class that corresponds to the English language and C locale. The other languages inherit from CLang and implement the virtual methods. If quick search functionality for a language is not desired, then only the constructor needs to be defined:

class CNewLang : public CLang
{
public:
    CLang(void)
    {
        name = "name of the language";
        locale = "Locale to be used";
    }
};

To be able to use the above interfaces and classes effectively and seamlessly, a management system has to be designed. The following UML class diagram illustrates the relationship between the elements of this system:

Class Diagram

Figure 2: Class Diagram

CLexiconObject wraps a CLexIndexFileBase and a CLexSourceFileBase object together. It represents an entry to the dictionary database. It stores path information for the two associated files (source and index file) and the name assigned to the dictionary by the user.

CLexManagement is responsible for the creation and destruction of CLexiconObject objects. CLexManagement is aware of all the available dictionary formats and languages. A unique integer value is assigned to every different format, implemented by a pair of classes derived from CLexIndexFileBase and CLexSourceFileBase. CreateNewObject takes as argument an integer and creates the appropriate instance. To be able to do so, a vector with the CreateNew methods of the non-abstract classes is created.

CLexIndexFileBase* CLexManagement::CreateNewIndexFile(int type)
{
    if (type < (int) newIndexFileFnVector.size())
        return( (newIndexFileFnVector.at(type))());
    return(NULL);
}

A language, implemented by a CLang derived class, is identified by a string. If a new dictionary format or a new language is to be inserted to the engine, this can be easily achieved by calling a macro:

// Language Registration
void CLexManagement::LangRegistration()
{
    registeredLangArray.Add(&langDefault);
    // Register languages here
    REGISTER_LANGUAGE(CLang);
    REGISTER_LANGUAGE(CLangFrench);
    REGISTER_LANGUAGE(CLangGerman);
    REGISTER_LANGUAGE(CLangGreek);
    REGISTER_LANGUAGE(CLangItalian);
    REGISTER_LANGUAGE(CLangSpanish);
}

// Type Registration
void CLexManagement::TypeRegistration()
{
    // Register types here
    REGISTER_TYPE(_T("Text"), CLexIndexFileText, CLexSourceFileText);
    REGISTER_TYPE(_T("Ding"), CLexIndexFileDing, CLexSourceFileDing);
    REGISTER_TYPE(_T("Dict"), CLexIndexFileDict, CLexSourceFileDict);
}

This is the only place in code where non-abstract derivatives from CLexIndexFileBase, CLexSourceFileBase, and CLang are referenced.

CLexDataBaseEntry holds the properties of a single dictionary. CLexDataBase implements a mechanism for the manipulation of the installed dictionaries. It provides methods for loading and deleting dictionaries, for changing their order, and for modifying the auto-load property. The auto-load property controls whether a dictionary is automatically loaded at start up or a manual loading is needed. Since the user may have installed several dictionaries, loading all of them unconditionally at start up will introduce a slow launching time. With the auto-load property, the user can choose to load rarely used dictionaries only when needed. CLexDataBaseStore is used for permanent storage of the dictionary properties. In the current implementation, data is stored in the registry, but an INI or an XML file could be used just as well.

Finally, just a few words about how the sorting with lexicographic order is performed. As mentioned before, the CLexIndexFileBase::FindWords method returns a vector with the indices of the words that match the search criteria. Then, the CLexSourceFileBase::GetHeadWord(int index) method returns the label of the keyword. The word list must be lexicographically sorted based on this label. This is easily achieved by creating a multimap, with a wstring object (the label) as the key type and an integer object (the index) as the element type.

typedef multimap<wstring, int,
         less_locale<wstring> > MultimapKeyWord;

Instead of the default less-binary predicate, a new predicate based on wcsicoll is used. wcsicoll performs locale-specific, case-insensitive comparison.

template <class T>
class less_locale
{
public:
    bool operator() (const T& s1, const T& s2)
    {
        return (wcsicoll(s1.c_str(), s2.c_str()) < 0);
    }
};

Web-browser integration

The browser integration was easy for Internet Explorer. Everything I needed was right here in CodeProject. One article showed me how to extend Internet Explorer's context menu. Another one described how a COM interface could be added to an existing MFC application. That was all. I added an IDispatch interface, and easily controlled the application through VBScript code.

For Mozilla/Firefox, it is a little bit trickier. The only way to do it is to create a Mozilla extension. Thankfully, there are many online resources available for the subject. You can also have a look at the source code of the many available extensions. The second option worked better with me. Autocopy and downloadwith extensions have been most helpful. I've actually created an extension that adds an entry to the browser's pop-up menu and calls a helper application with the selected text as argument. In the current version, the helper application's path is hard-coded to SmartLexicon's default installation path (C:\Program Files\SmartLexicon). A more experienced Mozilla developer could add a Properties dialog for the extension and ask the user to configure the path. The helper application is a Win32 program with no GUI, that invokes SmartLexicon through its IDispatch interface.

Mobile versions

If you own a Symbian 7.0 phone, then you can use S60Dict to browse dict dictionaries offline. If you just own a simple J2ME MIDP 2.0 phone, then you can find in my home page a program to convert any dictionary in plain text format to a J2ME dictionary.

Acknowledgements

SmartLexicon uses the following software components:

  • Regex library from boost. This provides the Regular Expressions functionality.
  • Zlib library for reading *.gz files. Dict compressed *.dz files are a variation of a *.gz file with a custom header. Sources from dictd were used to read *.dz files.
  • CTextFileDocument class by PEK for handling UTF-8 files.
  • Registry class by Robert Pittenger.
  • Colour picker by Chris Maunder.

Internet Explorer integration is based on ideas by:

  • roel_v2 (Automating of MFC Applications).
  • Bee Master (Extending IE's context menu).

I have also used ideas/code snippets from numerous CodeProject articles. Thanks to all CodeProject contributors for the help they've provided. CodeProject has really been the most valuable resource of information for me.

History

  • 23 August 2006 - Added mobile version, updated Firefox extension.
  • 7 November 2005 - Updated user manual, auto-resizing of list view, Settings dialog.
  • 23 September 2005 - Initial release.

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)

About the Author

Giannakakis Kostas
Software Developer (Senior)
Greece Greece
No Biography provided

Comments and Discussions

 
GeneralHi! I want to know about Symbian developing PinmemberAlimjan Yasin5-Jan-11 21:02 
QuestionCant run source Pinmembercanatan22-Aug-10 23:48 
AnswerRe: Cant run source Pinmembercanatan15-Sep-10 1:00 
QuestionHow you edit existing entry or add new entry to the dict file? Pinmemberthewizardcode10-Oct-06 17:42 
AnswerRe: How you edit existing entry or add new entry to the dict file? PinmemberGiannakakis Kostas10-Oct-06 19:49 
GeneralUnicode PinmemberSAKryukov7-Sep-06 10:49 
GeneralRe: Unicode PinmemberGiannakakis Kostas7-Sep-06 19:25 
How did you convert your dictionary in DICT format? Pre-formatted dictionaries from dict.org or freedict.org can be loaded as they are from SmartLexicon. If you format a dictionary yourself, then use UTF-8 without BOM for both the dict and the index file.
 
It would be easier to use the simple text format. This must be encoded in UTF-8 with BOM. If you still have problems, send me the dictionary or a portion of it to see what I can do.
QuestionRe: Unicode PinmemberSAKryukov11-Sep-06 5:08 
AnswerRe: Unicode PinmemberGiannakakis Kostas11-Sep-06 20:20 
Generalnice work;; implementing global hotkey support Pinmembernova_16-Feb-06 20:06 
GeneralRe: nice work;; implementing global hotkey support PinmemberGiannakakis Kostas26-Feb-06 9:18 
GeneralRe: nice work;; implementing global hotkey support PinmemberVasudevan Deepak Kumar21-Jun-06 19:34 
Questionwhy not use ms greta? PinmemberChauJohnthan17-Nov-05 15:40 
GeneralCannot open file. PinmemberWREY29-Sep-05 9:06 
GeneralRe: Cannot open file. PinsussGiannakakis Kostas30-Sep-05 6:47 
GeneralRe: Cannot open file. PinmemberWREY30-Sep-05 7:16 
GeneralRe: Cannot open file. PinmemberGiannakakis Kostas1-Oct-05 4:41 
GeneralRe: Cannot open file. PinmemberWREY1-Oct-05 8:03 
GeneralRe: Cannot open file. PinmemberGiannakakis Kostas1-Oct-05 12:37 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web01 | 2.8.140415.2 | Last Updated 30 Aug 2006
Article Copyright 2005 by Giannakakis Kostas
Everything else Copyright © CodeProject, 1999-2014
Terms of Use
Layout: fixed | fluid