Click here to Skip to main content
Click here to Skip to main content

A Lite-Memory Dictionary

By , 10 Mar 2007
 
Screenshot - cdictionarylite1.jpg

Introduction

Have you come across a situation where you need to store a dictionary with a huge number of strings and realized that keeping it in memory is a nightmare? If so, this might be one of the best solutions for your problem. In this article I'll describe an implementation where everything in the dictionary is kept in the disk. The advantages of using this in your code are:

  1. Memory foot print is very low (only few bytes) regardless of the size of the dictionary.
  2. Disk accesses are kept transparent to the user (the developer who uses the code) so that he can access it as it is in memory.
  3. Although kept in the disk, accessing it is quite fast due to the BTree based implementation.
  4. No separate mechanism is needed to write the dictionary to and read it from the disk.
  5. Unit of the string is kept as string rather than character so that the application developer has control on the depth of the trie (which is used to implement the dictionary).
  6. The relative frequency of a word in the dictionary can be stored as well.

Background

I am doing a research in natural language processing for Sinhala which is my native language. For this, I needed to keep several big dictionaries (phrase, word and sound dictionaries) in memory at the same time. However, I realized that the dictionaries eat a lot of memory (hundreds of megabytes) causing the program performance to degrade vastly. I was forced to go for an implementation where the dictionary can be kept in and accessed directly from the disk.

Why did the dictionary eat a lot of memory

Say, we are to store a dictionary with 100,000 words. If the average length of a word is 6 characters and we store the relative frequency (likelihood of appearance) of the word as an integer. We'll need 10 bytes for a word and only 1MB for the whole dictionary!! It's quite a small amount and why do we need to go for a disk approach? Well, this will be the case only if we store the words sequentially. This will require a linear search to find a word which is totally impractical (in terms of performance) for a big dictionary. Therefore we definitely need a structured way to store the dictionary. Trie, being a tree based data structure, is the most popular answer for this. In a trie, the search for a word will require only as few steps as the word's depth in the tree.

Now the question is how much memory do we need to store the trie. Will it be significantly higher than that needed for the sequential approach? Let's see. There are leaf nodes in the trie equal in number to the words in the dictionary i. e. 100,000. In addition, there are internal (non-leaf) nodes. In my work, I found this number is nearly two times the number of leaf nodes when words are stored. So, the total node count will be approximately 300,000. How much memory for a node? A node will require : (i) ID : 4 bytes (ii) A flag to indicate whether it is a leaf node : 1 byte (iii) map of transitions from the node if the node is internal - a transition is an association of a string (note that the unit is string) and an integer. Let's say the average number of transitions per node is 5 (5 bytes * 5 = 25 bytes). The map data structure itself will take some memory. If we assume it to be 10 bytes the memory taken by the map is 35 bytes. So, the memory taken by a node will be 40 bytes and the whole dictionary will need 40 * 300,000 = 12MB. This is much more than that we predicted for sequential approach. Things get worse when the strings stored are longer (e.g. phrases). The node count will be much higher and the memory requirement will grow into even hundreds of MBs.

How it is stored in the disk

You do not need to understand or even read this section to use the code. This section is for people who are interested in the inside story. I used the B-Tree approach when storing the dictionary in the disk which is very common in database implementations. You better read a good tutorial (there are many) on B-Trees before reading this section if you are not familiar with them.

In this implementation there are three structures that constitute the disk file.

  1. Node - stands for a node and contains node attributes like Node ID, End flag, Count, etc.
  2. Transition Page - this is a set of transitions where a transition consists of a string and a destination node id.
  3. Transition milestone Page (TMP) - This is the structure inherited from B-Tree architecture. We sort the set of transitions of a node and split them into pages. We call the first transition of a transition page, a transition milestone. Transition milestone page is the set of transition milestones of a node. A given node may have one or more transition milestone pages.

Our disk file consists of sequentially arranged instances of these structures. Lets peep into the node structure.

Node

cdictionarylite2

Note that a node contains its first TMP within itself. Then it has the offset to its next TMP if any; otherwise it is set to invalid offset. Transition Page structure looks exactly like the part enclosed by the red borders whereas in a TMP there is an additional integer field at the beginning which indicates the offset to the next TMP if any.

The disk file has a header which contains some attributes and statistics of the dictionary.

cdictionarylite3

First two bytes of the header is the two characters "DL" which stands for "Dictionary Lite". If you open the dictionary file in a text editor you'll see these two characters in the beginning. What immediately follows the header is the root node of the trie.

Using the code

Using the code is very simple. First include "Dictionary.h" in your project. You can create a CDictionaryLite object and manipulate it with the functions described in the following guidelines.

Creating the dictionary

The filename of the dictionary should be given in the constructor. You may use any extension with the filename. There are two constructors.
CDictionaryLite(CString sFilename);
CDictionaryLite(CString sFilename, bool bKeepFile,
   int iStrLen = DICTIONARY_LITE_DEFAULT_STR_LEN,
   int iNoOfStrInNode = DICTIONARY_LITE_DEFAULT_NO_OF_STR_IN_NODE,
   int iNoOfStrInTransition = DICTIONARY_LITE_DEFAULT_NO_OF_STR_IN_TRANSITION);

Use the second constructor when creating the dictionary for the first time. The second argument indicates whether or not to keep the dictionary file after being destroyed. If you create a large number of dictionaries throughout the program and do not need them after the program termination, keep this argument false. There are other dictionary related arguments (optional) that you can specify according to the requirement. For an example, if you are going to use character as the unit of string, you may set iStrLen = 2 (remember to keep one byte for the null terminator). iNoOfStrInNode is the number of strings in a node or in a TMP. iNoOfStrInTransition is the number of strings in a transition page. If not specified these parameters will be assigned with default values. However, it's always a good practice to enter these values to minimize the space and time complexities of your dictionary.

The first constructor should be used when creating a dictionary object using an already existing dictionary file. Only the filename has to be given because the dictionary file contains all other parameters in its header.

Adding words (strings) to the dictionary

After creating the dictionary object you can add words to it via the function
void AddString(CStrList& oStrList, int iCount = 1);

Note that you enter a string list rather than a single string as the word. This is because the unit of the string used in the dictionary is also a string. This is done to make it more generic. CStrList is a class derived from std::list<CString> so all you have to do is push unit strings to it. If this is not clear, refer to the example below. Optionally the count of the string can be given. If the word is already in the dictionary, its count will be incremented by this number.

Retrieving words from the dictionary

  1. Existence of a word in the dictionary can be verified by the function
    int GetStringCount(CStrList& oStrList);

    If the word is in the dictionary the output will be its count, zero otherwise.

  2. All the words in the dictionary can be sequentially retrieved using the two functions
    bool GetFirstWord(CStrList& oStrList, int& iCount);
    bool GetNextWord(CStrList& oStrList, int& iCount);
    

Function names are self-explanatory. Words in the dictionary are traversed in a depth-first-left-recursive fashion. i.e. if A and are two sibling nodes (with same parent node) with A is in left to B in the tree, B is not visited until A and all its descendents are visited.

After using the dictionary object just delete it. It's that simple! The dictionary file will remain in the disk (unless you entered bKeepFile parameter as false in the constructor) and you can use it again.

The following code snippet illustrates these basic operations:

// create dictionary
CDictionaryLite* pDicLite = new CDictionaryLite
            ("MyDic.ldic", true, 3, 5, 6);
// add three words to the dictionary


CStrList oStr1, oStr2, oStr3;
oStr1.push_back("a");
oStr1.push_back("pp");
oStr1.push_back("le");
// "apple"

oStr2.push_back("m");
oStr2.push_back("a");
oStr2.push_back("n");
// "man"

oStr3.push_back("ca");
oStr3.push_back("t");
// "cat"

pDicLite->AddString(oStr1); // count = 1
pDicLite->AddString(oStr2, 4); // count = 4
pDicLite->AddString(oStr3, 2); // count = 2

//check whether "cat" is in the dictionary
bool bIsInDictionary = (pDicLite->GetStringCount(oStr3) > 0);
// result : true
oStr3.push_back("s");
//check whether "cats" is in the dictionary
bIsInDictionary = (pDicLite->GetStringCount(oStr3) > 0); 
// result : false

//retrieve first two words in the dictionary
CStrList oWord;
int iCount;
pDicLite->GetFirstWord(oWord, iCount); 
// oWord = "apple" iCount = 1
pDicLite->GetNextWord(oWord, iCount); 
// oWord = "cat" iCount = 2

// destroy the dictionary object
delete pDicLite;

Points of Interest

Memory required for a dictionary object is in bytes - not even in kilobytes! This does not depend on the actual dictionary size. It takes less than 1ms to access one word (to find the count of a word) in my 2.8 GHz machine. This can be definitely made even more faster by adding in-memory cache. Can somebody try this?

The space that the dictionary acquires from disk can be reduced by implementing variable size pages because in this implementation the page size (number of strings in a page) is fixed and not necessary all the entries in a page (transition page or TMP) are used. In this case the page size can be stored somewhere in the page itself. Appreciate if someone can try this as well.

History

10-March-2007: Initial Release

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Dileepa Jayathilaka
Web Developer
Sri Lanka Sri Lanka
Member
I am a Sri Lankan Software Engineer with 3 years experience in software development. I have a C background and have experience in MFC. Iam very much interested in AI especially Natural Language Processing. I have had done some research in Sinhala which is my native language.
In my spare time I use to read novels and politics.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionHow to insert multiple meanings of the word in the dictionarymemberDiaa Fayed6 Oct '08 - 15:26 
How to insert multiple meanings of the word in the dictionary
Generalsome questions...memberMillion23 Aug '08 - 6:25 
Hi,
 
Thank you for your article.
I think this is a good article to build the dictionary.
But I have some question.
 
1. Is the dictionary constructed by this method trie ?
2. And balanced tree ?
 
For your method, I think:
 
In worst case, the number of comparing of key is N/ M where N is the number of word in the dictionary and M is the numner of word in TMP.
...
 
Is that right ?
 
2008.8.24
GeneralA reliability of the code [modified]memberkarmerer23 Jan '08 - 1:37 
What is a reliability of the code? It was posted nearly year ago- were any bugs found? were they fixed? If so, do current source codes reflect this changes? Do you use this feature constantly?
 
Thanks, it seem to be a good idea though, since I work with large dictionaries too and went at last to the similar realization.
 
modified on Wednesday, January 23, 2008 4:22:50 PM

Generaldoes your code handle node splitmemberRichard98_9815 Mar '07 - 14:32 
Thanks for your nice job.
I did not find the code to handle node split. I am wondering if the code does that?
thanks
Richard
 
Richard98

GeneralRe: does your code handle node splitmemberDileepa Jayathilaka17 Mar '07 - 0:47 
Yes it does. When there's an overflow in a transitions page it is split to two equal pages and a new entry (or possibly two entries) is added to the corresponding TMP. If you want to see this in the code refer the method CDictionaryLite::AddTransitionToCurrNode.
However, there's no split in TMPs. This is a place where this implementation defers from traditional BTree.
Thanks for your question.
 
cheers
Dileepa.
GeneralRe: does your code handle node splitmemberRichard98_9818 Apr '07 - 10:39 
Hi Dileepa:
thanks for your explaination. In traditional b-tree, the node has a key and a number of pair of key and offset. in your node, there is no key, there are onlya number of pair of key and offset.
 
I am wondering what is the difference between your b tree and traditional b-tree.
thanks
Rich
 
Richard98

GeneralRe: does your code handle node splitmemberRichard98_9818 Apr '07 - 12:19 
Hi Dileepa,
 
in traditional b-tree, is there nextOffset in a node?
thanks
Rich
 
Richard98

GeneralRe: does your code handle node splitmemberRichard98_9818 Apr '07 - 12:55 
hi Dileepa:
 
does offset1...N poits to a new node or a new transition?
in AddTransitionToCurrNode,
line 366,
if((*ite1).first > str)
{
bLimitReached = true;
break;
}
 
why bLimitReached = true when ((*ite1).first > str), i thougt the bLimitReached is used for node split if number of strings in node reach the max.
 
Thank you very much for further explaination
 
Rich
 
Richard98

GeneralRe: does your code handle node splitmemberDileepa Jayathilaka19 Apr '07 - 2:51 
Replies to your three questions.
(1)In this implementation, a node's key is the transition (string) that led to the particular node. For example, let's say there is a transition from node 2 to node 5 by the string "aa". then the key of node 5 is "aa".
(2)Having next offset inside a node is an implementation level decision. In any BTree implementation a link to next node should be kept though it is not necessarily the offset.
What actually contrasts this implementation from traditional BTree is that in this implementation there's no upper limit to the no of TMPs in a node.The dictionary parameters (transition page size & TMP size) should be set such that the no of TMPs of a node does not become very large. Otherwise there will be a degradation in performance because TMPs of a node are traversed sequentially.
(3)The flag "bLimitReached" is used to locate the Transition page to which the new transition shoub inserted. First string of a transition page, which is the smallest string according to alphebetical comparison, is kept in TMP. Therefore, if the new transition string is greater than the N th entry in TMP and is less than the N+1 th entry, then the new string should be inserted to the transition page which is pointed to by the N th entry of the TMP.
 
I think I have reasonably answered to your questions. If there's any doubt do not hesitate to bug me.
 
cheers
Dileepa.
GeneralRe: does your code handle node splitmemberRichard98_9820 Apr '07 - 9:20 
thanks for your explaination.
in your definition, a transition is the same definition of key, is it correct?
are offset1....N pointed to child node offset in harddrive?
in GetDestNodeFromCurrNode,at line 230
ReadTransitionMilestoneBlock(iNextOffset) is not reading a node,
it is reading a group of pair of key and offset, in your definition,
is a group of pair of key and offset called transition milestone page?
 
what is transition page? if there is only one transition milestone page,
is the b-tree same as traditional b-tree?
 
Thank you very much
 
Rich
 
Richard98

GeneralVery Interesting!memberVirtual Coder12 Mar '07 - 5:16 
What is the license of your code?
GeneralRe: Very Interesting!memberDileepa Jayathilaka13 Mar '07 - 3:04 
You can use it in your application freely. No conditions.
GeneralHunspellmemberPriyank Bolia10 Mar '07 - 21:37 
Hunspell is used by Mozilla and Open office and is extremely fast and can read write standard dictionaries used by other softwares. Also dictionary is more than just matching the words, like word and words won't have two entries in normal case but I guess for your software it need two entries.
 

GeneralRe: HunspellmemberThatsAlok12 Mar '07 - 18:53 
Priyank Bolia wrote:
Hunspell is used by Mozilla and Open office and is extremely fast and can read write standard dictionaries used by other softwares.

 is it Opensource dude !

 

"Opinions are neither right nor wrong. I cannot change your opinion. I can, however, change what influences your opinion." - David Crow


cheers,
Alok Gupta
Global Interface Table: An Easy Way to Marshal an Interface Pointer[new]
VC Forum Q&A :- I/ IV
Support CRY- Child Relief and You

GeneralRe: HunspellmemberPriyank Bolia12 Mar '07 - 19:26 
Yes, its open source and extremely fast and use standard dictionaries available for Mozilla or Open Office and just doesn't matches char by char, but works on phonetic and the science of words.
 

GeneralRe: HunspellmemberThatsAlok12 Mar '07 - 19:38 
Priyank Bolia wrote:
Yes, its open source and extremely fast and use standard dictionaries available for Mozilla or Open Office and just doesn't matches char by char, but works on phonetic and the science of words.

good ... provide me the link.. have you used that???

 

"Opinions are neither right nor wrong. I cannot change your opinion. I can, however, change what influences your opinion." - David Crow


cheers,
Alok Gupta
Global Interface Table: An Easy Way to Marshal an Interface Pointer[new]
VC Forum Q&A :- I/ IV
Support CRY- Child Relief and You

GeneralRe: HunspellmemberPriyank Bolia13 Mar '07 - 1:19 
http://hunspell.sourceforge.net/
 
I had used in wxQuickRun the code can be viewed from sourcforge.
 

GeneralRe: HunspellmemberThatsAlok13 Mar '07 - 1:39 
Priyank Bolia wrote:
http://hunspell.sourceforge.net/

thanks!

 

"Opinions are neither right nor wrong. I cannot change your opinion. I can, however, change what influences your opinion." - David Crow


cheers,
Alok Gupta
Global Interface Table: An Easy Way to Marshal an Interface Pointer[new]
VC Forum Q&A :- I/ IV
Support CRY- Child Relief and You

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web03 | 2.6.130523.1 | Last Updated 10 Mar 2007
Article Copyright 2007 by Dileepa Jayathilaka
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid