Have you come across a situation where you need to store a dictionary with a huge number of strings and realized that keeping it in memory is a nightmare? If so, this might be one of the best solutions for your problem. In this article I'll describe an implementation where everything in the dictionary is kept in the disk. The advantages of using this in your code are:
- Memory foot print is very low (only few bytes) regardless of the size of the dictionary.
- Disk accesses are kept transparent to the user (the developer who uses the code) so that he can access it as it is in memory.
- Although kept in the disk, accessing it is quite fast due to the BTree based implementation.
- No separate mechanism is needed to write the dictionary to and read it from the disk.
- Unit of the string is kept as string rather than character so that the application developer has control on the depth of the trie (which is used to implement the dictionary).
- The relative frequency of a word in the dictionary can be stored as well.
I am doing a research in natural language processing for Sinhala which is my native language. For this, I needed to keep several big dictionaries (phrase, word and sound dictionaries) in memory at the same time. However, I realized that the dictionaries eat a lot of memory (hundreds of megabytes) causing the program performance to degrade vastly. I was forced to go for an implementation where the dictionary can be kept in and accessed directly from the disk.
Why did the dictionary eat a lot of memory
Say, we are to store a dictionary with 100,000 words. If the average length of a word is 6 characters and we store the relative frequency (likelihood of appearance) of the word as an integer. We'll need 10 bytes for a word and only 1MB for the whole dictionary!! It's quite a small amount and why do we need to go for a disk approach? Well, this will be the case only if we store the words sequentially. This will require a linear search to find a word which is totally impractical (in terms of performance) for a big dictionary. Therefore we definitely need a structured way to store the dictionary. Trie, being a tree based data structure, is the most popular answer for this. In a trie, the search for a word will require only as few steps as the word's depth in the tree.
Now the question is how much memory do we need to store the trie. Will it be significantly higher than that needed for the sequential approach? Let's see. There are leaf nodes in the trie equal in number to the words in the dictionary i. e. 100,000. In addition, there are internal (non-leaf) nodes. In my work, I found this number is nearly two times the number of leaf nodes when words are stored. So, the total node count will be approximately 300,000. How much memory for a node? A node will require : (i) ID : 4 bytes (ii) A flag to indicate whether it is a leaf node : 1 byte (iii) map of transitions from the node if the node is internal - a transition is an association of a string (note that the unit is string) and an integer. Let's say the average number of transitions per node is 5 (5 bytes * 5 = 25 bytes). The map data structure itself will take some memory. If we assume it to be 10 bytes the memory taken by the map is 35 bytes. So, the memory taken by a node will be 40 bytes and the whole dictionary will need 40 * 300,000 = 12MB. This is much more than that we predicted for sequential approach. Things get worse when the strings stored are longer (e.g. phrases). The node count will be much higher and the memory requirement will grow into even hundreds of MBs.
How it is stored in the disk
You do not need to understand or even read this section to use the code. This section is for people who are interested in the inside story. I used the B-Tree approach when storing the dictionary in the disk which is very common in database implementations. You better read a good tutorial (there are many) on B-Trees before reading this section if you are not familiar with them.
In this implementation there are three structures that constitute the disk file.
- Node - stands for a node and contains node attributes like Node ID, End flag, Count, etc.
- Transition Page - this is a set of transitions where a transition consists of a string and a destination node id.
- Transition milestone Page (TMP) - This is the structure inherited from B-Tree architecture. We sort the set of transitions of a node and split them into pages. We call the first transition of a transition page, a transition milestone. Transition milestone page is the set of transition milestones of a node. A given node may have one or more transition milestone pages.
Our disk file consists of sequentially arranged instances of these structures. Lets peep into the node structure.
Note that a node contains its first TMP within itself. Then it has the offset to its next TMP if any; otherwise it is set to invalid offset. Transition Page structure looks exactly like the part enclosed by the red borders whereas in a TMP there is an additional integer field at the beginning which indicates the offset to the next TMP if any.
The disk file has a header which contains some attributes and statistics of the dictionary.
First two bytes of the header is the two characters "DL" which stands for "Dictionary Lite". If you open the dictionary file in a text editor you'll see these two characters in the beginning. What immediately follows the header is the root node of the trie.
Using the code
Using the code is very simple. First include "Dictionary.h" in your project. You can create a
CDictionaryLite object and manipulate it with the functions described in the following guidelines.
Creating the dictionary
The filename of the dictionary should be given in the constructor. You may use any extension with the filename. There are two constructors.
CDictionaryLite(CString sFilename, bool bKeepFile,
int iStrLen = DICTIONARY_LITE_DEFAULT_STR_LEN,
int iNoOfStrInNode = DICTIONARY_LITE_DEFAULT_NO_OF_STR_IN_NODE,
int iNoOfStrInTransition = DICTIONARY_LITE_DEFAULT_NO_OF_STR_IN_TRANSITION);
Use the second constructor when creating the dictionary for the first time. The second argument indicates whether or not to keep the dictionary file after being destroyed. If you create a large number of dictionaries throughout the program and do not need them after the program termination, keep this argument false. There are other dictionary related arguments (optional) that you can specify according to the requirement. For an example, if you are going to use character as the unit of string, you may set
iStrLen = 2 (remember to keep one byte for the null terminator).
iNoOfStrInNode is the number of strings in a node or in a TMP.
iNoOfStrInTransition is the number of strings in a transition page. If not specified these parameters will be assigned with default values. However, it's always a good practice to enter these values to minimize the space and time complexities of your dictionary.
The first constructor should be used when creating a dictionary object using an already existing dictionary file. Only the filename has to be given because the dictionary file contains all other parameters in its header.
Adding words (strings) to the dictionary
After creating the dictionary object you can add words to it via the function
void AddString(CStrList& oStrList, int iCount = 1);
Note that you enter a string list rather than a single string as the word. This is because the unit of the string used in the dictionary is also a string. This is done to make it more generic.
CStrList is a class derived from
std::list<CString> so all you have to do is push unit strings to it. If this is not clear, refer to the example below. Optionally the count of the string can be given. If the word is already in the dictionary, its count will be incremented by this number.
Retrieving words from the dictionary
- Existence of a word in the dictionary can be verified by the function
int GetStringCount(CStrList& oStrList);
If the word is in the dictionary the output will be its count, zero otherwise.
- All the words in the dictionary can be sequentially retrieved using the two functions
bool GetFirstWord(CStrList& oStrList, int& iCount);
bool GetNextWord(CStrList& oStrList, int& iCount);
Function names are self-explanatory. Words in the dictionary are traversed in a depth-first-left-recursive fashion. i.e. if A and are two sibling nodes (with same parent node) with A is in left to B in the tree, B is not visited until A and all its descendents are visited.
After using the dictionary object just delete it. It's that simple! The dictionary file will remain in the disk (unless you entered
bKeepFile parameter as false in the constructor) and you can use it again.
The following code snippet illustrates these basic operations:
CDictionaryLite* pDicLite = new CDictionaryLite
("MyDic.ldic", true, 3, 5, 6);
CStrList oStr1, oStr2, oStr3;
bool bIsInDictionary = (pDicLite->GetStringCount(oStr3) > 0);
bIsInDictionary = (pDicLite->GetStringCount(oStr3) > 0);
Points of Interest
Memory required for a dictionary object is in bytes - not even in kilobytes! This does not depend on the actual dictionary size. It takes less than 1ms to access one word (to find the count of a word) in my 2.8 GHz machine. This can be definitely made even more faster by adding in-memory cache. Can somebody try this?
The space that the dictionary acquires from disk can be reduced by implementing variable size pages because in this implementation the page size (number of strings in a page) is fixed and not necessary all the entries in a page (transition page or TMP) are used. In this case the page size can be stored somewhere in the page itself. Appreciate if someone can try this as well.
10-March-2007: Initial Release