Class for quick string lookup






3.25/5 (3 votes)
Sep 11, 2001
2 min read

76551

600
Class that represents strings in a tier to enable faster lookups.
Reasoning
String lookups can be a time consuming operation. This is because it often involves repeated calls to strcmp
. This is because we represent strings in standard null-terminated form. But by changing the representation of strings, we can increase lookup time.
Data representation
The data representation in my class is like the Trie structure. But I forgot the name so I just called it CTier
because it resembles a tier.
Basically, it stores words character by character. The head of the structure is a linked list to the starting letters of each word stored. Each letter has a pointer to another linked list which stores the second characters that make up the word. Therefore, to find a word, you first scan for the letter through a linked list, then continue on to the next linked list and scan for the next letter until you have consumed all letters in the search string.
The linked lists are in no order (they do not store letters from a-z). Also there is no distinction between upper and lower case letters. They are in the order in which words are added to the structure. But on average, it would require traversing through half the linked list (13 characters) to find the character you want. The search's time depends on how large the search string is (on average = 13 * strlen(word)
). I don't have any usage time statistics but I do have some benchmarks available.
I had an application that would load an entire dictionary into a file and then search for strings up to 13 characters long. At first, the data structure I used was an array of an array of strings. For instance, the array was of length 26, all words staring with 'a' went into index 0, 'b' into 1 etc. Then all the words were stored sequentially so a binary search was possible. Finding a word consisted of finding the index based on first letter, then doing a binary search (by using strcmp
). The entire operation would take roughly 40 seconds on my 1.4 ghz machine. But I changed my data structure over to this, and it cut the time down to about 10 seconds. So I was pleased!
Code
// this is case sensitive typedef struct _tagTIERDATA { char ch; bool bCompletesWord; _tagTIERDATA *pNextTierDown; _tagTIERDATA *pNext; } TIERDATA,*LPTIERDATA;
Here is some helper code to search through our data and find the letters:
LPTIERDATA CTier::FindLetter(char ch,LPTIERDATA pList) { for (LPTIERDATA pIter = pList;pIter;pIter = pIter->pNext) { if (ch == pIter->ch) return pIter; } return NULL; } LPTIERDATA CTier::FindOrCreateLetter(char ch,LPTIERDATA pList) { LPTIERDATA pLast; for (LPTIERDATA pIter = pList;pIter;pIter = pIter->pNext) { if (ch == pIter->ch) return pIter; pLast = pIter; // save the last one we were looking at } // we fell through here, that means we don't have one, create it pIter = new TIERDATA; if (!pIter) return NULL; pIter->ch = ch; pIter->bCompletesWord = false; pIter->pNext = NULL; pIter->pNextTierDown = NULL; pLast->pNext = pIter; // and link it up return pIter; }
Here is the main code to add a word to our structure:
bool CTier::AddWord(const char *pWord) { LPTIERDATA pIter; char ch; LPTIERDATA pLast; if (pWord == NULL) return false; if (m_root == NULL) // base case, add first entry { m_root = new TIERDATA; if (!m_root) return false; m_root->bCompletesWord = false; m_root->ch = *pWord; m_root->pNext = NULL; m_root->pNextTierDown = NULL; } // ok now we search for the first letter pIter = m_root; for (ch = *pWord;ch;ch = *(++pWord)) // go through letter by letter { // find it in our list, creating a new one if necessary pIter = FindOrCreateLetter(ch,pIter); if (!pIter) return false; // now go down to next level if (pIter->pNextTierDown == NULL && pWord[1] != '\0') // create a new downward level if we have to { pLast = new TIERDATA; if (!pLast) return false; pLast->ch = pWord[1]; pLast->bCompletesWord = false; pLast->pNext = NULL; pLast->pNextTierDown = NULL; pIter->pNextTierDown = pLast; } pLast = pIter; pIter = pIter->pNextTierDown; } // we have dropped out now, set flag for word completion pLast->bCompletesWord = true; return true; }
And finally, the code that searches for a string in our structure:
bool CTier::ContainsWord(const char *pWord) const { if (m_root == NULL) return false; LPTIERDATA pIter = m_root; char ch; LPTIERDATA pLast; for (ch = *pWord;ch;ch = *(++pWord)) // go through letter by letter { // find letter in list pIter = FindLetter(ch,pIter); if (!pIter) return false; pLast = pIter; // now go down to the next tier of letters if (!(pIter=pIter->pNextTierDown)) return false; } return (pLast->bCompletesWord); }