Reasoning
String lookups can be a time consuming operation. This is because it often involves repeated calls to strcmp
. This is because we represent strings in standard null-terminated form. But by changing the representation of strings, we can increase lookup time.
Data representation
The data representation in my class is like the Trie structure. But I forgot the name so I just called it CTier
because it resembles a tier.
Basically, it stores words character by character. The head of the structure is a linked list to the starting letters of each word stored. Each letter has a pointer to another linked list which stores the second characters that make up the word. Therefore, to find a word, you first scan for the letter through a linked list, then continue on to the next linked list and scan for the next letter until you have consumed all letters in the search string.
The linked lists are in no order (they do not store letters from a-z). Also there is no distinction between upper and lower case letters. They are in the order in which words are added to the structure. But on average, it would require traversing through half the linked list (13 characters) to find the character you want. The search's time depends on how large the search string is (on average = 13 * strlen(word)
). I don't have any usage time statistics but I do have some benchmarks available.
I had an application that would load an entire dictionary into a file and then search for strings up to 13 characters long. At first, the data structure I used was an array of an array of strings. For instance, the array was of length 26, all words staring with 'a' went into index 0, 'b' into 1 etc. Then all the words were stored sequentially so a binary search was possible. Finding a word consisted of finding the index based on first letter, then doing a binary search (by using strcmp
). The entire operation would take roughly 40 seconds on my 1.4 ghz machine. But I changed my data structure over to this, and it cut the time down to about 10 seconds. So I was pleased!
Code
typedef struct _tagTIERDATA
{
char ch;
bool bCompletesWord;
_tagTIERDATA *pNextTierDown;
_tagTIERDATA *pNext;
} TIERDATA,*LPTIERDATA;
Here is some helper code to search through our data and find the letters:
LPTIERDATA CTier::FindLetter(char ch,LPTIERDATA pList)
{
for (LPTIERDATA pIter = pList;pIter;pIter = pIter->pNext)
{
if (ch == pIter->ch)
return pIter;
}
return NULL;
}
LPTIERDATA CTier::FindOrCreateLetter(char ch,LPTIERDATA pList)
{
LPTIERDATA pLast;
for (LPTIERDATA pIter = pList;pIter;pIter = pIter->pNext)
{
if (ch == pIter->ch)
return pIter;
pLast = pIter;
}
pIter = new TIERDATA;
if (!pIter)
return NULL;
pIter->ch = ch;
pIter->bCompletesWord = false;
pIter->pNext = NULL;
pIter->pNextTierDown = NULL;
pLast->pNext = pIter;
return pIter;
}
Here is the main code to add a word to our structure:
bool CTier::AddWord(const char *pWord)
{
LPTIERDATA pIter;
char ch;
LPTIERDATA pLast;
if (pWord == NULL)
return false;
if (m_root == NULL)
{
m_root = new TIERDATA;
if (!m_root)
return false;
m_root->bCompletesWord = false;
m_root->ch = *pWord;
m_root->pNext = NULL;
m_root->pNextTierDown = NULL;
}
pIter = m_root;
for (ch = *pWord;ch;ch = *(++pWord))
{
pIter = FindOrCreateLetter(ch,pIter);
if (!pIter)
return false;
if (pIter->pNextTierDown == NULL && pWord[1] != '\0')
{
pLast = new TIERDATA;
if (!pLast)
return false;
pLast->ch = pWord[1];
pLast->bCompletesWord = false;
pLast->pNext = NULL;
pLast->pNextTierDown = NULL;
pIter->pNextTierDown = pLast;
}
pLast = pIter;
pIter = pIter->pNextTierDown;
}
pLast->bCompletesWord = true;
return true;
}
And finally, the code that searches for a string in our structure:
bool CTier::ContainsWord(const char *pWord) const
{
if (m_root == NULL)
return false;
LPTIERDATA pIter = m_root;
char ch;
LPTIERDATA pLast;
for (ch = *pWord;ch;ch = *(++pWord))
{
pIter = FindLetter(ch,pIter);
if (!pIter)
return false;
pLast = pIter;
if (!(pIter=pIter->pNextTierDown))
return false;
}
return (pLast->bCompletesWord);
}