|
Introduction
A while back, I was looking for a data structure that could store strings quickly and retrieve them very quickly. I stumbled upon the Patricia (PAT) trie ("Practical Algorithm To Retrieve Information Coded in Alphanumeric"). Unfortunately, I could not find a good enough explanation of the trie. Every explanation I read, glossed over too many details. Every code piece (there are not many) that I could find had very nice implementations, but did not store text only numbers. So, I wrote my own class that stores text.
If you want to store a good amount of unique text strings (duplicates can be handled, just not by the class itself), retrieve them super fast, and associate data or functions with them, then read on!
This class could be used for (as examples, but not limited to):
- command lists (like menus or OSI(7) protocols)
- a dictionary or thesaurus
- associative arrays
- catalogue ports on a TCP server
- store/parse XML
Not bad, huh?
Patricia Trie
I would like to give a brief overview of what a trie is and then speak about PAT tries more specifically. It is assumed you understand how a binary tree works.
A trie is a binary tree that uses key space decomposition (as opposed to object space decomposition). The idea is that instead of branching left or right based on what data value happens to be at a particular node in the tree (that is not a mistake, tries are trees), we branch based on predefined key values. So the root node of a tree could send the first thirteen letters of the alphabet to the left child while the others go to the right. This does not guarantee a balanced tree, but it helps to keep the tree balanced and ensures that the order in which data is input will not affect the performance of the data structure (i.e., insert a sorted list into a binary tree and end up with a list).
In a PAT trie, all of the internal nodes (a node with at least one child) hold only enough information to make branching decisions (go left or right), but no actual data. The leaves of the trie (a node with no children) hold the data: a key and a link to one class (that holds whatever we need it to).
My PAT trie uses integers (on a 32-bit machine) for keys. If you want to use a long for the key, then you must change the type of the key (keyType). If you change the type of the key or you compile this class on a machine that encodes integers (int) with more than 32 bits, then you must set the constant MAX_KEY_BITS to the number of bits that your data type is wide (default is 32). We will see later how to convert text to integers.
With PAT tries, at every level of the trie, we evaluate one bit of the key, and based on that bit, move left or right. To find any key in our trie, we only have to check at most the number of bits that our key is wide (32 bits).

The image above shows a PAT trie that stores a few keys (binary representation shown). The black circles are internal nodes. The yellow circles are leaf nodes.
Text To Number
Now, we turn our attention to making this a useful PAT trie. We need a way to turn a string of text into our key. All of the three methods described below are based on the fact that Unicode or ASCII characters are really just numbers. The character 'a' is also 97.
The first method involves converting all of our characters into 8-bit numbers (a byte) and concatenating them. There are only 26 letters in the alphabet; we can represent them with a byte. This is not a very good solution, because we could only discriminate against four letters (4 letters * 8 bits = 32 bits). Also, most of the branching would be on the same characters (eight bits to store an ASCII character, eight identical internal node checks).
The second method allows us to store more letters in our key. We could add our characters ('a' + 'e' + ... = 4532). In this way, with the largest ASCII character being no more than 255 (2^8-1) or in Unicode 65535 (2^16-1), we could have our key encode many numbers and we would not be wasting space on comparing eight internal nodes for one character. This method has one fatal flaw: encoding "was" or "saw" we will get the same number (3 + 5 = 5 + 3).
The last method uses a Huffman code. Huffman codes are primarily used for compression. They encode letters in a new format using the least amount of bits possible by assigning the least number of bits to the most frequent character. So, if 'e' is the most common letter in a document, then we will represent 'e' with a zero (or just one bit). We would represent the next most frequent letter with a one and the third most frequent with a two. In this way, the Huffman code saves the most space of the three methods, by replacing the most frequent letter with the smallest representation.
Please note that with any of these three methods, there is no guarantee that the generated key will be unique. If all of the letters are identical until the point where our key is full (we have represented as many characters as possible), then we will have generated the same key for two distinct strings ("a baby ate the fox" and "the baby ate the fox" will translate to the same key). To ease this issue, use a key with greater than 32 bits. With a 32 bit key, we can represent at most 32 characters (provided those characters can be represented by one bit ~ 32 e's or t's).
Huffman Is Our Man
The Huffman code used in my PAT trie is an array of twenty-six numbers. The first cell holds the Huffman code for the letter 'a', the next cell for 'b', and so on. The code stored in each cell is the reduced bit representation of the corresponding letter. For instance, the fifth cell is for 'e' and holds the number zero. 'e' is the most frequent letter in the English alphabet and so should map to the smallest number. To build our key:
- for each letter right to left (is there more variability at the end of words?)
- look up the Huffman code for the letter (we will refer to as "Huffing")
- shift the bits on the key to the left for the number of bits that the Huffman code is long (make space to add the code to the key)
- add (arithmetic) the new Huffman code to the key
To better explain the last three steps, here is an example:
- 'l' huffs to 8 or 1000 (in binary)
- the key 0000 0000 0000 0000 0101 1101 0110 0011, becomes 0000 0000 0000 0101 1101 0110 0011 0000
- we can add 8 to the key to get 0000 0000 0000 0101 1101 0110 0011 1000
How was the decision made that the letter 'e' is the most frequent and 'z' is the twenty-third most frequent? Initially, I wrote a program to mill through text and produce a list of the most frequent letters in the English alphabet. Fortunately, it was brought to my attention (by Don Clugston - on Code Project) that there is an official list of the most frequent letters. I should have thought of that... Irregardless, the code has been updated and kudos/thanks to Don! The change will improve the speed of the code. Note that if you use this class to store command lists or some type of structured document where the same words are being used over and over, it pays to rearrange the frequency to match the most frequently used letters. If you were storing C programming code, then if and while would crop up quite a bit. The most frequent letter should (possibly) be i.
Huffing Again and Again
What if you Huff duplicates? My PAT trie is set up so that you associate an entire class with a key. It will only store a key once, but the class associated with it could remember information about duplicates. For instance, here are three scenarios:
- You could keep a variable to count how many times you found a particular word. Instead of inserting, just search for the key and increment the variable in the associated class.
- You could build a linked-list in the associated class to store the position of the text string in a text document. Then, you would know that the word was stored at positions 45, 64, 134, and 856. Note: this would be down right horrible if you wanted to piece the document back together.
- You could build a linked-list that, instead of storing the position of the string, pointed to the next string in the document (or just stored the next word in text). That would help if you wanted to rebuild the document a little more efficiently later on. You could follow the bread crumbs of each string.
Code Overview
There are three classes that together create my Patricia Trie:
PatriciaTrieC ~ the Patricia trie.
PatriciaNodeC ~ an internal or leaf node in the trie.
ASSOCIATED_CLASS (PayloadC) ~ the associated class, originally PayloadC, but it can be changed by changing the constant ASSOCIATED_CLASS.
The Huffman Code [CODE]
The Huffman code array is a static member of the PatriciaTrieC class. It is initialized once for the class and available for all objects of PatriciaTrieC. If you want to change the values, you will have to go in and tweak the array itself.
char PatriciaTrieC::huffman_code [] =
{
2,
18,
11,
10,
0,
The code that Huffs text to keys will ignore all characters that are not A through Z (case-insensitive). So, the strings "HAPPY" and "H A P P Y" would Huff to the same key. Also note that the character strings should be terminated by the end of string character ('\0' or zero).
do
{
if (txt [txt_count] == END_OF_STRING)
break;
if (txt [txt_count] >= 'a' && txt [txt_count] <= 'z')
txt [txt_count] -= LOWER_TO_UPPER_FACTOR;
if (txt [txt_count] < 'A' || txt [txt_count] > 'Z')
{
txt_count += 1;
continue;
}
.
.
.
Using The Code [CODE]
You can insert, search, and remove a key (or one data element) out of the trie. There is a method BuildKey that you can use to convert text into keys if you wanted to store or find out the exact key signature for a string of text. You can also call Print to print the entire trie (if DEBUG is set) or Clear to wipe the entire trie.
The class associated with every leaf node (where data is stored) is PayloadC. If you want your own class, just change the constant ASSOCIATED_CLASS in the pat.h file to the name of your class.
The included project has a nice little program that loads words from a text file and inserts, searches and then removes them all. Below is a simple example of those three operations (insert, search, remove): #include "pat.h"
#include <iostream.h>
int main ()
{
PayloadC* p_dyn = new PayloadC ();
PayloadC p_stc;
PayloadC* catch_payload = NULL;
PatriciaTrieC pat;
if (pat.Insert (p_dyn, "HAPPY"))
cout << "inserted HAPPY" << endl;
else
cout << "did not insert HAPPY" << endl;
catch_payload = pat.Search ("HAPPY");
if (pat.Insert (&p_stc, "SAD"))
cout << "inserted SAD" << endl;
else
cout << "did not insert SAD" << endl;
catch_payload = pat.Search ("SAD");
catch_payload = pat.Search ("WHO");
if (!catch_payload)
cout << "WHO does not exist" << endl;
catch_payload = pat.Remove ("HAPPY");
catch_payload = pat.Remove ("SAD");
catch_payload = pat.Remove ("WHO");
return 0;
}
Points of Interest
The word frequency list I used came from here.
Worst-Case Running Times
Insert |
O(bc+bc*bc) |
where bc is the maximum number of bits for the key (32 bit comparisons to search for the spot to insert, plus for every break you have a full 32 bit comparison for duplication - this could be removed, but then duplicates would automatically create a path 32 bits deep, which is fine, but the extra work is better in the average case) |
Search |
O(bc) |
where bc is the maximum number of bits for the key (using an int on a 32 bit machine, 32 bit comparisons) |
Delete |
O(bc) |
where bc is the maximum number of bits for the key |
BuildKey |
O(n) |
where n is the length of the string |
The PAT trie with a 32 bit key can store (2^32-1) / 2 = 2,147,483,647 words and find any of them in 32 bit comparisons.
History
- version 0.0 [November 9, 2004]
- version 0.1 [December 3, 2004]
- The Huffman encoding code is better optimized based on this.
- No GPL. If you want it, use it. If I goofed and did not remove something, remove it and use it. I am sure I got everything though.
- version 0.2 [December 18, 2004]
- Academic Free License, Version 1.2 applied to source.
| You must Sign In to use this message board. |
|
| | Msgs 1 to 25 of 36 (Total in Forum: 36) (Refresh) | FirstPrevNext |
|
|
 |
|
|
This is a newbie question, okay?
If I want to store stuff in the payload, and have Search allow it to be retrieved, what do I have to do make that work?
I tried putting a char buffer in the PayloadC class but it doesn't seem to be storing it. More likely it's storing it and I don't know how to get it out.
class PayloadC { private:
public: char foo[128]; PayloadC () {} ~PayloadC () {} };
This is part of the code. The trie has been made part of a COM library. p ends up pointing to foo, but the data stored there doesn't gel with what is supposed to have been stored there.
STDMETHODIMP Patricia::Search(BSTR sData, VARIANT_BOOL *bResult) { USES_CONVERSION; LPSTR saData = W2A( sData ); PayloadC* p; *bResult = ( ( p = pat.Search(saData) ) != NULL ) ? VARIANT_TRUE : VARIANT_FALSE; // cout << p->foo; return S_OK; }
This is part of the code that calls the COM object:
For i = 0 To UBound(a) - 1 tag = a(i) Debug.Print "'" & tag & "'" If Not b.Insert(tag) Then fc = fc + 1 Else If b.Search(tag) Then xc = xc + 1 End If sc = sc + 1 End If Next
As well as a Search, it'd be nice to have a Recall as well.
Kind regards, Bruce.
Bruce M. Axtens Software Engineer Strapper Technologies http://www.protiumblue.com
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Hi Bruce,
When you call insert you have to pass two parameters.
Parameter 1 - the payload Parameter 2 - the key the payload should be associated with
Payload* payload = new Payload ();
pat_trie.Insert (payload, "WORD");
Payload* get = pat_trie.Search ("WORD");
then get and payload will point to the same object.
If you want to add more information to payload go right ahead. The trie does not use the payload for comparison, ever unfortunately. The insert and search methods also accept an unsigned integer (which is what a string is converted to). So if you want to write some hash function for all of the data you will put in payload then you can use that as the key. If the data you will place in payload is character data, then just use BuildKey - which is provided.
class Payload { char* new_data; }
Payload* payload = new Payload (); payload.new_data = {'h','a','p','p','y'}; //I think, but I have not written in C++ for a while.
pat_trie.Insert (payload, BuildKey (payload.new_data));
Payload* get = pat_trie.Search (BuildKey (payload.new_data));
get and payload will equal each other again. Naturally for the search you will have a string from somewhere else.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Michael
After feeling like it was all my fault that I couldn't get things working appropriately, I discovered a bug in your code.
In pat.cpp you have a bit of code as follows:
do { huff_code >>= 1; //shift bits off until we get to zero shift_bits += 1; //number of bits we need to shift before the add bit_count -= 1; //track how many bits have been shifted so far } //quit when we have counted how many bits are used to represent number while (huff_code > 0);
The shift_bits variable is initialised to zero but only at the top of the routine. What happens then is that if the first character has 4 bits, shift_bits is set to 4, and if the next character has 4 bits, then shift_bits is set to 8 and so on for the length of the input string. This leads to some rather weird keys.
So I've added in an appropriate change:
shift_bits = 0; do { huff_code >>= 1; //shift bits off until we get to zero shift_bits += 1; //number of bits we need to shift before the add bit_count -= 1; //track how many bits have been shifted so far } //quit when we have counted how many bits are used to represent number while (huff_code > 0);
Okay, that's the bug. Then there's the problem: some strings give the same key. For instance, TEARE and ANTEE give the same key value of 172. Any ideas?
Kind regards, Bruce.
Bruce M. Axtens Software Engineer Strapper Technologies
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
|
Michael
Shall give that a shot.
Thanks very much for the rapid responses. Given the years since your article was written I was pleasantly surprised to get a response at all.
Thanks, Bruce.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
|
In the VBScript code below, I've implemented the huffman conversion you have in BuildKey and I pass to it some names. The output is 'text ( binary_representation ) decimal_representation'. Note that the line 'huff = aHuff(n) * i' is multiplying the huffman value by the position of the character in the text. (I looked for other places to put your multiplication suggestion and that seemed to be the only sensible place.)
The fact is that, for the examples given, it matters not whether the the value is '* i' or not as both texts evaluate to the same numeric value.
Results with multiplier:
ISNER ( 100 1110 1111 0 11110 ) 80862 LONER ( 1001 110 1111 0 11110 ) 80862 ANDRLE ( 10 1010 11110 11000 101101 0 ) 5631066 ANDRY ( 10 1010 11110 11000 1011010 ) 5631066
Results without multiplier:
ISNER ( 100 111 101 0 110 ) 5078 LONER ( 1001 11 101 0 110 ) 5078 ANDRLE ( 10 101 1010 110 1001 0 ) 88786 ANDRY ( 10 101 1010 110 10010 ) 88786
The positional multiplier idea was good, but breaks down with the number of unique strings that I'm throwing at it (>20000). One could, I suppose, take a copy of the input string, reverse it and append it to the input string before passing it in ...
Thanks for your patience.
Kind regards, Bruce. VBScript code:
option explicit dim sHuff dim aHuff sHuff = "2,19,11,10,0,14,16,8,4,23,21,9,13,5,3,15,24,6,7,1,12,20,17,22,18,25" aHuff = split(sHuff,",")
translate "ISNER" translate "LONER" translate "ANDRLE" translate "ANDRY"
sub translate( sFrom ) dim i dim c dim n dim huff dim t t = "" for i = 1 to len( sFrom ) c = mid( sFrom, i, 1 ) n = asc(c) - asc("A") huff = aHuff(n) * i 'wscript.echo c,n,aHuff(n),i,huff,hex(huff) t = t & getbinary(huff) & " " next wscript.echo sFrom, "( "&t&")", getinteger(t) end sub
'binary functions adapted from http://www.romanpress.com/Articles/Bitwise_R/Bitwise.htm Function GetInteger(sBinary) ' Returns the integer that corresponds to a binary string of length 8 Dim iRet dim i dim s ' Remove any spaces s = Replace(sBinary, " ", "") iRet = 0 For i = 0 To Len(s) - 1 iRet = iRet + 2 ^ i * CInt(Mid(s, Len(s) - i, 1)) Next GetInteger = iRet End Function
Function GetBinary(iInput) ' Returns the 8-bit binary representation ' of an integer iInput where 0 <= iInput <= 255 Dim s dim i If iInput < 0 Or iInput > 255 Then GetBinary = "" Exit Function End If s = "" while iInput > 0 s = CStr(iInput Mod 2) & s iInput = iInput \ 2 wend if s = "" then s = "0" GetBinary = s End Function
Bruce M. Axtens Software Engineer Strapper Technologies
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Hey guys help me plz However I read above all document of B tree,B+tree I don't fully understand B+tree Now I have to implement this problem
Description In this assignment, you will implement the B+-tree data structure based on the algorithm described in the chapter 6, “Index Structures for Files” of the reference book, “Fundamentals of Database Systems,” by Elmasri and Navathe, third edition, 2000. Your B+-tree is built on a fixed-length data file such as a student table. You need to design your own data file for students e.g. Students(number, name, gpa, grade, major) and generate about 20 records automatically or manually. All fields are fixed length: 8 for number (20022509), 3 for name (PIS), 4 for gpa (4.00), 1 for grade (2), 2 for major (EE, CS). Each field is separated or delimited by a white space. Initially, your B+-tree is constructed based on this data file. After the initial loading, new records can be inserted or existing records can be searched through a command-line interface or a graphical user interface (GUI). According to user commands, both your B+-tree and the original data file should be updated and rebuilt. A selection command will retrieve an entire record associated with the key that you query in. The selection has two options; one for searching with the index and the other for searching without the index. You need to evaluate the response time of each option to compare the performance of an index search with a sequential search.
Requirements You need to (1) implement several operations on B+-tree. Insertion Selection with/without an index Also (2) compare the performance of retrieving records with or without index. Hint: test your B+-tree many many times with tens of thousands of records and calculate the average response time in both cases. You need to (3) provide at least command-line interface but GUI encouraged.
What should I do?
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
i need to use special characters in the PAT apart from the alphabets like @ , _ . etc., how it can be done using huffmann PAT?
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |
|
|
hi i am a student of MCS i need ur pesudo code plz help me and send me on siya_be@yahoo.com .my final project is related to huffman coding . and i improve this coding i need ur help. so plz send me pesudo code so i understand ur code. plz send me.
|
| Sign In·View Thread·PermaLink | 1.00/5 (1 vote) |
|
|
|
 |
|
|
when one or more spaces are included. For example, it gave an incorrect text count number when I include the words, "Happy Birthday" as a key. It gave the count as 10 (beginning the count at zero and including the space).
However, when I used the word, "H A P P Y", it gave the correct count as 8 (beginning the count at zero and including the space).
Why did it incorrectly count the first sample?
Also, when I used the word, "H A P P Y", it did not insert the key. It showed "--". Yet, later on when it went searching, it showed that it found, "H A P P Y".
How could it find something it didn't insert???
William
Fortes in fide et opere!
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Sorry about one part of my comment above, in which I said it did not insert the key, "H A P P Y".
It DID insert the key earlier when I used it as, "HAP PY", and therefore did not reinsert it a second time when I entered it in a different format. (My mistake on that part of my work with the sample.)
However, because "Happy Birthday" is made up of 2 words and had obtained an incorrect text count when I used it, I wanted to see if by simulating "HAP PY" as 2 words, if it would also produce an incorrect text count. It didn't!!!
Following some experimentation (one of which included extending the key size), I noticed some of the text count problems were solved, but NOT ALWAYS!!
The question then is, "If extending the key size doesn't always take care of the text count problem, why extend the size in the first place?" IOW, stick with the original key size and have ALL the data, nonetheless, entered and saved.
I say that, because if my word is, "IWishYouAHappyBirthdayAndMayYouSeeManyMore", we know that the text count will be off. But if the data can be inserted and later found (when searched for), it'd seem to me that its major purpose will have been accomplished. Right?
William
Fortes in fide et opere!
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
first, thank you again, for the great article. I also have a question. Do you know any source of information about have a trie on a disk and accessing it FROM the disk? the trie is a memory-optimized datastructure, and saving the nodes on the disk may lead to a lot of diskaccess for retrieve only one value I googled for a while but can't find anything interesting about leaving the trie-data in a file. B-Trees are not suitable for my problem (I think...), because I must have the ability to find misspelled words in the trie.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Well I think this data structure has to stay in memory Jumping around in a binary file would not be very helpful or efficient. I do think you could use the trie in memory as a full-index to your data. So every node of the trie stores the name of a text file (one text file for every node). Even better would be to have a binary file and set every node in the trie to point to a position in the binary file where that (very large) data structure exists. In this way you have the trie in memory, but the big data does not get pulled up unless it has to.
You are right, there is nothing on this stuff. Thanks for the compliment.
|
| Sign In·View Thread·PermaLink | 1.00/5 (1 vote) |
|
|
|
 |
|
|
In your sample, you have both the letters "U" and "F" set to the number '13'.
How should that be handled? Wouldn't changing it to a different number result in a massive displacement of other letters with regards to the numbers they already have assigned to them?
I noticed nothing different happened when I inserted 3 new words (at different places) in the input file: "unfit", "function", and "unfiltered". I was just curious to see what would happen since both "U" and "F" have the same value.
Thanks.
William
Fortes in fide et opere!
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Thank you for seeing that. My apologies. It has been fixed in the source code and should be updated soon.
If you wanted to fix it on your own, just count out which letter should be in which position.
e t ... z 0 1 ... 25
The fixed source has the numbering under the letters, so nothing went wrong.
Once again, my apologies.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
I noticed in the 'BuildKey' function when you're copying text into your buffer, you are using a 'char' type as the array index. Is there any significant reason for using a 'char' instead of 'int' or 'short' for the index?
Technically, I wouldn't use a 'char' type as the index for an array, but I'm new to the Huffman variation and I'm watching everything.
Also, I noticed inside the same function you have it commented where if the text is "big caps" you want to make them small, but then in the same loop you are using a macro in which it seems you are converting from a lower to an upper case. Doesn't this sound contradictory?
In any event, I entered a word all in upper case letters and then ran the program. I didn't see any conversion. The word was printed in all upper case letters just as how I had entered them.
Next, I entered a word followed by a question mark. The same routine states it is handling only alphabetic characters (skipping over non-alphabetic ones), but then it went ahead and inserted the word with the question mark as a key, and later found the same word with its non-alphabetic character, and displayed it. In the end it also showed where it removed the word (with its non-alphabetic character) from the trie.
Lastly, I entered the same word (with the question mark) only this time I put the question mark first. It didn't insert the word, but then it went to the trie and found it. How could it find something that wasn't inserted?
Aren't some of these actions contrary?

William
Fortes in fide et opere!
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
I noticed in the 'BuildKey' function when you're copying text into your buffer, you are using a 'char' type as the array index. Is there any significant reason for using a 'char' instead of 'int' or 'short' for the index?
** The char type is just smaller than an int or something. The array is set at 26, nothing more to it. **
Technically, I wouldn't use a 'char' type as the index for an array, but I'm new to the Huffman variation and I'm watching everything.
Also, I noticed inside the same function you have it commented where if the text is "big caps" you want to make them small, but then in the same loop you are using a macro in which it seems you are converting from a lower to an upper case. Doesn't this sound contradictory?
** The documentation is backward, if it is small it makes it big and then converts it into a key index for the huffman array and then looks up the corresponding huffman code. **
In any event, I entered a word all in upper case letters and then ran the program. I didn't see any conversion. The word was printed in all upper case letters just as how I had entered them.
** you have me stumped here, the huffman code stuff is one way, it takes your string and produces a number, so if there is upper-case stuff being printed somewhere it did not come out of the trie. **
Next, I entered a word followed by a question mark. The same routine states it is handling only alphabetic characters (skipping over non-alphabetic ones), but then it went ahead and inserted the word with the question mark as a key, and later found the same word with its non-alphabetic character, and displayed it. In the end it also showed where it removed the word (with its non-alphabetic character) from the trie.
** Alright, here is the deal. If you insert HELP? and HELP they will huff to the same number, or essentially are the same string. If you huff HELP? it will produce the key for HELP, and when you search for HELP? it will huff HELP and search for the key. So you are and were always using the HELP string and the ? was ignored. You saw the ? come up because it was stored in a node associated with the trie node, but it did not influence the key that was generated. So you can store anything, but HELP? and ?HELP and HELP will all insert to the same spot because the question marks are ignored, so you would only be able to insert the first of those three strings. **
Lastly, I entered the same word (with the question mark) only this time I put the question mark first. It didn't insert the word, but then it went to the trie and found it. How could it find something that wasn't inserted?
** Right if you already had inserted HELP, then ?HELP would not insert because it huffs to HELP. It already existed, because the question mark was ignored. **
Aren't some of these actions contrary?
** Not contrary, maybe confusing though. **
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
There is one activity I noticed which keeps getting repeated over and over. It is the "BuilKey" function. It gets done when you're first inserting the key (which I can understand). Then it gets repeated when you're searching for the entry you've inserted, and again when you're about to remove the same entry.
Considering these are duplicate activities which are being done, wouldn't saving the key and then using it later to access the entry (to confirm that you found the entry and then to remove it) be faster than to have to keep building it over and over?
Granted we are dealing with a couple dozen words in your sample, but what if we were working with (let's say) 200 different words that were being used (let's say) 1000 times. That's 200,000 times we'd need to build the key for 'inserting'; 200,000 times for 'accessing' it (etc.). It would just be a very large number of times we'd be performing the same activity to accomplish the same thing!!!
Wouldn't just saving the key (once it's been built) and then later use it (for whatever we want to do) be a better way to go?
William
Fortes in fide et opere!
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Yeah that would make sense, I agree. Here is the trick, where will you save and how will you save the keys?
It is the PAT trie that is storing them. If you have another data structure saving the keys, then you must search that to search the PAT trie, so now you are limited by the alternate data structures. Or that argument is not even worth saying, either way.
The point is that it is the PAT trie that is storing these things. If you want to store all of your keys in a table or something be my guest, but then you have to search the table.
Maybe a hash table, that is an idea... But then you could still, just use a hash table... which is not as flexible as a PAT trie.
The best solution is to find a better (faster) way to build keys, then it would add less overhead.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
My thoughts were more aimed at "searching" the trie (once the keys have been built) rather than keep rebuilding them every time.
Using my example, I pointed out that we were using just 200 words. (That's 200 different words.) However, we were using some of them maybe 30 times, others 20 times, while still others maybe 5 times, etc. It's the amount of times the various words were being used that amounted to the 1000 figure. Multiplying that 1000 figure for the 200 words was the reason behind looking for a better way when you consider we would end up with over half a million times just building keys.
Since we're dealing with just 200 words (for this example) storing 200 words along with their PAT keys shouldn't put a lot of stress on the system doing that, and for those words that are frequently used, we can put them up front in the table to cut down on the search for them inside the table.
Do you think this is a more reasonable solution to the repetitive key building activity?
Thanks for your continuous valuable response.
< | | | | | |