Please comment your votes!
This article presents a ternary search tree dictionary (a symbol table) that is generally as fast as the
Hashtable but also features new search possibilities. This new symbol table can be used in replacement of the other .Net dictionaries when the keys are strings. The main class behaves almost like any
IDictionary table, to the difference that it accepts string key only.
This data structure supports also interesting search methods such as partial match and near-neighbor match. For instance, the near-neighbor search can be used to provide similar words while implementing an online thesaurus.
The ternary search tree dictionary implementation is based on the article of Jon L. Bentley and Robert Sedgewick, Fast Algorithms for Sorting and Searching Strings, cfr. .
The class comes with a full set of NUnit test, benchmarks with
Hashtable and some utility to visualize the tree.
What is a ternary search tree ?
The ternary search tree (TST) is a dictionary data structure specially crafted for the situation where the key are string. It can find, remove and add keys quickly. It can also easily search the tree for partial matches -- auto matches autograph, automatic, and so on. In addition, you can implement near-match functions, which gives you the ability to suggest alternatives for misspelled words -- e.g., impliment matches implement.
A TST stores key-value pairs, where keys are strings and values are objects. In addition, TSTs use memory efficiently to store large quantities of data. Best of all, all of this magic is performed lightning fast (see benchmark). The tremendous flexibility of TSTs provides ample opportunity for creative programming.
The picture in the article front shows how the ternary tree is layout for 12 two letters words: as at be by he in is it of on or to.
How to use it ?
TstDictionary implements the dictionary. It can be used as any other
IDictionary based class, the only restriction being that key must be strings.
TstDictionary tst = new TstDictionary();
Object value = ...;
Iterate over entries
foreach(DictionaryEntry de in tst)
Object value = tst["hello"].Value;
Partial Match search
Partial match search with wild char character. Searching the dictionary for the pattern *o*o*o matches the single word rococo, while the pattern *a*a*a matches many words, including banana, casaba, and pajama:
foreach(DictionaryEntry de in tst.PartialMatch("*a*a"))
This methods finds all words in the dictionary that are within a given Hamming distance of a query word. For instance, a search for all words within distance two of soda finds code, coma and 117 other words.
foreach(DictionaryEntry de in tst.NearNeighbors("a",1))
Why not implementing IDictionary ?
As mentionned in the introduction, the
TstDictionary does not implement the
IDictionary interface, although it implements all the methods needed. The rationale for this is to avoid confusing the user by letting him use non-string keys.
TstDictionary class provides a
Synchronized method to obtain a synchronized (thread-safe) wrapper of the dictionary. This wrapper behaves similarly to other .NET synchronized wrappers.
The Tst project is composed of the following projects:
Tst, a class library that contains the TST implementation,
Tst.Utility, a set of helper classes to visualize the TST modifications,
Tst.Tests, NUnit tests
Tst.Benchmark, benchmarks with
Tst.Console, demo application.
- The enumerator should iterate the keys in alphabetical order, which it does not... yet.
- 11-01-2004, initial release.
Jonathan de Halleux is Civil Engineer in Applied Mathematics. He finished his PhD in 2004 in the rainy country of Belgium. After 2 years in the Common Language Runtime (i.e. .net), he is now working at Microsoft Research on Pex (http://research.microsoft.com/pex).