
Abstract
If you want to detect keywords or special tags in a data stream, you typically use a parser generator like yacc or ANTLR. Otherwise, you can use the power of Regular Expressions. In some cases, a handcrafted parser will fulfill the requirement. To simplify this task, I wrote the template tag_tree (as part of a larger library using boost) to create a kind of state engine in memory during runtime. It has a low memory footprint, and is more efficient than any tokenizer-string compare approach could be.
A handwritten parser/scanner needs a lot of states, and looks similar to the following piece of code:
boost::tribool parser::consume(char input)
{
switch (state_)
{
case state_1:
if (input == 'P')
{
state_ = state_2;
return boost::indeterminate;
}
else
{
return false;
}
case state_2:
if (input == 'O')
{
return true;
}
... <snip> ...
default:
return false;
}
}
This is the implementation of a state engine, and is very error prone if you have to manage a lot of states. It works fine and fast for you, but there is a lot of work to modify the keywords. The following code snippet shows how you can solve this problem with the tag_tree template:
class parser
{
public:
...
typedef tag_tree < std::string, unsigned > TagTree_;
typedef typename TagTree_::MyPtr_ TagTreePtr_;
TagTree_ kt_;
};
parser::parser()
: prev_( NULL )
{
kt_.insertKeyWord( "POST", 10 );
kt_.insertKeyWord( "HOST", 20 );
kt_.insertKeyWord( "XONN", 30 );
}
unsigned parser::run()
{
TagTreePtr_ prev = &kt_;
for (;;)
{
TagTreePtr_ p = prev->findNode( next() );
if (p == NULL)
{
break;
}
prev = p;
}
return (prev->isLeaf())
? prev->getValue()
: -1; }
next() returns the next character from the data stream. While findNode() is matching, the method continues to loop. After breaking the loop, isLeaf() tests if a valid keyword was found and the method returns an identifier. It is quite possible to return the prev pointer itself. prev points in every case to a valid address, and contains all the information you need.
How does it work
The main idea is that each keyword builds a unique trace of linked nodes where each node represents a specific parser state. To build this trace, a tree of arrays will be used. The following figure shows a tree with the keywords PORT and POST:

And here comes some comfort to you. If you prefer speed over memory, choose the maximum size for the arrays (256 for char, for example). Then choose the simplest hash algorithm that is possible - the ASCII code as the array index. To save memory, the array could be shrunken. But then we have to manage a collision list.
In the download section is a project named TagTree which contains a more sophisticated sample of usage for the tag_tree template.
Reference
tag_tree_base
Defines some typedefs and access methods for all node values. tag_tree_base is not restricted to strings. You can use any string-like container which defines a const_iterator and a value_type type.
tag_tree_base::tag_tree_base( char_type c );
Parameters:
c: The character this node is representing.
value_type & tag_tree_base::getValue();
Return value:
Value of this node.
tag_tree::tag_tree( char_type c );
tag_tree is designed for usage in a streaming parser. tag_tree can help you build compact state engines in memory, since each node represents a unique parser state.
Parameters:
c: The character this node is representing.
Setting all pointers to NULL is done by the containers.
tag_tree::MyPtr_ tag_tree::findNode( char_type c );
Parameters:
Return value
A pointer to a node containing the specified character c or NULL.
bool tag_tree::empty ( ) const;
Return value
true if there are no outgoing node pointers.
The semantic is similar to the STL - container empty() method. If the node array nodeMap_ contains no pointers, then nodeList_ should not either. And it should be a leaf. Otherwise, something is going wrong.
size_t tag_tree::leaf_count( bool branch = false ) const;
Return value:
The count of keywords starting from this point.
void tag_tree::dump( ostream_type_ & os,
size_t depth = 0L,
size_t branch = 0L
) const;
Parameters:
os: an open output stream, std::cout, for example.
Prints a plain list of all nodes and their attributes. Useful for debugging purposes.
tag_tree::lookUp( char_const_iterator from, char_const_iterator to ) const
Parameters:
from: An input iterator addressing the position of the first element in the keyword.
to: An input iterator addressing the position that is one past the final element in the keyword.
Return value:
The value of the keyword. If the keyword does not exist, a null value will be returned.
Looks up for the specified keyword.
History
- 29 Oct., 2007 - Initial release.