Click here to Skip to main content
15,881,757 members
Articles / Programming Languages / C++

How to build an in-memory state engine at runtime

Rate me:
Please Sign up or sign in to vote.
3.67/5 (9 votes)
28 Oct 2007CPOL3 min read 32.5K   174   20   5
A C++ template for an efficient in-memory state engine.

tag tree

Abstract

If you want to detect keywords or special tags in a data stream, you typically use a parser generator like yacc or ANTLR. Otherwise, you can use the power of Regular Expressions. In some cases, a handcrafted parser will fulfill the requirement. To simplify this task, I wrote the template tag_tree (as part of a larger library using boost) to create a kind of state engine in memory during runtime. It has a low memory footprint, and is more efficient than any tokenizer-string compare approach could be.

Introduction

A handwritten parser/scanner needs a lot of states, and looks similar to the following piece of code:

C++
boost::tribool parser::consume(char input)
{
  switch (state_)
  {
  case state_1:
    if (input == 'P')
    {
      state_ = state_2;
      return boost::indeterminate;
    }
    else
    {
      return false;
    }
  case state_2:
    if (input == 'O')
    {
      return true;
    }
    ... <snip> ...
  default:
    return false;
  }
}

This is the implementation of a state engine, and is very error prone if you have to manage a lot of states. It works fine and fast for you, but there is a lot of work to modify the keywords. The following code snippet shows how you can solve this problem with the tag_tree template:

C++
class parser
{
public:
    ...
    typedef tag_tree < std::string, unsigned >     TagTree_;
    typedef typename TagTree_::MyPtr_    TagTreePtr_;
    
    TagTree_    kt_;
    
};

parser::parser()
    : prev_( NULL )
{
    kt_.insertKeyWord( "POST", 10 );
    kt_.insertKeyWord( "HOST", 20 );
    kt_.insertKeyWord( "XONN", 30 );
}

unsigned parser::run()
{
    TagTreePtr_    prev = &kt_;
    for (;;)
    {
        TagTreePtr_ p = prev->findNode( next() );
        if (p == NULL)
        {
            break;
        }
        prev = p;
    }
    
    return (prev->isLeaf())
        ? prev->getValue()
        : -1;    //! nothing found
}

next() returns the next character from the data stream. While findNode() is matching, the method continues to loop. After breaking the loop, isLeaf() tests if a valid keyword was found and the method returns an identifier. It is quite possible to return the prev pointer itself. prev points in every case to a valid address, and contains all the information you need.

How does it work

The main idea is that each keyword builds a unique trace of linked nodes where each node represents a specific parser state. To build this trace, a tree of arrays will be used. The following figure shows a tree with the keywords PORT and POST:

tag tree

And here comes some comfort to you. If you prefer speed over memory, choose the maximum size for the arrays (256 for char, for example). Then choose the simplest hash algorithm that is possible - the ASCII code as the array index. To save memory, the array could be shrunken. But then we have to manage a collision list.

In the download section is a project named TagTree which contains a more sophisticated sample of usage for the tag_tree template.

Reference

tag_tree_base

Defines some typedefs and access methods for all node values. tag_tree_base is not restricted to strings. You can use any string-like container which defines a const_iterator and a value_type type.

tag_tree_base constructor and destructor

tag_tree_base::tag_tree_base( char_type c );
Parameters:
  • c: The character this node is representing.

getValue

value_type & tag_tree_base::getValue();
Return value:

Value of this node.

tag_tree constructor and destructor

tag_tree::tag_tree( char_type c );

tag_tree is designed for usage in a streaming parser. tag_tree can help you build compact state engines in memory, since each node represents a unique parser state.

Parameters:
  • c: The character this node is representing.

Setting all pointers to NULL is done by the containers.

findNode

tag_tree::MyPtr_ tag_tree::findNode( char_type c );
Parameters:
  • c: character to find.
Return value

A pointer to a node containing the specified character c or NULL.

empty

bool tag_tree::empty     (          )      const;
Return value

true if there are no outgoing node pointers.

The semantic is similar to the STL - container empty() method. If the node array nodeMap_ contains no pointers, then nodeList_ should not either. And it should be a leaf. Otherwise, something is going wrong.

leaf_count

size_t tag_tree::leaf_count( bool branch = false )     const;
Return value:

The count of keywords starting from this point.

dump

void tag_tree::dump(     ostream_type_ &      os,
        size_t      depth = 0L,
        size_t      branch = 0L
    )             const;
Parameters:
  • os: an open output stream, std::cout, for example.

Prints a plain list of all nodes and their attributes. Useful for debugging purposes.

lookUp

tag_tree::lookUp(    char_const_iterator from, char_const_iterator to ) const
Parameters:
  • from: An input iterator addressing the position of the first element in the keyword.
  • to: An input iterator addressing the position that is one past the final element in the keyword.
Return value:

The value of the keyword. If the keyword does not exist, a null value will be returned.

Looks up for the specified keyword.

History

  • 29 Oct., 2007 - Initial release.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
osy
Software Developer (Senior)
Switzerland Switzerland
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionA trie? Pin
Robert Surtees30-Oct-07 12:53
Robert Surtees30-Oct-07 12:53 
AnswerRe: A trie? Pin
osy1-Nov-07 4:03
osy1-Nov-07 4:03 
GeneralI don't understand it so ... Pin
The Wizard of Doze28-Oct-07 12:21
The Wizard of Doze28-Oct-07 12:21 
GeneralRe: I don't understand it so ... Pin
Paul A. Howes28-Oct-07 15:55
Paul A. Howes28-Oct-07 15:55 
If you don't like Boost, that is your problem. Why don't you create a Boost-free version for your own use instead of asking the author to completely rewrite code based on your questionable desire?

--
Paul

GeneralRe: I don't understand it so ... Pin
osy30-Oct-07 9:03
osy30-Oct-07 9:03 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.