Click here to Skip to main content
15,860,943 members
Articles / Programming Languages / C++
Article

Opening a door towards Spirit: a parser framework

Rate me:
Please Sign up or sign in to vote.
4.73/5 (44 votes)
23 Mar 20037 min read 198.9K   942   69   29
A quick introduction to Spirit, a parser generator framework based

While reading the article on lexx and yacc (see [5]) that was published a few days ago in CP, I came accross a thread that spoke about Spirit  (see here). I must say I was happy to have read this thread...

Introduction

Spirit (see [1]) is a parser generator framework. It is part of Boost (see [2]) since the release 1.30. (Boost is somewhat a collection of libraries that extend the STL, a must for any C++ developper).

In this article, I will try to give a short idea of what Spirit is capable of. If I succeed you will never approach the problem of string parsing the same way again. Moreover, I strongly recommend the readers interrested with Spirit to jump to the documentation page (see [1]) as it is complete, highly pedagogical and really (2x) well done.

At last, I do not pretend to be a Spirit specialist, neither a grammar theorician. In fact, while playing with parsing in my previous article (see [4]), I was looking for new solutions and I came accross Spirit which made me think that all I had written before could be trashed away.

A mini but very impressive example

Before getting into any details, let's give you a good reason to continue to read this article: Consider that you want to parse a string str containing common separated values and store it into a vector v (this example is takened from the Spirit doc., see [6]).

C++
std::string str;  // the string to parse
std::vector<double> v; // the vector to fill

A good old C approach would use sscanf, a standard C++ would use istrtream's. The Spirit solution is:

// defining the parser rule
rule<> r = real_p[append(v)] >> *(',' >> real_p[append(v)]);

// parsing the string with the rule r 
parse(str.c_str(), r, space_p);

Pretty impressing. Personnaly, I was really amazed by that snippet. Let's analyse the rule r:

  • real_p: matches a floating point number. This is the rule for the first real,
  • [append(v)]: append is a functor that keeps a reference of v. When the match on real_p is found, append(double) is called and the real is added to the vector using vector::push_back,
  • >>: this operator makes a sequence of rules,
  • *( ... ): the pattern contained inside the brackets can be repeated 0 or more times. This is the main loop that will repeatedly read the reals,
  • ',' >> real_p: matches a comma and a real,
  • [append(v)]: again, real are pushed on the vector

Great, simply great. In two lines of code, you can get rid of sscanf and jump into C++ meta-programming! As mentioned before, Spirit doc is really well-done and you can directly skip the end of this article to go and read it... but don't forget to come back.  

Parsers and Spirit

First, let's do our homework on parsers. To do that, we can begin analysing the sentence in the very beginning of the Spirit documentation:

Spirit is an object-oriented recursive-descent parser generator framework implemented using template meta-programming techniques. Expression templates allow us to approximate the syntax of Extended Backus-Normal Form (EBNF) completely in C++.

Sounds pretty scary... what does this sentence mean exactly ?

  • recursive-descent parser:
    Spirit derived parsers are LL(unlimited lookahead).
    Thow (cfr. Home Simpson), what exactly is a LL parser? Good question for a newbie like me. After googling a bit, I found a short description (see [3]), enough to understand the different families of parsers:
    • LL,LR,SLR and LALR are parsing methods. If a grammar can be parsed by that method, it is also called LL, LR, SLR or LALR (k).
    • LL(k) means a top-down parser can be created for the parser with a max lookahead of k symbols. Spirit is of LL(unlimited) type.
    • LR(k) means a bottom-up parser can be created for the parser with a max lookahead of k symbols.
    • LALR(1) grammars are a subset of LR(1) grammars, requiring smaller parsing table. (speed up gain vs. complication of error reporting and recovery).
    • SLR(1) parsing is a hack of LR(0) parsing: you attempt to construct a LR(0) parser for it, and if there are only minor glitches, you slap on some extra control.
    • Because these parsing processes are rather complex, it's a lot of work to check if a grammar is LR(k),SLR(1) or LALR(1)
  • parser generator framework: Spirit is a library to create parser in the code. This is a big difference with other tools like lexx and yacc: with Spirit the grammar description is inlined in the code to the contrary of lexx and yacc, where a grammar description file is proccesed and a C file is generated.
  • meta-programming techniques: Gee, highly advanced programming techniques are used. There is, for sure, a lot to learn, just by using Spirit,
  • Expression templates allow us to approximate the syntax of Extended Backus-Normal Form (EBNF) completely in C++:
    EBNF is a grammar description file. It would deserve an article on its own and I do not plan to go into details about it. EBNF files are described in a CodeProject article (see [5]).
    I will not focus on that feature, however it is straightforward to transform an EBNF file to a Spirit syntax. Usually, all programming languages have a EBNF that you can download on the web, hence it is straightforward to transform them to Spirit.

Spirit is that and more: it is extremely well documented, full of working examples and already contains parsers for C, C++, pascal, xml, etc...

If you didn't switch yet to the Spirit home page yet (see [7]), it's time to go for an exercice: Yet Another Command Parser

Yet Another Command Line Parser

Internet is full of command line parsers, they are usually coded in C and "hand made". We can use the power of Spirit to build a robust command line parser that will store the pair of key-value into a <FONT color=#000000>map< string, string ></FONT>

Command line description

Let's look at a classic command line:

some_command -key1=value1 -key2="this is the value 2"

The characteristics of the above are:

  • some_command: the application name,
  • -key1=value1: is a pair key - value, the key is preceded by - and separated from the value by =, the key and value contain only alpha-numeric characters
  • -key2="this is the value 2": is also a key-value pair, but here the value is encapsulated in " and contains any escaped character

Key - Value parser

Let's begin to build the rules used in a key-value parser:

  • equal is a rule that matches the character =
    equal = ch_p('=');
  • <FONT color=#000000><FONT color=#000000><FONT color=#000000><FONT color=#000000>key_tag</FONT></FONT></FONT> </FONT>is a rule that matches the character -
    key_tag = ch_p('-');
  • key matches the key, +alnum_p describes a word of 1 or more alphanumeric characters, assign_string is a functor that assign the match to self.str_key (self will be defined later)
    key =  (+alnum_p)
        [ 
            assign_string(self.str_key) 
        ];
  • value matches a value, confix_p matches string enclosed in "...", c_escape_ch_p handles escape characters like \", etc..., the operator | is similar to regular expression and means "or":
    value = ( 
        confix_p( '"', (+ c_escape_ch_p )[assign_string(self.str_val)] , '"' ) 
        | (+ alnum_p )[assign_string(self.str_val)] );
  • key_value matches key and values in string like: -key=value or -key="other value"
    key_value = key_tag >> key >> equal >> value;

Now that we have those rules, we can put them together in a grammar. The grammar takes two references to string that shall hold the result (key, value) of the parse.

struct keyvalue_grammar : public grammar<keyvalue_grammar>
{
    // Constructor, takes two reference to string
    // str_key and str_value will hold the parse result
    keyvalue_grammar(std::string& str_key_, std::string& str_val_)
        : str_key(str_key_), str_val(str_val_){};
  
   template <typename ScannerT>
   struct definition
   {    
        definition(keyvalue_grammar const& self)  
        { 
            equal = ch_p('=');

            key_tag = ch_p('-') | ch_p('/');

            key = key_tag >> 
                (+alnum_p)[ assign_string(self.str_key) ];
            value = ( 
                    confix_p( 
                        '"',
                        (+ c_escape_ch_p )[assign_string(self.str_val)] ,
                        '"'
                        )
                | (+ alnum_p )[assign_string(self.str_val)] );
                
            key_value = key >> equal >> value;
        }

         rule<ScannerT>  key_tag, key, equal, value, key_value;
         rule<ScannerT> const& start() const { return key_value; };
    };

    std::string& str_key;
    std::string& str_val;
};

Key value adder functor

We need a functor to add the pairs key-value to the map. The functor needs to provide the void operator()(const char* first, const char* last) const method. Moreover there is one more trick to handle:

When the grammar key_value_grammar is called two string are modified, these strings are subsequently passed to the functor.

template<typename keyvalue_container> 
class add_keyvalue_pair
{
public:
    // key_ and val_ should point to the string modified in keyvalue_grammar
    // kvc_ is the map of key - values
    add_keyvalue_pair( keyvalue_container& kvc_, std::string& key_, std::string& val_)
        : kvc( kvc_ ), key(key_), val(val_)
    {
    }

    // the method called by the parser    
    template <typename IteratorT>
    void operator()(IteratorT first, IteratorT last) const
    {
        kvc.insert( keyvalue_container::value_type(key, val) );
    }
private:
    std::string& key;
    std::string& val;
    keyvalue_container& kvc;
};

The parser

We are almost done. The rule to parse the command line are:

  • Parses the name of the application
    command = (+alnum_p)[assign_string(self.str_command)];
  • Parses the whole line. As one can see, the key_value rule is enclosed in *(...) so that it can happen 0 or more times.
    line = command 
        >> *( 
               key_value
               [ 
                  add_keyvalue_pair<keyvalue_container>( 
                      self.kvc, 
                      key, 
                      value ) 
               ] 
             );

The full definition of the grammar is here:

template<typename keyvalue_container> 
struct cmdline_grammar : public grammar< cmdline_grammar >
{
    // Constructor
    // kvc_ is a key-value container
    // str_command will hold the application name
    cmdline_grammar( keyvalue_container& kvc_, std::string& str_command_)
        : kvc(kvc_), str_command(str_command_)
    {};

    template <typename ScannerT>
    struct definition
    {
        definition( cmdline_grammar<keyvalue_container> const& self )
            : key_value( key, value )
        { 
            command = (+alnum_p)[assign_string(self.str_command)];
            line = command 
                >> *( 
                      key_value
                      [ 
                         add_keyvalue_pair<keyvalue_container>( 
                            self.kvc, 
                            key, value ) 
                      ] 
                     );
        }            

        rule<ScannerT> command, line;
        keyvalue_grammar key_value;
        std::string key;
        std::string value;
        rule<ScannerT> const& start() const { return line; }
    };

    keyvalue_container& kvc;
    std::string& str_command;
};

As emphasized in the code above, key_value and add_keyvalue_pair work on the same strings. So that in the parse process, first key and value are set inside the keyvalue_grammar, and after that, they are added in the add_keyvalue_pair call.

Encapsulating all and making the parse call

You can encapsulate the grammar into another class that hides the details. For example, the following method parses a string and returns true if succeded. kvc is a map<string, string>, command is a string:

bool parse(const std::string& str)
{
    kvc.clear();

    cmdline_grammar<keyvalue_container> cmdline_parser(kvc, command);
    parse_info<> info = boost::spirit::parse(
               str.c_str(), 
               cmdline_parser, 
               boost::spirit::space_p );

    return info.full;
};

The full source of the example is available in the downloads. Otherwize, Spirit contains numerous other examples.

Conclusions

As promised, parsing string should not longer be done by old fashion, error-prone succession of sscanf calls, but by effective and robust "custom" parsers (moreover it's fun).

History

26-03-2003Corrected a lot of typos, added some references.
24-03-2003Initial release.

Reference

[1]Spirit page at boost
[2]Boost home page
[3]Grammar (short) description
[4]Multiple language syntax highlighting, part1: JScript
[5]Introduction to lexx and yacc
[6]Spirit: Predifined actions
[7]Spirit Home Page (Sourceforge)

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Engineer
United States United States
Jonathan de Halleux is Civil Engineer in Applied Mathematics. He finished his PhD in 2004 in the rainy country of Belgium. After 2 years in the Common Language Runtime (i.e. .net), he is now working at Microsoft Research on Pex (http://research.microsoft.com/pex).

Comments and Discussions

 
GeneralRe: Spirit does it Pin
Jonathan de Halleux25-Mar-03 1:00
Jonathan de Halleux25-Mar-03 1:00 
GeneralRe: Spirit does it Pin
pyrokins26-Aug-09 20:20
pyrokins26-Aug-09 20:20 
GeneralRe: Spirit does it Pin
Paul Selormey26-Aug-09 22:16
Paul Selormey26-Aug-09 22:16 
GeneralSpirit home page Pin
Hartmut Kaiser24-Mar-03 21:14
Hartmut Kaiser24-Mar-03 21:14 
GeneralUpdated Pin
Jonathan de Halleux25-Mar-03 0:58
Jonathan de Halleux25-Mar-03 0:58 
GeneralWell done! Pin
Rob Manderson24-Mar-03 15:55
protectorRob Manderson24-Mar-03 15:55 
GeneralWell done and talk about quick off the mark Pin
Neville Franks24-Mar-03 14:54
Neville Franks24-Mar-03 14:54 
GeneralRe: Well done and talk about quick off the mark Pin
Jonathan de Halleux25-Mar-03 1:38
Jonathan de Halleux25-Mar-03 1:38 
Very good documentation!

Btw: Nice editor. From what I saw on your web site, you must have a serious background on parsers!

Jonathan de Halleux.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.