![]() |
General Programming »
Algorithms & Recipes »
Parsers
Intermediate
Opening a door towards Spirit: a parser frameworkBy Jonathan de HalleuxA quick introduction to Spirit, a parser generator framework based |
VC6, VC7, Windows, Visual Studio, Dev
|
|
Advanced Search Add to IE Search |
|
|
|
||||||||||||||||

While reading the article on lexx and yacc (see [5]) that was published a few days ago in CP, I came accross a thread that spoke about Spirit (see here). I must say I was happy to have read this thread...
Spirit (see [1]) is a parser generator framework. It is part of Boost (see [2]) since the release 1.30. (Boost is somewhat a collection of libraries that extend the STL, a must for any C++ developper).
In this article, I will try to give a short idea of what Spirit is capable of. If I succeed you will never approach the problem of string parsing the same way again. Moreover, I strongly recommend the readers interrested with Spirit to jump to the documentation page (see [1]) as it is complete, highly pedagogical and really (2x) well done.
At last, I do not pretend to be a Spirit specialist, neither a grammar theorician. In fact, while playing with parsing in my previous article (see [4]), I was looking for new solutions and I came accross Spirit which made me think that all I had written before could be trashed away.
Before getting into any details, let's give you a good reason to continue to read this article: Consider that you want to parse a string str containing common separated values and store it into a vector v (this example is takened from the Spirit doc., see [6]).
std::string str; // the string to parse
std::vector<double> v; // the vector to fill
A good old C approach would use sscanf, a standard C++ would use istrtream's. The Spirit solution is:
// defining the parser rule rule<> r = real_p[append(v)] >> *(',' >> real_p[append(v)]); // parsing the string with the rule r parse(str.c_str(), r, space_p);
Pretty impressing. Personnaly, I was really amazed by that snippet. Let's analyse the rule r:
real_p: matches a floating point number. This is the rule for the first real,
[append(v)]: append is a functor that keeps a reference of v. When the match on real_p is found, append(double) is called and the real is added to the vector using vector::push_back,
>>: this operator makes a sequence of rules,
*( ... ): the pattern contained inside the brackets can be repeated 0 or more times. This is the main loop that will repeatedly read the reals,
',' >> real_p: matches a comma and a real,
[append(v)]: again, real are pushed on the vectorGreat, simply great. In two lines of code, you can get rid of sscanf and jump into C++ meta-programming! As mentioned before, Spirit doc is really well-done and you can directly skip the end of this article to go and read it... but don't forget to come back.
First, let's do our homework on parsers. To do that, we can begin analysing the sentence in the very beginning of the Spirit documentation:
Spirit is an object-oriented recursive-descent parser generator framework implemented using template meta-programming techniques. Expression templates allow us to approximate the syntax of Extended Backus-Normal Form (EBNF) completely in C++.
Sounds pretty scary... what does this sentence mean exactly ?
recursive-descent parser:LL(unlimited lookahead). LL parser? Good question for a newbie like me. After googling a bit, I found a short description (see [3]), enough to understand the different families of parsers:
LL,LR,SLR and LALR are parsing methods. If a grammar can be parsed by that method, it is also called LL, LR, SLR or LALR (k).
LL(k) means a top-down parser can be created for the parser with a max lookahead of k symbols. Spirit is of LL(unlimited) type.
LR(k) means a bottom-up parser can be created for the parser with a max lookahead of k symbols.
LALR(1) grammars are a subset of LR(1) grammars, requiring smaller parsing table. (speed up gain vs. complication of error reporting and recovery).
SLR(1) parsing is a hack of LR(0) parsing: you attempt to construct a LR(0) parser for it, and if there are only minor glitches, you slap on some extra control.
LR(k),SLR(1) or LALR(1)parser generator framework: Spirit is a library to create parser in the code. This is a big difference with other tools like lexx and yacc: with Spirit the grammar description is inlined in the code to the contrary of lexx and yacc, where a grammar description file is proccesed and a C file is generated.
meta-programming techniques: Gee, highly advanced programming techniques are used. There is, for sure, a lot to learn, just by using Spirit,
Expression templates allow us to approximate the syntax of Extended Backus-Normal Form (EBNF) completely in C++:
Spirit is that and more: it is extremely well documented, full of working examples and already contains parsers for C, C++, pascal, xml, etc...
If you didn't switch yet to the Spirit home page yet (see [7]), it's time to go for an exercice: Yet Another Command Parser
Internet is full of command line parsers, they are usually coded in C and "hand made". We can use the power of Spirit to build a robust command line parser that will store the pair of key-value into a map< string, string >
Let's look at a classic command line:
some_command -key1=value1 -key2="this is the value 2"
The characteristics of the above are:
some_command: the application name,
-key1=value1: is a pair key - value, the key is preceded by - and separated from the value by =, the key and value contain only alpha-numeric characters
-key2="this is the value 2": is also a key-value pair, but here the value is encapsulated in " and contains any escaped characterLet's begin to build the rules used in a key-value parser:
equal is a rule that matches the character = equal = ch_p('=');
key_tag is a rule that matches the character - key_tag = ch_p('-');
key matches the key, +alnum_p describes a word of 1 or more alphanumeric characters, assign_string is a functor that assign the match to self.str_key (self will be defined later) key = (+alnum_p)
[
assign_string(self.str_key)
];
value matches a value, confix_p matches string enclosed in "...", c_escape_ch_p handles escape characters like \", etc..., the operator | is similar to regular expression and means "or": value = ( confix_p( '"', (+ c_escape_ch_p )[assign_string(self.str_val)] , '"' ) | (+ alnum_p )[assign_string(self.str_val)] );
key_value matches key and values in string like: -key=value or -key="other value" key_value = key_tag >> key >> equal >> value;
Now that we have those rules, we can put them together in a grammar. The grammar takes two references to string that shall hold the result (key, value) of the parse.
struct keyvalue_grammar : public grammar<keyvalue_grammar> { // Constructor, takes two reference to string // str_key and str_value will hold the parse result keyvalue_grammar(std::string& str_key_, std::string& str_val_) : str_key(str_key_), str_val(str_val_){}; template <typename ScannerT> struct definition { definition(keyvalue_grammar const& self) { equal = ch_p('='); key_tag = ch_p('-') | ch_p('/'); key = key_tag >> (+alnum_p)[ assign_string(self.str_key) ]; value = ( confix_p( '"', (+ c_escape_ch_p )[assign_string(self.str_val)] , '"' ) | (+ alnum_p )[assign_string(self.str_val)] ); key_value = key >> equal >> value; } rule<ScannerT> key_tag, key, equal, value, key_value; rule<ScannerT> const& start() const { return key_value; }; }; std::string& str_key; std::string& str_val; };
We need a functor to add the pairs key-value to the map. The functor needs to provide the void operator()(const char* first, const char* last) const method. Moreover there is one more trick to handle:
When the grammar key_value_grammar is called two string are modified, these strings are subsequently passed to the functor.
template<typename keyvalue_container> class add_keyvalue_pair { public: // key_ and val_ should point to the string modified in keyvalue_grammar // kvc_ is the map of key - values add_keyvalue_pair( keyvalue_container& kvc_, std::string& key_, std::string& val_) : kvc( kvc_ ), key(key_), val(val_) { } // the method called by the parser template <typename IteratorT> void operator()(IteratorT first, IteratorT last) const { kvc.insert( keyvalue_container::value_type(key, val) ); } private: std::string& key; std::string& val; keyvalue_container& kvc; };
We are almost done. The rule to parse the command line are:
command = (+alnum_p)[assign_string(self.str_command)];
key_value rule is enclosed in *(...) so that it can happen 0 or more times.line = command
>> *(
key_value
[
add_keyvalue_pair<keyvalue_container>(
self.kvc,
key,
value )
]
);
The full definition of the grammar is here:
template<typename keyvalue_container> struct cmdline_grammar : public grammar< cmdline_grammar > { // Constructor // kvc_ is a key-value container // str_command will hold the application name cmdline_grammar( keyvalue_container& kvc_, std::string& str_command_) : kvc(kvc_), str_command(str_command_) {}; template <typename ScannerT> struct definition { definition( cmdline_grammar<keyvalue_container> const& self ) : key_value( key, value ) { command = (+alnum_p)[assign_string(self.str_command)]; line = command >> *( key_value [ add_keyvalue_pair<keyvalue_container>( self.kvc, key, value ) ] ); } rule<ScannerT> command, line; keyvalue_grammar key_value; std::string key; std::string value; rule<ScannerT> const& start() const { return line; } }; keyvalue_container& kvc; std::string& str_command; };
As emphasized in the code above, key_value and add_keyvalue_pair work on the same strings. So that in the parse process, first key and value are set inside the keyvalue_grammar, and after that, they are added in the add_keyvalue_pair call.
You can encapsulate the grammar into another class that hides the details. For example, the following method parses a string and returns true if succeded. kvc is a map<string, string>, command is a string:
bool parse(const std::string& str) { kvc.clear(); cmdline_grammar<keyvalue_container> cmdline_parser(kvc, command); parse_info<> info = boost::spirit::parse( str.c_str(), cmdline_parser, boost::spirit::space_p ); return info.full; };
The full source of the example is available in the downloads. Otherwize, Spirit contains numerous other examples.
As promised, parsing string should not longer be done by old fashion, error-prone succession of sscanf calls, but by effective and robust "custom" parsers (moreover it's fun).
| 26-03-2003 | Corrected a lot of typos, added some references. |
| 24-03-2003 | Initial release. |
General
News
Question
Answer
Joke
Rant
Admin
|
PermaLink |
Privacy |
Terms of Use
Last Updated: 23 Mar 2003 Editor: Rob Manderson |
Copyright 2003 by Jonathan de Halleux Everything else Copyright © CodeProject, 1999-2009 Web16 | Advertise on the Code Project |