
While reading the article on lexx and yacc (see [5]) that was published a few days ago in CP, I came accross a thread that spoke about Spirit (see here). I must say I was happy to have read this thread...
Introduction
Spirit (see [1]) is a parser generator framework. It is part of Boost (see [2]) since the release 1.30. (Boost is somewhat a collection of libraries that extend the STL, a must for any C++ developper).
In this article, I will try to give a short idea of what Spirit is capable of. If I succeed you will never approach the problem of string parsing the same way again. Moreover, I strongly recommend the readers interrested with Spirit to jump to the documentation page (see [1]) as it is complete, highly pedagogical and really (2x) well done.
At last, I do not pretend to be a Spirit specialist, neither a grammar theorician. In fact, while playing with parsing in my previous article (see [4]), I was looking for new solutions and I came accross Spirit which made me think that all I had written before could be trashed away.
A mini but very impressive example
Before getting into any details, let's give you a good reason to continue to read this article: Consider that you want to parse a string str
containing common separated values and store it into a vector v
(this example is takened from the Spirit doc., see [6]).
std::string str;
std::vector<double> v;
A good old C approach would use sscanf
, a standard C++ would use istrtream
's. The Spirit solution is:
rule<> r = real_p[append(v)] >> *(',' >> real_p[append(v)]);
parse(str.c_str(), r, space_p);
Pretty impressing. Personnaly, I was really amazed by that snippet. Let's analyse the rule r
:
real_p
: matches a floating point number. This is the rule for the first real,
[append(v)]
: append
is a functor that keeps a reference of v
. When the match on real_p
is found, append(double)
is called and the real is added to the vector using vector::push_back
,
>>
: this operator makes a sequence of rules,
*( ... )
: the pattern contained inside the brackets can be repeated 0 or more times. This is the main loop that will repeatedly read the reals,
',' >> real_p
: matches a comma and a real,
[append(v)]
: again, real are pushed on the vector
Great, simply great. In two lines of code, you can get rid of sscanf
and jump into C++ meta-programming! As mentioned before, Spirit doc is really well-done and you can directly skip the end of this article to go and read it... but don't forget to come back.
Parsers and Spirit
First, let's do our homework on parsers. To do that, we can begin analysing the sentence in the very beginning of the Spirit documentation:
Spirit is an object-oriented recursive-descent parser generator framework implemented using template meta-programming techniques. Expression templates allow us to approximate the syntax of Extended Backus-Normal Form (EBNF) completely in C++.
Sounds pretty scary... what does this sentence mean exactly ?
Spirit is that and more: it is extremely well documented, full of working examples and already contains parsers for C, C++, pascal, xml, etc...
If you didn't switch yet to the Spirit home page yet (see [7]), it's time to go for an exercice: Yet Another Command Parser
Yet Another Command Line Parser
Internet is full of command line parsers, they are usually coded in C and "hand made". We can use the power of Spirit to build a robust command line parser that will store the pair of key-value into a map< string, string >
Command line description
Let's look at a classic command line:
some_command -key1=value1 -key2="this is the value 2"
The characteristics of the above are:
some_command
: the application name,
-key1=value1
: is a pair key - value, the key is preceded by -
and separated from the value by =
, the key and value contain only alpha-numeric characters
-key2="this is the value 2"
: is also a key-value pair, but here the value is encapsulated in "
and contains any escaped character
Key - Value parser
Let's begin to build the rules used in a key-value parser:
equal
is a rule that matches the character =
equal = ch_p('=');
key_tag
is a rule that matches the character -
key_tag = ch_p('-');
key
matches the key, +alnum_p
describes a word of 1 or more alphanumeric characters, assign_string
is a functor that assign the match to self.str_key
(self
will be defined later) key = (+alnum_p)
[
assign_string(self.str_key)
];
value
matches a value, confix_p
matches string enclosed in "...", c_escape_ch_p
handles escape characters like \", etc..., the operator |
is similar to regular expression and means "or": value = (
confix_p( '"', (+ c_escape_ch_p )[assign_string(self.str_val)] , '"' )
| (+ alnum_p )[assign_string(self.str_val)] );
key_value
matches key and values in string like: -key=value or -key="other value" key_value = key_tag >> key >> equal >> value;
Now that we have those rules, we can put them together in a grammar
. The grammar takes two references to string
that shall hold the result (key, value)
of the parse.
struct keyvalue_grammar : public grammar<keyvalue_grammar>
{
keyvalue_grammar(std::string& str_key_, std::string& str_val_)
: str_key(str_key_), str_val(str_val_){};
template <typename ScannerT>
struct definition
{
definition(keyvalue_grammar const& self)
{
equal = ch_p('=');
key_tag = ch_p('-') | ch_p('/');
key = key_tag >>
(+alnum_p)[ assign_string(self.str_key) ];
value = (
confix_p(
'"',
(+ c_escape_ch_p )[assign_string(self.str_val)] ,
'"'
)
| (+ alnum_p )[assign_string(self.str_val)] );
key_value = key >> equal >> value;
}
rule<ScannerT> key_tag, key, equal, value, key_value;
rule<ScannerT> const& start() const { return key_value; };
};
std::string& str_key;
std::string& str_val;
};
Key value adder functor
We need a functor to add the pairs key-value to the map. The functor needs to provide the void operator()(const char* first, const char* last) const
method. Moreover there is one more trick to handle:
When the grammar key_value_grammar
is called two string are modified, these strings are subsequently passed to the functor.
template<typename keyvalue_container>
class add_keyvalue_pair
{
public:
add_keyvalue_pair( keyvalue_container& kvc_, std::string& key_, std::string& val_)
: kvc( kvc_ ), key(key_), val(val_)
{
}
template <typename IteratorT>
void operator()(IteratorT first, IteratorT last) const
{
kvc.insert( keyvalue_container::value_type(key, val) );
}
private:
std::string& key;
std::string& val;
keyvalue_container& kvc;
};
The parser
We are almost done. The rule to parse the command line are:
- Parses the name of the application
command = (+alnum_p)[assign_string(self.str_command)];
- Parses the whole line. As one can see, the
key_value
rule is enclosed in *(...)
so that it can happen 0 or more times.line = command
>> *(
key_value
[
add_keyvalue_pair<keyvalue_container>(
self.kvc,
key,
value )
]
);
The full definition of the grammar is here:
template<typename keyvalue_container>
struct cmdline_grammar : public grammar< cmdline_grammar >
{
cmdline_grammar( keyvalue_container& kvc_, std::string& str_command_)
: kvc(kvc_), str_command(str_command_)
{};
template <typename ScannerT>
struct definition
{
definition( cmdline_grammar<keyvalue_container> const& self )
: key_value( key, value )
{
command = (+alnum_p)[assign_string(self.str_command)];
line = command
>> *(
key_value
[
add_keyvalue_pair<keyvalue_container>(
self.kvc,
key, value )
]
);
}
rule<ScannerT> command, line;
keyvalue_grammar key_value;
std::string key;
std::string value;
rule<ScannerT> const& start() const { return line; }
};
keyvalue_container& kvc;
std::string& str_command;
};
As emphasized in the code above, key_value
and add_keyvalue_pair
work on the same strings. So that in the parse process, first key and value are set inside the keyvalue_grammar
, and after that, they are added in the add_keyvalue_pair
call.
Encapsulating all and making the parse call
You can encapsulate the grammar into another class that hides the details. For example, the following method parses a string and returns true if succeded. kvc
is a map<string, string>
, command
is a string
:
bool parse(const std::string& str)
{
kvc.clear();
cmdline_grammar<keyvalue_container> cmdline_parser(kvc, command);
parse_info<> info = boost::spirit::parse(
str.c_str(),
cmdline_parser,
boost::spirit::space_p );
return info.full;
};
The full source of the example is available in the downloads. Otherwize, Spirit contains numerous other examples.
Conclusions
As promised, parsing string should not longer be done by old fashion, error-prone succession of sscanf
calls, but by effective and robust "custom" parsers (moreover it's fun).
History
26-03-2003 |
Corrected a lot of typos, added some references. |
24-03-2003 |
Initial release. |
Reference