|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Services
Chapters
Feature Zones
|
IntroductionParsers, Interpreters and Compilers are becoming more and more into the reach of the normal programmer. In the very beginning, such tools were hand crafted. Then came the time of the parser and compiler generators (e.g., YACC (popular UNIX parser generator) or ANTLR), and currently we experience the integration of grammars and parsers directly into a host language. This is the first part of a series of articles which explains the most basic concepts and practices to cope with parsing and compiling. It uses toy programming languages and their interpreters to expose the concepts. For each toy language, sample programs are presented. These samples are handwritten, but in a style which shows how the task could be automated. This introduction into language processing addresses newbies as well as more experienced programmers. The former get enough explanatory text, the latter C++ code samples which show how language processing can be done with minimal baggage. The whole article series strives for minimality both in theory and in implementation. Steps for building an interpreterLet's take a command line calculator for arithmetic formulas as an introductory example. The following excerpt shows a sample session, where lines starting with '>>' indicate program output. Pi=3.1415926535
>> Pi= 3.1415926535
Area=Pi*r^2
>> Area= 0
r=3
>> Area= 28.2743338815
>> r= 3
To implement this, the calculator program must accomplish the following tasks:
Parsing and GrammarsParsing is the process of associating a hierarchical structure with a source text. A programmer, who indents her source text appropriately (e.g., she indents the body of loops), does a simple form of parsing. Parsing is the first task in language processing and one of the most important ones, because it guides all the later steps of a translation system. The formal framework for parsing is set by a grammar which describes the source language. A grammar consists of a set of rules, which taken together associates a unique tree structure with any correct input text. Grammars are divided into different classes according to their expressiveness. On the lowest level are grammars for regular expressions. These grammars cannot describe recursively nested constructs as they are needed to express, for example, the fact that a statement has other statements as its components. The grammar class commonly used to describe programming languages is called "context free". Each rule of a context free grammar has the form
The following sample context free grammar rule describes a string of digits enclosed in an arbitrary number of parentheses: EnclosedDigits: [0-9]+ | '(' EnclosedDigits ')' ;
This grammar rule associates the input string EnclosedDigits
'(' EnclosedDigits ')'
'(' EnclosedDigits ')'
'123'
In the coming samples, we will use the following linear form as a tree notation: EnclosedDigits< '(' EnclosedDigits< '(' EnclosedDigits<123> ')' > ')' >
Interestingly, the rule Expression: Expression '+' Expression | '(' Expression ')' | [0-9]+ ;
The following table shows one possible substitution history:
The resulting generated language string is
The LR - grammar Expressiveness of Context Free GrammarsMany import properties of existing programming languages can either not be expressed with a context free grammar or are tedious to describe. Consider a language with procedures where the procedure name must both occur at the beginning and at the end: Proc: 'proc' Name Parameters Body Name ';' ;
There is no way to describe that the Name at the beginning must be the same as the Name at the end. Take as another example a length limitation, e.g., the rule that an identifier cannot exceed 32 characters. While it is possible to describe this with a context free grammar rule, it either requires a lot of writing or the addition of a new quantifier (e.g.: [a-zA-Z]{1,32}). It turns out in many cases that it is preferable to check for the violation of a restriction after successful parsing rather than trying to express the restriction within the grammar. Generally speaking, a grammar for a programming language accepts always a superset of the legal constructs. The real strength of context free grammars is the proper description of hierarchical relations as shown by the following examples of grammars for simple arithmetic expressions. This is a poorly chosen grammar for expressions: expr0: expr0 ('+'|'-') expr0 | expr0 ('*'|'/') expr0 | [0-9]+ ;
This grammar does not express the fact that the precedence of '+' is not the same as that of '*'. The following is a much better grammar for expressions: expr1: mul_expr | expr1 ('+'|'-') mul_expr ;
mul_expr: [0-9]+ | mul_expr ('*'|'/') [0-9]+ ;
The source text expr0<
expr0<'2'>
'*'
expr0< expr0<'3'> '+' expr0<'5'> >
>
whereas expr1<
expr1<mul_expr<'2'> '*' '3' >
'+'
mul_expr<'5'>
>
Only the second tree reflects the precedence of the '+' and '*' operators. By the way, the grammar rule for The Peril of AmbiguityA grammar is ambiguous if a source text can be parsed into two different parse trees depending on the order of substitutions. The above sample expr0<expr0<'2'> '*' expr0< expr0<'3'> '+' expr0<'5'> > >expr0<expr0<expr0<'2'> '*' expr0<'3'> > '+' expr0<'5'> >
depending on the substitution order. Ambiguity defeats the main purpose of grammars, namely the association of a source text with a unique hierarchical structure. Whether ambiguity of a grammar really results in different hierarchical structures not only depends on the grammar but also on the parsing strategy. Recursive Descent ParsingRecursive descent parsing is the most simple and intuitive parsing method. It is an LL parsing method and therefore reads the input from left to right (the first L in LL) and traverses the grammar from left to right (the second L in LL). The name recursive descent is due to the fact that this parsing method uses a function for every grammar rule. The function for the start rule must parse the complete source. It does this by providing a function body, which implements the right hand side of the start rule in the following way:
Note that the implementation of a recursive descent parser sometimes requires backtracking. Backtracking means that you go back to a previous state of the computation. If, for example, the first part of the chosen alternative is successfully matched against the input text and then a mismatch occurs, we must set back the source position to the position before we entered the alternative and choose another alternative. Let's show this with a sample implementation. The rules: EnclosedDigits: Digits | '(' EnclosedDigits ')' ;
Digits: [0-9] Digits? ;
could be implemented by the following parsing procedures which merely test whether the input matches the grammar rules: typedef const unsigned char* Iterator; bool MatchesDigits(Iterator& it) //[0-9]Digits? { return (*it>='0' && *it<='9'?(++it,true):false)//[0-9] && (MatchesDigits(it),true); //Digits? } bool MatchesEnclosedDigits(Iterator& it) //Digits | '(' EnclosedDigits ')' { Iterator itSave= it; return MatchesDigits(it) // Digits || // | ( (itSave=it, (*it=='('?(++it,true):false) // '(' && MatchesEnclosedDigits(it) // EnclosedDigits && (*it==')'?(++it,true):false) // ')' ) ||(it=itSave,false) ); } The start grammar function Digits: Digits? [0-9] ;
because this would result in infinite recursion. Many grammars coming with language specifications are written for LR parsers (the well known YACC tool requires an LR grammar). But it is quite easy to rewrite an LR grammar into an LL grammar. Parsing Expression Grammars (PEGs)The idea behind the so called parsing expression grammar (PEG) formalism [1] is a "procedural interpretation" of context free grammars. A PEG grammar can be "executed" using the grammar and a source string as input. The result of the execution is either a match of the input string, in which case the parsing position is at the end of the recognized part, or otherwise the result is "doesn't match", in which case the input position remains unchanged. This statement is not only true for the grammar as a whole, but for each grammar rule. The PEG formalism is nothing else than a mapping of common practices in recursive descent parsing to grammars. It consists of the following grammar elements:
The following table shows the effect of PEG grammar constructs on sample input strings. The
PEG versus regular expressionsSince PEGs cover context free grammars, they are more powerful than regular expressions. The most important difference between Parsing Expression Grammars and regular expressions is the fact that the former supports recursive definitions, the latter not. But simple non recursive situations often require shorter regular expression patterns as can be shown for an email address recognizer. Regular expression as email address recognizer:
[A-Za-z0-9._%-]+@[A-Za-z0-9._%-]+\.[A-Za-z]{2,4}Incorrect PEG rule as email address recognizer:
EMail: [A-Za-z0-9._%-]+'@'[A-Za-z0-9._%-]+'.'[A-Za-z]{2,4} ;Correct PEG rules as email address recognizer:
EMailChar: [A-Za-z0-9._%-];
EMailSuffix: [A-Za-z]{2,4} !EMailChar ;
EMail: EMailChar+
'@'
([A-Za-z0-9_%-] |'.'!EMailSuffix)+
'.'
EmailSuffix ;
The correct PEG email address recognizer is more complicated than the regular expression email address recognizer and uses implicit backtracking (in Procedural and template implementations of PEG elementsAn ideal C/C++ PEG implementation is both close to a PEG grammar and runtime efficient. This can be achieved both with normal "C" code as well as with C++ templates. C++ templates prove to be more flexible, whereas normal C-code has the advantage of making less trouble in case of programmer errors, because error messages for templates are still difficult to decode. The rule typedef const unsigned char* Iterator; inline bool Char(Iterator& p,int c) { return *p==c?(++p,true) : false;} bool S(Iterator& it){return Char(it,'C');} and the following template implementation: template<int CHARVALUE> struct Char{ template<typename It> static bool Matches(It& p) { return *p==CHARVALUE?(++p,true):false;} }; struct S{ typedef Char<'C'> rule; template<typename It> static bool Matches(It& p) { return rule::Matches(p); } }; The following table sketches the procedural and template implementations of all the PEG constructs:
Note, that the constructs bool MatchesEMail(openend_iterator& it) { using namespace peg_templ; C c; typedef In<'A','Z','a','z'> Alpha; typedef In<'A','Z','a','z','0','9'> Alnum; typedef Or<Alnum,OneOfChars<'.','_','%','-'> > EMailChar; typedef And<For<2,4,Alpha >,Not<EMailChar > > EMailSuffix; typedef And< PlusRepeat<EMailChar>, Char<'@'>, PlusRepeat<Or< Or<Alnum,OneOfChars<'_','%','-'> >, And<Char<'.'>,Not<EMailSuffix> > > >, Char<'.'>, EMailSuffix > EMail; return EMail::Matches(it,&c); } Note that Input IteratorsThe parser reads the source for parsing. Storage medium, encoding, and end of input indicators can be different for various sources. In most cases, the storage medium is a file or a character string. The character encoding is either ASCII, UTF-8 or one of the many other Unicode encodings. The end of input can be detected either by the presence of a special terminator or by comparison against an end of input pointer. Checking each input position against the end of input is in most cases not necessary. Most grammars expect only a subset of the possible input characters. If we know that the input is terminated by a special character, say '\0', we can use a start rule like: If we use a template parameter
The following table shows two sample models for input iterators:
Why performance mattersThe parser of a C++ compiler spends less then 20% of the compilation time with parsing. But there are many applications for parsing without sophisticated post-processing where speed really matters. Examples are auto-completion support for development environments, preprocessing of C/C++ source files, extraction of items from large XML files. There are two main sources of inefficiencies in parsing:
In recursive descent parsing, backtracking can be avoided by rewriting the grammar in a way that avoids alternatives with the same beginning. This technique is known as left factorization and will be shown in one of the next articles of this series. In LR parsing (LALR parsing to be precise) as used by YACC (the UNIX parser generator), backtracking is avoided by the internal use of a stack for symbols which wait for further processing. It is well known that hand written recursive descent parsers using optimized grammars and tools like YACC are the best performers. But for the casual user, handwritten parsers as well as compiler generation tools are not ideal, the former because they require too much work, the latter, because they generate in most cases impenetrable and hard to maintain code. With Parsing Expression Grammars, the picture could change, because PEGs combine the easy to understand recursive descent technique with the possibility of an efficient implementation which is very close to the original grammar. I therefore compared the runtime efficiency of three PEG implementations and of a hand-optimized parser. I took a grammar for arithmetical expressions as a test case, because such a grammar can be written in a way which requires no backtracking. All the tested parsers had to parse the expression
The results show that the templated and procedural PEG implementations come very close to the hand-optimized parser. They also show that using an iterator which always checks for the end of input condition has its price. Points of interestThe term Parsing Expression Grammars was coined by Bryan Ford [1]. He also pointed out that this kind of recursive descent parsing with backtracking is common practice. He showed that only very few rules are needed to define a powerful framework for parsing. Instead of relying on grammar transformations which eliminate alternatives having the same beginning (so called left factorization), he invented a parser, which memorizes already taken paths and achieves linear parsing time for any PEG grammar (the packrat parser). An important advantage of PEGs comes from the fact that a separate scanner is not necessary. When parsing with a scanner, one divides the parsing task in a low level part called scanning, and in a high level part, proper parsing. A scanner typically skips white space and comments and reads items like numbers, operators and identifiers, and returns an integer value to indicate what it has found. A scanner reduces the necessity to backtrack in some cases. On the other hand, it introduces a new level of abstraction which was never necessary from a theoretical point of view, and it can cause problems because of the lack of context information. A typical example is the closing angle bracket of template instantiations (the History
References
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||