Crafting an interpreter Part 1 - Parsing and Grammars

Martin.Holzherr

Rate me:

4.88/5 (77 votes)

29 Apr 2005CPOL20 min read

222.4K

2.7K

200

First of a series of articles on building a language interpreter, describing basics about parsing and grammars.

Download sample - 98.6 Kb

Introduction

Parsers, Interpreters and Compilers are becoming more and more into the reach of the normal programmer. In the very beginning, such tools were hand crafted. Then came the time of the parser and compiler generators (e.g., YACC (popular UNIX parser generator) or ANTLR), and currently we experience the integration of grammars and parsers directly into a host language. boost::spirit, for example, is integrated into C++ as a library which comes with a domain specific sublanguage (using operator overloading and expression templates). Perl 6 goes even a step further and makes grammars part of the language. These tools help you very much but they still require quite a lot of knowledge about the underlying theoretical framework.

This is the first part of a series of articles which explains the most basic concepts and practices to cope with parsing and compiling. It uses toy programming languages and their interpreters to expose the concepts. For each toy language, sample programs are presented. These samples are handwritten, but in a style which shows how the task could be automated. This introduction into language processing addresses newbies as well as more experienced programmers. The former get enough explanatory text, the latter C++ code samples which show how language processing can be done with minimal baggage. The whole article series strives for minimality both in theory and in implementation.

Steps for building an interpreter

Let's take a command line calculator for arithmetic formulas as an introductory example. The following excerpt shows a sample session, where lines starting with '>>' indicate program output.

Pi=3.1415926535
>> Pi= 3.1415926535

Area=Pi*r^2
>> Area= 0

r=3
>> Area= 28.2743338815
>> r= 3

To implement this, the calculator program must accomplish the following tasks:

Task	Sketch of Solution
It must recognize the structure and the components of the input line which consists of arithmetical expressions, assignments and formula names. Errors must be reported in a user friendly, informative way.	This is done by a process called parsing. The effort done for parsing pays off in the quality of the error messages for syntax errors and in the error recovery. A poorly designed parser will produce follow errors after the first error or will not be able to continue parsing at all after the first error.
It must evaluate the arithmetic formula and store the result in a map which associates the formula name with the value of the expression. Furthermore, it must store the formula expression itself in order to recalculate the formula when a component changes its value by another evaluation.	Evaluation of an expression or statement to a result value can only be done after successful parsing. Evaluation may either be executed directly during parsing or it may be deferred until the parsing process has constructed an appropriate intermediate "database" and finally has generated efficient "code" for the execution by a real or virtual processor.
It must cope with runtime errors like divide by zero or taking the root of a negative number.	Handling of runtime errors is done by the runtime system. The amount of runtime error handling is a tradeoff between efficiency and convenience and must be adapted to the expectations of the target user group.

Parsing and Grammars

Parsing is the process of associating a hierarchical structure with a source text. A programmer, who indents her source text appropriately (e.g., she indents the body of loops), does a simple form of parsing. Parsing is the first task in language processing and one of the most important ones, because it guides all the later steps of a translation system. The formal framework for parsing is set by a grammar which describes the source language. A grammar consists of a set of rules, which taken together associates a unique tree structure with any correct input text. Grammars are divided into different classes according to their expressiveness. On the lowest level are grammars for regular expressions. These grammars cannot describe recursively nested constructs as they are needed to express, for example, the fact that a statement has other statements as its components. The grammar class commonly used to describe programming languages is called "context free". Each rule of a context free grammar has the form R: a b c | d ; where R is the name of the rule (also called a Nonterminal), : separates the rule name from the right hand side of the rule and where a b c d and so on are either terminals or names of further rules (Nonterminals), and | is used to separate alternatives. This is further explained in the following table:

Notion	Samples	Meaning
Terminal	'for' [a-zA-Z][0-9]+;	Describes a string of the input directly. It is common practice to use a subset of the notation which is used in regular expressions for describing terminals. The only difference is the notation for strings. In context free grammars, we write e.g., 'for', where we would write just `for` in a regular expression.
Quantifiers	? * +	Same meaning as in regular expressions. These three quantifiers are sufficient.
Alternative	\|	Same meaning as in regular expressions.
Grouping	( ... )	Same meaning as in regular expressions.
Nonterminal Rule	EnclosedDigits: [0-9]+ \| '(' EnclosedDigits ')' ;	EnclosedDigits in the sample is a Nonterminal. In this sample, it is both the name of a rule and used on the right hand side of the rule. For each Nonterminal used somewhere in a grammar rule, there must be exactly one corresponding rule.

The following sample context free grammar rule describes a string of digits enclosed in an arbitrary number of parentheses:

EnclosedDigits: [0-9]+ | '(' EnclosedDigits ')' ;

This grammar rule associates the input string ((123))) with the following parse tree:

EnclosedDigits
               '(' EnclosedDigits                             ')'
                                 '(' EnclosedDigits       ')'
                                                    '123'

In the coming samples, we will use the following linear form as a tree notation:

EnclosedDigits< '(' EnclosedDigits< '(' EnclosedDigits<123> ')' > ')' >

Interestingly, the rule EnclosedDigits: matches the complete input string ((123)) from the first to the last character, and it also matches inner parts like (123) and 123. This is possible because of the recursive nature of context free grammars. When a grammar is used to recognize an input string, we say that we use the grammar for parsing. A grammar can also be used to generate strings of the language by a process called substitution. Substitution takes place on the right hand side of the so called start rule. Substitution means replacing a Nonterminal by one of the alternatives on the right side of the rule for the replaced Nonterminal. By repeated substitutions on the right hand side of the start rule, we can generate all input strings which belong to the language described by the grammar. This shall be shown with the following sample start rule:

Expression: Expression '+' Expression | '(' Expression ')' | [0-9]+ ;

The following table shows one possible substitution history:

Before Substitution	Step Description	After Substitution	Parse Tree
Expression	Substitute by first alternative of Expression rule	Expression '+' Expression	Expression< Expression '+' Expression >
Expression '+' Expression	Substitute marked Nonterminal by second alternative of Expression rule	Expression '+' '(' Expression ')'	Expression< Expression '+' Expression< '(' Expression ')' > >
Expression '+' '(' Expression ')'	Substitute marked Nonterminal by '1001' (using [0-9]+ alternative)	'1001' '+' '(' Expression ')'	Expression< Expression<'1001'> '+' Expression< '(' Expression ')' > >
'1001' '+' '(' Expression ')'	Replace marked Nonterminal by '37' (using [0-9]+ alternative)	'1001' '+' '(' '37' ')'	Expression< Expression<'1001'> '+' Expression< '(' Expression<'37'> ')' > >

The resulting generated language string is 1001+(37). Note that the order of substitutions - whether you first substitute a Nonterminal occurring on the very right side or on the left side - is not specified by a context free grammar. A parser therefore needs not only a context free grammar to do its job, it needs also a parsing strategy, which is nothing else than a built-in rule telling which of the possible substitutions is carried out first. Common parsing strategies are LL and LR. Both parser types read the input from left to right, which is the reason for the first L in LL and LR. An LL parser always substitutes the leftmost Nonterminal first, whereas an LR parser substitutes the rightmost Nonterminal first. Typically, LL and LR parsers do not accept the same grammar rules. An LL parser is not happy with left recursive rules whereas an LR parser is not happy with right recursive rules, this is exemplified by the following sample grammar rule:

Parsing Strategy	Grammar	Comment
LL	Digits: [0-9] Digits? ;	Match first a terminal, then recurse.
LR	Digits: Digits? [0-9] ;	Recurse, then match terminal.

The LR - grammar Digits: Digits? [0-9] is left recursive, because its right hand side has the rule name as its first symbol. Fortunately, it is not difficult to rewrite an LR grammar into an LL grammar and vice versa, as we will see later.

Expressiveness of Context Free Grammars

Many import properties of existing programming languages can either not be expressed with a context free grammar or are tedious to describe. Consider a language with procedures where the procedure name must both occur at the beginning and at the end:

Proc: 'proc' Name  Parameters Body Name ';' ;

There is no way to describe that the Name at the beginning must be the same as the Name at the end. Take as another example a length limitation, e.g., the rule that an identifier cannot exceed 32 characters. While it is possible to describe this with a context free grammar rule, it either requires a lot of writing or the addition of a new quantifier (e.g.: [a-zA-Z]{1,32}). It turns out in many cases that it is preferable to check for the violation of a restriction after successful parsing rather than trying to express the restriction within the grammar. Generally speaking, a grammar for a programming language accepts always a superset of the legal constructs. The real strength of context free grammars is the proper description of hierarchical relations as shown by the following examples of grammars for simple arithmetic expressions. This is a poorly chosen grammar for expressions:

expr0: expr0 ('+'|'-') expr0 | expr0 ('*'|'/') expr0  | [0-9]+ ;

This grammar does not express the fact that the precedence of '+' is not the same as that of '*'. The following is a much better grammar for expressions:

expr1:    mul_expr | expr1 ('+'|'-') mul_expr ;
mul_expr: [0-9]+ | mul_expr ('*'|'/') [0-9]+ ;

The source text 2*3+5 could be parsed by expr0 into the following tree:

expr0<
   expr0<'2'>
   '*'
   expr0< expr0<'3'> '+' expr0<'5'> >
>

whereas expr1 would result in this tree:

expr1<
   expr1<mul_expr<'2'> '*'  '3' >
   '+'
   mul_expr<'5'>
>

Only the second tree reflects the precedence of the '+' and '*' operators. By the way, the grammar rule for expr1 is suitable for an LR parser, but not for an LL parser.

The Peril of Ambiguity

A grammar is ambiguous if a source text can be parsed into two different parse trees depending on the order of substitutions. The above sample expr0: expr0 ('+'|'-') expr0 | expr0 ('*'|'/') expr0 | [0-9]+ ; is ambiguous because the source text 2*3+5 can be parsed into both of the following trees:

expr0<expr0<'2'> '*' expr0< expr0<'3'> '+' expr0<'5'> > >

expr0<expr0<expr0<'2'> '*' expr0<'3'> > '+' expr0<'5'> >

depending on the substitution order. Ambiguity defeats the main purpose of grammars, namely the association of a source text with a unique hierarchical structure. Whether ambiguity of a grammar really results in different hierarchical structures not only depends on the grammar but also on the parsing strategy.

Recursive Descent Parsing

Recursive descent parsing is the most simple and intuitive parsing method. It is an LL parsing method and therefore reads the input from left to right (the first L in LL) and traverses the grammar from left to right (the second L in LL). The name recursive descent is due to the fact that this parsing method uses a function for every grammar rule. The function for the start rule must parse the complete source. It does this by providing a function body, which implements the right hand side of the start rule in the following way:

in case of alternatives, it tries all the promising alternatives by executing the corresponding code for the alternative until one of them succeeds (matches the input text)
in case of a sequence of grammar symbols (tokens or Nonterminals), it executes the corresponding code for each symbol in the sequence until one of them fails or otherwise the code for the whole sequence succeeds.

Note that the implementation of a recursive descent parser sometimes requires backtracking. Backtracking means that you go back to a previous state of the computation. If, for example, the first part of the chosen alternative is successfully matched against the input text and then a mismatch occurs, we must set back the source position to the position before we entered the alternative and choose another alternative. Let's show this with a sample implementation. The rules:

EnclosedDigits: Digits | '(' EnclosedDigits ')' ; 
Digits: [0-9] Digits? ;

could be implemented by the following parsing procedures which merely test whether the input matches the grammar rules:

typedef const unsigned char* Iterator;
bool MatchesDigits(Iterator& it)         //[0-9]Digits?
{
   return (*it>='0' && *it<='9'?(++it,true):false)//[0-9]
      &&  (MatchesDigits(it),true);               //Digits?
}
bool MatchesEnclosedDigits(Iterator& it) //Digits | '(' EnclosedDigits ')'
{
   Iterator itSave= it;
   return
      MatchesDigits(it)                          // Digits
   ||                                            // |
       (   (itSave=it,
              (*it=='('?(++it,true):false)       //  '('
            && MatchesEnclosedDigits(it)         //   EnclosedDigits
            && (*it==')'?(++it,true):false)      //  ')'
           )
         ||(it=itSave,false)
       );                     
}

The start grammar function MatchesEnclosedDigits takes an iterator (e.g., character pointer) as input and either returns true, in which case it sets the iterator to the position just behind the recognized text, or it returns false, in which case it leaves the iterator unchanged. The close relationship between grammar rules and corresponding parsing functions make it obvious that a recursive descent parser cannot work with grammar rules like:

Digits: Digits? [0-9]  ;

because this would result in infinite recursion. Many grammars coming with language specifications are written for LR parsers (the well known YACC tool requires an LR grammar). But it is quite easy to rewrite an LR grammar into an LL grammar.

Parsing Expression Grammars (PEGs)

The idea behind the so called parsing expression grammar (PEG) formalism [1] is a "procedural interpretation" of context free grammars. A PEG grammar can be "executed" using the grammar and a source string as input. The result of the execution is either a match of the input string, in which case the parsing position is at the end of the recognized part, or otherwise the result is "doesn't match", in which case the input position remains unchanged. This statement is not only true for the grammar as a whole, but for each grammar rule. The PEG formalism is nothing else than a mapping of common practices in recursive descent parsing to grammars. It consists of the following grammar elements:

PEG construct	Notation	Operational Meaning
Sequence	e₁ e₂	Match input against e₁ moving the input position behind the input part described by e₁, then match from new input position against e₂. If the match fails for either e₁ or e₂, the input position is restored to the position before entering the Sequence and the result is fail, otherwise the input position is moved behind the recognized Sequence and the result is success.
Prioritized choice between Alternatives	e₁ / e₂	Match input against e₁ moving the input position behind the input part described by e₁. If the match fails, restore the input position and match against e₂. If the match fails again, restore to the position before entering the Alternative and the result is fail. Otherwise, if one of the alternatives succeeds, the input position is moved behind the source part which is recognized by the chosen alternative and the result is success.
Greedy Option	e?	Match input against e, moving the input position behind the recognized part in case of success. If the match fails, restore the input position, but the result is in any case success.
Greedy zero or more occurrences	e*	Interpret it like an unlimited number of e? e? e? e? ...
Greedy one or more occurrences	e+	Interpret it like e e? e? e? ...
Peek predicate	&e	Match the input against e without changing the input position (the input position remains the same whether the match succeeds or fails). The result is success if it matches, otherwise it is fail.
Not predicate	!e	Match the input against e without changing the input position (the input position remains the same whether the match succeeds or fails). The result is fail if it matches, otherwise it is success.
Advance predicate	.	Increase the input position by one. The result is success.

The following table shows the effect of PEG grammar constructs on sample input strings. The | character indicates the input position:

PEG construct	Input	Match Position	Match Result
S: 'for' ;	\|for	for\|	true
S: 'for' ;	\|former	for\|mer	true
S: 'for' ;	\|afor	\|afor	false
S: 'for' 'all' ;	\|forall men	forall\| men	true
S: 'former' / 'for' ;	\|for	for\|	true
S: 'former' / 'for' ;	\|former	former\|	true
S: 'for' / 'former' ;	\|for	for\|	true
S: 'for' / 'former' ;	\|former	for\|mer	true
S: 'for'?'mer' ;	\|former	former\|	true
S: 'for'?'mer' ;	\|mer	mer\|	true
S: 'for'?'former' ;	\|former	\|former	false
S: [0-9]* ;	\|1903.535	1903\|.535	true
S: [a-z .]+'.*'? ;	\|ifi.go.*	ifi.go.\|*	true
S: 'for' &'(' ;	\|for(	for\|(	true
S: 'for' &'(' ;	\|for[	\|for[	false
S: 'for' !'(' ;	\|for[	for\|[	true
S: 'for' !'(' ;	\|for(	\|for(	false

PEG versus regular expressions

Since PEGs cover context free grammars, they are more powerful than regular expressions. The most important difference between Parsing Expression Grammars and regular expressions is the fact that the former supports recursive definitions, the latter not. But simple non recursive situations often require shorter regular expression patterns as can be shown for an email address recognizer.

Regular expression as email address recognizer:
          [A-Za-z0-9._%-]+@[A-Za-z0-9._%-]+\.[A-Za-z]{2,4}

Incorrect PEG rule as email address recognizer:
   EMail: [A-Za-z0-9._%-]+'@'[A-Za-z0-9._%-]+'.'[A-Za-z]{2,4} ;

Correct PEG rules as email address  recognizer:
   EMailChar:   [A-Za-z0-9._%-];
   EMailSuffix:  [A-Za-z]{2,4} !EMailChar ;
   EMail:   EMailChar+
            '@'
            ([A-Za-z0-9_%-] |'.'!EMailSuffix)+
            '.'
            EmailSuffix ;

The correct PEG email address recognizer is more complicated than the regular expression email address recognizer and uses implicit backtracking (in !EMailSuffix). When we parse the legal email address marc.bloom@blo.blo.uk with the PEG rule EMail: [A-Za-z0-9._%-]+'@'[A-Za-z0-9._%-]+'.'[A-Za-z]{2,4} ;, we will fail, because blo.blo.uk will be recognized by the greedy rule component [A-Za-z0-9._%-]+, so that there will be no input character left for '.'[A-Za-z]{2,4}. The regular expression pattern does not have this problem, because the input string blo.blo.uk will be wisely shared between the two parts of the regular expression [A-Za-z0-9._%-]+\.[A-Za-z]{2,4}. Fortunately, it turns out that such intricate backtracking (as used in the correct PEG rule) is seldom necessary for parsing.

Procedural and template implementations of PEG elements

An ideal C/C++ PEG implementation is both close to a PEG grammar and runtime efficient. This can be achieved both with normal "C" code as well as with C++ templates. C++ templates prove to be more flexible, whereas normal C-code has the advantage of making less trouble in case of programmer errors, because error messages for templates are still difficult to decode. The rule S: 'C' ; has the following procedural implementation:

typedef const unsigned char* Iterator;

inline bool Char(Iterator& p,int c) { return *p==c?(++p,true) : false;}

bool S(Iterator& it){return Char(it,'C');}

and the following template implementation:

template<int CHARVALUE>
    struct Char{
        template<typename It>
        static bool Matches(It& p) { return *p==CHARVALUE?(++p,true):false;}
    };
struct S{
   typedef Char<'C'> rule;
   template<typename It>
   static bool Matches(It& p)      { return rule::Matches(p); }
};

The following table sketches the procedural and template implementations of all the PEG constructs:

PEG notation	procedural implementation	template implementation
[a-zA-Z0-9]	In ( p, 'a', 'z', 'A', 'Z', '0', '9' )	In<'a', 'z', 'A', 'Z', '0', '9'>::Matches(p)
[:;/]	OneOfChars ( p, ':', ';', '/' )	OneOfChars<':', ';', '/'>::Matches(p)
'for'	String ( p,'f','o','r' )	String<'f', 'o', 'r'>::Matches(p)
e₁ e₂	( pSave=p, e1(p) && e1(p) )\|\| ( p=pSave, false )	And<e1,e2>::Matches(p)
e₁ / e₂	( pSave=p, e1(p)) \|\| ( p=pSave, e2(p) ) \|\| ( p=pSave, false )	Or<e1,e2>::Matches(p)
e?	e ( p ) ; return true;	Option<e>::Matches(p)
e*	while ( e(p) ) {} return true;	OptRepeat<e>::Matches(p)
e+	if ( !e(p) ) {return false;} while ( e(p) ) {} return true;	PlusRepeat<e>::Matches(p)
e{LOW,HIGH}	pSave=p; int i; for ( i=0; i<=HIGH; i++ ) { if ( !e(p) ) { break; } } return i>=LOW ? true: ( p=pSave,false );	For<LOW,HIGH>::Matches(p)
&e	( pSave=p, e(p) ) ?( p=pSave,true ): ( p=pSave, false )	Peek<e>::Matches(p)
!e	( pSave=p, e(p) ) ? ( p=pSave,false ) : ( p=pSave,true )	Not<e>::Matches(p)
.	++p	Advance<1>::Matches(p)

Note, that the constructs . and e{LOW,HIGH} are not part of the originally proposed PEG elements, but are added by me. Applied to the email PEG recognizer rules, we get the following template implementation:

bool MatchesEMail(openend_iterator& it)
{
  using namespace peg_templ;
  C c;

  typedef In<'A','Z','a','z'>                     Alpha;
  typedef In<'A','Z','a','z','0','9'>             Alnum;
  typedef Or<Alnum,OneOfChars<'.','_','%','-'> >  EMailChar;
  typedef And<For<2,4,Alpha >,Not<EMailChar > >   EMailSuffix;
  typedef And<
            PlusRepeat<EMailChar>,
            Char<'@'>,
            PlusRepeat<Or< Or<Alnum,OneOfChars<'_','%','-'> >,
                           And<Char<'.'>,Not<EMailSuffix> >
                       >
            >,
            Char<'.'>,
            EMailSuffix
          > EMail;

  return EMail::Matches(it,&c);
}

Note that EMail::Matches(it,&c) has two parameters (contrary to the table above). The first is an iterator, the second a pointer to the parsing context. This parsing context is necessary, if one wants to carry out some action during parsing.

Input Iterators

The parser reads the source for parsing. Storage medium, encoding, and end of input indicators can be different for various sources. In most cases, the storage medium is a file or a character string. The character encoding is either ASCII, UTF-8 or one of the many other Unicode encodings. The end of input can be detected either by the presence of a special terminator or by comparison against an end of input pointer.

Checking each input position against the end of input is in most cases not necessary. Most grammars expect only a subset of the possible input characters. If we know that the input is terminated by a special character, say '\0', we can use a start rule like: S: TopRule '\0';.

If we use a template parameter Iterator, we can cope with all the mentioned possibilities by providing a sensible iterator implementation. Our PEG implementation uses only the following iterator operations:

PEG iterator operation	Meaning
`int iterator::operator*()const`	Returns the current character value. In case of an iterator for an input range, it returns a negative value when at the end of range.
`iterator& iterator::operator++()`	Increments the input position. In case of an iterator for an input range, it does not increment the input position when at the end of the input range.

The following table shows two sample models for input iterators:

container/iterator characterization

iterator implementation

character buffer with special termination character (e.g.: '\0').

typedef const unsigned 
      char* peg_openend_iterator;

character buffer with end of input pointer.

struct peg_begend_iterator{
   peg_begend_iterator(const 
        char* p=0,const char* pEnd=0)
      :m_p((const unsigned char*)p),
       m_pEnd((const unsigned 
                char*)pEnd){}
   int  operator*()const{
      return m_p==m_pEnd?
         (INT_MIN+1):*m_p;
   }
   peg_begend_iterator& 
          operator++(){
      if(m_p<m_pEnd)++m_p;
      return *this;
   }
   unsigned const char* m_p;
   unsigned const char* m_pEnd;
};

Why performance matters

The parser of a C++ compiler spends less then 20% of the compilation time with parsing. But there are many applications for parsing without sophisticated post-processing where speed really matters. Examples are auto-completion support for development environments, preprocessing of C/C++ source files, extraction of items from large XML files.

There are two main sources of inefficiencies in parsing:

backtracking
runtime costs of unnecessary abstractions

In recursive descent parsing, backtracking can be avoided by rewriting the grammar in a way that avoids alternatives with the same beginning. This technique is known as left factorization and will be shown in one of the next articles of this series. In LR parsing (LALR parsing to be precise) as used by YACC (the UNIX parser generator), backtracking is avoided by the internal use of a stack for symbols which wait for further processing. It is well known that hand written recursive descent parsers using optimized grammars and tools like YACC are the best performers. But for the casual user, handwritten parsers as well as compiler generation tools are not ideal, the former because they require too much work, the latter, because they generate in most cases impenetrable and hard to maintain code. With Parsing Expression Grammars, the picture could change, because PEGs combine the easy to understand recursive descent technique with the possibility of an efficient implementation which is very close to the original grammar. I therefore compared the runtime efficiency of three PEG implementations and of a hand-optimized parser. I took a grammar for arithmetical expressions as a test case, because such a grammar can be written in a way which requires no backtracking. All the tested parsers had to parse the expression 132*( firstOccurance + x2*( 1001/N55 )+19 ), which was repeated until it filled a buffer of 1000000 characters. The following table gives the results.

Implementation technique	CPU usage VC++ 7.0	CPU usage GCC 3.3.1
procedural PEG implementation	0.17 seconds	0.03 seconds
templated PEG implementation	0.211 seconds	0.041 seconds
begEnd templated PEG implementation	0.32 seconds	0.09 seconds
hand optimized PEG implementation	0.06 seconds	0.03 seconds

The results show that the templated and procedural PEG implementations come very close to the hand-optimized parser. They also show that using an iterator which always checks for the end of input condition has its price.

Points of interest

The term Parsing Expression Grammars was coined by Bryan Ford [1]. He also pointed out that this kind of recursive descent parsing with backtracking is common practice. He showed that only very few rules are needed to define a powerful framework for parsing. Instead of relying on grammar transformations which eliminate alternatives having the same beginning (so called left factorization), he invented a parser, which memorizes already taken paths and achieves linear parsing time for any PEG grammar (the packrat parser).

An important advantage of PEGs comes from the fact that a separate scanner is not necessary. When parsing with a scanner, one divides the parsing task in a low level part called scanning, and in a high level part, proper parsing. A scanner typically skips white space and comments and reads items like numbers, operators and identifiers, and returns an integer value to indicate what it has found. A scanner reduces the necessity to backtrack in some cases. On the other hand, it introduces a new level of abstraction which was never necessary from a theoretical point of view, and it can cause problems because of the lack of context information. A typical example is the closing angle bracket of template instantiations (the > symbol). You get a compiler error if you close two template instantiations at once using >> (one must use > > instead), because the scanner recognizes >> as a shift operator, a problem which does not arise if you do not use a scanner. Interestingly, Christoper Diggins [2], who presented his YARD parser on this site, does not mention the work of Bryan Ford, but uses exactly the PEG ideas including scanner-less parsing. This is certainly due to the fact that these ideas lie in the air.

History

2005 April: initial version.
2005 April: corrections to chapter PEG versus regular expressions (thanks to Gary Feldman).

References

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Martin.Holzherr

Switzerland

This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.