## Introduction

Parsers, Interpreters and Compilers are becoming more and more into the reach of the normal programmer. In the very beginning, such tools were hand crafted. Then came the time of the parser and compiler generators (e.g., YACC (popular UNIX parser generator) or ANTLR), and currently we experience the integration of grammars and parsers directly into a host language. `boost::spirit`

, for example, is integrated into C++ as a library which comes with a domain specific sublanguage (using operator overloading and expression templates). Perl 6 goes even a step further and makes grammars part of the language. These tools help you very much but they still require quite a lot of knowledge about the underlying theoretical framework.

This is the first part of a series of articles which explains the most basic concepts and practices to cope with parsing and compiling. It uses toy programming languages and their interpreters to expose the concepts. For each toy language, sample programs are presented. These samples are handwritten, but in a style which shows how the task could be automated. This introduction into language processing addresses newbies as well as more experienced programmers. The former get enough explanatory text, the latter C++ code samples which show how language processing can be done with minimal baggage. The whole article series strives for minimality both in theory and in implementation.

## Steps for building an interpreter

Let's take a command line calculator for arithmetic formulas as an introductory example. The following excerpt shows a sample session, where lines starting with '>>' indicate program output.

Pi=3.1415926535
>> Pi= 3.1415926535
Area=Pi*r^2
>> Area= 0
r=3
>> Area= 28.2743338815
>> r= 3

To implement this, the calculator program must accomplish the following tasks:

Task | Sketch of Solution |
---|

It must **recognize the structure and the components** of the input line which consists of arithmetical expressions, assignments and formula names. Errors must be reported in a user friendly, informative way. | This is done by a process called **parsing**. The effort done for parsing pays off in the quality of the error messages for syntax errors and in the error recovery. A poorly designed parser will produce follow errors after the first error or will not be able to continue parsing at all after the first error. |

It must **evaluate** the arithmetic formula and store the result in a map which associates the formula name with the value of the expression. Furthermore, it must store the formula expression itself in order to recalculate the formula when a component changes its value by another evaluation. | **Evaluation** of an expression or statement to a result value can only be done after successful parsing. Evaluation may either be executed directly during parsing or it may be deferred until the parsing process has constructed an appropriate intermediate "database" and finally has generated efficient "code" for the execution by a real or virtual processor. |

It must cope with **runtime errors** like divide by zero or taking the root of a negative number. | Handling of runtime errors is done by the **runtime system**. The amount of runtime error handling is a tradeoff between efficiency and convenience and must be adapted to the expectations of the target user group. |

## Parsing and Grammars

Parsing is the process of associating a hierarchical structure with a source text. A programmer, who indents her source text appropriately (e.g., she indents the body of loops), does a simple form of parsing. Parsing is the first task in language processing and one of the most important ones, because it guides all the later steps of a translation system. The formal framework for parsing is set by a grammar which describes the source language. A grammar consists of a set of rules, which taken together associates a unique tree structure with any correct input text. Grammars are divided into different classes according to their expressiveness. On the lowest level are grammars for regular expressions. These grammars cannot describe recursively nested constructs as they are needed to express, for example, the fact that a statement has other statements as its components. The grammar class commonly used to describe programming languages is called "context free". Each rule of a context free grammar has the form `R: a b c | d ;`

where `R`

is the name of the rule (also called a Nonterminal), `:`

separates the rule name from the right hand side of the rule and where `a b c d`

and so on are either terminals or names of further rules (Nonterminals), and `|`

is used to separate alternatives. This is further explained in the following table:

**Notion** | **Samples** | **Meaning** |

Terminal | 'for' [a-zA-Z][0-9]+; | Describes a string of the input directly. It is common practice to use a subset of the notation which is used in regular expressions for describing terminals. The only difference is the notation for strings. In context free grammars, we write e.g., 'for', where we would write just `for` in a regular expression. |

Quantifiers | ? * + | Same meaning as in regular expressions. These three quantifiers are sufficient. |

Alternative | | | Same meaning as in regular expressions. |

Grouping | ( ... ) | Same meaning as in regular expressions. |

Nonterminal Rule | EnclosedDigits: [0-9]+ | '(' EnclosedDigits ')' ; | EnclosedDigits in the sample is a Nonterminal. In this sample, it is both the name of a rule and used on the right hand side of the rule. For each Nonterminal used somewhere in a grammar rule, there must be exactly one corresponding rule. |

The following sample context free grammar rule describes a string of digits enclosed in an arbitrary number of parentheses:

EnclosedDigits: [0-9]+ | '(' EnclosedDigits ')' ;

This grammar rule associates the input string `((123)))`

with the following parse tree:

EnclosedDigits
'(' EnclosedDigits ')'
'(' EnclosedDigits ')'
'123'

In the coming samples, we will use the following linear form as a tree notation:

EnclosedDigits< '(' EnclosedDigits< '(' EnclosedDigits<123> ')' > ')' >

Interestingly, the rule `EnclosedDigits:`

matches the complete input string `((123))`

from the first to the last character, and it also matches inner parts like `(123)`

and `123`

. This is possible because of the recursive nature of context free grammars. When a grammar is used to recognize an input string, we say that we use the grammar for parsing. A grammar can also be used to generate strings of the language by a process called substitution. Substitution takes place on the right hand side of the so called *start rule*. Substitution means replacing a Nonterminal by one of the alternatives on the right side of the rule for the replaced Nonterminal. By repeated substitutions on the right hand side of the start rule, we can generate all input strings which belong to the language described by the grammar. This shall be shown with the following sample *start rule*:

Expression: Expression '+' Expression | '(' Expression ')' | [0-9]+ ;

The following table shows one possible substitution history:

**Before Substitution** | **Step Description** | **After Substitution** | **Parse Tree** |

Expression | Substitute by first alternative of Expression rule | Expression
'+'
Expression | Expression<
Expression
'+'
Expression
> |

Expression
'+'
Expression | Substitute marked Nonterminal by second alternative of Expression rule | Expression
'+'
'(' Expression ')' | Expression<
Expression
'+'
Expression<
'(' Expression ')'
>
> |

Expression
'+'
'(' Expression ')' | Substitute marked Nonterminal by '1001' (using [0-9]+ alternative) | '1001'
'+'
'(' Expression ')'
| Expression<
Expression<'1001'>
'+'
Expression<
'(' Expression ')'
>
> |

'1001'
'+'
'(' Expression ')' | Replace marked Nonterminal by '37' (using [0-9]+ alternative) | '1001'
'+'
'(' '37' ')'
| Expression<
Expression<'1001'>
'+'
Expression<
'('
Expression<'37'>
')'
>
> |

The resulting generated language string is `1001+(37)`

. Note that the order of substitutions - whether you first substitute a Nonterminal occurring on the very right side or on the left side - is not specified by a context free grammar. A parser therefore needs not only a context free grammar to do its job, it needs also a parsing strategy, which is nothing else than a built-in rule telling which of the possible substitutions is carried out first. Common parsing strategies are LL and LR. Both parser types read the input from left to right, which is the reason for the first L in LL and LR. An LL parser always substitutes the leftmost Nonterminal first, whereas an LR parser substitutes the rightmost Nonterminal first. Typically, LL and LR parsers do not accept the same grammar rules. An LL parser is not happy with left recursive rules whereas an LR parser is not happy with right recursive rules, this is exemplified by the following sample grammar rule:

Parsing Strategy | Grammar | Comment |
---|

LL | Digits: [0-9] Digits? ; | Match first a terminal, then recurse. |

LR | Digits: Digits? [0-9] ; | Recurse, then match terminal. |

The LR - grammar `Digits: Digits? [0-9]`

is left recursive, because its right hand side has the rule name as its first symbol. Fortunately, it is not difficult to rewrite an LR grammar into an LL grammar and vice versa, as we will see later.

## Expressiveness of Context Free Grammars

Many import properties of existing programming languages can either not be expressed with a context free grammar or are tedious to describe. Consider a language with procedures where the procedure name must both occur at the beginning and at the end:

Proc: 'proc' Name Parameters Body Name ';' ;

There is no way to describe that the *Name* at the beginning must be the same as the *Name* at the end. Take as another example a length limitation, e.g., the rule that an identifier cannot exceed 32 characters. While it is possible to describe this with a context free grammar rule, it either requires a lot of writing or the addition of a new quantifier (e.g.: [a-zA-Z]{1,32}). It turns out in many cases that it is preferable to check for the violation of a restriction after successful parsing rather than trying to express the restriction within the grammar. Generally speaking, a grammar for a programming language accepts always a superset of the legal constructs. The real strength of context free grammars is the proper description of hierarchical relations as shown by the following examples of grammars for simple arithmetic expressions. This is a poorly chosen grammar for expressions:

expr0: expr0 ('+'|'-') expr0 | expr0 ('*'|'/') expr0 | [0-9]+ ;

This grammar does not express the fact that the precedence of '+' is not the same as that of '*'. The following is a much better grammar for expressions:

expr1: mul_expr | expr1 ('+'|'-') mul_expr ;
mul_expr: [0-9]+ | mul_expr ('*'|'/') [0-9]+ ;

The source text `2*3+5`

could be parsed by `expr0`

into the following tree:

expr0<
expr0<'2'>
'*'
expr0< expr0<'3'> '+' expr0<'5'> >
>

whereas `expr1`

would result in this tree:

expr1<
expr1<mul_expr<'2'> '*' '3' >
'+'
mul_expr<'5'>
>

Only the second tree reflects the precedence of the '+' and '*' operators. By the way, the grammar rule for `expr1`

is suitable for an LR parser, but not for an LL parser.

## The Peril of Ambiguity

A grammar is ambiguous if a source text can be parsed into two different parse trees depending on the order of substitutions. The above sample `expr0: expr0 ('+'|'-') expr0 | expr0 ('*'|'/') expr0 | [0-9]+ ;`

is ambiguous because the source text `2*3+5`

can be parsed into both of the following trees:

expr0<expr0<'2'> '*' expr0< expr0<'3'> '+' expr0<'5'> > >

expr0<expr0<expr0<'2'> '*' expr0<'3'> > '+' expr0<'5'> >

depending on the substitution order. Ambiguity defeats the main purpose of grammars, namely the association of a source text with a unique hierarchical structure. Whether ambiguity of a grammar really results in different hierarchical structures not only depends on the grammar but also on the parsing strategy.

## Recursive Descent Parsing

Recursive descent parsing is the most simple and intuitive parsing method. It is an LL parsing method and therefore reads the input from left to right (the first L in LL) and traverses the grammar from left to right (the second L in LL). The name recursive descent is due to the fact that this parsing method uses a function for every grammar rule. The function for the start rule must parse the complete source. It does this by providing a function body, which implements the right hand side of the start rule in the following way:

- in case of alternatives, it tries all the promising alternatives by executing the corresponding code for the alternative until one of them succeeds (matches the input text)
- in case of a sequence of grammar symbols (tokens or Nonterminals), it executes the corresponding code for each symbol in the sequence until one of them fails or otherwise the code for the whole sequence succeeds.

Note that the implementation of a recursive descent parser sometimes requires *backtracking*. Backtracking means that you go back to a previous state of the computation. If, for example, the first part of the chosen alternative is successfully matched against the input text and then a mismatch occurs, we must set back the source position to the position before we entered the alternative and choose another alternative. Let's show this with a sample implementation. The rules:

EnclosedDigits: Digits | '(' EnclosedDigits ')' ;
Digits: [0-9] Digits? ;

could be implemented by the following parsing procedures which merely test whether the input matches the grammar rules:

typedef const unsigned char* Iterator;
bool MatchesDigits(Iterator& it) {
return (*it>='0' && *it<='9'?(++it,true):false) && (MatchesDigits(it),true); }
bool MatchesEnclosedDigits(Iterator& it) {
Iterator itSave= it;
return
MatchesDigits(it) || ( (itSave=it,
(*it=='('?(++it,true):false) && MatchesEnclosedDigits(it) && (*it==')'?(++it,true):false) )
||(it=itSave,false)
);
}

The *start grammar function* `MatchesEnclosedDigits`

takes an iterator (e.g., character pointer) as input and either returns `true`

, in which case it sets the iterator to the position just behind the recognized text, or it returns `false`

, in which case it leaves the iterator unchanged. The close relationship between grammar rules and corresponding parsing functions make it obvious that a recursive descent parser cannot work with grammar rules like:

Digits: Digits? [0-9] ;

because this would result in infinite recursion. Many grammars coming with language specifications are written for LR parsers (the well known YACC tool requires an LR grammar). But it is quite easy to rewrite an LR grammar into an LL grammar.

## Parsing Expression Grammars (PEGs)

The idea behind the so called parsing expression grammar (PEG) formalism [1] is a "procedural interpretation" of context free grammars. A PEG grammar can be "executed" using the grammar and a source string as input. The result of the execution is either a match of the input string, in which case the parsing position is at the end of the recognized part, or otherwise the result is "doesn't match", in which case the input position remains unchanged. This statement is not only true for the grammar as a whole, but for each grammar rule. The PEG formalism is nothing else than a mapping of common practices in recursive descent parsing to grammars. It consists of the following grammar elements:

**PEG construct** | **Notation** | **Operational Meaning** |

**Sequence** | e_{1} e_{2} | Match input against e_{1} moving the input position behind the input part described by e_{1}, then match from new input position against e_{2}. If the match fails for either e_{1} or e_{2}, the input position is restored to the position before entering the Sequence and the result is *fail*, otherwise the input position is moved behind the recognized Sequence and the result is *success*. |

Prioritized choice between **Alternative**s | e_{1} / e_{2} | Match input against e_{1} moving the input position behind the input part described by e_{1}. If the match fails, restore the input position and match against e_{2}. If the match fails again, restore to the position before entering the Alternative and the result is *fail*. Otherwise, if one of the alternatives succeeds, the input position is moved behind the source part which is recognized by the chosen alternative and the result is *success*. |

Greedy **Option** | e? | Match input against *e*, moving the input position behind the recognized part in case of success. If the match fails, restore the input position, but the result is in any case *success*. |

Greedy **zero or more **occurrences | e* | Interpret it like an unlimited number of *e? e? e? e?* ... |

Greedy **one or more** occurrences | e+ | Interpret it like *e e? e? e?* ... |

**Peek** predicate | &e | Match the input against *e* without changing the input position (the input position remains the same whether the match succeeds or fails). The result is *success* if it matches, otherwise it is *fail*. |

**Not** predicate | !e | Match the input against *e* without changing the input position (the input position remains the same whether the match succeeds or fails). The result is *fail* if it matches, otherwise it is *success*. |

**Advance** predicate | **.** | Increase the input position by one. The result is *success.* |

The following table shows the effect of PEG grammar constructs on sample input strings. The `|`

character indicates the input position:

**PEG construct** | **Input** | **Match Position** | **Match Result** |

S: 'for' ; | |for | for| | true |

S: 'for' ; | |former | for|mer | true |

S: 'for' ; | |afor | |afor | false |

S: 'for' 'all' ; | |forall men | forall| men | true |

S: 'former' / 'for' ; | |for | for| | true |

S: 'former' / 'for' ; | |former | former| | true |

S: 'for' / 'former' ; | |for | for| | true |

S: 'for' / 'former' ; | |former | for|mer | true |

S: 'for'?'mer' ; | |former | former| | true |

S: 'for'?'mer' ; | |mer | mer| | true |

S: 'for'?'former' ; | |former | |former | false |

S: [0-9]* ; | |1903.535 | 1903|.535 | true |

S: [a-z **.**]+'.*'? ; | |ifi.go.* | ifi.go.|* | true |

S: 'for' &'(' ; | |for( | for|( | true |

S: 'for' &'(' ; | |for[ | |for[ | false |

S: 'for' !'(' ; | |for[ | for|[ | true |

S: 'for' !'(' ; | |for( | |for( | false |

### PEG versus regular expressions

Since PEGs cover context free grammars, they are more powerful than regular expressions. The most important difference between Parsing Expression Grammars and regular expressions is the fact that the former supports recursive definitions, the latter not. But simple non recursive situations often require shorter regular expression patterns as can be shown for an email address recognizer.

Regular expression as email address recognizer:
[A-Za-z0-9._%-]+@[A-Za-z0-9._%-]+\.[A-Za-z]{2,4}

Incorrect PEG rule as email address recognizer:
EMail: [A-Za-z0-9._%-]+'@'[A-Za-z0-9._%-]+'.'[A-Za-z]{2,4} ;

Correct PEG rules as email address recognizer:
EMailChar: [A-Za-z0-9._%-];
EMailSuffix: [A-Za-z]{2,4} !EMailChar ;
EMail: EMailChar+
'@'
([A-Za-z0-9_%-] |'.'!EMailSuffix)+
'.'
EmailSuffix ;

The correct PEG email address recognizer is more complicated than the regular expression email address recognizer and uses implicit backtracking (in `!EMailSuffix`

). When we parse the legal email address *marc.bloom@blo.blo.uk* with the PEG rule `EMail: [A-Za-z0-9._%-]+'@'[A-Za-z0-9._%-]+'.'[A-Za-z]{2,4} ;`

, we will fail, because *blo.blo.uk* will be recognized by the greedy rule component `[A-Za-z0-9._%-]+`

, so that there will be no input character left for `'.'[A-Za-z]{2,4}`

. The regular expression pattern does not have this problem, because the input string *blo.blo.uk* will be wisely shared between the two parts of the regular expression `[A-Za-z0-9._%-]+\.[A-Za-z]{2,4}`

. Fortunately, it turns out that such intricate backtracking (as used in the correct PEG rule) is seldom necessary for parsing.

### Procedural and template implementations of PEG elements

An ideal C/C++ PEG implementation is both close to a PEG grammar and runtime efficient. This can be achieved both with normal "C" code as well as with C++ templates. C++ templates prove to be more flexible, whereas normal C-code has the advantage of making less trouble in case of programmer errors, because error messages for templates are still difficult to decode. The rule `S: 'C' ;`

has the following procedural implementation:

typedef const unsigned char* Iterator;
inline bool Char(Iterator& p,int c) { return *p==c?(++p,true) : false;}
bool S(Iterator& it){return Char(it,'C');}

and the following template implementation:

template<int CHARVALUE>
struct Char{
template<typename It>
static bool Matches(It& p) { return *p==CHARVALUE?(++p,true):false;}
};
struct S{
typedef Char<'C'> rule;
template<typename It>
static bool Matches(It& p) { return rule::Matches(p); }
};

The following table sketches the procedural and template implementations of all the PEG constructs:

**PEG notation** | **procedural implementation** | **template implementation** |

[a-zA-Z0-9] | In ( p, 'a', 'z',
'A', 'Z', '0', '9' ) | In<'a', 'z', 'A', 'Z',
'0', '9'>::Matches(p) |

[:;/] | OneOfChars ( p, ':', ';', '/' ) | OneOfChars<':', ';',
'/'>::Matches(p) |

'for' | String ( p,'f','o','r' ) | String<'f', 'o',
'r'>::Matches(p) |

e_{1} e_{2} | ( pSave=p, e1(p) && e1(p) )||
( p=pSave, false ) | And<e1,e2>::Matches(p) |

e_{1} / e_{2} | ( pSave=p, e1(p))
|| ( p=pSave, e2(p) )
|| ( p=pSave, false ) | Or<e1,e2>::Matches(p) |

e? | e ( p ) ; return true; | Option<e>::Matches(p) |

e* | while ( e(p) ) {} return true; | OptRepeat<e>::Matches(p) |

e+ | if ( !e(p) ) {return false;}
while ( e(p) ) {} return true; | PlusRepeat<e>::Matches(p) |

e{LOW,HIGH} | pSave=p; int i;
for ( i=0; i<=HIGH; i++ ) {
if ( !e(p) ) { break; }
}
return i>=LOW ? true:
( p=pSave,false ); | For<LOW,HIGH>::Matches(p) |

&e | ( pSave=p, e(p) )
?( p=pSave,true ):
( p=pSave, false ) | Peek<e>::Matches(p) |

!e | ( pSave=p, e(p) )
? ( p=pSave,false ) :
( p=pSave,true ) | Not<e>::Matches(p) |

**.** | ++p | Advance<1>::Matches(p) |

Note, that the constructs `.`

and `e{LOW,HIGH}`

are not part of the originally proposed PEG elements, but are added by me. Applied to the email PEG recognizer rules, we get the following template implementation:

bool MatchesEMail(openend_iterator& it)
{
using namespace peg_templ;
C c;
typedef In<'A','Z','a','z'> Alpha;
typedef In<'A','Z','a','z','0','9'> Alnum;
typedef Or<Alnum,OneOfChars<'.','_','%','-'> > EMailChar;
typedef And<For<2,4,Alpha >,Not<EMailChar > > EMailSuffix;
typedef And<
PlusRepeat<EMailChar>,
Char<'@'>,
PlusRepeat<Or< Or<Alnum,OneOfChars<'_','%','-'> >,
And<Char<'.'>,Not<EMailSuffix> >
>
>,
Char<'.'>,
EMailSuffix
> EMail;
return EMail::Matches(it,&c);
}

Note that `EMail::Matches(it,&c)`

has two parameters (contrary to the table above). The first is an iterator, the second a pointer to the parsing context. This parsing context is necessary, if one wants to carry out some action during parsing.

## Input Iterators

The parser reads the source for parsing. Storage medium, encoding, and end of input indicators can be different for various sources. In most cases, the storage medium is a file or a character string. The character encoding is either ASCII, UTF-8 or one of the many other Unicode encodings. The end of input can be detected either by the presence of a special terminator or by comparison against an end of input pointer.

Checking each input position against the end of input is in most cases not necessary. Most grammars expect only a subset of the possible input characters. If we know that the input is terminated by a special character, say '\0', we can use a start rule like: `S: TopRule '\0';`

.

If we use a template parameter `Iterator`

, we can cope with all the mentioned possibilities by providing a sensible iterator implementation. Our PEG implementation uses only the following iterator operations:

**PEG iterator operation** | **Meaning** |

`int iterator::operator*()const` | Returns the current character value. In case of an iterator for an input range, it returns a negative value when at the end of range. |

`iterator& iterator::operator++()` | Increments the input position. In case of an iterator for an input range, it does not increment the input position when at the end of the input range. |

The following table shows two sample models for input iterators:

**container/iterator characterization** | **iterator implementation** |

character buffer with special termination character (e.g.: '\0'). | typedef const unsigned
char* peg_openend_iterator; |

character buffer with end of input pointer. | struct peg_begend_iterator{
peg_begend_iterator(const
char* p=0,const char* pEnd=0)
:m_p((const unsigned char*)p),
m_pEnd((const unsigned
char*)pEnd){}
int operator*()const{
return m_p==m_pEnd?
(INT_MIN+1):*m_p;
}
peg_begend_iterator&
operator++(){
if(m_p<m_pEnd)++m_p;
return *this;
}
unsigned const char* m_p;
unsigned const char* m_pEnd;
}; |

## Why performance matters

The parser of a C++ compiler spends less then 20% of the compilation time with parsing. But there are many applications for parsing without sophisticated post-processing where speed really matters. Examples are auto-completion support for development environments, preprocessing of C/C++ source files, extraction of items from large XML files.

There are two main sources of inefficiencies in parsing:

- backtracking
- runtime costs of unnecessary abstractions

In recursive descent parsing, backtracking can be avoided by rewriting the grammar in a way that avoids alternatives with the same beginning. This technique is known as left factorization and will be shown in one of the next articles of this series. In LR parsing (LALR parsing to be precise) as used by YACC (the UNIX parser generator), backtracking is avoided by the internal use of a stack for symbols which wait for further processing. It is well known that hand written recursive descent parsers using optimized grammars and tools like YACC are the best performers. But for the casual user, handwritten parsers as well as compiler generation tools are not ideal, the former because they require too much work, the latter, because they generate in most cases impenetrable and hard to maintain code. With Parsing Expression Grammars, the picture could change, because PEGs combine the easy to understand recursive descent technique with the possibility of an efficient implementation which is very close to the original grammar. I therefore compared the runtime efficiency of three PEG implementations and of a hand-optimized parser. I took a grammar for arithmetical expressions as a test case, because such a grammar can be written in a way which requires no backtracking. All the tested parsers had to parse the expression `132*( firstOccurance + x2*( 1001/N55 )+19 )`

, which was repeated until it filled a buffer of 1000000 characters. The following table gives the results.

**Implementation technique** | **CPU usage VC++ 7.0** | **CPU usage GCC 3.3.1** |

procedural PEG implementation | 0.17 seconds | 0.03 seconds |

templated PEG implementation | 0.211 seconds | 0.041 seconds |

begEnd templated PEG implementation | 0.32 seconds | 0.09 seconds |

hand optimized PEG implementation | 0.06 seconds | 0.03 seconds |

The results show that the templated and procedural PEG implementations come very close to the hand-optimized parser. They also show that using an iterator which always checks for the end of input condition has its price.

## Points of interest

The term Parsing Expression Grammars was coined by Bryan Ford [1]. He also pointed out that this kind of recursive descent parsing with backtracking is common practice. He showed that only very few rules are needed to define a powerful framework for parsing. Instead of relying on grammar transformations which eliminate alternatives having the same beginning (so called left factorization), he invented a parser, which memorizes already taken paths and achieves linear parsing time for any PEG grammar (the packrat parser).

An important advantage of PEGs comes from the fact that a separate scanner is not necessary. When parsing with a scanner, one divides the parsing task in a low level part called scanning, and in a high level part, proper parsing. A scanner typically skips white space and comments and reads items like numbers, operators and identifiers, and returns an integer value to indicate what it has found. A scanner reduces the necessity to backtrack in some cases. On the other hand, it introduces a new level of abstraction which was never necessary from a theoretical point of view, and it can cause problems because of the lack of context information. A typical example is the closing angle bracket of template instantiations (the `>`

symbol). You get a compiler error if you close two template instantiations at once using `>>`

(one must use `> >`

instead), because the scanner recognizes `>>`

as a shift operator, a problem which does not arise if you do not use a scanner. Interestingly, Christoper Diggins [2], who presented his YARD parser on this site, does not mention the work of Bryan Ford, but uses exactly the PEG ideas including scanner-less parsing. This is certainly due to the fact that these ideas lie in the air.

## History

- 2005 April: initial version.
- 2005 April: corrections to chapter
*PEG versus regular expressions* (thanks to Gary Feldman).

## References

- Parsing Expression Grammars, Bryan Ford, MIT, January 2004 [^]
- A Regular Expression Tokenizer using the YARD Parser, Christopher Diggins, December 2004 [^]