An introduction to lex and yacc part 1

Rob Manderson

4.90/5 (52 votes)

Mar 22, 2003

CPOL

17 min read

290462

A guide to using lex and yacc to create parsers in your projects

An introduction to lex and yacc.

I'm working on an editor to work with dialog templates. Not, I hasten to add, a dialog editor as such - but an editor that can handle tables related to dialogs but not supported by Visual Studio.

As part of the project I wanted to be able to load and display dialog templates as dialogs.

There's an easy way and a hard way to do this. The easy way is to require that the resource script has already been compiled into a .res file and then (somehow) get at the compiled templates. Since I'm (presumably) writing the dialog templates and they are part of a project I'm working on it seems reasonable to assume that I have access to .res files. This would let me get a handle to a dialog template and pass it to CreateDialogIndirect and would also allow my code to inspect each controls characteristics. But this doesn't allow for the fallibility of human memory. If there's one thing I'm sure of it is that I'll change a dialog template somewhere and forget to compile it.

Thus we come to the hard way. I decided write a parser that can read a Visual Studio 6 resource source file (.rc file) and hand me back a set of compiled dialog templates.

So how to go about it? A quick look at the source code in a resource template file was enough to tell me that life was too short to try and do it the obvious way. A resource file contains many blocks of text. Some are comments, some are toolbar resources, some are menus, some are dialogs, and there are many other kinds of blocks. Each kind of block has it's own format. Writing a parser that knows about all these kinds of blocks using normal procedural programming seemed like a lot of work. So I turned to yacc and lex to help.

Yacc and Lex 101

Yacc and Lex originated in the UNIX world. Yacc stands for 'yet another compiler compiler' and Lex is short for 'lexical analyser'. Wow that tells you a lot doesn't it! My introduction to yacc was sometime in 1986 when I was reading a HPUX 9.x manual and found the chapter on yacc. It was pretty much the standard introduction to yacc - it assumed you knew all about the subject before you started. See this http://dinosaur.compilertools.net/ for a fair example of the level of description and explanation.

Lex is a tool that reads a stream of input from somewhere and breaks it up into its component pieces. Each component piece is a token. A token might be a keyword or a number or a string or punctuation. In this context a token is something that cannot be broken down into smaller pieces. Either it's a keyword or it's something else. If it's a string it's a string and we are not concerned with the machine representation of a string. Likewise with numbers and punctuation.

Yacc is a tool that receives a stream of tokens from somewhere and imposes structure upon them. The programmer defines the structure and specifies actions to perform when a particular structure is recognised. It should come as no surprise that yacc frequently gets it's tokens from lex. (Parenthetically I want to point out that yacc and lex are UNIX programs - there are GNU equivalents in flex and bison - a rather bad pun on yacc... for the purpose of this article you can interchange flex and lex or yacc and bison).

For yacc the programmer writes a grammar and processes it through yacc. The output is a c file which is then compiled and linked into your program. Likewise with lex.

Yacc and lex 102

The overall structure of yacc and lex files are similar. The first section contains general c code declarations and yacc/lex directives and is delimited from the second section by a %% line.

The second section contains either the yacc grammar in the case of a yacc file or regular expressions in the case of a lex file. The second section is delimited from the third section by a %% line.

The third section contains raw c code which is copied verbatim to the output file.

A yacc file looks like this

%{
C declarations
%}
yacc declarations
%%
Grammar rules
%%
Additional C code

And a lex file looks like this

%{
C declarations
%}
lex declarations
%%
Regular expressions
%%
Additional C code

Yacc and Lex 103

Let's look at a simple yacc grammar. We're trying to write a parser to interpret a dialog template in a resource script file. Know thy input! So let's look at a dialog template first.

IDD_WHISPERWINDOW DIALOG 0, 0, 261, 146 STYLE DS_MODALFRAME | 
WS_MINIMIZEBOX | WS_POPUP | WS_VISIBLE | WS_CAPTION | WS_SYSMENU
CAPTION "Whispers"
FONT 8, "MS Sans Serif"
BEGIN
    CONTROL"",IDC_EDIT,"RichEdit20A",WS_BORDER | WS_TABSTOP 
        | 0xc4,7,109,247,12
    CONTROL"",IDC_RICHEDIT,"RichEdit20A",WS_BORDER | 
        WS_VSCROLL | WS_TABSTOP | 0x844,7,7,247,100
END

We can break this down into the following parts:

A dialog template consists of an ID followed by the word DIALOG followed by some screen coordinates followed by a STYLE followed by a CAPTION followed by a FONT followed by a block that defines some controls. Phew what a mouthful!

Expressed in yacc it might look like this:

%start dialog

%token L_ID
%token L_STRING
%token L_NUMBER
%token L_DIALOG
%token L_STYLE
%token L_CAPTION
%token L_FONT
%token L_BEGIN
%token L_END
%token L_STYLES
%token L_CONTROL
%token L_NUM

%token L_WS_TABSTOP
%token L_WS_BORDER

more 
styles

%%
dialog   : L_ID L_DIALOG L_NUM ',' L_NUM ',' 
           L_NUM ',' L_NUM L_STYLEstylescaption L_FONT L_NUM ',' L_STRING 
           L_BEGIN controls L_END
         ;

styles   :  style
         |  style '|' styles
         ;

style    :  L_WS_TABSTOP
         |  L_WS_BORDER
         |  L_WS_VSCROLL
         |  /* more styles */
         ;

caption  :  /*empty */
         |  L_CAPTION L_STRING
         ;

controls :  control
         |  control controls
		 ;

control  :  L_CONTROL L_STRING ',' L_ID ',' L_STRING ',' styles 
            ',' L_NUM ',' L_NUM ',' L_NUM ',' L_NUM 
         ;
%%

The first part of this yacc program is a %start dialog declaration. This tells the parser that it should start assuming that it's looking for input that matches the dialog rule defined a little later in the file.

Then follow the definitions of a series of tokens. Each %token is a token, a lowest level component of input. Notice that we do NOT define what an L_ID looks like. Nor do we define what a L_STRING looks like. We're simply stating that the yacc parser recognises a token indicating that at this point in the input stream we've seen something that is an L_ID or is a L_STRING. But hold onto that thought...

In addition to explicitly named tokens the yacc parser also treats single characters as tokens. You can see this in the first line of the dialog grammar.

dialog:L_ID L_DIALOG L_NUM ',' L_NUM ',' L_NUM ',' L_NUM

The ',' is a token representing a comma in the input stream.

Next comes the dialog : line. This says, in yacc parlance, what I said above: A dialog template consists of a dialog ID (an L_ID) followed by the word 'DIALOG' followed by some screen coordinates followed by a STYLE followed by a CAPTION followed by a FONT followed by a block that defines some controls.

The stuff to the left of : is a non-terminal symbol. This simply means that it's not a token and that it can be broken down into smaller bits (tokens). The stuff to the right of: can be either tokens (no further reduction possible) or non-terminal symbols which in turn consist of finer structure. The definition of the left hand side (the non-terminal symbol) continues until we hit a semicolon ';'.

Convention dictates that tokens (terminal symbols) are in UPPER CASE and non-terminals are in lower case. Thus, from the example above

dialog:L_ID L_DIALOG L_NUM ',' L_NUM ',' L_NUM ',' L_NUM

dialog is a non-terminal symbol defined as an L_ID (token) followed by L_DIALOG (token) followed by a L_NUM (token) followed by a comma ',' etc etc...

When the yacc parser sees this stream of tokens it knows it's on to something - namely, a dialog template.

Yacc and Lex 104

Now that we've defined a simple grammar let's look at where the stream of tokens comes from. At the most basic level we need a way to recognise a stream of characters as matching a particular pattern and signal somehow to the parser that we've just seen that particular pattern.

A pattern can be a specific keyword such as DIALOG or it can literally be a pattern such as 'any string of digits'.

A lex based lexical analyser capable of recognising the set of tokens used in the example yacc grammar above might look like this...

%%
"DIALOG"               { return L_DIALOG; }
"STYLE"                { return L_STYLE; }
"CAPTION"              { return L_CAPTION; }
"FONT"                 { return L_FONT; }
"BEGIN"                { return L_BEGIN; }
"END"                  { return L_END; }
"CONTROL"              { return L_CONTROL; }
"WS_TABSTOP"           { return L_WS_TABSTOP; }
"WS_BORDER"            { return L_WS_BORDER; } 

more styles

[a-zA-Z_][a-zA-Z0-9_]* { return L_ID; }
-?[0-9]+               { return L_NUM; }
"\".*\""               { return L_STRING; }
%%

Each line in the lex script is a regular expression. The regular expression may be a specific keyword or it may be a pattern which lex recognises. When lex recognises a pattern it executes the action associated with the pattern (the stuff between the braces '{ }' and moves on to further input.

Note that this is incomplete - I haven't yet specified a way for the lexical analyser (tokeniser) to return the data associated with the token to the parser. For simple keywords such as DIALOG there's no need to return data - the data is implicit in the fact that we recognised the DIALOG keyword. But for a string you need to pass back both the fact that you saw a string and the string itself. We'll come to that shortly.

Yacc and Lex 105

Now it's time to look at how the lexical analyser returns data values as well as token types. To achieve this both yacc and lex work together. Yacc defines the data types attaching to each token and defines a union containing all the possible datatypes. For example.

%union {
    int  numVal;
    char *stringval;
}

This declaration occurs in the yacc script in the first section of the script (the bit before the first %%). Following the union declaration you list each token that needs to return a data type. Not all tokens need to return data - a keyword token usually doesn't because the fact that you've seen a keyword is data enough. An example of how tokens are declared with data types is

%type <numVal>    L_NUM
%type <stringVal> L_STRING  L_ID

This states that the data that accompanies an L_NUM token is numeric whilst the data for L_STRING and L_ID is string data. It also tells yacc that it should use the numVal member as an int whenever it sees a token of type L_NUM and that it should use the stringVal member as a char * whenever it sees tokens of type L_STRING or L_ID. numVal member of what? You don't care - because you'll be referring to the data using yacc specific syntax and letting yacc take care of exactly where the data is.

In the lex script each fragment of code is responsible for extracting the relevant data and assigning it to the appropriate data member in the union. Yacc and lex become very chummy at this point because yacc defines the data structure and lex performs assignments to the data structure.

The code fragments associated with the regular expressions in the lex script that recognise numbers and id's change to look like this.

[a-zA-Z_][a-zA-Z0-9_]* {
                           yylval.stringVal = strdup(yytext);
                           return L_ID; 
                       }
-?[0-9]+               {
                           yylval.numVal = atoi(yytext);
                           return L_NUM;
                       }

yytext is the text that matched the rule and yylval is a parser provided instance of the datatype defined in the union statement.

If integers are sufficient for your parser it's possible and legal to leave out the union statement. By default integers are used for all token types.

Yacc and Lex 106

Let's take a closer look at the yacc grammar. Assume the tokens are defined (all in upper case following convention). The part of the grammar we're interested in now is this...

%%
dialog    :    L_ID L_DIALOG L_NUM ',' L_NUM ',' L_NUM 
               ',' L_NUM L_STYLE styles caption L_FONT L_NUM ','
               L_STRING L_BEGIN controls 
               L_END
          ;

styles    :    style
          |    style '|' styles
          ;
		  
style     :    L_WS_TABSTOP
          |    L_WS_BORDER
          |    L_WS_VSCROLL
          |    /* more styles */
          ;

caption   :    /*empty */
          |    L_CAPTION L_STRING
          ;

controls  :    control
          |    control controls
          ;

control   :    L_CONTROL L_STRING ',' L_ID ',' L_STRING ',' styles 
               ',L_NUM ',' L_NUM ',' L_NUM ',' L_NUM 
          ;

%%

We've established that the dialog : line defines the outline of a dialog template. Every token between the : and the ; must be present in order for the parser to recognise a dialog template.

We can also say that a non-terminal symbol (the left hand side of the ':') consists of this or that. We do this by expressing the first alternative immediately after the colon and then putting a '|' symbol and the second alternative. A non terminal symbols consists of this or this or this...

Look at the definition for styles.

styles    :    style
          |    style '|' styles
          ;

What's going on here is we're saying that styles consists of a style or a style followed by a '|' pipe character followed by styles again. This is a right-recursive rule. This is how we can parse something like

DS_MODALFRAME | WS_MINIMIZEBOX | WS_POPUP | WS_VISIBLE | WS_CAPTION | 
WS_SYSMENU

and expect to get away with it. The parser gets a token representing the first style, DS_MODALFRAME, and sees that it's followed by a pipe '|'. The rules indicate that if it sees a style followed by a '|' then it will follow the second branch. That branch states that it should then look for another style, at which point it sees WS_MINIMIZEBOX. And so on until it sees WS_SYSMENU after which it will see some other symbol and therefore follows the first branch, which consists of style alone.

Now let's look at the definition of caption.

caption    :    /*empty */
           |    L_CAPTION L_STRING
           ;

This is another powerful yacc construct. The first branch is empty. In other words a caption is optional. But if it's present it must consist of the L_CAPTION token followed by the L_STRING token.

Yacc and Lex 107

Obviously the parser and lexer I've shown above are incomplete. Not only do they not allow for every variation in dialog templates, they do nothing except recognise dialog templates that follow the limited rules defined. I don't plan in this text to present a complete dialog template parser. You can find that in part two . But I do plan to show you how to have the parser do something useful with what it parses.

So now we get to add some code to act on the rules we've matched. I'm going to show it with pseudocode because the details of how to actually create a dialog template in memory are horrendous. Let's change the dialog part of our yacc grammar to this (again I'm leaving out the %token definitions to save space - assume that if it's in uppercase it's a token).

dialog    :    L_ID L_DIALOG
               {
                   gDlgTemplate = new DLGTEMPLATE;
                   gDlgID = $1;
               }
               L_NUM ',' L_NUM ',' L_NUM ',' L_NUM
               L_STYLE styles caption L_FONT L_NUM ','L_STRING
               L_BEGIN controls L_END
          ;

We've introduced a few new things here. Assume we have a global variable called gDlgTemplate as a pointer to a DLGTEMPLATE structure. As soon as we recognise we have a dialog template we create a new instance of a DLGTEMPLATE structure. We then assign the value associated with the L_ID to a global integer called gDlgID.

Not so incidentally, the $1 refers to the first token on the right hand side. Tokens are counted from left to right and include action code blocks contained within matching braces, so L_ID is $1, L_DIALOG is $2, the action code is $3, and so on...

But hang on a moment! Didn't we say before that an L_ID is a string? Yes it is. However you already know that an L_ID is a symbol defined in resource.h and used throughout your code as a handy placeholder for a number. How then can we parse a dialog template that might have either a number or a symbol as the ID? We could do it like this...

dialog    :    L_ID L_DIALOG
               {
                   gDlgTemplate = new DLGTEMPLATE;
                   gDlgID = FindSymbol($1);
               }
               L_NUM ',' L_NUM ',' L_NUM ',' L_NUM
               L_STYLE styles caption L_FONT L_NUM ',' L_STRING
               L_BEGIN controls L_END
		  |    L_NUM L_DIALOG
               {
                   gDlgTemplate = new DLGTEMPLATE;
                   gDlgID = $1;
               }
               L_NUM ',' L_NUM ',' L_NUM ',' L_NUM 
               L_STYLE styles caption L_FONT L_NUM ',' L_STRING
               L_BEGIN controls L_END
          ;

where we repeat the grammar, once for the case where the dialog template contains an ID, and again for the case where the dialog template contains a number. In the first case we take the data associated with the L_ID token (the actual identifier text) and look it up in a symbol table defined somewhere. In the second case we've already got the number so we just do the assignment.

This repeated grammar fragment will work but it's not the best way to do it. In either case we want the first token to evaluate to a number. Whether it's a direct number embedded in the dialog resource template or it's a string referring to a symbol defined somewhere else is irrelevant - what we want is a number. We can substitute a sub grammar to do this.

dialog    :    nameID L_DIALOG
               {
                   gDlgTemplate = new DLGTEMPLATE;
                   gDlgID = $1;
               }
               L_NUM ',' L_NUM ',' L_NUM ',' L_NUM
               L_STYLE styles caption L_FONT L_NUM ',' L_STRING
               L_BEGIN controls L_END
           ;

nameID     :   L_NUM
               {
                   $$ = $1;
               }
           |   L_ID
               {
                   $$ = FindSymbol($1);
               }
           ;

We've added a new non-terminal symbol, nameID that is defined as either an L_ID or an L_NUM.

If a nameID is a number (first path) we assign it's value ($1) to $$. If it's an L_ID (second path) we look its value up somewhere and assign that value to $$. In either case $$ is the value of nameID. And in either case $1 is the value that the lexical analyser (tokeniser) returned with the associated token. That's to say, if the lexical analyser saw a number it assigned that number as the value and returned L_NUM as the token. If, on the other hand, the lexical analyser saw an L_ID, it assigned the text of the L_ID as the value and returned L_ID as the token type.

How does the parser know it should return a number from the nameID rule? Up in the %token section of the grammar we tell the parser that the nameID rule has a numeric value by means of a %type declaration, in exactly the same way we told the parser that an L_NUM is a number.

%type <numVal> L_NUM  nameID

In other words, the %type keyword can assign data types to both tokens (terminal symbols) and non-terminal symbols. Not all tokens or non-terminal symbols need data types, only those that actually have data associated with them.

Yacc and lex 108

Inevitably your parser will encounter errors. We've looked at the structure of a dialog resource and know that the first element is a numeric ID. Suppose the dialog template you're reading is missing a comma following the firstL_NUM

IDD_WHISPERWINDOW DIALOG 0 0, 261, 146

The grammar has parsed the IDD_WHISPERWINDOW, the DIALOG and the 0 and then gets another L_NUM token. Unfortunately this doesn't match the expected input. The parser expected to see a comma at this point. So the parser checks for any other rule that could match what we've seen so far and fails to find one. Error!

dialog    :    L_ID L_DIALOG L_NUM ',' L_NUM ',' L_NUM ',' L_NUM

We can get around this situation by redefining the start of the grammar. Instead of assuming that we're looking for just a stream of input that matches the dialog rule let's look for either a valid dialog template or allow for graceful handling of an input error. Now our start condition looks like this.

start     :    dialog
          |    error L_END { yyerrok(); }
          ;

error is a predefined token with special meaning. Whenever the parser encounters input that it can't handle it moves into the error state and looks for a rule that contains the error token. The parser also attaches some other significance to the error state. Specifically, it looks for the token following the error symbol and discards all other input until it sees that token. Remember the structure of a dialog template. A bunch of stuff including some control declarations terminated by the END keyword. In other words, the parser can run into invalid input anywhere within the dialog template and it will then keep reading (and discarding) tokens from the lexical analyser until it sees a token matching the one that terminates the error condition. When it sees that token it matches the rule and executes the action associated with the rule. In this case it simply calls the yyerrok() function and continues. All that yyerrok does is reset the parser to 'normal' input mode and let it continue trying to parse input instead of simply discarding tokens. In a real application you'd probably want the error rule to do something more useful like inform the user that an error was found.

Yacc and lex 109

Well that was the easy error handling - errors that arise from unexpected or invalid input. A much more difficult class of errors arise from ambiguous grammars.

Yacc implements what's called an LALR(1) grammar. That means that it looks ahead to see what's coming. The (1) means it uses a single look ahead token.

Thus, if your grammar is simple enough that the parser can always decide what to do based on what it's already seen plus a single token ahead then your grammar can always be parsed unambiguously. Unfortunately lots of useful grammars are ambiguous which leads to a problem for the parser - if, based on what it's already seen plus the one look ahead token, it has two or more equally valid branches to follow the parser has to have some other way to decide. Yacc resolves this by applying the rule that came first in the grammar file.

When yacc encounters such ambiguities in the grammar it outputs an error message warning you of the fact. Usually you need to carefully examine the error message and the offending line and try to work out manually what the parser will actually do given the ambiguity. Most yacc implementations can also produce a verbose output file that shows in great detail what the parser will do at any given point. However, the interpretation of such files is beyond the scope of this tutorial.

References:

http://www.monmouth.com/~wstreett/lex-yacc/lex-yacc.html I've found this to be the most generally useful reference to lex/yacc (actually flex/bison).

http://www.eng.usyd.edu.au/tutorial/course1/cplusplus15.html#l188 for an introduction to flex++ and bison++ which are lexical analysers and parser generators that can create c++ classes.

Tools:

There are many freeware or otherwise implementations of flex and bison available for download on the web. For the most part I've found them to be a nightmare to use. Even when you can find compiled binaries targeted at win32 platforms the generated code is, to my eyes, almost unreadable. Worse, the usual implementation uses a skeleton file into which yacc and lex insert the code fragments and generated tables. Those skeleton files come from the UNIX world and usually contain #ifdef thismachine #else thatmachine code. Frequently it can take longer to get a successful compile of yacc/lex output files than it did to write the grammar in the first place.

Eventually I gave up fighting the free downloads and spent the $US60 to license a copy of Pargen from http://www.bumblebeesoftware.com/ Um nope - I won't get any richer if you buy a license.