An introduction to lex and yacc part 2

Rob Manderson

4.88/5 (23 votes)

Mar 26, 2003

CPOL

13 min read

172266

1946

Using lex and yacc to create parsers for your projects

Download source - 92 Kb

Introduction

In part 1 of this series I discussed the basics of using lex to create a lexical analyser (token recogniser) and yacc to create a parser. In this, the second part, I'll look at using these tools to create a parser capable of reading Visual Studio 6 resource files. Part 2 (this part) will concentrate on the lexical analyser.

However, considering how closely intertwined the two tools are, it will be necessary to switch back and forth between the parser and the lexical analyser.

The project download files contain both the lexical analyser and the parser source code.

Lex 201

As discussed in the first article, the lexical analyser is responsible for reading a stream of input from somewhere and breaking it up into a stream of tokens. Each token represents a lowest level building block representing such things as a string, a number or keyword.

The lexical analyser achieves this by matching the input data to a set of regular expressions (rules). When it finds a match to a specific rule it executes some action code written by the programmer and then returns a token to the parser indicating which rule was matched. It may also return some data associated with the rule. For example, the regular expression

/* Decimal numbers */
-?[0-9]+    {
        yylval.ival = atoi(yytext);
        return NUMBER;
    }

matches a sequence of 1 or more decimal digits optionally preceded by a minus sign. (For a discussion of the format and meaning of regular expressions you can see http://www.monmouth.com/~wstreett/lex-yacc/flex_1.html and search for "Patterns"). When it sees the match it calls atoi passing it the string yytext which just happens to be the string of text which matched the rule. The return value from atoi in turn is assigned to the ival member of the yylval union defined by the parser. The action then returns the NUMBER token to the parser and terminates.

As with most regular expression matchers, lex is 'greedy'. It will try and match the longest possible string of input to a rule. That's what you would expect of course. Given an input of digits, 123 you want it to match 123, returning just one NUMBER token and not a 1 followed by a 2 followed by a 3, returning three NUMBER tokens, one per digit. But sometimes this present a problem. Suppose you try and code a rule to recognised strings like this

/* Strings */
\".*"\"    {
        yylval.pcs = new CString(yytext);
        return STRING;
    }

You're saying that when the lexer sees a double quote it should then consume all input (the .*) until it sees a closing double quote. And it will do just that. Unfortunately, it will also happily consume double quote characters until it reaches the very last one in the input. The 'greedy' algorithm ensures this.

At this point you might be tempted to try and write a more complex regular expression that takes this into account. Perhaps something like

/* Strings */
\"[^"]*\"    {
        yylval.pcs = new CString(yytext);
        return STRING;
    }

This rule will stop at the first double quote following the opening double quote. But what it won't do is let you embed an escaped double quote. At some point you have to decide how complex you want your regular expressions to be or whether it might be simpler (and more maintainable) to write some c/c++ code to help the lexer out. Don't forget that you might be looking at the regular expressions six months from now. I chose the latter approach.

So my string matcher looks like this

/* Strings */
\"    {
    /* Seen start of a string */
    int     ch;
    BOOL bExit = FALSE;

    yylval.pcs = new CString;
    
    while (!bExit && (ch = yyinput()))
    {
        switch (ch)
        {
        case '\\':
            switch (ch = yyinput())
            {
            case 'r':
                ch = '\r';
                break;

            case 'n':
                ch = '\n';
                break;

            case 't':
                ch = '\t';
                break;
            }
            
            break;
            
        case '"':
            bExit = TRUE;
            continue;
        }

        *yylval.pcs += ch;
    }

    return STRING;
}

This combination of regular expressions and procedural code neatly solves both the 'greed' problem and allows for the translation of escape sequences into the characters they represent. The yyinput() function simply gets the next character from the input stream feeding lex.

Lex 202

Visual Studio 6 Resource Scripts use an awful lot of keywords and things that look like keywords. There are keywords such as DIALOG and MENU and then there are window styles such as WS_TABSTOP or WS_BORDER that also look like keywords.

My first draft of the lext script was written to parse a very simple DIALOG template with only one or two window styles. It made sense (and was quick and dirty) to simply code a regular expression per window style and return a token to the parser for each window style.

After I had the basics of the parser working I started copying and pasting other DIALOG templates from other projects into my test file. As each new template was added I'd note the errors the parser encountered and add yet another rule to lex to cover the new window style. It began to look like this

WS_TABSTOP          { return L_WS_TABSTOP; }
WS_BORDER           { return L_WS_BORDER; }
.
.
.
TCS_OWNERDRAWFIXED  { return L_TCS_OWNERDRAWFIXED; }
TCS_RAGGEDRIGHT     { return L_TCS_RAGGEDRIGHT; }
TCS_RIGHT           { return L_TCS_RIGHT; }
TCS_RIGHTJUSTIFY    { return L_TCS_RIGHTJUSTIFY
.
.
.
//    more styles...

Have you actually counted how many window styles are defined in the various Platform SDK Headers? I make it about 225 window styles and a further 40 or so extended styles. That makes for about 265 regular expressions just to cope with window and extended window styles.

In addition to having the styles defined here I also had to define a token for each style over in the parser source file and add another parser rule to deal with the token. It all added up to significant cutting and pasting and was rapidly starting to look like a maintainers nightmare (to say nothing of RSI from all that mousing).

At about this point I realised that a windows style looks exactly like an identifier. I already had a rule in place in the lexer to recognise identifiers (things like IDOK and IDCANCEL)

//    Identifiers
[[:alpha:]_][[:alnum:]_]* {
        yylval.pcs = new CString(yytext);
        return IDENTIFIER;
    }

so it wasn't that much of a step to define two static tables of window styles, one for regular and one for extended styles and then to dynamically build up maps relating the name of the window style to the value of the style. Once that was done I changed the identifier rule to this

//    Identifiers
[[:alpha:]_][[:alnum:]_]* {
        DWORD data;

        //    Check if it's actually a predefined style or extended
        //    style.                            
        if (m_styleMap.Lookup(yytext, data))
        {
            yylval.dval = data;
            return STYLE;
        }
        else if (m_extStyleMap.Lookup(yytext, data))
        {
            yylval.dval = data;
            return EXTENDEDSTYLE;
        }

        //    Nope, none of the above, so return the string and
        //    let the parser sort it out...
        yylval.pcs = new CString(yytext);
        return IDENTIFIER;
    }

This bought me a few things. Firstly, the parser itself was simplified. Instead of 265 tokens it had 2 tokens, one for STYLE and one for EXTENDEDSTYLE where the exact style was passed as the data associated with the token. I could add new window styles as and when they were defined (or I became aware of their existence) simply by adding them to the static table, without needing any changes whatsoever to the parser source file.

Incidentally I use dynamically created maps (created when the lexer is created) for performance reasons. A linear search of a 265 entry table may not take very long in these days of super Pentiums but the time does add up when you're parsing a typical resource script and, in any case, there's no excuse for inefficient code.

This change also cut the time lex required to process the source file and create the output c++ file from 20 or so seconds to less than 1 second. My time is also important.

Thirdly, this change cut the size of the lex generated file from well over 12000 lines to around 2000 lines. This makes for a significant change in both compile time and in the size of the compiled executable.

Finally, the lexical analyser itself, at runtime, is significantly faster.

Lex 203

The entire purpose of this parser is to parse dialog templates in Visual Studio 6 resource scripts. I wanted my parser to be able to handle the .rc files created and maintained by AppStudio without any other processing. This meant that it had to be able to handle any data constructs edited by AppStudio. On the other hand, my project isn't particularly interested in MENU or TOOLBAR or STRINGTABLE or any of the dozens of other things to be found in a .rc file.

In order to avoid having to write parser rules that could handle everything in a resource script I used a feature of lex which allows us to create sub-lexers within a lex script.

When a lexer begins running it's in what is called the INITIAL state. However, at any time, it can be switched into another state which causes a different set of rules to be followed. For example.

%exclusive comment

%%    //    End of first section of the lext script

//    Lex rules start...
"/*"    BEGIN comment;

<comment>"*/"     BEGIN INITIAL;
<comment>.        |
<comment>\n       { ; }

This simple set of rules allows the lexer to switch into the comment state when it sees a /* sequence. Once it's in the comment state it matches all further input against rules prefixed with <comment>. Start states can be exclusive or not exclusive. If they're exclusive you use the %exclusive keyword, otherwise use the %start keyword in the first part of the lex script. When you're using exclusive start states the input will only be matched against the rules defined for that state. Non exclusive states match input against rules without a <state> prefix and also those rules with the corresponding start state.

In the case of the preceding code fragment if the lexer is in the INITIAL state and it sees a /* sequence it matches that rule and executes the action. This action changes the lexer state to comment and continues to read input. However, now the lexer only considers rules commencing with <comment> (because we declared the comment state as exclusive). If the input is anything other than */ the input is discarded. When, however, the lexer input contains */ it matches that rule and the lexer switches back to the INITIAL state. Obvious! But note that none of the actions within the comment state return anything to the parser. And since they return no tokens to the parser the parser sees nothing between the start of the comment and the end. Comments effectively vanish from the input stream as far as the parser is concerned.

If we can make comments vanish from the input what else can vanish? How about code in #ifdef blocks where the symbol wasn't defined? This is a little harder to do because the parser is the rightful place for code deciding whether or not an #ifdef condition was satisfied. But not that much harder to do. Consider this fragment from the parser.

ifdef    :    L_IFDEF IDENTIFIER
         {
            if (!m_symbols.IsDefined(*$2))
            {
                LEXER->m_ifNesting++;
                LEXER->yybegin(ifnest);
            }
                
            delete $2;
         }
     ;

Here the parser sees the token for #ifdef and an IDENTIFIER. It checks if the IDENTIFIER is in it's symbol table and if not it moves the lexer into the ifnest state. It also increments the lexers m_ifNesting counter. Now look at the rules for the ifnest state in the lexer.

//    Rules when inside #ifdef nesting.  Basically ignore everything until
//    we've closed the outer #ifdef with an #endif.  This does require us to
//    count #ifdefs inside the block even when the condition wasn't met.
//    Note that we're only in this state if the #if/#ifdef/#ifndef condition
//    wasn't
//    met
<ifnest>#ifdef    |
<ifnest>#if       |
<ifnest>#ifndef   { m_ifNesting++; }
<ifnest>#endif    {
                        m_ifNesting--;
            
                        if (m_ifNesting == 0)
                            BEGIN INITIAL;
                  }
<ifnest>.         |
<ifnest>\n        { ; }

Firstly, the ifnest state is declared as exclusive. Secondly, the ifnest state needs to count how many #if(def) statements it ignores. This is so it can handle such things as

#if !defined(AFX_RESOURCE_DLL) || defined(AFX_TARG_ENU)
#ifdef _WIN32
LANGUAGE 9, 1
#pragma code_page(1252)
#endif //_WIN32
#include "res\Prober.rc2"  // non-Microsoft Visual C++ edited resources
#include "afxres.rc"       // Standard components
#endif

where if the very first #if fails it will stay in the ifnest state until it reaches the last #endif statement (the one after the #include "afxres.rc" statement).

The net effect of these constructs is that once the parser determines that it's not interested in an input stream it hands responsibility for ignoring all further input until a specified state is reached over to the lexer. Once the responsibility has been handed over the parser sees no more tokens until the end state is reached.

I used the same techniques to allow the parser to ignore, for example, MENU statements in the resource script. The parser relies on the knowledge that a MENU statement consists of an IDENTIFIER followed by the MENU keyword then a bunch of stuff follwed by the BEGIN keyword then more stuff and all terminated with an END keyword. Within the MENU statement there may be nested BEGIN/END blocks. By counting BEGIN and END keyword pairs we can determine when the MENU statement has finished.

Many other constructs within the resource file follow a similar pattern and are ignored in the same way.

Lex 204

Resource script files use #include statements to include the contents of another file at the point where the #include statement occurs. Implementing this functionality takes a little sleight of hand (and some knowledge of the exact method the lexer uses to get it's input). It's pretty easy to establish that the lexer calls a function somewhere to get it's input. Normally that input comes from a file, so you need to store that file handle somewhere, substitute a new one and return to the lexer. Then, when the lexer hits end of file on the newly substituted file handle you close that file and restore the original file handle. That's the theory. Exactly how you do this depends on the lexer you're using. I used Pargen (see Notes below) as my lexer so I did it this way. First the parser needs to recognise that there's a #include statement to process.

include    :    L_INCLUDE STRING
                {
                    DoInclude(*$2);
                    delete $2;
                }
           |    L_INCLUDE BRACESTRING
                {
                    DoInclude(*$2);
                    delete $2;
                }
           ;

BRACESTRING is a token that recognises filenames of the form <filename.ext>. DoInclude() does

void YYPARSERNAME::DoInclude(LPCTSTR szFileName)
{
    ASSERTSTRING(szFileName);

#ifdef NDEBUG
    try
#endif
    {
        FILE *f = fopen(szFileName, "rb");
        
        if (f != (FILE *) NULL)
        {
            if (yydebug)
                TRACE("#including '%s'\n", szFileName);

            sInclude *include = new sInclude;
            include->file = LEXER->yyin;
            include->line = LEXER->yylineno;
            LEXER->m_csCurrentFile = include->m_csFileName = szFileName;
            m_includeStack.AddHead(include);
            LEXER->yyin = f;
            LEXER->yylineno = 0;
            LEXER->yyreset();
        }
    }
#ifdef NDEBUG
    catch(...)
    {
        CString csMessage;
        
        csMessage.Format(_T("Error processing %s\nin #include"), szFileName);
        AfxMessageBox(csMessage, MB_OK);
    }
#endif
}

If this code succeeds in opening the file it allocates a new sInclude structure and stores away the existing input file handle which is stored in the yyin member of the lexer. It also stores away the current line number and the filename. Then it substitues the new file handle for the existing one in the lexer, resets the linenumber to zero (you want error reporting to tell you the correct linenumber for the current file) and resets the lexer. We also add the sInclude structure to a stack. We then returns to normal parsing.

Notice I said if this code succeeds in opening the file. There's no path logic included here. For the purposes of my project I'm not interested in symbols or dialog templates defined in the Microsoft supplied directories. Thus, if an attempt to #include a file fails I don't consider that to be an error - I simply ignore the situation and continue parsing at the following line.

In the lexer we continue reading input as normal. Eventually the lexer reaches end of file and the yywrap() function is called.

int YYLEXERNAME::yywrap()
{
    //  We get here when EOF on the current input file is reached.  This
    //  code checks the parser include stack and pops the previous file
    //  (the one that included the file that's just hit EOF) off.  It then
    //  resets the line number back to the new current file and continues
    //  reading input from the previous file.
    sInclude *include;

    if (!PARSER->m_includeStack.IsEmpty())
    {
        fclose(yyin);
        include = (sInclude *) PARSER->m_includeStack.RemoveHead();

        if (PARSER->yydebug)
            TRACE("Popping parser include stack '%s'\n",
                       include->m_csFileName);

        yyin = include->file;
        yylineno = include->line;
        m_csCurrentFile = include->m_csFileName;
        delete include;
        
        //    Keep on lexing...
        return 0;
    }

    //    No more input from anywhere,  tell the lexer so...
    return 1;
}

This works well for our simple parser. It may not work properly for more complex parsers. The problem is that the lexer may need to read forward a long way to determine which rule to match, and it may also need to backtrack a long way if it fails to match a particular rule. If your grammar is such that the lexer needs to do this across a file boundary this technique won't work. Fortunately a resource file is simple enough that it's unlikely such a situation will arise.

Notes on the sample files

This project was developed using Parser Generator from http://www.bumblebeesoftware.com which has versions of lex and yacc (actually called alex and ayacc) that can generate c++ classes. I don't expect that classical lex and yacc or flex and bison or flex++ or bison++ will be able to compile the source files. However, Bumble-bee Software provide a 30 day evaluation of Parser Generator (I used most of the 30 days developing and tweaking this project). I liked it enough to consider the US$60 for a license a bargain.

I also rebuilt the libraries Bumble-Bee provide to use classic c style FILE pointers instead of c++ iostreams. You don't care that I did this except that the code to implement #include uses FILE pointers. If you attempt to rebuild my project files using Parser Generator you need either to change my code to use c++ iostreams or rebuild their libraries to use FILE pointers.

The files include the first draft of the project for which I wrote the parser. It currently has only the most basic functionality. To play with the code open a new project and select a .rc file. It should parse the script and present a list of dialogs. Double clicking on a dialog name should launch the dialog.

As has been pointed out in the first comment below - the project won't compile. Well it will actually, but I didn't think to point out that in addition to using files passed through the Parser Generator alex and ayacc it also links with boilerplate libraries and header files provided by Parser Generator. I'm not licensed to distribute those files so they are not included. You can get your own copy by downloading and installing the evaluation version of Parser Generator from the link above.

Acknowledgements

This project contains code written by Herbert Menke (h.menke@gmx.de) but modified by me. The code in question allows resizable controls on dialogs. (Aha, so that's what this project is about).