Compiler Patterns

DQNOK

Rate me:

4.87/5 (35 votes)

27 Jul 2017CPOL36 min read

63.6K

905

Traces the evolution of a high-speed EXPRESSION EVALUATOR to demonstrate the various PATTERNS you will need to "roll your own" recursive descent compiler.

Download source - 29.02 KB

Summary

Depending on the complexity of the grammar, a hand-built recursive descent compiler may actually be the best option. It can be quicker to implement than learning a compiler-compiler system, plus it avoids having to integrate the compiler-compiler's runtime into your source code base. But it also involves risks that must be carefully weighed.

Introduction

Many years ago, smart people recognized that building compilers was tough. BUT, they also recognized that there were many repeated patterns within the code for compilers. And where there is the combination of smart people, AND patterns, software magically evolves*. Thus was born "compiler-compilers" to assist in converting grammar specifications into compilers without the tedium of writing the same code over and over for those patterns. Famous among these compiler-compilers is YACC -- Yet Another Compiler Compiler.

This article demonstrates a few of the patterns of compilers, and how to go about implementing a simple compiler WITHOUT a compiler-compiler. To help make the ideas concrete, we work through the task of building an expression evaluator.

This is NOT a tutorial on compiler theory or on language design. There are several textbooks on that subject. Plus there's always that repository of all useful knowledge, Wikipedia if you want that. Rather, this article is a pragmatic look at the patterns that go into building a simple recursive descent compiler. If you want, you can download the full source code for the expression evaluator.

But first, some important preliminaries: Before reading this article thinking that you're going to implement a compiler, follow this flowchart:

Do you need to implement a fairly simple parser/compiler?
- no
  { save a link to this article, and come back when you need it.
    If you're anything like me, you won't remember enough of the
    details to make it worth reading anyway.  Heck, I wrote it,
    and I don't even remember what's in here!
  }
- yes
  { Can you afford to hire someone who really understands
    compilers?
    - yes
      { just do it!  Save yourself a lot of pain.
      }
    - no
      { Do you have enough time to learn to use a good
        compiler-compiler?
        - yes
          { this is probably the smart move.  Good compiler-
            compilers reduce the risk of missing things, or of
            accidentally coming up with a compiler that accepts
            constructs you hadn't planned on, which are
            probably meaningless anyway.
          }
        - no
          { I feel sorry for you.  Be sure you implement an
            exhaustive test suite.
          }
      }
  }

Several years back, I needed to implement a very small compiler for reading and processing industry-standard stacking sequence codes for lamina layup in composite structural materials (like fiberglass or carbon-fiber composites) into strength and stiffness results for the final product. Because the grammar was quite small, I opted to not dig out my compiler-compiler (a commercial product called "LALR" written by Paul B. Mann, one of the geniuses of our day) and decided instead to implement a simple top-down recursive descent parser. It took me a few hours of focused thought to get it working correctly.

One of the advantages of NOT using a compiler-compiler is that you don't have to drag in all the (possibly licensed) parsing engine and lexing engine code. Another advantage, as I alluded to earlier, is that it MIGHT actually be quicker than learning (or re-learning) the compiler-compiler's system. All compiler-compilers that I'm aware of have their own syntax for inputs, then requirements for where to put output files, how to build the resulting generated code, etc. Once you learn all that, it's pretty quick to make changes and re-generate. But there IS DEFINITELY a learning curve.

Even if you do use a compiler-compiler, you still have to implement the action code associated with each individual grammar production. That means you STILL have to REALLY UNDERSTAND what you're trying to accomplish. The compiler-compiler really only helps you implement the parser. The actions you take as the parser dissects the input are still up to you. You still have to build the backend data structures, the logic, the compiler's output, etc.

The risks of rolling your own compiler (instead of using a compiler-compiler) is that you are building the code for each production from scratch, and you can miss little things that will cause your parser to misbehave, or to behave differently on similar productions on which it SHOULD behave pretty much the same.

Well, having said all that, let's throw caution to the wind and get started.

Getting Started: The Project Requirements

In my expression evaluator project, I needed to read in a mathematical expression from an appropriate .ini file, and then to evaluate the result of that expression at run-time, with parameter values that could only be known at run-time. The expression was to be evaluated tens of thousands of times so it needed to be compiled only once into an intermediate-form that would be very fast to evaluate. I also wanted it to be able to deal with differences in the formula for different points in the domain; just like a real mathematical function would:

f(x) = {0 when x<0, x when 0 <= x <= 2, 4-x when x>2}

Thus, my goal was to implement an expression evaluator that:

Accepts replaceable parameters
Accepts logical constructs
Executes branching based on logical results
Produces a pre-compiled version for speedy execution

The simple approach for my application would have been to drag in a scripting language that could handle all of these with ease. If you have the environment available to you, using PERL, or VBScript would probably take care of all the details for you (actually, I don't know whether they pre-compile an expression, or whether they recompile the expression each time it's evaluated). Unfortunately, in my environment, I had no pre-made scripting language that would be available, so I needed to implement the expression evaluation code myself.

The three primary tasks here are:

Come up with the intermediate-form. This is quite independent of writing the compiler. The compiler will simply put the human-written expression INTO that intermediate-form. Part of this task is also to implement the evaluator function that turns the compiled expression plus replaceable parameters into a final single number.
Come up with an appropriate grammar. This part is critical, and determines the precedence and associativity of the operators. If this part is wrong, nothing else is going to work as expected. I'll share with you some of the joys I had in getting certain aspects of the grammar incorrect, and having to work thru them.
Implement a compiler that both recognizes and parses the grammar correctly, AND takes appropriate construction actions (constructing that intermediate-form from part 1) along the way.

--Task 1: Design the Intermediate Form --

The intermediate-form I settled on is very similar to RPN (Reverse Polish notation: see http://www.hpmuseum.org/rpn.htm), a stack based representation of an expression formally called "postfix" form. Humans write expressions in "infix" form where the operators are in-between the operands. As such, we frequently have to add parenthesis to indicate precedence. Postfix expressions don't need parenthesis because they just perform all operations in left to right order. Here are a couple of examples.

infix form  :  7 + 2
postfix form:  7 2 +

infix form  :  5*sqrt( 4 + 3*4 )
postfix form:  3 4 * 4 + sqrt 5 *

To evaluate a postfix expression, a pointer into the postfix expression represents the "top of the stack". As the expression is traversed from left to right, when operators are encountered, the appropriate number of operands to the left of the operator is processed and the results replace the operator.

Stepping through that last example expression, stopping when operators are recognized and executed, the expression evolves as follows:

3 4 *  4  + sqrt 5 *
^
3 4 12 4  + sqrt 5 *
    ^
3 4 12 4 16 sqrt 5 *
          ^
3 4 12 2 16  4   5 *
             ^
3 4 12 2 16  4   5 20
                   ^

When finished, the final answer is on the top of the stack (last position of the postfix expression).

The problem with using this exact process is that I needed to be able to repeat the process many times. You can see that the postfix expression is consumed along the way, making it unavailable for re-use. Also, so far there is no provision for replaceable parameters. And lastly, the expression evaluator must recognize that '+', '*', and "sqrt" are operators, not operands. That likely involves decision making at run time, which, for speed sake I wanted to avoid or to make as simple as possible. All of these obstacles could probably be overcome using a strict postfix idiom, but I chose a slightly different approach.

After stewing about it for a while, I settled on a hybrid approach that involved four arrays: one array of constants that is populated at compile time, one array of instructions that is also populated at compile time, one array of parameters that is passed to the evaluator routine for use at evaluation time, and a scratch-pad array -- the stack, used at evaluation time. The structure looks something like this:

C++

typedef struct ExpressionStruct {

   double *constants;
   int     constantsLen;
   int    *instructions;
   int     instructionsLen;

} Expression_T;

For the following input, "6 * sqrt( 5 + 3*4 )", the generated arrays are:

constants:      3       4            5                  6
instructions: PUSH_C  PUSH_C  MUL  PUSH_C  ADD  SQRT  PUSH_C  MUL

Where PUSH_C means "get the next constant from the constants array and push it onto the local stack". There is also a PUSH_P "Push Parameter" instruction. See the algorithm below.

The evaluation routine runs through the entire instructions array, and depending on the instruction either copies the next constant to the scratch-pad stack, copies one of the parameters to the same stack, or executes the indicated instruction (like SQRT or ADD or MULT) using values near the top of the stack. It looks like this:

C++

double evaluate( Expression_T* ep, double* parameters )
{
   int constantIndex = 0; //next constant to push to stack.
   int ndx;               //our current position on the
                          //instruction stack.
   double stack[50];      //50 Oreo "Double Stack" cookies.
                          //I'm drooling already...
   int stktop = -1;       //last item ON the evaluation stack.
                          //Nothing on the stack yet.

 //for each instruction in the array,
   for( ndx=0; ndx < ep->instructionLen; ndx++ )
   {
    //get the instruction.
      int instruction = ep->instructions[ndx];

    //Zero is the instruction number for "PUSH_C", which
    //means "push the next constant to the stack."
      if( 0 == instruction )
         stack[++stktop] = ep->constants[constantIndex++] ;

    //Negative instruction numbers means "push the
    //indicated parameter to the stack."  Convert the
    //negative number to a positive, zero-based index.
      else if( instruction < 0 )
         stack[++stktop] = parameters[-instruction-1];

    //Positive instruction numbers are the regular
    //mathematical operators.
      else switch( instruction )
      {
         case ADD:  //binary operators decrement the stack by one.
                      stktop-- ;
                    //do the addition
                      stack[stktop] += stack[stktop+1];
                      break;

         case MUL:  //another binary operator.
                      stktop-- ;
                    //do the multiplication
                      stack[stktop] *= stack[stktop+1];
                      break;

         case SQRT: //unary operators leave the stack
                    //height unchanged.
                    //do the square root operation.
                      stack[stktop] = sqrt( stack[stktop] );
                      break;

         default:   //print an error message and abort.
                      error("Unrecognized operator");
      }
   }
   return stack[0]; //final answer will be here now.
}

The actual code is quite a bit longer owing to a much longer switch statement (more operators). Plus it includes lots of debugging and tracing code. It can be downloaded from links elsewhere in this post.

Note that in the example code in this article, there are some hard limits embedded within the code; like the "double stack[50]" above. This is especially bad form unless you plan to test the stack height at every PUSH, and throw a runtime error if something goes wrong. If you use the code where you expose the functions to the outside world, like if you're getting the infix expression from a web-form or some such thing, using hard limits exposes you to nasty runtime things like buffer overruns. For example, the expression...

"1*(1*(1*(1*(1*(...)))))"

...will eventually overflow the 50 element stack if continued long enough. The actual code uses dynamic memory for the stack as well as for the instructions and constants arrays. While I won't discuss it in the article, the actual stack usage can be computed at compile time, and the correct amount of stack pre-allocated so that run-time stack-overflow tests won't even be necessary. Again, you can see how this is done in the actual code within the addInstruction() function.

That finishes the task of designing the intermediate form that the compiler will generate.

--Task 2: Design the Expression Grammar --

We're going to be recognizing human-readable infix expressions. Later in task 3, we'll implement the production actions associated with each grammar rule, converting them to the quasi-postfix format discussed above. For now, we just need to write the BNF that describes the grammar.

The first thing to note is that we want the compiling process to end when it hits the end-of-string marker (the null terminator). Before then, it better have recognized an "expression". Thus, we can say:

goal:
  -> expression '\0'

where "expression" is one of the non-terminals we're going to implement.

I'm going to attempt to express the expression grammar from the "bottom-up". That is, thinking about the lowest, most primitive forms, and then building upon them.

It seemed to me that the most primitive operands within the grammar would be literal numbers, and parameters. That is, everything that ends up on either side of an operator must ultimately come down to something that can be reduced to just a number. To indicate the most basic of operands, I use the non-terminal name t0. As we progress to more complicated forms, we'll march through t1, t2, t3, etc.

t0:
  -> number
  -> parameter

It also occurs to me that among the most basic forms, we should include anything within parenthesis. Thus, we should add:

-> '(' expression ')'

where we've yet to define what an expression is, but it will end-up being just what you think it is; a series of operands and operators that reduce to express a single number.

We should also count function calls as most basic. Thus, we should add:

-> funcname '(' list ')'

where a list will be a comma separated list of expressions.

And one last thing: if we want to allow named constants, like "pi", we should do it here. I only implemented "pi", but more could be added if desired.

Thus, the complete description of the t0 non-terminal is:

t0:
  -> number
  -> parameter
  -> '(' expression ')'
  -> funcname '(' list ')'
  -> named_constant         //I'm actually only allowing pi for simplicity.

I'll go ahead and note here that on my first pass, I thought the branching construct should go here. Thus, I had another production rule like this:

-> expression  '?'  expression ':' expression

where you'll recognize the use of the C language "conditional". More about this later. Here I just want to point out that this was a mistake, and caused the parser to act in very odd ways, looking for and processing tokens in a way that I had not intended at all. After looking to the C language specification to see how THEY implemented it, I realized it needed to be at the LOWEST precedence, not the highest. Thus, I moved it out of the definition for t0.

Having decided what our "primitives" are, we begin progressively adding complexity to our terms by following the old PEMDAS pattern: Parenthesis, Exponentiation, Multiplication, Division, Addition, Subtraction (or as my kids learned it, "Parachute Expert My Dear Aunt Sally"). We've already taken care of parenthesis, so we can begin progressively creating non-terminals for each precedence level of operator, paying close attention to the operator's associativity.

Actually, the PEMDAS pattern is not completely correct as we should apply unary negation ahead of exponentiation. I want to be able to say -2^3 and have it mean (-2)^3, not -(2^3). In this example, you get the same answer, but what about in a formula that uses a parameter instead of a literal: for example, -p1 ^ 3.2 where you know ahead of time that the parameter value p1 is negative, and its sign must be changed BEFORE attempting to exponentiate it with a fractional exponent. Obviously the binding should be (-p1)^3.2.

t1:           //unary prefix operators.
  -> t0
  -> '-' t0
  -> '+' t0

The purist in me says we ought to be able to negate (or posate?) ANY single term, and thus we should rewrite it with a recursive definition as follows:

t1:
  -> t0
  -> '-' t1
  -> '+' t1

making this a right recursive rule (because the non-terminal symbol "t1" shows-up on the right end of (at least) one production rule). But the pragmatist in me says, "how many times should we really allow...

x + - - - - + + + - - 5

...anyway?" Thus, I'm overruling the purist, and sticking to my decision to ONLY allow a single negation (or posation, er uh, positivation, or maybe positively_plusillyation) operator, and leaving it as originally written.

NOW that unary operators are handled, let's do exponentiation:

t2:
  -> t1
  -> t1 '^' t2

Note the right recursion. This guarantees that the ^ operator will bind from right to left. That is, in a sequence of operations like 4 ^ 3 ^ 2, the compiler will recognize that "4" must remain as a t1, and "3 ^ 2" must be reduced to a t2. In reducing "3 ^ 2" to a t2, it must leave "3" as a t1, but reduce "2" to a t2. Thus, the implied binding will be from right to left, effecting 4^(3^(2)).

Next is multiplication and division. They are at the same level of precedence, and thus all go within the same non-terminal, and are evaluated in first-come first-served (left to right) order. We're going to add another operator to the classic PEMDAS, and implement modulo as well.

t3:
  -> t2
  -> t3 '*' t2
  -> t3 '/' t2
  -> t3 '%' t2

Note the left recursion. Again, this is what forces left associativity of the operators. We'll find that most of our operators are actually left associative.

And finally addition and subtraction: Again, we want them to associate left to right, so we need left recursion.

t4:
  -> t3
  -> t4 '+' t3
  -> t4 '-' t3

That finishes the simple calculator part of formula evaluation. However, as stated earlier, I want the ability to do tests and branching to different formulas based on those tests. Thus, the first thing we need to implement is logical comparisons: less than, greater than, equals, etc. Turns out that the precedence of logical equals needs to be lower than the comparison operators, so we need to do the less than and greater than first.

t5:
  -> t4
  -> t5 "<"  t4
  -> t5 "<=" t4
  -> t5 ">"  t4
  -> t5 ">=" t4

t6:
  -> t5
  -> "not" t5

By placing "not" at one precedence level lower than the comparison operators,

not x < 7

evaluates as not(x < 7), instead of (not x) < 7. I believe this makes more intuitive sense than having the "not" bind with x first.

Note that by putting it at this level of precedence, "not x" cannot be used in contexts that require a t5, a t4, a t3, a t2, a t1, or a t0. That is, you can't say "6 * not p2" to effect a 0 or 6 toggle based on p2. You CAN however effect exactly this by simply enclosing the "not p2" in parens. "6 * (not p2)" DOES effect a 0 or 6 toggle. Because of the parenthesis, "(not p2)" is a primitive (i.e. a t0) and can be used in any context, whereas "not p2" is a t6 and is limited in where it can be used.

Note also that we're patterning this after the unary negation (t1). Thus we only test t5s on the right hand side. As with unary negation, it is NOT recursively defined.

t7:
  -> t6
  -> t7 "eq" t6
  -> t7 "not_eq" t6

One might wonder why not to include equality tests as higher precedence than "not" so that...

not y eq x

...evaluates as not(y eq x) instead of how it's actually going to evaluate: (not y) eq x.

The answer is that there is a separate operator for not-equal, and to compute not equal, we should just use that operator. Again, if "not y eq x" bound as not(y eq x), how would we effect (not y) eq x? We would be forced to use parens.

I'll note here that I was well into the testing aspect of the project when it finally dawned on me (via a test) that the "eq" token created a slight conflict. Consider this construct:

6.5eq7.0

Obviously this is false, but it illustrates the problem. Because I'm using strtod() to pick up numbers, the strtod() function scans up through 6.5e thinking it's getting a scientific notation number, like 6.5e3 for 6500. Then it sees the 'q' and aborts. Yes, I could have fixed this by writing my own number scanning function. But this seemed like overkill. Instead, I just added the C operators "==" and "!=" to the t7 rule so someone can write "6.5==7.0" and get the result they want (of course they could also have just added spaces: "6.5 eq 7.0" works just fine). Besides, when the "6.5eq" fails it throws an error and does not quietly march forward and produce erroneous results. I deemed that "good enough"!

t7:
  -> t6
  -> t7 "eq" t6
  -> t7 "==" t6
  -> t7 "not_eq" t6
  -> t7 "!=" t6

For logical completeness, here are the ands and ors.

t8:
  -> t7
  -> t8 "and" t7

t9:
  -> t8
  -> t9 "or" t8

Both are left recursive meaning they associate left to right. Since "and" is higher precedence than "or",

x and y or z and w

evaluates as (x and y) or (z and w)

We're only lacking two things now: "expression"s, and "list"s for use with function parameters. First, the left associative list:

list:
  -> expression
  -> list ',' expression

Note that, unless we allow null expressions, this way of defining a list is not going to permit empty lists. In our grammar, that is not a problem since we’re not going to be calling functions with empty parameter lists anyway. But in a more general grammar where we wanted to allow empty parameter lists, we would need to consider implementing empty expressions (and consider how to deal with the complexity that will produce) or add another production to permit empty lists:

list:
   -> expression
   -> list ',' expression
   -> ''

But now we’ve introduced some interesting issues. First, input like the following is now legal:

C++

somefunc( , , 2)

Unless you know what you want to do with such input, that is, what kind of structures you want the compiler to build in response to such input, they are meaningless anyway and you don’t want to allow them.

Next, you’ve just introduced a conflict into the grammar. Consider how such a grammar should parse a simple function call: sin( 2 ). Realize that before the “2”, there is an empty string (of zero length). It’s invisible to us, but not to the parser. Should the parser first accept the zero-length string, reducing it to a “list”? Or should it first attempt to recognize and accept an expression? It is clear to US that the right answer is to look for an expression first, then if one is not found, to accept nothingness as a valid list. But to implement this, you’ll have to be careful of the order in which you do things.

Turns out, if all you want is to implement empty parameter lists (and not to accept constructs like somefunc( , , 2)), this is probably the wrong place to do it. You should probably modify the t0 non-terminal by adding the following rule:

-> funcname '(' ')'

We'll talk more about this when we implement the list non-terminal.

This finishes everything except the "choose which expression" operator. Since only the expression non-terminal is left to define, this is the place to incorporate the branching operator. I opted to use the C language ?: conditional operator for branching.

expression:
  -> t9
  -> expression '?' expression ':' expression

That finishes the grammar. Now on to task number 3: actually implementing it as a compiler.

--Task 3: Building the Compiler --

Parsers (input language recognizers) are broadly placed into one of two categories: Top-down, or bottom-up. In a bottom-up parser, a list of "next legal tokens" is generated for each "state" the parser finds itself in. If the next token read from the input doesn't match one of the items in the list, there's an error in the input. If it does match, the parser executes whatever action (that you design as part of making the "parser" into a "compiler") that is associated with that state and that token, and then moves to the next parser state associated with that state and that token. Bottom-up parsers may execute faster than top-down parsers, but they are difficult to build by hand.

Top-down parsers on the other hand can be built in a "recursive-descent" fashion almost directly from the grammar specification. Consider the t2 non-terminal,

t2:
  -> t1
  -> t1 '^' t2

we can implement this very simply with the following function. In this function, the parameter "ifp" is the infix expression pointer address. We need the address of the pointer so we can modify where the expression pointer is pointing to as we "consume" tokens from the input. As you'll see, all the non-terminal functions return a value that can be used to express success or failure: success when one of the production rules is recognized, and failure when none of the rules are recognized.

C++

int t2( Expression_T* ep, const char** ifp )
{
   if( not t1(ep, ifp) ) //if not getting a t1,
   {
    //since both productions begin with t1, if we don't get a t1 first,
    //we can't go forward.
      return 0;  //return failure indicator.
   }
  //OK, we got a t1.  We at least got the first production.  Now we need to
  //see if we're actually getting the second production instead of the first
  //by checking if there's a " ^ t2 " following the t1 we just accepted.

  //Note that the t1 function moved what ifp is pointing at forward to just
  //after the t1 that was just accepted.

  //Before each terminal (actual input token) we can have arbitrary white
  //space. Therefore, we always ltrim before looking for an actual input token.
    ltrim( ifp );       //skip over white space.

    if( **ifp == '^' )  //we got the '^' symbol,
    {
       *ifp++ ;         //move the input past the '^' symbol.
       if( t2(ep, ifp) )//look for a trailing t2 non-terminal.  RECURSIVE CALL.
       {//yes, we got the second production: "t1 '^' t2", instead of the first.
          addInstruction( ep, EXPON ); //push an enumeration value indicating
                                       //we saw an exponentiation action.
          return 1;   //indicate success.
       }
       else
       {
        //We got a '^' but not a t2.  There are no rules in t2
        //that begin with t1 '^' , so we've failed to see a t2.
          return 0;
       }
    }
  //at this point we got a t1 only which is good enough to produce a t2.
  //No compiler action needed; just indicate success.
    return 1;
}

Pattern: Recursive descent parsers can be written almost directly from the grammar definition. You check for the existence of non-terminals in the input by simply calling the function that implements that non-terminal, then testing the return to see if it succeeded. You check for literals in the input by first bypassing white-space, then doing a direct '==' character comparison, or a strcmp(). As we match input literals (tokens), we "consume" them by advancing the input pointer past them.

Pattern: Anytime you have two rules that begin the same, make sure you don't prematurely accept the shorter rule (like t2 -> t1 in the above example) without checking if a longer rule is actually present (like t2 -> t1 '^' t2 in the above example). If a rule is seen, execute a construction action if required (like the “addInstruction()” action above). Indicate whether any of the rules were seen by returning a zero (for failure) or non-zero (for success).

Saving State

Note that in each case where we returned zero (failed to see a valid production), we just glibly returned, without restoring the input pointer (ifp) to where it had been before our failed t1 or t2 attempt, even though either of those actions could have advanced the pointer to who-knows where, and made it impossible to continue with an alternate action. For example, consider the following grammar:

tx:
  -> first_production
  -> next_production
  -> last_production

first_production:
  -> "hello world" ';'

next_production:
  -> "hello world" '?'

We implement first_production like this:

C++

int first_production( Expression_T* ep, const char** ifp )
{
   ltrim( ifp );        //skip leading white.
   if( 0 == strncmp( *ifp, "hello world", 11 ) //got "hello world"
   and ( *ifp += 11,    //advance past "hello world" in side-effect context
          ltrim( ifp ), //advance past white space in side-effect context
         ';' == **ifp ) ) //GOT IT!
   {
      *ifp ++ ;  //advance past the ';'
      return 1;  //indicate success.
   {
   else
      return 0;  //indicate failure.
}

Now suppose the actual input looks like this:

"hello world?"

Within first_production, the input is advanced past "hello world" and is pointing at the '?' when the production fails and returns. tx now attempts next_production, but since the input is pointing at the '?' instead of at "hello world?" it will also fail, even though it should succeed.

Clearly, if first_production fails, before trying next_production, we must restore the pointer to where it was at the beginning of the function. The same thing if next_production fails; before trying last_production, we must restore the input pointer (the state).

And while we're talking about it, note that state is actually much more than just the input position pointer. In our compiler, state also includes the instruction array end pointer, and the constants array end pointer. In a more complicated compiler, it might include other things as well.

The question is, WHERE should we restore state? That is, which function should be responsible for restoring state: the calling function, or the called function?

The way t2 is implemented, we are clearly ASSUMING that if a function in the call chain ahead of t2 needs the state restored, it will take care of it, because t2 certainly didn't do it!

This gets down to a design decision: should the caller, or the called function do state restoration. Well, consider the first few non-terminals and production rules of our grammar:

t1: -> t0
    ...

t2: -> t1
    ...

t3: -> t2
    ...

t4: -> t3
    ...

If we rely on the called function to restore state, we're going to have code that looks like this:

C++

int t1( Expression_T* ep, const char** ifp )
{
   State_T s;
   save_state( &s, ep, ifp );
   if( not t0( ep, ifp ) )
   {
      restore_state( &s, ep, ifp );
      return 0;
   }
   else ...
}

int t2( Expression_T* ep, const char** ifp )
{
   State_T s;
   save_state( &s, ep, ifp );
   if( not t1( ep, ifp ) )
   {
      restore_state( &s, ep, ifp );
      return 0;
   }
   else ...
}

etc.

Each non-terminal function must save a local copy of the state before it does anything else. Thus, saving and restoring state becomes a rather time and stack consuming process as each non-terminal function drills into the next lower non-terminal function.

Now suppose we let the calling function take care of restoring state. Then the example functions above become:

C++

int t1( Expression_T* ep, const char** ifp )
{
   if( not t0( ep, ifp ) )
   {
      return 0;
   }
   else ...
}

int t2( Expression_T* ep, const char** ifp )
{
   if( not t1( ep, ifp ) )
   {
      return 0;
   }
   else ...
}

Much simpler, much faster, and uses less stack space for local variables.

Pattern: Let the calling functions restore their own state if they need to. Any time there are alternate rules (that is, if failing the first rule doesn't doom the whole non-terminal), that non-terminal will need to save state before drilling down with other non-terminal calls.

As we will see, most of our grammar rules will be like the t2 rule implemented above, where, if the first production fails, there's no point in moving forward; that non-terminal has already failed. Since there was no alternative, there was no point in saving the state within that non-terminal.

Before moving forward with more code, lets update our Expression_T type to take care of that annoying double pointer input "**ifp".

C++

typedef struct ExpressionStruct {

   double *constants;
   int     constantsLen;
   int    *instructions;
   int     instructionsLen;
   const char* infix;       //replaces **ifp that we were using.

} Expression_T;

This will allow us to pass just one argument to the non-terminal functions, and make manipulating the input pointer easier. When we initiate the compiling process, we'll set the infix member to the actual expression the user wants compiled.

Left Recursion Pattern

Now let's demonstrate more patterns by writing the functions for different non-terminals. The t2 rule demonstrated right recursion. Now let's demonstrate left recursion by implementing t4.

t4:
  -> t3              //rule #1
  -> t4 '+' t3       //rule #2
  -> t4 '-' t3       //rule #3

Before we get started, note that the ONLY way to get a t4 (from a non-t4) is by seeing a t3. If we don't begin by seeing a t3, there is no way to move forward. Thus, while there ARE three separate rules, implying we should save state, the first one is mandatory, so there's really no point is saving state before it.

C++

int t4( Expression_T* ep )
{
   InstructionEnum_T ins; //which instruction (ADD or SUBT(ract)) we're seeing.

   if( not t3( ep ) )
   {
    //notice that if we naively attempt to move on and try the next rule
    //  if( t4( ep ) )
    //   //great, got a t4 so move forward...
    //we have just created an infinite loop.
    //If we don't start this non-terminal by FIRST getting a t3,
    //we simply don't have any basis to move forward.  PERIOD.

      return 0;
   }

 //OK, got here, so we must have gotten a t3, which is good enough from
 //rule #1 to reduce to a t4.  Thus, since we now have a t4, we can now
 //look at the subsequent rules to see if we're actually getting one of
 //them. That is, look at the next token to see if it is a '+', or a '-'.

 //Note that if we DO get a successful production, it also reduces to a
 //t4, and we should restart the process of again looking for one of the
 //rules that begins with t4 followed by one of the '+' or '-' tokens.

   while(
 //We keep looping while seeing one of the last two production rules.

 //Remember, we always allow arbitrary white space before tokens.
           ltrim( &(ep->infix) ), //trim white in side-effect context.

  //use side-effect context to store info about the token we're recognizing.
          (ins=ADD ,   '+' == *(ep->infix))
      or  (ins=SUBT,   '-' == *(ep->infix))  )
   {
    //YES, we got one of the addition or subtraction operator tokens,
      ep->infix += 1;       //skip past the token.
      if( t3( ep ) )        //try the rest of the rule.
      {
       //YES, we got another "t4 OP t3" reduction.
         addInstruction( ep, ins ); //push whichever instruction we got.
      }
      else
    //Now we're in a quandary.  We must have seen a t4 or we wouldn't be here.
    //So we should return success, right?  But then we saw a '+' or '-'
    //NOT followed by a t3.  We don't have any rules for which THAT should
    //be considered success.  If we return failure, we're indicating that
    //the WHOLE rule failed, from the very first t4 -> t3, which is not
    //true; that rule succeeded.  Oh my, oh my!  What should we do?
    //Let's take a break and discuss it...
   }
   return 1; //at least the original t4 -> t3 rule succeeded.
}

Pattern: Implement grammar left recursion via a "while" loop. Don't carelessly create an infinite loop by recursively calling a non-terminal that will bring you right back to the same position in the code!

More State Saving

So, what should we do about that 'else' clause after seeing "t4 OP ", but then NOT seeing a successful t3 non-terminal? I know of two possible approaches.

The Purist Approach

In the purist approach, we should attempt to isolate the t4 non-terminal logic from every other non-terminal, and make it stand alone. In this way, if any other parts of the grammar should ever change, the t4 is rock-solid, and will not need to change. Let's think about how to do this. Regardless of whether we want to be a purist, it's worth examining anyway because there may be other situations where this is the only choice we have.

Let's consider input that looks like this:

"4 + 5 - 6 * 7"

The first '4' is a t0 which ultimately reduces to a t3, which generates success for rule #1. We then move into the while loop looking for a '+' or a '-' sign. We get it at the '+' sign. We then call t3 to see what's on the right hand side of the '+' sign. The '5' reduces to a t3 and returns success, which signals we can again reduce via rule #2 to t4. We then continue the while loop and see the '-' sign. We call t3 again and now the entire "6 * 7" reduces to a t3. We move forward still within the while loop looking for the next '+' or '-' sign. There isn't one so the while loop exits and we return 1 (the last line of the function).

Now consider this input:

"4 + 5 - &x"

This isn't legal input for our expression grammar, BUT as stated earlier, perhaps the grammar will be extended in the future and we want our rule to stand alone and still work correctly. We need to somehow indicate that the "4 + 5" portion reduced successfully, but return failure for the "- &x" portion of the input. Suppose that we add a rule to the t1 non-terminal:

t1:           //unary prefix operators.
  -> t0
  -> '-' t0
  -> '+' t0
  -> '&' t0   //new rule

In this case, since t1 is higher precedence than t4, the "&x" input will successfully reduce to a t3, and everything will work fine: NO ERROR.

But suppose instead that the t6 non-terminal gets a new rule:

t6:
  -> t5
  -> "not" t5
  -> t6 '-' '&' t0   //new rule

Remember our input:

"4 + 5 - &x"

Within the t4 non-terminal, the "4 + 5" reduces to a t4, then the minus sign is consumed and the t3 fails with the "&x" input. If we could somehow indicate that the "4+5" successfully reduced to a t4, then it would in-turn reduce to a t5 (via t5 rule #1), which would reduce to a t6 (via t6 rule #1), and we'd be ready to move forward with the rule "t6 -> t6 '-' '&' t0" PROVIDED the minus sign that was consumed in the t4 was replaced.

That is, within the t4 non-terminal, within the while loop, we should keep track of the state at the LAST SUCCESS, and if we do encounter a failure, restore that state before returning SUCCESS.

Let's rewrite t4 with this new information (most of the comments removed to highlight our new-found policy):

C++

int t4( Expression_T* ep )
{
   InstructionEnum_T ins;
   State_T s;

   if( not t3( ep ) )
   {
      return 0;
   }
   while(
           ltrim( &(ep->infix) ),
          (ins=ADD ,   '+' == *(ep->infix))
      or  (ins=SUBT,   '-' == *(ep->infix))  )
   {
      saveState( &s, ep );  //do this BEFORE consuming any tokens.
      ep->infix += 1;       //skip past the '+' or '-' (consume it).
      if( t3( ep ) )        //try the rest of the rule.
      {//success.
         addInstruction( ep, ins );
      }
      else  //t3 failed.
      {//put things back like they were immediately after the last success.
         restoreState( &s, ep );
         break;                  //return success.
      }
   }
   return 1;
}

This represents a deviation from our earlier policy of not restoring state prior to returning to a calling function. But in this case, since we are inside of a loop, there's just no way for the calling function to know how far along we got before we quit seeing successful reductions.

Actually, since the while loop represents recursion (left recursion), once we get into the loop, THIS FUNCTION IS the calling function (as well as the called function). It is implicitly calling itself. Thus, when we restore state, t4 essentially IS the calling function restoring the context. So this actually is not a deviation from earlier policy.

Pattern: Within a while loop that represents left recursion, you should save state after each successful production, and if you advance state while looking for a pattern, but the pattern fails, restore state before returning the success indicator. Remember, if you saw ANY successful rules, you return "SUCCESS", not "FAILURE".

The Pragmatic Approach

I started the last section by saying there were TWO WAYS to deal with the else clause, and the way we just discussed was the "purist" approach. The other approach is to simply look around the grammar and see where else the '+' and '-' signs appear. Turns out they only appear in this t4 rule, and in the higher-precedence t1 rule. Unlike our fictitious example where we purposely installed a '-' sign in a lower-precedence rule, in the real grammar, once you get a "t4 '-' " there simply are NO OTHER LEGAL productions. If you DON'T get a t3, you have an error in the input. So for our real grammar, we can simplify t4 even more by just raising an error if we get a '+' or '-' but then DON'T get a t3.

C++

int t4( Expression_T* ep )
{
   InstructionEnum_T ins;
   const char* oldfix;    //instead of the State_T variable "s"

   if( not t3( ep ) )
   {
      return 0;
   }
   while(
           ltrim( &(ep->infix) ),
          (ins=ADD ,   '+' == *(ep->infix))
      or  (ins=SUBT,   '-' == *(ep->infix))  )
   {
      ep->infix += 1;       //skip past the input token.
      oldfix = ep->infix;   //to get a decent error message if needed.
      if( t3( ep ) )
         addInstruction( ep, ins );
      else                  //t3 failed.
         error( oldfix );   //print the bad input, and abort the compilation.
   }
   return 1;
}

We still sort-of saved state via the oldfix variable, just not as much state -- it's only a single pointer instead of stack-tops, etc. AND we saved it AFTER advancing past the '+' or '-' token. We needed this in order to get a decent error message if the t3 does fail. If we don't care about telling the user what went wrong, we could even skip this part. BUT, not giving some clue to the user about what went wrong is BAD FORM. Your users will hate you... They will curse you as they pull their hair out hunting for what went wrong. They will lay in bed at night and dream-up bad things to do to you -- ways to make you PAY!

SAVE YOURSELF! Just give them a decent error message for pity's sake.

As noted earlier, this version of t4 is no longer independent of the rest of the grammar. If other parts of the grammar change, t4 needs to be reexamined to see whether it is still correct to raise an error within this context.

Pattern: If you KNOW within a production that you've encountered an error in the input, you can just abort right there (with a nice error message of course), and not worry about returning up the call chain.

Greedy Input

Now let's implement the t5 non-terminal. As you'll see, it is pretty much exactly like t4, EXCEPT that there is a subtlety in how to match "<=" instead of "<", and ">=" instead of ">". Here we must be sure to use a greedy algorithm when attempting to consume the comparison token. If we don't, we can incorrectly match the shorter token.

t5:
  -> t4
  -> t5 "<"  t4
  -> t5 "<=" t4
  -> t5 ">"  t4
  -> t5 ">=" t4

C++

int t5( Expression_T* ep )
{
   InstructionEnum_T ins;
   const char* oldfix;

 //this next variable is new to this non-terminal.  It wasn't in the
 //t3 or t4 nonterminals.  It is the length of the matched token.
   int  toklen;

   if( not t4( ep ) )
   {
      return 0;
   }

 //Note that in the following test section, if we search for "<" first,
 //it will match both "<" and "<=" and will take precedence over "<=".
 //Thus, we must use a "greedy algorithm" and search for the longest
 //matching string FIRST.  That is, we must search for "<=" BEFORE
 //searching for "<", and for ">=" before ">".

   while(
           ltrim( &(ep->infix) ), //do this in side-effect context.

  //Now we ALSO store TOKEN LENGTH in side effect context.
          (ins=LT_EQ, toklen=2, 0 == strncmp( ep->infix, "<=", 2 ))  //this first,
      or  (ins=LT   , toklen=1, 0 == strncmp( ep->infix, "<",  1 ))  //then this.
      or  (ins=GT_EQ, toklen=2, 0 == strncmp( ep->infix, ">=", 2 ))  //this first,
      or  (ins=GT   , toklen=1, 0 == strncmp( ep->infix, ">",  1 ))) //then this.
   {
      ep->infix += toklen;  //skip past the token.
      oldfix = ep->infix;
      if( t4( ep ) )        //try the rest of the rule.
         addInstruction( ep, ins ); //push whichever instruction it was.
      else
      {
    //there are no other uses of these comparator tokens anywhere in the
    //grammar.  If they are not being used correctly here, it's
    //definitely an error.
         error( oldfix );
      }
   }
   return 1; //at least the original t5 -> t4 rule succeeded.
}

Primitives

The code for the t0 (primitives) non-terminal illustrates the requirement to store state between rules, and it illustrates a greedy algorithm in searching for function names.

But before we write it, we need a list of function names that we're going to allow. We'll binary-search this list when looking for a match within the input, so it must be in alphabetic order.

C++

const char* funcnames[] =
{ "abs", "acos", "asin", "atan", "atan2", "ceil", "cos", "cosh", "erf"
, "erfc", "exp", "fact", "floor", "gamma", "ln", "lngamma", "log10"
, "mod", "pow", "sin", "sinh", "sqrt", "tan", "tanh"
};

int t0( Expression_T* ep )
{
 // implements primary terms.
 // t0:
 //   -> number
 //   -> parameter
 //   -> "pi"
 //   -> '(' expression ')'
 //   -> funcname '(' list ')'

   double    val;
   char*     end;
   const char* oldfix;
   int       numargs;  //to confirm we got the right number of function args.
   State_T   s;

   FindRet_T fr;

   ltrim( &(ep->infix) );
   oldfix = ep->infix ;

 // -> number
   if( val = strtod( ep->infix, &end ), (ep->infix != end) )  // got a number
   {
      addConstant( ep, val );         //put this number on the constants array.
      addInstruction( ep, PUSH_C );    //push next constant to stack.
      ep->infix = end;
   }

 // -> parameter
   else if(('p' == *(ep->infix) or 'P' == *(ep->infix)) // -> parameter
       and isdigit( *(ep->infix + 1) )                  // 2'nd char is a digit: pd
       and (    0 == *(ep->infix + 2)                   //string ends after pd
             or not isdigit( *(ep->infix + 2) )         //pdx; where x is NOT a digit
             or 0 == *(ep->infix + 3)                   //string ends after pdd
             or not isdigit( *(ep->infix + 3) )) )      //pddx; where x is NOT a digit,
   {
      int pnum = *(ep->infix + 1) - '0';  //convert first digit to a number.

      ep->infix += 2;  //skip past "pd" where d is the first digit.

      if( isdigit( *(ep->infix) ) )  //got another digit,
      {
         pnum *= 10;
         pnum += *(ep->infix) - '0' ;//add second digit.
         ep->infix += 1;             //skip past second digit.
      }
      if( 0 == pnum )     //parameter numbers begin with 1, not 0.
         error( oldfix );
      addInstruction( ep, (Operation_T)(-pnum) ); //negative means
				//"push the numbered parameter to stack".
   }

 // -> "pi"
   else if( 0 == strncmp(ep->infix, "pi", 2) )
   {
      addConstant( ep, 3.1415926535897932384626433832795 );
      addInstruction( ep, PUSH_C );    //push next constant to stack.
      ep->infix += 2; //skip past "pi"
   }

 // -> '(' expression ')'
 //Up to this point in the function, as each of the previous productions
 //failed, we had not called any other non-terminal functions, and had not
 //advanced the input pointer.  Therefore we had not needed to save state.
 //In this rule, however, we're going to drill into the expression non-
 //terminal, and if that fails, attempt to match the funcname rule.
 //Therefore, we must save state prior to adjusting anything so that it
 //will be available for us to restore before attempting the funcname rule.

   else if(  saveState( &s, ep )  , //save state in side-effect context
           ( ( '(' == *(ep->infix)) //got a leading '('
        and (ep->infix++ , expression( ep )) //skipped past '(' then got an expression,
        and (ltrim( &(ep->infix) ), ')' == *(ep->infix)) )) //got trailing ')'
   {
      ep->infix++; //advance past that last ')'.  No other action req'd.
   }

 // -> funcname '(' list ')'
 //Here we must be careful to do another greedy comparison.  If we attempt
 //to search the funcnames array with ep->infix pointing to "cosh(p3)...",
 //a binary or linear search will match the "cos" function name UNLESS we
 //tell it to search for a string of length 4, or we insist on the longest
 //possible match.
 //Since the standard library bsearch function does not allow us to pass
 //an extra argument with the comparison function (i.e. length 4 for the
 //cosh funcname), we have to come up with a different approach.  We can
 //either copy out the longest series of letters to a local variable, and
 //then search the array with that, or use a specialized comparison function.
 //I chose the second option and implemented funcAryNameCompare.

   else if( restoreState( &s, ep ),   //restore state from previous failed attempt,
            fr = binarySearch(oldfix,funcnames,
		FUNCNAMES_LENGTH,sizeof(char*),funcAryNameCompare) ,
            (    not fr.notfound //we DID find it in the list,
             and ( ep->infix += strlen(funcnames[fr.index]), //skip past func name
                   ltrim( &(ep->infix) ),                    //skip past white space
                   '(' == *(ep->infix) )                     //got '('
             and (ep->infix++ , (numargs=list( ep )))        //skip past '('
           //list() is unusual in that it returns the number of elements in the list,
             and (ltrim( &(ep->infix) ), ')' == *(ep->infix)) ) //got ')'
          )
   {
      if( numargs-1 != opStkUse[fr.index] )
         error( oldfix ); //wrong number of args for this func.
      ep->infix++; //advance past that last ')'.
      addInstruction( ep, (Operation_T)(fr.index) ); //add a function call.
   }
   else
   {
      return False;  //didn't get any of the mandatory reductions.
   }
   return True;
}

Pattern: When accepting one of the language tokens, always use greedy input. Look for and accept the longest legal token.

Note that the list() non-terminal function returned the number of items in the list (see the line containing “numargs=list( ep )”), NOT a Boolean value as every other non-terminal does. There’s no reason why you can’t return whatever information you need from any non-terminal function -- including a structure, if that makes sense. Here, we needed to verify that the number of parameters being passed to a function was the correct number. To do this, we encoded success and failure within the return value: zero for failure, and num_items+1 otherwise. That way, we can accept empty parameter lists but still get a “success” indicator. Remember, we talked earlier about other changes we would have to make to actually accept empty parameter lists. Here I just wanted to point out how we might want to deal with a "list" non-terminal if it did accept zero parameters.

Pattern: The return value from a non-terminal does not have to be Boolean – it can be whatever makes sense. You just have to be able to determine from the return value whether the non-terminal succeeded, or failed.

More Interesting Subtleties

The "list" non-terminal brought up some interesting points regarding empty lists that we should probably look at. Although not used within the expression evaluator being built for this project, it's worth looking at how "empty" rules might be dealt with.

Let's look at two different scenarios: The first is where we just want to accept empty parameter lists, and return the value "1", indicating success, but "1-1" (or zero) items actually in the list. Here's the code:

C++

static int list( StaticExpression* ep )
{// implements
 // list:
 //   -> expression
 //   -> list ',' expression

//returns the number of expressions in the list.
  const char* oldfix;
  int         ret = 1;

  if( not expression(ep) )  //didn't get the first production,
  {//this next section is how we accept empty parameter lists.
   //this is NOT actually part of the grammar described by
   //the BNF above.  It's a short-circuit way of adding another
   //formal rule to the t0 non-terminal.  Since the t0()
   //function already knows how to deal with zero items in the
   //list, it's a no-brainer to add it here.
     if( ep->infix = ltrim( ep->infix ),
         ')' == *(ep->infix) )
     {
      //DON'T consume the ')' token! The funcname rule in t0 needs it.
        return 1;  //indicate success, but with empty list.
     }
   //OK, apparently it wasn't an empty list.. some other weirdness.
     return 0;             //can't go anywhere without getting prod1.
  }
 //else, got production 1.

  while( ep->infix = ltrim( ep->infix ),
         ',' == *(ep->infix) )
  {//looks like we may have (another?) production 2.  Try it.
     ret++ ;
     ep->infix += 1;            //advance past the ','.
     oldfix = ep->infix;
     if( not expression(ep) ) //no, apparently not production 2.
        error( oldfix );
   //no instruction to add.
  }
  return ret;
}

And here's another shot at it, but this time we actually accommodate the empty string production, which allows inputs like somefunc( , , 1.5, ). You'll have to decide what eval() action should happen for a null parameter.

C++

//alternate implementation that allows null strings to be list elements.
static int list( StaticExpression* ep )
{// implements
 // list:
 //   -> expression           //rule #1
 //   -> ''                   //rule #2
 //   -> list ',' expression  //rule #3
 //   -> list ',' ''          //rule #4

//returns the number of expressions in the list.
  int         ret = 1;
  State_T s;

  saveState(&s, ep);
  if( not expression(ep) )  //didn't get the first rule,
  {//no worries. We'll just assume an empty string (rule #2) and press forward.
     restoreState(&s, ep);
     addInstruction( ep, NULL_LIST_ELEMENT ); //not actually implemented.
  }
 //check for rule #3
  while( ep->infix = ltrim( ep->infix ),
         ',' == *(ep->infix) )
  {//looks like we may have (another?) rule #3.  Try it.
     ret++ ;
     ep->infix += 1;            //advance past the ','.
     saveState(&s, ep);
     if( not expression(ep) ) //apparently not rule #3.
     {//assume rule #4.
        restoreState(&s, ep);
        addInstruction( ep, NULL_LIST_ELEMENT ); //not actually implemented.
     }
  }
  return ret;
}

The Branching Mechanism

The last thing I want to cover in depth in this article is the branching mechanism used to implement the ?: conditional operator. It is contained within the "expression" non-terminal.

The branching construct implemented here is effectively an if/else construct (as opposed to a simple "if" construct). As such, it needs two jumps: one conditional jump to the "do this if test is FALSE" section, and one unconditional jump to the "after the whole thing" section. The flow is like this:

  if( test is false )         //these two lines form the
     goto FALSE_SECTION;      //conditional jump section.
//otherwise, just proceed.

  <code for true section>
  ...
  goto AFTER_SECTION;         //unconditional jump.

 FALSE_SECTION
  <code for false section>
  ...

 AFTER_SECTION
  <code that follows the z?x:y construct>
  ...

The way we'll execute these jumps is to store the current state of the instructions array and constants array inside the constants array, along with some PUSH_C instructions to push those constants onto the stack, then well jump to those positions by restoring the state of the two arrays. We'll do this at two different points along the parse of the ?: construct; once at each jump.

To help explain how this is going to work, here are the COND_JMP and UNCOND_JMP instructions as they will be executed within the evaluate function. You can see that they pull their operands off the stack, just like every other operation does.

C++

double evaluate( Expression_T* ep, double* parameters )
{
   double stack[50];
   int stktop = -1;       //last item ON stack.
   int constantIndex = 0; //next constant to push to stack.
   int indx;
   ...
   for( indx=0; indx < ep->instructionsLen; indx++ )
   {
   ...
      switch( instruction )
      {
         ...
       //unconditional jump.  Last operation executed after the "Then" part.
         case UNCOND_JMP: // Binary operator with no resultant
                 assert( stktop >= 1 );
                 stktop -= 2;  //totally remove this construct from the stack.
                 cndx = (int)stack[stktop+1];    //restore the constants array position.
                 indx = (int)stack[stktop+2]-1;  //execute jump. The -1 is there because
                                                 //the very next action is an
                                                 //increment of this indx loop counter!
                 break;

       //conditional jump. First operation executed as part of the ?: operator.
         case COND_JMP: // Trinary operator with no resultant
                 assert( stktop >= 2 );
                 stktop -= 3;  //totally remove this construct, INCLUDING the test.
                 if( not stack[stktop+1] )         //if the test evaluated false
                 {//execute jump to the "else" part; the ':' part.
                    cndx = (int)stack[stktop+2];   //restore the constants array position.
                    indx = (int)stack[stktop+3]-1; //execute jump. The -1 is there because 
                                                   //the very next action is an 
                                                   //increment of this indx loop counter!
                 }
               //else, test was true: stay and do the "then" part; the '?' part.
                 break;

         ...
      }
   }
   return stack[0]; //final answer will be here now.
}

To do all this, however, we're going to need a little more detail within the Expression_T structure.

C++

typedef struct ExpressionStruct {

   double *constants;
   int     constantsLen;
   int     cndx;            //current top of constants "stack"
   int    *instructions;
   int     instructionsLen;
   int     indx;            //current top of the instructions "stack"
   const char* infix;

} Expression_T;

It should be obvious that as we call addConstant() and addInstruction() we are incrementing the cndx and indx members respectively.

C++

expression:
   -> t9
   -> expression '?' expression : expression

int expression( Expression_T* ep )
{
   if( not t9( ep ) )
      return 0;

 //now do the left recursive part.
   while( ltrim( &(ep->infix) ), //always trim prior to a token
          '?' == *(ep->infix) )
   {
      int condJmpConstant;
      int uncondJmpConstant;

      condJmpConstant = ep->cndx;   //remember where we are on the constants array.
                                    //We'll come back to this position later and put
                                    //in the "if false" jump targets.  Right now,
                                    //we don't know what those targets are.
    //we'll come back later and put in the correct target values instead of zeros.
      addConstant( ep, 0 ); //where we'll put the target constants index.
      addConstant( ep, 0 ); //where we'll put the target instructions index.
      addInstruction( ep, PUSH_C );    //push constants index to stack.
      addInstruction( ep, PUSH_C );    //push instructions index to stack.
      addInstruction( ep, COND_JMP );  //the conditional jump instruction.
                                       //Uses values on the stack to restore the
                                       //indexes of the constants array and the
                                       //instructions array.

      ep->infix++ ;    //move past '?'
      oldfix = ep->infix;
    //Now compile the "Then_exp" part of the if(test)Then_exp else Else_exp
      if( not expression( ep ) )
      {
         error( oldfix ); // '?' is only used here, so must be an error.
      }
    //We just successfully compiled the "Then" part of the construct.

    //The last instruction of the "Then" part is an unconditional jump past the "Else" part
      uncondJmpConstant = ep->cndx; //remember where to put the unconditional jump targets
    //we'll come back later and put in the correct target values instead of zeros.
      addConstant( ep, 0 ); //where we'll put the target constants index.
      addConstant( ep, 0 ); //where we'll put the target instructions index.
      addInstruction( ep, PUSH_C );    //push constants index to stack.
      addInstruction( ep, PUSH_C );    //push instructions index to stack.
      addInstruction( ep, UNCOND_JMP ); //the UNconditional jump instruction.

    //we're now completely past the "Then" phrase.  Now we know what the
    //"if false" jump targets are (where we are right now).  Go back and
    //fill them in.
      ep->constants[condJmpConstant]   = ep->cndx;
      ep->constants[condJmpConstant+1] = ep->indx;

    //continue parsing. Move past the "   :" part of the production rule.
      ltrim( &(ep->infix);
      if( ':' != *(ep->infix) ) //didn't get the ':' expected,
      {
         error( oldfix ); // '?' is only used here, so must be an error.
      }
      ep->infix++ ;    //move past the ':'
      oldfix = ep->infix;

    //We're ready now to compile the "Else" part.
      if( not expression( ep ) )
      {
         error( oldfix );
      }
    //We just successfully compiled the "Else" part of the construct.
    //That is, we're at the very end of the z?x:y construct.

    //fill in where to jump to after the "Then" part; the unconditional jump.
      ep->constants[uncondJmpConstant]   = ep->cndx;
      ep->constants[uncondJmpConstant+1] = ep->indx;
   }
   return True;
}

Pattern: You may have to write a good bit of "compiler" code within your "parser". I can't think of any way we could have fully accepted the ?: construct in one step, THEN gone back and filled in the details. We had to mix it up along the way.

The main() Routine

Any time you implement a compiler, and especially if you do it by hand as we have done the expression compiler, you should build a really good test suite. While I'm not going to discuss building it, I will say that it's a good idea to start by testing the simplest constructs first, and then slowly build upon success. The "main()" routine in the expression source contains the tests I implemented.

By the time I have software up and running, I usually have it just chock-full of debugging statements. I find the most effective debugging to be a combination of an interactive debugger (like Visual C++) AND a logfile generation mechanism. The logfile lets me quickly jump to the position in the execution where I'm trying to figure out what's going on. The expression.c source code is no different. The actual code (as opposed to the example code I've included in the article) is full of assertions and debugging code.

It's also a good idea when writing a test routine to NOT fill up the output screen on success, but rather to only throw error messages on actual errors. On my 3GHz machine, if I have full debugging output turned on, it takes about 7 seconds to output all the data from the test cases. On the other hand, if I have the debugging turned off, and ONLY ask it to run the internal tests (without tracing), it completes as quickly as I hit the return key. This is a whole lot better option in case someone wants to put a script together to run test suites on many different libraries.

One thing that occurred to me during testing was that there is not a sure-fire way of determining HOW the expression system is going to compile any particular infix expression handed to it at run time. Here is the scenario: I put an expression into my .ini file, read it in, compile it, and HOPE that the expression compiler interprets it the same way I intended! Yes, I could write a separate program to read in the expression and compute it and compare against known outputs, but that seems a lot of work for the USER of these routines. Therefore, I included a section within the main() program that can be used interactively for just this purpose. Additionally, if someone is uncertain that the compiler is actually parsing the expression correctly, I included the "-trace" option to show the top few elements on the stack during evaluation, and how the operation is being performed. One can compare this against a hand-generated RPN version of the same expression.

I have found the test version of the program to be so useful for everyday work, that I renamed it "eval.exe" and put it in my path so I can call it from the command line. Now I use it regularly for quick little computations that I used to use the Windows "calc" program for, or would break open a spreadsheet for.

Summary (x2)

With careful consideration, and a simple enough grammar, the patterns identified in this article can be used to successfully roll your own recursive descent parser/compiler. Considering the time it takes to learn a compiler-compiler system, it MAY be a more effective use of your time to roll your own compiler than to learn a new system. I'd say that the complexity of the expression grammar implemented in this article is on the ragged edge of needing to break out a compiler-compiler.

Regardless of whether you roll your own, or use a compiler-compiler, you must CLEARLY UNDERSTAND what the compiler needs to do upon recognizing each production rule. The non-terminal function that implemented the branching construct (the "expression()" non-terminal above) should have been a good example of that principle. A compiler-compiler cannot help you with this. The compiler-compiler just helps you implement the grammar correctly, and get the production actions in the right place.

*Final Thoughts

*From the introduction paragraph: "...software magically evolves" (wink) when smart people are in the presence of patterns. Of course, WE know there's nothing magical about it: it's a lot of thought and hard work. But biological machines (living organisms), which are orders of magnitude more complex than any software or hardware ever seen, apparently do evolve without the benefit of any "smartness" (or intelligence) whatsoever. If we are to believe evolutionists, they just spring into existence by pure random chance.

I once heard a very smart programmer say that trusting things to chance was not a good way to write software. As a mechanical engineer myself, I can affirm that this principle also applies in machine design; randomly stringing mechanical-components together just doesn't cut it. Apparently, however, according to evolutionists, random chance works GREAT in biological design. Pure random chance composed not just any-old biological machine, but machines so complex that they can reproduce, and one of them is so sophisticated and intelligent that it even designs computers and writes software! Just unbelievable...

Hmmmm. A compiler-compiler "writes" software, kind-of mimicking the actions of a biological machine (it mimics the actions of those so-called "humans"). I'm thinking a really great compiler-compiler can be had by just randomly scrambling lots of instructions together. And hey, isn't that how YACC originally came about (but it evolved in the days of slower computers. Today, we have better random number generators and faster computers so this new compiler-compiler will be much smarter / more evolved).

I'M ON IT!<nbsp><nbsp>

<nbsp><nbsp>;-)

Have fun building YOUR compiler!

David Qualls

History

17^th November, 2011: Initial version
5^th December, 2011: Article updated

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

DQNOK

Software Developer USAF

United States

Education:
-BS Mechanical Engineering
-MS Mechanical Engineering
Languages:
-Perl
-SQL
-C and C++
-Fortran
-VB
Past Jobs:
-Jet engines engineer
-Adjunct Engineering Professor
-Structural Tracking System: IT Engineer
-Mathematician/Forecasting Models Architect/Implementer
Current Job:
-Financial manager for a small company, and CTO at another small company.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.