Click here to Skip to main content
Click here to Skip to main content

A simple hand-coded script parser

, 30 Nov 2004
Rate this:
Please Sign up or sign in to vote.
A simple script parser and engine to demonstrate coding a parser, and recursive descent statement evaluation.

Introduction

This is the third parser I have written, and it sucks somewhat less than the first two. The goal of this one was an easily expandable parser that would be easy to understand by other programmers, and would be easy to embed into other projects. I also wanted it to be simple enough for an intermediate programmer to understand how the code works. I’ve been partially successful.

It still isn’t that easy to add more operators or built in functions, but the code is quite a bit more transparent than my previous efforts. Why would I want to do this when there are very good parsers out there, and tools for generating C code for parsers given a description of a language? Those tools are often a bit much for someone just getting started with writing a parser to work with. This code is small enough to see how a simple parser works, yet has enough features to demand a better design than a very simple parser. It is suitable for students to study how a parser can be put together.

Features

  • User defined functions
  • Supports recursion and nested calls of user defined functions
  • Global and local variables
  • Rich set of operators
  • Must declare all variables
  • Generates simple byte code.
  • Small, easy to insert into existing projects.

Deficiencies

  • No headers – cannot call a function defined later in the script. A function can call itself.
  • Cannot call a function in complex statements. Function values can be assigned to variables, but must be the only term of the statement, e.g.: y=foo{}; is OK, y=x + foo{}; is not OK.
  • Uses a simple recursive descent method to evaluate statements. This makes it easy to see how this part of the parser works, but it generates inefficient byte code. Having the parser generate reverse Polish statements would have made the engine more efficient and flexible.
  • Arrays limited to one dimension.
  • Functions limited to 20 parameters.
  • Functions cannot have array parameters

Notes on the Parser and Engine

This is a one pass parser. It reads the whole script text into memory and parses it from beginning to end in one pass. There is one extra pass to process the directives before the main parser pass. Except for comments, white-space is ignored.

The parser reads in one ‘token’ at a time. A token is one valid piece of text, for example, "int", "+", "==", a variable name, or a literal value. The parser is hard-coded to check syntax. For example, if it encounters a literal string that is an integer, it expects an assignment to follow it. This logic is hard-coded, not driven by a description of the language.

Parsing Variables

When variables are declared, their type, name, and location are saved. When variables are later encountered in statements, the parser generates code that uses the variables type and location to get or set its value.

Global variables start at the beginning of the byte stream and simply get added as more global variables are declared. In the byte code, a global variable will be a byte indicating the type (int, float, etc…) followed by an integer that is the variable's offset from the start of the byte code.

Local variables are quite different. The code supports recursion, so local variables' location must change depending on how many function calls are nested. When a function is called, a pointer to the current local variable stack is adjusted. Local variables are always located relative to the current position of this pointer, not to a fixed location, like global variables. When a function call is parsed, the current local variable pointer is increased by the size of the function’s local variables, and on exit, this same amount is removed from the local variable pointer. Local variables are then retrieved relative to this moving pointer.

If / when

This is a single pass parser. When an if is parsed, it is replaced by code that evaluates the logic statement, then jumps past the end of the if block if the logical statement is false. The parser does not know where to jump to yet, so we leave the goto’s address blank, and save where this blank address is in a temporary stack that only exists while parsing. Then, when we reach an end statement, we can fill in the missing goto address, and remove the goto from the temporary stack.

while, break, and else statements are handled in a similar fashion. Using a stack of addresses allows the parser to support nested flow control.

Start Up Code

The parser is hard-coded to run the user function "main" first. A few bytes of code are placed in the start of the byte stream to initialize the data pointers and call the main function.

What I Would Have Done Differently

Despite this being the third parser I have written over the last decade, I made a number of poor design decisions that caused me to have to re-write large portions of the parser. My first attempt at handling variables worked fine for globals, but did not handle local variables at all. Re-Write. The method I used to evaluate statements worked fine as long as there were no function calls. When I added functions calls, I had to do another re-write, and the parser is still limited to having function calls only by themselves in statements.

If I have time to re-do this, I would create a much more firm design document – especially for the byte code that the parser emits. I started out with only a list of features and a vague design in my head. This is not ideal programming practice! I would have worked out a much better and general method for parsing and evaluating complex statements if I were to start over. The method I used, while easy to implement initially without worrying about function calls, has a number of limits and workarounds to handle function calls. I would also carefully study how existing compilers handle variables and function calls. A lot of wisdom exists in how C and other simple languages handle these tasks.

Evaluating Statements

The parser does not do much re-ordering of statements. It does change how assignments are done, but the statement on the right of an assignment is pretty much just replaced with byte code of the same form. At run time, statements are evaluated using a simple recursive descent method. This is an easy method to hand code, but generates less than optimal byte code. A more ambitious parser might generate reverse-Polish statements, or otherwise re-order the statement so that it can be run faster with a simpler statement evaluator.

Let’s look at a simple statement:

4 + 3 * 5

To be evaluated correctly, the 3*5 must be done first. Recursive descent does this by scanning the statement for higher precedent operators, evaluating them, and replacing the evaluated terms with the result. This scanning and replacing is repeated until a single value is left.

1<SUP>st</SUP> pass 
4+15
2<SUP>nd</SUP> pass
19 – all done

A second example
4 + (3*2+4) – 2
1<SUP>st</SUP> pass
4+(6+4)-2
2<SUP>nd</SUP>
4+(10)-2
3<SUP>rd</SUP>
4+10-2
4<SUP>th</SUP>
14-2
5<SUP>th</SUP>
12 – all done

The byte code engine actually copies over the result of a pass overtop of the original statement (padding with null operators where required) until only a literal value is left. A copy of the statement is evaluated as the statement actually gets destroyed when parsed. This destructive evaluation is another reason why reordering statements is a better idea. A copy is required as the script does support looping.

In other parsers, I converted the statement to a linked list of statements which made the code easier to read, but even slower. The linked list was then evaluated, and evaluated nodes were replaced and deleted as required.

In the function evaluate_int_exp(), you will see this implemented by scanning for higher-precedent statements, then calling evaluate_in_exp() on a sub-part of the statement recursively until the whole statement is reduced to a single literal value.

Data Types

  • Integers, 1, 2, 3
  • Character ‘a’, ‘b’, ‘c’
  • Float, 1.2, 1.0, 9.9
  • Arrays
  • Integer
  • Character "abcded", "\n", "\t"
  • Float
  • Single dimension only
  • pointers? – no, but can assign string to char arrays

Declare like C.

  • int
  • float
  • char
  • int [ ]
  • float [ ]
  • char [ ]

Operators

Operator Precedence Integer Function Float Function Character String Function
[ ] 0 Array element Array element Array element
= 0 Assignment Assignment Assignment
( ) 1 Change precedence Change precedence NA
! 2 Logical NOT Logical NOT NA
- 2 Negate Negate NA
~ 2 Bit-wise NOT NA NA
* 3 Multiple Multiple NA
/ 3 Divide Divide NA
% 3 Modulo DIV Modulo DIV NA
+ 4 Add Add Concatenate
- 4 Subtract Subtract NA
^ 5 Bit-wise exclusive OR NA NA
& 5 Bit-wise AND NA NA
| 5 Bit-wise OR NA NA
== 6 Test for equality Test for equality Test for equality
<= 6 Test for less than equal Test for less than equal Less than or equal
< 6 Less than Less than Less than
>= 6 Greater than or equal Greater than or equal Greater than or equal
> 6 Greater than Greater than Greater than
&& 6 Logical AND Logical AND NA
|| 6 Logical OR Logical OR NA
// NA Comment Comment Comment
atoi 2 NA NA Convert to int
atof 2 NA NA Convert to float
itoa 2 Convert to str NA NA
ftoa 2 NA Convert to str NA

Flow control

  • while
  • if
  • if else
  • break
  • continue
  • end

Built-In Functions

print

print followed by comma delimited list of statements. E.g.:

print "hi world ", x, " that was x\n";

User Function Definition

func type name {parms};
end

Return type must be int or float. Arrays cannot be parameters. Maximum of 20 parameters. E.g.:

func int fubar {int x, int y, float b};
  body;
end;

Calling a function:

fubar {1, 2, 3, aa};

or

x=fubar {1, 2, 3, aa};

Cannot put a function in a complex statement:

X=fubar(1,2,3, aa} + 99;  // NOT allowed

Scope

  • global – all variables declared outside of functions.
  • local – variables declared within a function.
  • Parameters are treated like local variables.

Directives

Directives must go in comments. They are used to set the size reserved for variables and the call stack.

  • // $globalsize = size in bytes of static global vars (not a stack)
  • // $localsize = size in bytes of local var stack
  • // $parmsize = size in bytes of parameter stack
  • // $stacksize = size in bytes of code stack

Default is 64Kb for each.

Statements

Statements can contain literals, vars, and operators. Statements are always ended by a semi-colon ‘;’.

Statements with functions can have no other terms, i.e., many terms, or a single function call. E.g.:

i=1+j*5;

OR

i=f1{};

Array index may be a statement.

Sample Script

// test file

int x;
int i,j,k;
int ii[100],ij, ik;
float f, g;
float ff[10];
char s [ 100 ] , b, #9; #9; c;
char s2[100];
// ---------------------------------------------------------
func int test{};
  int y;
  print "at top of test\n";
  s[0]='a';
  s[1]=0;
  print "s = ", s, " s[0] = ", s[0], "\n";
  s="I am the cat " + "Hi World\n";
  s2=s;
  print "s = ", s, " s2 = ", s2, " newline\n";
  print "Hi world\n";
  print "Hi world tab\t cr\r world\n";
  x=43;
  y=5;
  ii[34]=x + y;
  ii[x+y]=5;
  print "ii[34] = ", ii[34], "\n";
  print ">x +10 = < ", x + 10, " y="y, "ii[2]=", ii[2], " ii[34] =", ii[34], "\n";
  print "(4-2)*5 = ", (4-2) *5, "\n";
  b='a';
  print "b = ", b, "\n";
  x=5;
  y=10;
  print "x=", x, " y=",y,"\n";
  if 1;
    print "before if break\n";
    break;
    print "after if break\n";
  end
  while x > 0;
    x=x-1;
    print "x=", x, "\n";
    // continue;
    if x > 2;
      print " x > 2\n";
    else
      print " x is not > 2\n";
    end
    print "before the break\n";
    break;
    print "past the break\n";
  end
  print "at the end, past the while\n";
end

// ----------------------------------------------------
func int foo1{};
  set 5;
end
// ----------------------------------------------------
func int foo2{int x, int y};
  set x*y;
end
// ----------------------------------------------------
func int main{};
  // locals
  int y, z;
  float m;
  print "top of main\n";
  print "call test next\n";
  test{};
  m=(1.0 + 0.5) * 20.0;
  print "m=",m,"\n";
  x=foo1{};
  y=10;
  z= (10 + y)*x;
  print "z = ", z, "\n";
  print "x=", x, " y=",y,"\n";
  while x > 0;
    x=x-1;
    if x > 2;
      print " x > 2\n";
    else
      print " x is not > 2\n";
    end
    print "x=", x, "\n";
  end
  print "at the end, past the if/while\n";
  x=5;
  y=6;
  x=foo2{x,y};
  print "x of foo2=", x, "\n";
end

More Information

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

arussell
Software Developer (Senior)
Canada Canada
Professional Programmer living in Beautiful Vancouver, BC, Canada.

Comments and Discussions

 
QuestionI use the tool to parse in.txt into in.txt.byte, but i cannot run the bytecode file anyway, what's wrong? Pinmemberjoshua01378-Feb-10 21:36 
QuestionComparing strings PinmemberJostein Kjellesvik24-Oct-08 0:42 
Questionatoi syntax PinmemberJostein Kjellesvik21-Oct-08 23:02 
GeneralVisual BNF parser generator PinmemberAdamSlosarski13-Mar-07 18:14 
GeneralRun-Time Check Failure #3 PinmemberFanzhe Cui3-Apr-06 5:59 
GeneralRe: Run-Time Check Failure #3 Pinmemberarussell3-Apr-06 16:53 
GeneralRe: Run-Time Check Failure #3 PinmemberFanzhe Cui4-Apr-06 3:57 
GeneralRe: Run-Time Check Failure #3 PinmemberPaul Singh18-Dec-06 6:03 
GeneralParse error PinsussAnonymous20-Mar-05 21:09 
GeneralRe: Parse error PinsussAnonymous21-Mar-05 1:05 
QuestionRe: Parse error PinmemberWinstonio2-Jul-07 6:21 
Generaluninitialized variable causes error in release build PinmemberLars Wadefalk8-Dec-04 5:25 
GeneralRe: uninitialized variable causes error in release build Pinmemberarussell8-Dec-04 15:15 
GeneralRe: uninitialized variable causes error in release build PinmemberLars Wadefalk8-Dec-04 20:42 
GeneralRe: uninitialized variable causes error in release build Pinmemberarussell9-Dec-04 15:00 
GeneralRe: uninitialized variable causes error in release build Pinmembernotalreadyinuse25-Sep-07 10:00 
GeneralThis is the third parser I have written, and it sucks somewhat less than the first two PinsussAnonymous1-Dec-04 8:55 
GeneralRe: This is the third parser I have written, and it sucks somewhat less than the first two Pinmemberazhu_ram26-Mar-07 23:40 
GeneralRe: This is the third parser I have written, and it sucks somewhat less than the first two PinmemberChaz Zeromus21-May-07 14:32 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web03 | 2.8.140721.1 | Last Updated 1 Dec 2004
Article Copyright 2004 by arussell
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid