Click here to Skip to main content
15,881,248 members
Articles / Programming Languages / C++
Article

A simple hand-coded script parser

Rate me:
Please Sign up or sign in to vote.
4.41/5 (11 votes)
30 Nov 20049 min read 112.7K   2.7K   53   22
A simple script parser and engine to demonstrate coding a parser, and recursive descent statement evaluation.

Introduction

This is the third parser I have written, and it sucks somewhat less than the first two. The goal of this one was an easily expandable parser that would be easy to understand by other programmers, and would be easy to embed into other projects. I also wanted it to be simple enough for an intermediate programmer to understand how the code works. I’ve been partially successful.

It still isn’t that easy to add more operators or built in functions, but the code is quite a bit more transparent than my previous efforts. Why would I want to do this when there are very good parsers out there, and tools for generating C code for parsers given a description of a language? Those tools are often a bit much for someone just getting started with writing a parser to work with. This code is small enough to see how a simple parser works, yet has enough features to demand a better design than a very simple parser. It is suitable for students to study how a parser can be put together.

Features

  • User defined functions
  • Supports recursion and nested calls of user defined functions
  • Global and local variables
  • Rich set of operators
  • Must declare all variables
  • Generates simple byte code.
  • Small, easy to insert into existing projects.

Deficiencies

  • No headers – cannot call a function defined later in the script. A function can call itself.
  • Cannot call a function in complex statements. Function values can be assigned to variables, but must be the only term of the statement, e.g.: y=foo{}; is OK, y=x + foo{}; is not OK.
  • Uses a simple recursive descent method to evaluate statements. This makes it easy to see how this part of the parser works, but it generates inefficient byte code. Having the parser generate reverse Polish statements would have made the engine more efficient and flexible.
  • Arrays limited to one dimension.
  • Functions limited to 20 parameters.
  • Functions cannot have array parameters

Notes on the Parser and Engine

This is a one pass parser. It reads the whole script text into memory and parses it from beginning to end in one pass. There is one extra pass to process the directives before the main parser pass. Except for comments, white-space is ignored.

The parser reads in one ‘token’ at a time. A token is one valid piece of text, for example, "int", "+", "==", a variable name, or a literal value. The parser is hard-coded to check syntax. For example, if it encounters a literal string that is an integer, it expects an assignment to follow it. This logic is hard-coded, not driven by a description of the language.

Parsing Variables

When variables are declared, their type, name, and location are saved. When variables are later encountered in statements, the parser generates code that uses the variables type and location to get or set its value.

Global variables start at the beginning of the byte stream and simply get added as more global variables are declared. In the byte code, a global variable will be a byte indicating the type (int, float, etc…) followed by an integer that is the variable's offset from the start of the byte code.

Local variables are quite different. The code supports recursion, so local variables' location must change depending on how many function calls are nested. When a function is called, a pointer to the current local variable stack is adjusted. Local variables are always located relative to the current position of this pointer, not to a fixed location, like global variables. When a function call is parsed, the current local variable pointer is increased by the size of the function’s local variables, and on exit, this same amount is removed from the local variable pointer. Local variables are then retrieved relative to this moving pointer.

If / when

This is a single pass parser. When an if is parsed, it is replaced by code that evaluates the logic statement, then jumps past the end of the if block if the logical statement is false. The parser does not know where to jump to yet, so we leave the goto’s address blank, and save where this blank address is in a temporary stack that only exists while parsing. Then, when we reach an end statement, we can fill in the missing goto address, and remove the goto from the temporary stack.

while, break, and else statements are handled in a similar fashion. Using a stack of addresses allows the parser to support nested flow control.

Start Up Code

The parser is hard-coded to run the user function "main" first. A few bytes of code are placed in the start of the byte stream to initialize the data pointers and call the main function.

What I Would Have Done Differently

Despite this being the third parser I have written over the last decade, I made a number of poor design decisions that caused me to have to re-write large portions of the parser. My first attempt at handling variables worked fine for globals, but did not handle local variables at all. Re-Write. The method I used to evaluate statements worked fine as long as there were no function calls. When I added functions calls, I had to do another re-write, and the parser is still limited to having function calls only by themselves in statements.

If I have time to re-do this, I would create a much more firm design document – especially for the byte code that the parser emits. I started out with only a list of features and a vague design in my head. This is not ideal programming practice! I would have worked out a much better and general method for parsing and evaluating complex statements if I were to start over. The method I used, while easy to implement initially without worrying about function calls, has a number of limits and workarounds to handle function calls. I would also carefully study how existing compilers handle variables and function calls. A lot of wisdom exists in how C and other simple languages handle these tasks.

Evaluating Statements

The parser does not do much re-ordering of statements. It does change how assignments are done, but the statement on the right of an assignment is pretty much just replaced with byte code of the same form. At run time, statements are evaluated using a simple recursive descent method. This is an easy method to hand code, but generates less than optimal byte code. A more ambitious parser might generate reverse-Polish statements, or otherwise re-order the statement so that it can be run faster with a simpler statement evaluator.

Let’s look at a simple statement:

4 + 3 * 5

To be evaluated correctly, the 3*5 must be done first. Recursive descent does this by scanning the statement for higher precedent operators, evaluating them, and replacing the evaluated terms with the result. This scanning and replacing is repeated until a single value is left.

1<SUP>st</SUP> pass 
4+15
2<SUP>nd</SUP> pass
19 – all done

A second example
4 + (3*2+4) – 2
1<SUP>st</SUP> pass
4+(6+4)-2
2<SUP>nd</SUP>
4+(10)-2
3<SUP>rd</SUP>
4+10-2
4<SUP>th</SUP>
14-2
5<SUP>th</SUP>
12 – all done

The byte code engine actually copies over the result of a pass overtop of the original statement (padding with null operators where required) until only a literal value is left. A copy of the statement is evaluated as the statement actually gets destroyed when parsed. This destructive evaluation is another reason why reordering statements is a better idea. A copy is required as the script does support looping.

In other parsers, I converted the statement to a linked list of statements which made the code easier to read, but even slower. The linked list was then evaluated, and evaluated nodes were replaced and deleted as required.

In the function evaluate_int_exp(), you will see this implemented by scanning for higher-precedent statements, then calling evaluate_in_exp() on a sub-part of the statement recursively until the whole statement is reduced to a single literal value.

Data Types

  • Integers, 1, 2, 3
  • Character ‘a’, ‘b’, ‘c’
  • Float, 1.2, 1.0, 9.9
  • Arrays
  • Integer
  • Character "abcded", "\n", "\t"
  • Float
  • Single dimension only
  • pointers? – no, but can assign string to char arrays

Declare like C.

  • int
  • float
  • char
  • int [ ]
  • float [ ]
  • char [ ]

Operators

OperatorPrecedenceInteger FunctionFloat FunctionCharacter String Function
[ ]0Array elementArray elementArray element
=0AssignmentAssignmentAssignment
( )1Change precedenceChange precedenceNA
!2Logical NOTLogical NOTNA
-2NegateNegateNA
~2Bit-wise NOTNANA
*3MultipleMultipleNA
/3DivideDivideNA
%3Modulo DIVModulo DIVNA
+4AddAddConcatenate
-4SubtractSubtractNA
^5Bit-wise exclusive ORNANA
&5Bit-wise ANDNANA
|5Bit-wise ORNANA
==6Test for equalityTest for equalityTest for equality
<=6Test for less than equalTest for less than equalLess than or equal
<6Less thanLess thanLess than
>=6Greater than or equalGreater than or equalGreater than or equal
>6Greater thanGreater thanGreater than
&&6Logical ANDLogical ANDNA
||6Logical ORLogical ORNA
//NACommentCommentComment
atoi2NANAConvert to int
atof2NANAConvert to float
itoa2Convert to strNANA
ftoa2NAConvert to strNA

Flow control

  • while
  • if
  • if else
  • break
  • continue
  • end

Built-In Functions

print

print followed by comma delimited list of statements. E.g.:

print "hi world ", x, " that was x\n";

User Function Definition

func type name {parms};
end

Return type must be int or float. Arrays cannot be parameters. Maximum of 20 parameters. E.g.:

func int fubar {int x, int y, float b};
  body;
end;

Calling a function:

fubar {1, 2, 3, aa};

or

x=fubar {1, 2, 3, aa};

Cannot put a function in a complex statement:

X=fubar(1,2,3, aa} + 99;  // NOT allowed

Scope

  • global – all variables declared outside of functions.
  • local – variables declared within a function.
  • Parameters are treated like local variables.

Directives

Directives must go in comments. They are used to set the size reserved for variables and the call stack.

  • // $globalsize = size in bytes of static global vars (not a stack)
  • // $localsize = size in bytes of local var stack
  • // $parmsize = size in bytes of parameter stack
  • // $stacksize = size in bytes of code stack

Default is 64Kb for each.

Statements

Statements can contain literals, vars, and operators. Statements are always ended by a semi-colon ‘;’.

Statements with functions can have no other terms, i.e., many terms, or a single function call. E.g.:

i=1+j*5;

OR

i=f1{};

Array index may be a statement.

Sample Script

// test file

int x;
int i,j,k;
int ii[100],ij, ik;
float f, g;
float ff[10];
char s [ 100 ] , b, #9; #9; c;
char s2[100];
// ---------------------------------------------------------
func int test{};
  int y;
  print "at top of test\n";
  s[0]='a';
  s[1]=0;
  print "s = ", s, " s[0] = ", s[0], "\n";
  s="I am the cat " + "Hi World\n";
  s2=s;
  print "s = ", s, " s2 = ", s2, " newline\n";
  print "Hi world\n";
  print "Hi world tab\t cr\r world\n";
  x=43;
  y=5;
  ii[34]=x + y;
  ii[x+y]=5;
  print "ii[34] = ", ii[34], "\n";
  print ">x +10 = < ", x + 10, " y="y, "ii[2]=", ii[2], " ii[34] =", ii[34], "\n";
  print "(4-2)*5 = ", (4-2) *5, "\n";
  b='a';
  print "b = ", b, "\n";
  x=5;
  y=10;
  print "x=", x, " y=",y,"\n";
  if 1;
    print "before if break\n";
    break;
    print "after if break\n";
  end
  while x > 0;
    x=x-1;
    print "x=", x, "\n";
    // continue;
    if x > 2;
      print " x > 2\n";
    else
      print " x is not > 2\n";
    end
    print "before the break\n";
    break;
    print "past the break\n";
  end
  print "at the end, past the while\n";
end

// ----------------------------------------------------
func int foo1{};
  set 5;
end
// ----------------------------------------------------
func int foo2{int x, int y};
  set x*y;
end
// ----------------------------------------------------
func int main{};
  // locals
  int y, z;
  float m;
  print "top of main\n";
  print "call test next\n";
  test{};
  m=(1.0 + 0.5) * 20.0;
  print "m=",m,"\n";
  x=foo1{};
  y=10;
  z= (10 + y)*x;
  print "z = ", z, "\n";
  print "x=", x, " y=",y,"\n";
  while x > 0;
    x=x-1;
    if x > 2;
      print " x > 2\n";
    else
      print " x is not > 2\n";
    end
    print "x=", x, "\n";
  end
  print "at the end, past the if/while\n";
  x=5;
  y=6;
  x=foo2{x,y};
  print "x of foo2=", x, "\n";
end

More Information

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Software Developer (Senior)
Canada Canada
Professional Programmer living in Beautiful Vancouver, BC, Canada.

Comments and Discussions

 
QuestionProject Purpose Pin
dseverns527-Jan-19 5:25
professionaldseverns527-Jan-19 5:25 
QuestionGreat project. It works very well. Thanks Pin
muntak6-Oct-18 2:47
muntak6-Oct-18 2:47 
QuestionI use the tool to parse in.txt into in.txt.byte, but i cannot run the bytecode file anyway, what's wrong? Pin
joshua01378-Feb-10 21:36
joshua01378-Feb-10 21:36 
I use the tool to parse the file "in.txt" into file "in.txt.byte", but i cannot run the bytecode file anyway, what's wrong?
QuestionComparing strings Pin
Jostein Kjellesvik24-Oct-08 0:42
Jostein Kjellesvik24-Oct-08 0:42 
Questionatoi syntax Pin
Jostein Kjellesvik21-Oct-08 23:02
Jostein Kjellesvik21-Oct-08 23:02 
GeneralVisual BNF parser generator Pin
AdamSlosarski13-Mar-07 18:14
AdamSlosarski13-Mar-07 18:14 
GeneralRun-Time Check Failure #3 Pin
Fanzhe Cui3-Apr-06 5:59
Fanzhe Cui3-Apr-06 5:59 
GeneralRe: Run-Time Check Failure #3 Pin
arussell3-Apr-06 16:53
professionalarussell3-Apr-06 16:53 
GeneralRe: Run-Time Check Failure #3 Pin
Fanzhe Cui4-Apr-06 3:57
Fanzhe Cui4-Apr-06 3:57 
GeneralRe: Run-Time Check Failure #3 Pin
Paul Singh18-Dec-06 6:03
Paul Singh18-Dec-06 6:03 
GeneralParse error Pin
Anonymous20-Mar-05 21:09
Anonymous20-Mar-05 21:09 
GeneralRe: Parse error Pin
Anonymous21-Mar-05 1:05
Anonymous21-Mar-05 1:05 
QuestionRe: Parse error Pin
Winstonio2-Jul-07 6:21
Winstonio2-Jul-07 6:21 
AnswerRe: Parse error Pin
Member 139774062-Jan-19 19:59
Member 139774062-Jan-19 19:59 
Generaluninitialized variable causes error in release build Pin
Lars Wadefalk8-Dec-04 5:25
Lars Wadefalk8-Dec-04 5:25 
GeneralRe: uninitialized variable causes error in release build Pin
arussell8-Dec-04 15:15
professionalarussell8-Dec-04 15:15 
GeneralRe: uninitialized variable causes error in release build Pin
Lars Wadefalk8-Dec-04 20:42
Lars Wadefalk8-Dec-04 20:42 
GeneralRe: uninitialized variable causes error in release build Pin
arussell9-Dec-04 15:00
professionalarussell9-Dec-04 15:00 
GeneralRe: uninitialized variable causes error in release build Pin
notalreadyinuse25-Sep-07 10:00
notalreadyinuse25-Sep-07 10:00 
GeneralThis is the third parser I have written, and it sucks somewhat less than the first two Pin
Anonymous1-Dec-04 8:55
Anonymous1-Dec-04 8:55 
GeneralRe: This is the third parser I have written, and it sucks somewhat less than the first two Pin
azhu_ram26-Mar-07 23:40
azhu_ram26-Mar-07 23:40 
GeneralRe: This is the third parser I have written, and it sucks somewhat less than the first two Pin
Chaz Zeromus21-May-07 14:32
Chaz Zeromus21-May-07 14:32 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.