Predictive Parser to generate syntax tree and an Intermediate code for assignment satement

khalid.A

4.61/5 (15 votes)

Feb 26, 2007

CPOL

6 min read

112289

3702

This project is a compiler program that translate assignment statement into an intermediate code .

Introduction

to be a great programmer you must learn and write a compiler because it will make any algorithm easier for you no matter how difficult it seems to you.

writing a compiler is interesting because it teaches you what a compiler is.

I will show the front end of a compiler and how to generate a syntax tree represented as an array and generate the three-address-code .

Background :

Compiler is a program that reads a program written in a source language and translate it into an equivalent program In a target language .

Compiler consists of several phases each of which passes its output to the next phase.

Design of compiler are often partitioned into Front End that deals only with language specific issues and Back End that deals only with machine-specific issues .

In this project I discuss only the Front End.

I implemented a Predictive Parser to generate a syntax tree and an intermediate code ( three-address-code ) for assignment statement .

assignment statement is an identifier = expression , Ex: x = a + 5 or a = -x / y and so on

1 – Lexical Analysis phase ( scanner )

Lexical analyzer groups characters into lexical units or tokens ( identifier, number,..etc.) The input to the lexical phase is a character stream and the output is a stream of tokens .the white spaces ( blank , tap , new line ) are scanned out .

The scanner can be implemented as a finite state machine

This Lexical Analyzer is used as a method called by the parser to return the next token When an identifier in the input is detected by the lexical analyzer , the identifier is entered into a data structure called symbol table. The symbol table contains a record for each identifier with fields for the attribute of the identifier . the Lexical Analyzer look up for the identifier in the symbol table and enter it only if this identifier is not found inside the symbol table .

Screenshot - symboltable.jpg

Implementation of Lexical Analyzer ( scanner )

TokenUnit nextToken()

This is the lexical analyzer method it return the token to the caller ( parser ) .

public class TokenUnit
{
public int attribute;
public int code;
}
string Lexbuf;
string inputStream;// contains an expresion that written by user
TokenUnit token = new TokenUnit();
SymbT[] SymbolTable = new SymbT[20];

private TokenUnit nextToken()
{
Lexbuf = null;
TokenUnit token = new TokenUnit();
while (index < inputStream.Length)
{
t = (int)inputStream[index];
// white space
if (t == ' ' || t == '\t' || t == '\n' || t == '\r')
{
index++;
continue;
}
// Numbers
else if (isdigit(t))
{
while (isdigit(t))
{
Lexbuf += inputStream[index];
index++;
if (index == inputStream.Length) break;
t = (int)inputStream[index];
}
token.attribute = Int32.Parse(Lexbuf);
token.code = NUM;
return token;
}
// Identifier 
else if (isletter(t))
{
// next character either letter or digit
while (isletter(t) || isdigit(t))
{
Lexbuf += inputStream[index];
index++;
if (index >= inputStream.Length) break;
t = (int)inputStream[index];
}
int p = LookUp(Lexbuf);
if (p == 0) Pst = Insert(Lexbuf, ID);
else Pst = p;
token.attribute = Pst;
token.code = SymbolTable[Pst].code;
return token;
}
// operator sign
else if (t == '=' || t == '+' || t == '-' || t == '*' || t == '/' || t == '(' || t == ')')
{
index++;
token.attribute = 0;
token.code = t;
return token;
}
// otherwise Lexical error
else
{
Lexerror = true;
error();
token.attribute = NONE;
token.code = t;
return token;
}
}
return token;
}
// this function to look up on the symbol table and return the row number if found otherwise 0

private int LookUp(string lexe)
{
for (int i = 0; i < 20; i++)
{
if (lexe == SymbolTable[i].lexeme)
return i;
}
return 0;
}
// this function to insert a symbol on the symbol table

private int Insert(string lexe, int tcode)
{
Pst++;//pointer to the symbol table entry
SymbolTable[Pst].lexeme = lexe;
SymbolTable[Pst].code = tcode;
return Pst;
}

2- Syntax Analysis phase ( Parser )

The parser recognizes a sentence in terms of grammar to check if it is grammatically well formed or not. It groups tokens into a syntactical units. The output of the parser is a parse tree representation of the program, nodes of parse tree are constructed through using grammar, grammar is a set of rules which govern the interdependencies and structure among the tokens.

Context-free grammars are used to define the program structure recognized by a parser, context-free-grammar are a formalization of recursive rules that can be used to guide the syntax analysis .

This grammar is specified using ( production ) the grammar for assignment statement must be eliminate the grammar defects such as ambiguity and left recursion and left factoring .

Grammar for assignment statement :

S --> id = E

E --> T E'

E' --> + T E' | - T E' | ε

T --> F T'

T' --> * F T' | / F T' | ε

F --> id | num | - F | ( E )

Components of context-free-grammar :

Production : "left side --> right side "

Non-Terminals are { S , E , E', T, T', F }

Terminals are { id , = , + , - , * , / , num , ( , ) ,ε }

Start symbol : a designation of one of the non-terminal , S is the start symbol .

Derivations : sequence of replacement for the given input according to the grammar

for example : a = b * - c

derivate into :

S --> id =E

E -->T E'

T --> F T'

F -->id

T' --> * F T'

F --> - F

F --> id

T' --> ε

E' --> ε

Parse tree :

A parse tree shows the start symbol of a grammar derives a sentence in the input .

Screenshot - parsetree1.jpg

Syntax tree:

It is a compressed representation of the parse tree in which the operators appear as the nodes, and the operands of an operator are the children of the node , each node is represented as a record with field for its operator and additional fields for its children, nodes are allocated from an array of records and the array index ( position of the node ) serves as the pointer to the node , all the node in the syntax tree can be visited by following pointers , starting from the root at the last position .

The node structure is consist of label and left pointer and right pointer such as a binary operation , some nodes such as unary minus operation has only one pointer , when constructing a leaf , the node label is set to represent the token of that leaf such as identifier or number .

Screenshot - structure.jpg

To construct the syntax tree we must add the semantic action into the grammars to represent the syntax tree as array of records, The parser traverses the parse tree depth-first and constructing the syntax tree during parsing for a successive statements according to the grammar , the grammars contains semantic action that execute at the end of a grammar , this grammars called syntax-directed-definition , semantic actions used to write the required output .

Screenshot - Syntax_Directed_Definition.jpg

Screenshot - parsetree2.gif

The parser traverses the parse tree depth-first and constructing the syntax tree during parsing for a successive statements according to the syntax-directed-definition .

Screenshot - traverse.gif

Note at the beginning , this grammar must be add and subtract to value so we must create node that has a value zero and in the multiply and division we must create another node that has a value one .

The syntax tree represented as array of records

Screenshot - syntaxtree.jpg

syntax tree build bottom-up

Screenshot - syntaxtree2.jpg

3- The semantic analysis phase

The semantic analysis phase analyzes the parse tree for context-sensitive information often called the static semantics, type checking is very important in the semantic analysis, the output of the semantic analysis phase is an annotated parse tree (augmented with semantic actions).

4- Intermediate code Phase .

Three address code is similar to assembly code. It's a linear representation of syntax tree .

The three-address code for the last example is :

temp0=0

temp1=1

temp2=b

temp3=c

temp4=-temp3

temp5=temp1*temp4

temp6=temp2*temp5

temp7=temp6+temp0

Now for each non-terminal we create a method for it and for each terminal we use the match() method the function of this method is checks tokens , it reads the next input token if the look-ahead symbol is matched it get the next token and calls the error() function otherwise .

Note that tokens represented by integer numbers called the token-code.

Remember this parser generate both syntax tree and three-address-code .

/* Grammar for assignment statement
* S--> id = E
* E--> T E'
* E'--> + T E' | - T E' | $
* T --> F T'
* T' --> * F T' | / F T' | $
* F --> ( E ) | id
* */

// Production S in grammar

private int stmt()
{
int sPtr, ePtr, idPtr, idPlace;
lookahead = token.code;
if (lookahead == ID)
{
idPlace = token.attribute;
match(ID);
match('=');
ePtr = expr();
idPtr = mKnode(ID, idPlace);
sPtr = mKnode('=', idPtr, ePtr);
// Three Address Code
Splace = newTemp();
emit(Splace, "=", SymbolTable[idPlace].lexeme, "", "");
emit(SymbolTable[idPlace].lexeme, "=", Eplace, "", "");
return sPtr;
}
else error();
return -1;
}
//Production E in grammar

private int expr()
{
int ePtr, edPtr, tPtr;
lookahead = token.code;
if (lookahead == ID || lookahead == NUM || lookahead == '(' || lookahead == '-')
{
tPtr = term();
edPtr = edash();
ePtr = mKnode('+', tPtr, edPtr);
// Three Address Code
Eplace = newTemp();
Tplace = "temp" + tPtr;
EdashPlace = "temp" + edPtr;
emit(Eplace, "=", Tplace, "+", EdashPlace);
return ePtr;
}
else error();
return -1;
}
//Production E' in grammar

private int edash()
{
int edPtr, ed1Ptr, tPtr;
lookahead = token.code;
string Edash1place;
if (lookahead == '+')
{
match('+');
tPtr = term();
ed1Ptr = edash();
edPtr = mKnode('+', ed1Ptr, tPtr);
// Three Address Code
Edash1place = "temp" + ed1Ptr;
Tplace = "temp" + tPtr;
EdashPlace = newTemp();
emit(EdashPlace, "=", Edash1place, "+", Tplace);
}
else if (lookahead == '-')
{
match('-');
tPtr = term();
ed1Ptr = edash();
edPtr = mKnode('-', ed1Ptr, tPtr);
// Three Address Code
Edash1place = "temp" + ed1Ptr;
Tplace = "temp" + tPtr;
EdashPlace = newTemp();
emit(EdashPlace, "=", Edash1place, "-", Tplace);
}
else
{
edPtr = NodeZeroPtr;
// Three Address Code
Edash1place = "temp0";
}
return edPtr;
}
//Production T in grammar

private int term()
{
int tPtr, tdPtr, fPtr;
lookahead = token.code;
if (lookahead == ID || lookahead == NUM || lookahead == '(' || lookahead == '-')
{
fPtr = factor();
tdPtr = tdash();
tPtr = mKnode('*', fPtr, tdPtr);
// Three Address Code
Fplace = "temp" + fPtr;
TdashPlace = "temp" + tdPtr;
Tplace = newTemp();
emit(Tplace, "=", Fplace, "*", TdashPlace);
return tPtr;
}
else error();
return -1;
}
// Production T' in grammar

private int tdash()
{
int tdPtr, td1Ptr, fPtr;
lookahead = token.code;
string Tdash1Place;
if (lookahead == '*')
{
match('*');
fPtr = factor();
td1Ptr = tdash();
tdPtr = mKnode('*', td1Ptr, fPtr);
// Three Address Code
Tdash1Place = "temp" + td1Ptr; ;
Fplace = "temp" + fPtr;
TdashPlace = newTemp();
emit(TdashPlace, "=", Tdash1Place, "*", Fplace);
return tdPtr;
}
else if (lookahead == '/')
{
match('/');
fPtr = factor();
td1Ptr = tdash();
tdPtr = mKnode('/', td1Ptr, fPtr);
// Three Address Code
Tdash1Place = "temp" + td1Ptr;
Fplace = "temp" + fPtr;
TdashPlace = newTemp();
emit(TdashPlace, "=", Tdash1Place, "/", Fplace);
return tdPtr;
}
else
{
tdPtr = NodeOnePtr;
TdashPlace = "temp1";
}
return tdPtr;
}
//Production F in grammar

private int factor()
{
int fPtr = -1; int f1Ptr = -1;
int ePtr, idPlace, numValue;
lookahead = token.code;
if (lookahead == ID)
{
idPlace = token.attribute;
match(ID);
fPtr = mKnode(ID, idPlace);
// Three Address Code
Fplace = newTemp();
emit(Fplace, "=", SymbolTable[idPlace].lexeme, "", "");
}
else if (lookahead == NUM)
{
numValue = token.attribute;
match(NUM);
fPtr = mKnode(NUM, numValue);
// Three Address Code
Fplace = newTemp();
emit(Fplace, "=", numValue.ToString(), "", "");
}
else if (lookahead == '(')
{
match('(');
ePtr = expr();
match(')');
fPtr = ePtr;
// Three Address Code
Fplace = Eplace;
}
else if (lookahead == '-')
{
match('-');
f1Ptr = factor();
fPtr = mKnode('-', f1Ptr);
// Three Address Code
F1Place = "temp" + f1Ptr;
Fplace = newTemp();
emit(Fplace, "=", "-", F1Place, "");
}
else error();
return fPtr;
}

private void match(int t)
{
if (token.code == t)
token = nextToken();
else error();
}
// Generate temporaries variables for three-address-code

public string newTemp()
{
count++;
return "temp" + count;
}
// write three-address-code to file

public void emit(string str1, string str2, string str3, string str4, string str5)
{
if (Lexerror || Synerror) return;
sw.WriteLine(str1 + str2 + str3 + str4 + str5);
}