Building a Programming Language: Part II (Adding Conditions, Loop and Blocks to BrainLess)

BrainlessLabs.com

4.75/5 (14 votes)

Feb 4, 2010

LGPL3

11 min read

60563

1670

In this article, we will discuss implementing conditional statements, loops and blocks.

Clone at github

1. Introduction

I will divide this second part into Part II - A, and Part II - B. In this article, we will talk more about the implementation of the BrainLess language compiler/interpreter. We will do some modification to the BrainLess Virtual Machine. We will also talk more about the lexer and parser part of the compiler. Basically, we have added some special slots to the BrainLess VM. These we will use while doing comparisons, function calls, etc.

In Part II - B, we will implement functions in BrainLess.

2. Background

To know how BrainLess was created, you can refer to my first article.

3. Using the Code

3.1 Files

Here with this article, I am attaching the second version of the BrainLess compiler, interpreter.

A brief description of the code is given below. This will help you dive into the code and make your own changes.

A. Helper.cpp, and Helper.h

These files contain and will contain the helper functions that will be helpful in creating BrainLess. Currently, they contain the class FileRead.

This class handles reading from the input script file.

B. constNDefs.h

This file contains constants and definitions that will be used throughout the code.

C. MainDriver.cpp

This file contains the int main() for the program.

D. Lexer.cpp, and Lexer.H

These files contain the lexer.

class TokenAttrib

Tokens can have attributes, like a group of characters like "123" is a single number with the
value attribute of 123. This value is held by this class.

class Lex

This is the lexer class. This class will get characters one by one from the source file and then
categorize it according to any of the enum enmTokens groups. Ex: "-*" are two characters but they form a single token. This token will be recognized by this class, and returned.

enum enmTokens

This is the enum holding the list of all the tokens. There are ways to organize this token. One can be used by Lexer and another enum may be created to be addressed by the Parser. But we will use only one for both Lexer and Parser. We will talk more about this later in this article when we introduce some more terms.

E. Interpreter.cpp, and Interpreter.H

This is the interpreter or the compiler for the BrainLess language. The class responsible for doing this is class Interpreter.

The compilation/interpretation starts from the entry point function.

void Interpreter::interpreteScript()

The next important function is void Interpreter::Sentence().

Basically interpreteScript() is a loop pumping for the next token and then calling the Sentence()which in calls the appropriate function based on the token, simply speaking this will call the real interpreter for the statement in BrainLess.

F. TapeMachine.h

This file contains the class that simulates the BrainLess virtual machine, or you can say this is the virtual machine.

template< typename storageType, typename IO, ulong iTapeSize>
class TapeMachine
{
...
};

Here it is designed as a template to support the plug and play architecture. Virtually you can think that the virtual machine has 2 parts.

Type of storage, which tells what kind of memory each slot in the VM has. If you want to change it, just changing the storageType will be sufficient. This is easy to do.

IO is the IO interface for the VM. Currently it supports only console. But if need be, a different IO class can be easily interfaced.

G. VMInterface.h

The previously discussed IO class is an instance of the class IOInterface described in this file. Right now, it supports console application, but it can be extended to support UI based IO interface.

H. TapeMachine.cpp

This file is removed. All of them are added to the TapeMachine.h file. Now this is a template class.

3.2 Compiling

To compile the code, just open the BrainLess.sln file in Visual Studio 2008 express edition and compile it. This is not dependent on any other library. Compiling on any other platform than windows is also easy. Just keep all the files in one folder and write a small make-file to compile them and link it. A modern C++ compiler is needed.

3.3 Example

An example file, m.bl is supplied with the code. Execute it.

4. Parser, and Lexer

Let us talk more about the lexer and the parser here in details. We will introduce some more concepts out here.

4.1 Recap

First let us recap some of the concepts we introduced in part 1 of this article.

Symbol

This is an atomic entity, represented by a character, or sometimes by a reserved or keyword.

Ex: + , ; EOF. The FileReader class reads all the atomic entities for us.

Alphabet

"A" set of symbols/tokens. But it should be non empty, and finite.

Example: Set of English characters for English language, for C++.

+ - / * [a-z,A-Z,0-9] { } ( )"space"

For BrainLess, we have the following characters set:

A = {a...z, A...Z, *, +, -, [, ], (, ), 0...9, !, >, <}

Here we do not have "space", EOF, '\0', '\t', '\b' in this set "A". We do not have "{" and "}" also.

Phrase/word/string

This is defined over an alphabet set "A". We take symbols from the alphabet set, and group them using certain rules to get a string.

Example: In C, the identifier name cannot start with a number. This is a rule.

The Lex class recognizes these, and returns them to the Interpreter class.

For us in BrainLess "-*", "-- " etc. are words for our language.

Language

The set of all words make up a language.

Basically Interpreter::Sentence()recognizes all that is in the set language for us. For the current BrainLess compiler, if it is not included in Sentence() then it is not in the language.

Grammar

To get the meaning from a certain sequence of words, we need to have some rules. These rules are given by the grammar. This is a set of rules, that all need to follow to make something meaningful. We will call this set of rules as “G” for grammar.

Example: For BrainLess, the symbols eIF eLeftParen eEquals eRightParen eLeftSqrParen <Any other valid sentence/ sequence of words> eRightSqrParen eFI.

Each of these tokens are valid. But if we have something like:

eIF eFI eLeftParen eEquals eRightParen eLeftSqrParen <Any other valid sentence/ sequence of words> eRightSqrParen

This is not a valid sentence. This has no meaning.

4.2 Phrase Structure, and Lexical Structure

4.2.1 Lexical Structure

To build up words for a language, we follow certain rules. These rules structures are called the lexical structure for the language.

Example: Read and store a character is given by ">>". Numbers are given by any sequence of characters from the set {0-9} placed adjacent for BrainLess. Each of these are words for BrainLess. But describing these in English has few demerits, like it is verbose and second it will be confusing. So for clarity, we can use some formal notations.

Let us see how we match the word ">>". It is basically a string pattern at its core, so we need some pattern matching to identify this. A tool for pattern matching is Regular Expression. A regular expression specifies the form that a string may take by using the symbols from the alphabet "A". But to impose structure, it uses some other symbols also. These symbols are called as Meta Symbols. Regular expressions are of practical interest in programming language translation because they can be used to specify the structure of the tokens

Some of these are:

Concatenation - Symbols or strings may be concatenated by writing them next to one another, or by using the meta-symbol · (dot) if further clarity is required.
Alternation - choice between two symbols "a" and "b" is indicated by separating them by the meta-symbol | (bar).
Repetition - A symbol "a" followed by the meta-symbol * (star) indicates that a sequence of zero or more occurrences of "a" is allowable. A symbol "a" followed by the meta-symbol + indicates one or more occurrence. In other words a+ = a.a*.
Grouping - A group of symbols may be surrounded by the meta-symbols ( and ) (parentheses).

For us, the class Lex recognizes the regular expressions for us. So the word ">>" can be defined with the regular expression as:

>.>

4.2.2 Phrase Structure

The phrase structure of a language is the way in which the individual tokens are combined to get a meaningful statement for the language. This is basically handled by the grammar for the language. This is recognized by the class Interpreter.

5. Changes to BrainLess VM

The BrainLess VM is changes. This is given in the below figure:

Here a new register is introduced called m_iSpecialRegStart. This register will hold the index 60. The index 60 will store the value that we get by comparing the current head value and the current head + 1 value.

// The token "\<>"
void Interpreter::compareWithNext()
{
    if(m_tapeVirtualMachine.getCurrentHeadValue() < 
		m_tapeVirtualMachine.getNthPosValFrmHead(1))
    {/*If m_iTapeStore[m_iHeadPos] < If m_iTapeStore[m_iHeadPos  + 1]*/
        m_tapeVirtualMachine.setFlagRegister(HEX_1);
    }
    else if(m_tapeVirtualMachine.getCurrentHeadValue() > 
		m_tapeVirtualMachine.getNthPosValFrmHead(1))
    {/*If m_iTapeStore[m_iHeadPos] > If m_iTapeStore[m_iHeadPos  + 1]*/
        m_tapeVirtualMachine.setFlagRegister(HEX_2);
    }
    else
    {/*If m_iTapeStore[m_iHeadPos] == If m_iTapeStore[m_iHeadPos  + 1]*/
       m_tapeVirtualMachine.setFlagRegister(HEX_3);
    }
}

6. Adding Blocks to BrainLess

6.1 What is a Block?

Computation is basically a process of controlling an abstract entity. This abstract entity is the program. We cannot touch the program or feel it. But what the heck!! It does our work, so who cares. Computation is basically instructing this abstract entity to do some work. This entity can get some work done or can manipulate another abstract quantity called data to fill us with some instruction. Now there are some actions or some instructions we perform now and then, again and again. This will be highly inconvenient to write the instructions again and again. If there is some error in the instructions, then we need to change it in every place. To make these things convenient, we will create another special abstract entity. This will be called a Block. A Block will be defined as follow:

#B(<Block Name>:<Block Body>)

When the interpreter sees the #B, it will generate the token symbol eBlock. On seeing this, the interpreter goes to the void Interpreter::Block().

The action that the interpreter takes on seeing this eBlock token depends on the next bunch of tokens. If so we get the next token, if the next token is ":", then we know that this is a function definition. So what ever follows till the ")" is encountered is code that needs to be executed.

On the other hand, if the token sequence is like eBlock eLeftParen eCharVal eColon eRightParen, then we go to the desired function and execute the list of statements.

Collapse

So here is the algorithm:

Token loop Start:

If: token==eBlock

Go to Block method.

Token loop End

Block Method Start

if: token sequence "eBlock eLeftParen eCharVal eColon"

Store the "Name:Token index" in a map.

Eat up all character till we find a matching ")"

Else if: we get the token sequence "eBlock eLeftParen eCharVal eRightParen"

Save the index to where to return after the block execution.

Search for the name in the map, if found then get the index of the code index.

Execute the code block. (This code block can even contain more #B calls)

Return to the position from where the block was called.

Block Method End

A block can contain more blocks, IF...FI Clauses, DO...DONE Loops.

The source code contains appropriate comments.

7. BrainLess Snail Programs

A small program to calculate the sum of N numbers in BrainLess Snail.

The N is taken as user input.

![5]
>>
\>
DO
*
++
<
\>
DONE
![3]

\=
DO
*
--
\>
<
+*
DONE

![5]

8. Conclusion

Till now, we have discussed a little theory and implemented a good deal of the programing language. But this is not the end by any means. Next series will focus more on the implementation of functions. But then, after that again I will switch on to the theory part, but theory alone is very boring for me, so we will also keep on implementing few tools and gadgets along side. Please provide your valuable comments. Let me know in case of any doubts.

There no value of space in BrainLess. This means #B is same as # B. And also IFFI is a valid construct. Here the lexer operates in this way. It goes for the first match.

History

BrainLess Snail V 0.5 - Fixed bug with the DO Loop structure
BrainLess V 0.3 - Adding Conditional statements, and Loop. Added Blocks,
BrainLess V 0.4 - This has a logger also, the logger will record the time take for each Interpreter function.

Added Blocks : #B(Block Name : Block Body) - Declare a block
#B(Block Name) call a block
Added the operators "\<", "\=", "\>", "\<>"
Added IF(condition) Statements FI statement
Added DO Statement DONE
BrainLess VM - Added special slots for holding special values like the comparison result.

I have moved the code to Google code. Further updates will be available there.

http://code.google.com/p/brainless-labs/

We will improve the compilation.

From next time on, we will move into using more mature C++ libraries like Boost library, etc. Whenever I will be using it, I will give enough details. We will use these because we have to use industry strength mature helper libraries to make our task easier and get more knowledge into the real compiling essence instead of reinventing all the wheels. Rather we will only reinvent the compilation/interpretation wheel.