Introduction
In this article, a translator is written which takes a program written in BASIC and converts it to JavaScript. The purpose of this project is not just for the nostalgia of being able to write programs in BASIC again, but also to demonstrate how one can go about writing a compiler in C#.
Being able to define a language and write a compiler for it will not only give you insight into how the programming languages you use work, but it's useful in a number of real-world scenarios, such as writing a custom rules engine for a business application, allowing things such as levels in a game to be describable in a simple language, writing IDE add-ins, or writing a domain-specific language.
Demonstration
There are two demonstrations available for this project. In the first one, you can play a little game I wrote, and in the second, you can write your own BASIC program, have it converted to JavaScript, and then run it in your browser.
Background
Many programmers who grew up in the 80s and 90s learnt programming in a language called BASIC, whether it was using a Commodore 64/VIC-20/etc. plugged into their TV, or programming in GW-BASIC on their 286. For my part, I was given a Dick Smith VZ200 at the age of 8, which came with a stunning 8KB of RAM and loaded programs from cassette tapes. The games it came with very quickly/instantly became horribly boring, but the ability to program the machine made it infinitely interesting. The first BASIC program I and most people learnt generally looked something like this:
10 print "Hello world"
20 goto 10
This would generate something like the following output (and would never stop):
Side note: "Hello world" would often be replaced by something a little more whimsical, such as "You suck", especially when the 10-year-old programmer was faced with a bank of Commodore 64s and Amigas at the local electronics shop.
The main points to note from such a simple basic program is that each line is numbered, flow is controlled using harmful "goto
" statements, and there is a print
statement for printing to the console. There are no classes, and no type declarations. A variable is appended with the '$' character, and there are other statements such as input
to get user input, and if-then-else
statements, which cannot span more than one line. Comments are created by starting the line with "REM
" (short for "remark", not a nod to a certain musical group, nor a phenomenon which we all regularly experience several times a night). Using this information, we can write the following simple program:
10 print "Hello"
20 rem Get the user to enter their name, and hold the result in the variable a$
30 input "Please enter your name" a$
40 rem Quit if the user entered "quit"
50 if a$ = "quit" then goto 80
60 print "Hello " a$
70 goto 30
80 print "Goodbye"
What this program does should be pretty obvious, but feel free to see it in action here. Using the constructs here, and a few other statements such as for
and while
loops, and the gosub
/return
statements, reasonably sophisticated text-based games could be made. The turning point for me, though, was when I found out about the LOCATE
statement, which allows you to place the cursor anywhere on the screen (as opposed to the next line, when using the print
statement), and inkey$
, which allows you to get the last key the user pressed:
10 locate 10,4
20 print "Down here"
30 locate 2, 10
40 print "Up here"
Run it
10 a$ = inkey$
20 if a$ <> "" then print "You pressed " a$
30 goto 10
Run it
With this knowledge, I was able to draw a spaceship (>=-) at a certain position, and when the 'A' button was pressed, remove the ship from its current location by printing three spaces over the top of where it was, LOCATE
ing the cursor one row above the last position, and then redrawing the ship. Same-deal-different-direction for the 'Z' key. Animation was created!
I drew a second ship, and using for
loops, allowed them to fire bullets (full-stops: '.') at each other. I wrote this game, calling it "Space War", when I was nine, and spent hours on it. Unfortunately, the cassette player on my computer could only load programs; not save them. So, at the end of the day, I had to eventually turn off my computer, and it was lost forever. I later remade more elaborate versions using QBASIC when I had a PC, and fairly recently - for reasons I'm sure only a psychologist versed in computer science can answer - I wanted to recreate it by writing it in old-fashioned BASIC and converting the code to JavaScript. When I read about the Irony .NET Compiler Construction Kit, it was like seeing a vision: fate could not be fought, and the JSBasic compiler project was born.
The implementation of this project can be nicely split into two halves: compiling a BASIC program, and generating JavaScript.
Compiling a BASIC Program
In order to read BASIC source code and generate script from that, the source code must be parsed to pick up line numbers, keywords, variable names, strings, loops, if
statements, and everything else in the language. Once parsed, an abstract syntax tree is constructed, which is to say the source code is converted into a tree, with the root node representing the whole program. You can think of each line as a child of the root node, where each line has more child nodes (for example, an if
statement might be a child node of a line, and the condition of the if
statement and the then
part of the if
statement are then children of the if
statement node).
Once the abstract syntax tree is built, the tree can be traversed and code can be generated.
There are a lot of tools available for this, but to me, it has always seemed a little bit too much work to get the tools together in a .NET project and generate a parser and do whatever else is needed. And then, one day I read about Irony on The Code Project, which allows you to write the grammar of the language in C#, and then it takes care of parsing and building the tree. I immediately saw the benefit in this, and realised that this is the sort of project which could enable the implementation of the Interpreter pattern to become much more widespread, what with its barrier-removing-goodness.
The key to language implementation is to understand the aforementioned "grammar" of the language. This requires you to describe how the language you are writing the compiler for works, and requires quite a bit of recursive thinking. In terms of BASIC though, we can say the following:
- A PROGRAM is made up of LINES
- Each LINE starts with a number, followed by a STATEMENT
- A STATEMENT can be any of the following:
- a PRINT statement, which is followed by something to print
- an INPUT statement, which is followed by something to print and then a VARIABLE
- an IF statement, which looks like this: IF something THEN a statement ELSE a statement (where the ELSE part is optional)
- a GOTO statement: go to a line number
- an ASSIGNMENT statement: variable = something
- An EXPRESSION. An expression is what each of the "something"s are in the examples above, for example, PRINT EXPRESSION. An EXPRESSION can be things such as:
- A string or number
- A binary comparison, e.g., something < somethingElse
- A variable, such as PRINT a$
Using the descriptions above, the following program...
10 print "Hello world"
20 goto 10
...can be represented as a tree as follows:
Terminology-time: The orange nodes above are "Terminals" as they come at the end of each branch in the tree; the blue nodes are Non-Terminals.
This is somewhat of a simplification, as a line can actually be a statement list, where each statement is separated by a colon, and goto and print have their own non-terminals (GotoStmt
and PrintStmt
), but you get the idea.
The problem that Irony solves is how to convert a source code into a tree, such as that above. The first step is to describe the grammar formally. A common way to describe a grammar is using the Backus-Naur form (BNF), which is not a required step for using Irony, but it is helpful to do this first. The above descriptions in BNF would be as follows:
<program> ::= <lines>
<lines> ::= <lines> <line> | <line>
<line> ::= <number> <statement>
<statement> ::= "print" <expression>
| "input" <expression> <variable>
| "if" <expression> "then" <statement>
| "if" <expression> "then" <statement> "else" <statement>
| "goto" <number>
| <variable> "=" <expression>
<expression> ::= <string>
| <number>
| <variable>
| <expression> <binary operator> <expression>
<binary operator> ::= "=" | "<>" | "<" | ">" | "<=" | ">="
This next section explains the background to the <lines> node above. This is a recursive definition which essentially states that a program can have one or more lines. This section is not necessary when using Irony however, as a much more natural construct exists which allows you to bypass this step. However, the following explanation has been left to give further background to those who are interested.
When converting from English to BNF, the biggest difference is in the way a line like "A PROGRAM is made up of LINES" is written. In the first line, we say that a <Program> is made up of <Lines> . <Lines> is then followed by a <Lines> , or a <Line> : this is where the recursive thinking is required. If you think of a program with one <Line> , then you can see that one <Line> can be considered as a <Lines> node. If we now add a second line to this imaginary program, then we can say that the first <Line> matches (and gets represented as) a <Lines> node, and then that <Lines> node with the second <Line> matches the "<lines> <line> " rule, and that can get reduced to a <Lines> node.
This, in fact, means that when defined this way, the <Program> node will have one child node, <Lines> , which in turn will have only two child nodes for any program over one line: the first child will be a <Lines> node, the second child will be a <Line> node. The above diagram was therefore a simplification; the "Hello World" program would actually look like the following:
This may seem a roundabout way to represent the idea that a node can have one-or-more of a certain type of child nodes, but that's the way it's traditionally been done. The good news is that you do not need to use this type of construct when using Irony; instead, you can use the much more intuitive concept that a program has one or more line nodes. Therefore, the first tree shown above is in fact the correct representation when using Irony.
|
Once you have defined the grammar of your language, you need to write this as a C# class. Using Irony.NET, you create a new class which extends the abstract Grammar
class, and in the constructor, build the equivalent of the above Backus-Nauf Form description of the language. You need to declare each part of the language, such as:
NonTerminal PROGRAM = new NonTerminal("PROGRAM");
NonTerminal LINE = new NonTerminal("LINE");
And also declare the terminals:
Terminal number = new NumberTerminal("NUMBER");
Terminal stringLiteral = new StringLiteral("STRING", "\"", ScanFlags.None);
Terminal variable = new VariableIdentifierTerminal();
In this case, NumberTerminal
and StringLiteral
are classes supplied with Irony to identify numbers and quoted strings "like this" in the source code, and VariableIdentifierTerminal
is a class written by myself which extends Terminal
and is used to match strings of alphanumeric characters ending with the '$' character (i.e., BASIC variables).
Once we have those formalities out of the way, we can get down to business. The following is a subset and simplification of what ended up in the final grammar of JSBasic:
PROGRAM.Rule = MakePlusRule(PROGRAM, null, LINE);
LINE.Rule = number + STATEMENT + NewLine;
STATEMENT.Rule = EXPR | ASSIGN_STMT | PRINT_STMT | INPUT_STMT | IF_STMT |
| IF_ELSE_STMT | BRANCH_STMT;
ASSIGN_STMT.Rule = variable + "=" + EXPR;
PRINT_STMT.Rule = "print" + EXPR_LIST;
INPUT_STMT.Rule = "input" + EXPR_LIST + variable;
IF_STMT.Rule = "if" + CONDITION + "then" + STATEMENT;
IF_ELSE_STMT.Rule = "if" + CONDITION + "then" +
STATEMENT + "else" + STATEMENT;
EXPR.Rule = number | variable | EXPR + BINARY_OP + EXPR | stringLiteral;
Remember, this is all compilable C#, with operator overloading provided by Irony to allow a pretty close translation from BNF to C#. It's really a nice way to define a language's grammar.
MakePlusRule and MakeStarRule
Note the first line in the code above, which is how you specify that a NonTerminal (in this case, PROGRAM
) has one or more LINE
nodes, which is one of my favourite features of Irony, as it simplifies the generated tree. There is also a MakeStarRule
for zero or more nodes. When compiled, the Program
node in the abstract syntax tree will have a collection of Line
nodes.
Important note: when using MakePlusRule
or MakeStarRule
, you cannot have anything else in the rule.
Case Insensitivity
Making your language case-insensitive couldn't be much easier. In the constructor of your grammar, you just set a property, and then all keywords such as "if
" will work nO maTTer WHAT the CaSiNg.
this.CaseSensitive = false;
Line-breaks
BASIC lines are ended with line breaks. Languages, like JavaScript, which are ended with characters such as the semi-colon are generally easier to handle; simply add ";" to the end of the rule. However, if the language you are implementing has lines which end with line-breaks, then the following steps are required:
1: End the rule with "NewLine
", which is defined in the base Grammar
class:
LINE.Rule = number + STATEMENT + NewLine;
2: Because Irony ignores whitespace (including line breaks) when scanning the source code, you need a way to resurrect the line breaks when the Abstract Syntax Tree is being created. Fortunately, this can be done with the following line, which can go anywhere in your grammar's constructor:
this.TokenFilters.Add(new CodeOutlineFilter(false));
The CodeOutlineFilter
can also help if you need to know the indentation of the source code, and I believe it can (or will in a future release) allow you to handle languages like VB which use characters to join lines together (e.g., by using an underscore). See the parsers included in Irony for examples.
Comments
Irony provides a terminal which matches comments. To match BASIC comments (i.e., statements starting with "REM
" and ending with a line break), the following declaration was used:
Terminal comment = new CommentTerminal("Comment", "REM", "\n");
And, this was then used in the statement rule:
COMMENT_STMT.Rule = comment;
This was required by JSBasic as I wanted to re-print the BASIC comments as JavaScript comments in the generated JavaScript. Normally, you just want to ignore comments, as defining the fact that, for example, in C# the /* */ comments can appear anywhere would really bloat your grammar definition if you didn't just ignore it! So, if you want to ignore comments, define the comment terminal, and rather than putting it in one of your rules, just add it to the non-terminals list:
base.NonGrammarTerminals.Add(comment);
Compiling Source Code
There are a few little details I've skipped above (see the source code for the full story), but once the Grammar is defined, you can get a string
containing the source code and compile it into an abstract syntax tree like so:
string sourceCode = "10 print \"hello world\"";
BasicGrammar basicGrammer = new BasicGrammar();
LanguageCompiler compiler = new LanguageCompiler(basicGrammer);
AstNode rootNode = compiler.Parse(sourceCode);
OK, so now we've got a tree in memory. Next step: traversing the tree and generating the code.
Generating JavaScript from an Abstract Syntax Tree
There is more than one way to do this, but a rather elegant way to generate code is to create a class extending the Irony-defined AstNode
. This means there are classes such as ProgramNode
, LineNode
, StatementNode
, PrintStmtNode
, etc. Earlier, we saw that the <Program>
node will have a collection of <Line>
node objects, which in C# means ProgramNode
has the following property:
public IEnumerable<LineNode> Lines { get; }
As another example, the IfElseStmtNode
would have three properties:
public ExpressionNode Condition { get; private set};
public StatementNode ThenExpression { get; private set};
public StatementNode ElseExpression { get; private set};
In general, these properties should be set in the constructor of the class. This gets called by Irony, and so you need to copy-and-paste basically the same thing into each node class, with a few different name-changes depending on the number of child nodes:
public class IfElseStmtNode : AstNode
public IfElseStmtNode(AstNodeArgs args)
: base(args)
{
Condition = (ExpressionNode)args.ChildNodes[1];
ThenExpression = (StatementNode)args.ChildNodes[3];
ElseExpression = (StatementNode)args.ChildNodes[5];
}
}
The base constructor for AstNode
has a property ChildNodes
which it sets to args.ChildNodes
, which is useful for traversing the tree of nodes later.
Note that when you are in a node created using MakePlusRule
or MakeStarRule
, you cannot set the property in the constructor, as at that point args.ChildNodes
is empty (it gets populated during the creation of the tree). Therefore, it is recommended to cast the nodes in the property. For the ProgramNode
class, its Lines
property looks like this:
public IEnumerable<LineNode> Lines {
get {
return base.ChildNodes.Cast<LineNode>();
}
}
When Irony is creating the abstract syntax tree, it needs to know that, for example, <LineNode>
nodes should be instantiated as instances of LineNode
classes. This is achieved by using a different constructor for the LineNode
non-terminal above:
NonTerminal LINE = new NonTerminal("LINE", typeof(LineNode));
This is done for all the other types, and now when the code is compiled, the abstract syntax tree will consist of the AstNode
classes that you have defined. Generating script is now relatively easy: you just need to get each class to write itself as the target code. To achieve this, I created the following interface which all my AstNode
classes implemented:
public interface IJSBasicNode
{
void GenerateJavaScript(JSContext context, TextWriter textWriter);
}
The JSContext
was just to recreate the code indentation, to make the JavaScript look pretty, but isn't really necessary. Now, each node can print the bits it needs to, and call GenerateJavaScript
on all its child nodes. Here's what the IfStmtNode
's implementation of this method could look like:
public override void GenerateJavaScript(JSContext context,
TextWriter textWriter)
{
textWriter.Write("if (");
Condition.GenerateJavaScript(context, textWriter);
textWriter.WriteLine(") {");
ThenStatement.GenerateJavaScript(context, textWriter);
textWriter.WriteLine("} else {");
ElseStatement.GenerateJavaScript(context, textWriter);
textWriter.WriteLine("}");
}
The Condition
, ThenStatement
and ElseStatement
nodes can each be complicated, nested, recursive nodes, but because each node just writes the bits that are particular to itself, each class' GenerateJavaScript
method is pretty straight-forward. Once every node has its GenerateJavaScript
method defined, everything just automagically works. Just compile some source code, create a text writer, and ask the ProgramNode
to generate the JavaScript.
I must admit that I've left quite a few details out here, but hopefully, it's given you a general understanding of how to write a compiler and generate code using Irony. Unfortunately, converting BASIC to JavaScript wasn't so straight-forward for every node type, as the next section describes.
Jamming a Square Peg into a Round Hole
A JavaScript program written to run in a browser and a BASIC program written for a console are two completely different beasts. This section describes the problems which needed to be overcome to complete this project.
The Console
In order to handle print
, input
, and locate
statements in the same way as an old computer console, I wrote JSConsole. See that article for more information, but in a nutshell, it allows you to treat an HTML element such as a DIV
in the same way as a console, for example:
console.cls();
var userName = console.input('What is your name?');
console.println('Hello ' + userName);
Replicating the "goto" Statement
JavaScript does not have a "goto
" branch statement, which made things very difficult. The idea is somewhat similar to calling functions, and so my first thought was to wrap each line into its own function, and at the end of each function, the next function to call is called. We will use the following program again as an example:
10 print "Hello world"
20 goto 10
This would look something like this, then:
function line10() {
console.println("Hello world");
line20();
}
function line20() {
line10();
}
If you run this though, the function stack will very quickly overflow, and the program will crash (e.g., after "Hello world" has printed 1000 times, 1000 line10
and 1000 line20
function calls will be on the stack, and the program will quickly die). To solve this problem, instead of having each line call the next function directly, it returns the next-function-to-call to a base function (or null
to end). This now looks something like this:
function line10() {
console.println("Hello world");
return line20;
}
function line20() {
return line10;
}
function run() {
var currentLine = line10;
while (currentLine != null) {
currentLine = currentLine();
}
}
This will now successfully run without blowing the call-stack, because the stack will always consist of either only run
, run
and line10
, or run
and line20
. This seems like a good solution, until you run it and the browser stops responding. Welcome to the next problem...
Simulating Thread.Sleep in JavaScript
The problem is that the little program above will run using 100% of the CPU, and will not release any CPU time to the browser. The browser will become stuck (or the browser will think the JavaScript has got into an infinite loop, and let you stop it). Ideally, we need the run()
function to look something like the following:
function run() {
var currentLine = line10;
while (currentLine != null) {
currentLine = currentLine();
Thread.Sleep(10);
}
}
Of course, there is no Thread.Sleep
in JavaScript. The closest is the setTimeout
function, which lets you execute code after a certain number of milliseconds. The problem with setTimeout
is that it essentially acts as an asynchronous call, and so you cannot return anything from it, and we need to know the next function to call.
The trick is to move everything outside of the while
loop to become global variables, and turn the while
loop into a function, with the condition becoming an if
statement, like so:
var curLine;
function run() {
curLine = line10;
mainLoop();
}
function mainLoop() {
if (curLine == null) {
return;
}
curLine = curLine();
setTimeout('mainLoop()', 10);
}
This will now run the deceptively simple two-line basic program forever, while keeping the browser responsive. BASIC also supports gosub
and return
statements, which are essentially function calls:
10 gosub 40
20 print "Hello from line 20"
30 end
40 print "Hello from line 40"
50 return
Are you ready for the output?
Hello from line 40
Hello from line 20
This was a lot more simple to implement than the goto
statement, as it just gets converted to simple function calls. Note that there is a limitation to this technique of returning function pointers to the main loop: you cannot have goto
statements within a sub-routine.
The main-loop pattern is very useful for any situation where you need to execute a loop in JavaScript which will last more than a few seconds (such as animations). At a more general level, whenever you have a JavaScript program such as:
function animatePage() {
while (someCondition) {
}
}
You can convert it to this:
function initialse() {
mainLoop();
}
function mainLoop() {
if (!condition) {
cleanUp();
return;
}
setTimeout(mainLoop, 10);
}
function cleanUp() {
}
The mainLoop()
code can be found in JSBasic.js, and it actually has a little more code for error handling, and a couple of global functions for string
s, and the implementation of inkey$
, which simply listens for key-press events on the window and saves the last-pressed key.
Conclusion
My goal in this project was both to create a BASIC compiler and evaluate the usefulness of Irony. Using Irony was a real pleasure, and I wouldn't hesitate to drop the Irony DLL into a project and define a little grammar for a domain-specific language, should the need ever arise.
I hope that this project may be useful to others as an example of how to create a compiler in .NET. Thanks for reading!
History and Acknowledgements
- 9 April, 2008: Initial article released
- 22 April, 2008: Updated to include newer version of Irony, better handling of new-lines, line-sensitivity, and the MakePlusRule
- 23 April, 2009: Updated solution
- 20 January, 2013: Put code on GitHub and changed hosting to AppHarbor
A big thank you to the creator of Irony, Roman Ivantsov, who made major improvements to my initial implementation of JSBasic and answered my many queries. His help has improved the usefulness of this article a great deal. Have a read of the CodeProject Irony article and get the latest release of Irony from CodePlex.
Daniel has a Bachelor of Science with First Class Honours from the University of Auckland, and has designed and developed software in companies large and small.