A Simple Brainf**k To COBOL Compiler In Managed COBOL






4.50/5 (4 votes)
A 101 on writing a compiler - a Managed COBOL compiler which compiles Brainf**k (a very simple language) to Managed COBOL
Introduction
Brainf**k (replace ** with u and c and you get the actual name - but I don't want to cause offence) is a super minimal programming language. Managed COBOL is a very large programming language. I just love the idea of translating BF into Managed COBOL. BF (as I shall call it from here on) contains only 8 symbols and is derived from P" which contains only 6. Both languages are Turin Complete and neither is of any utility in actual programming. P" was created as a exercise; the font of all knowledge (ahem) Wikipedia has this to say about it 'P'' was the first "GOTO-less" imperative structured programming language to be proven1,2 Turing-complete.'My idea here is:
- Make a translator from BF to Managed COBOL in Managed COBOL.
- The result should be able to be used to create classes where the BF code is a method body.
- Demonstrate these being compile and called from COBOL with Visual COBOL 2010.
- Demonstrate some simple abstract symbol tree optimizations.
- Demonstrate BF being called from C# and (using a developmental compiler) Java [Later post].
- Go the whole-hog and translate BF to COBOL and then call from F# [Later post].
For more on Managed COBOL please see my blog.
What is BF and how can we translate it to COBOL?
BF has a notion of a continuous area of bytes which hold integers. There is a pointer into those bytes referred to as the data-pointer. There is also the program which is a sequence of symbols. All symbols are legal, but anything other than the 8 BF symbols are ignored. The program notionally moves from the start to the end of the symbols but does have a looping construct. P" and BF are the same to this point, however, BF extends the model into having some form of IO. This is normally associated with keyboard and display, but I will use the IO extensions of BF to provide the mechanism for passing in and returning data when BF is invoked from other languages.
Here as the BF symbols:
> move the data pointer right (higher value)
< move the data pointer left (lower value)
+ increment by 1 the byte at the current data pointer
- decrement by 1 the byte at the current data pointer
[ begin loop
] close loop
. move the value at the data pointer to the output
, move the next input value to the value at the data pointer
Notes: Loops continue to execute the symbols contained repeatedly until the data pointed to by the data pointer is 0 at the start of an iteration. Bytes can take the decimal values 0-255. Incrementing a value 255 will result in 0; decrementing a value 0 will result in 255. Loops can be nested.
Arguments And Returns
To allow my BF routine to be invoked from other languages, I have the idea of using the IO symbols. A table of bytes will be passed to BF. And BF will 'return' a new table of bytes (for the non COBOL heads, a table is pretty much the same thing as an array in C#).
The passed in table is the arguments to the routine and the passed back table is the return. BF has an input pointer which starts at the start on the arguments table; each time BF reads an bytes from the arguments the pointer is incremented. Incrementing the input pointer beyond the end of the table causes the pointer to reset to 1 (the start).
The returned table is initialized as empty then each time there is an output from BF the output bytes value is added to the end of the output table. (The implementation does not actually append to a table, it appends to a list and converts the result to a table, but the logical effect is the same.)
Compilation
The compilation process involves simply iterating over to input bytes and each one being converted into an abstract symbol tree (AST) token or ignored. The AST can then be translated into COBOL and finally wrapped in some boilerplate COBOL to give it the appropriate environment. Optimization can happen to the AST.
The only part of BF which requires any serious thought (and only a few seconds at that) is the [] looping construct. Fortunately, modern COBOL dialects have the in-line, nested perform syntax. The [] BF syntax directly translated to...
until data(data-pointer) = 0
...
end-perform
... assuming that the data area is simply a table of bytes and there is a local-storage item data-pointer which points into that table.
OK - Let's write a compiler!
I have chosen not to implement the normal file read and write stuff which a compiler does. This compiler will accept a string and return a string. The input string being BF and the output being the source for a Managed COBOL method body. This is the most straight forward implementation which can then be added to other programs to compile files or other source of source.
The compiler is massive overkill. BF is such a simple language that a very simple compiler will do. My aim with this compiler it to make it have a very clean and structured design. I have not actually implemented a full AST but rather just a list (a tree with only one branch). This is possible because there is no calling structure and no variables. It is really a testament to the efficiency of languages with only conditional loops.
Also, I have represented each of the BF symbols as an object (I call them tokens) which 'knows' the symbol and the required COBOL to enact that symbol. These objects all implement a common interface (BFToken). By moving the symbol definition and COBOL representation to a polymorphic type, it is possible to move the responsibility for understanding the lex from the main compiler loop and into these token objects. Although BF has only 8 symbols, the approach would allow additions of new symbols without changing the lexer or parser at all. This was a design feature I used in VSPL to some success.
Here is the code defining the token class interface and an instance of a token class:
interface-id BFToken as "BF.BFToken". method-id get property BFSymbol. procedure division returning ret as string. end method. method-id get property COBOLCode. procedure division returning ret as string. end method. end interface. class-id BFIncr as "BF.BFIncr" implements type BFToken. 01 source-code. 03 filler pic x(29) value "if data-block(data-ptr) = 255". 03 filler pic x(02) value x"0D0A". 03 filler pic x(34) value " move 0 to data-block(data-ptr)". 03 filler pic x(02) value x"0D0A". 03 filler pic x(04) value "else". 03 filler pic x(02) value x"0D0A". 03 filler pic x(34) value " add 1 to data-block(data-ptr)". 03 filler pic x(02) value x"0D0A". 03 filler pic x(06) value "end-if". method-id get property BFSymbol. procedure division returning ret as string. move "+" to ret. end method. method-id get property COBOLCode. procedure division returning ret as string. move source-code to ret end method. end class.
The AST is just a list of already created token classes. Here is the code which does the lexing/parsing and creates the AST:
table of type BFToken ( new BFDecr new BFIncr new BFLeft new BFRight new BFInput new BFOutput new BFLoopStart new BFLoopEnd ). 01 separator string value x"0D0A". method-id compile public. 01 current-char character. 01 current-token type BFToken. *> Use a list as a dicationary is overkill when we only *> have 8 memebers and this is a trivial implementation 01 ast type List[type BFToken] value new List[type BFToken]. 01 code-builder type StringBuilder value new StringBuilder. procedure division using by value source-code as string returning generated-code as string. *> These loops make up the parser and lexer *> all in one perform varying current-char through source-code::ToCharArray *> respocibility for what characters make up the language *> is devolved to the token objects perform varying current-token through tokens if current-token::BFSymbol::Chars(0) equals current-char invoke ast::Add(current-token) end-if end-perform end-perform
Once this loop has constructed the AST it can be interrogated directly to get the COBOL source to inject into the boiler plate.
By going to the effort of laying out the COBOL to be generated in the value clauses of COBOL group items we cause the generated COBOL code to be human readable and have a clean representation in the compiler source as well.
Here is an example of the generated code:
01 data-block binary-short occurs 32768. 01 data-ptr binary-long value 1. 01 input-ptr binary-long value 1. 01 output-list type List[binary-short] value new List[binary-short]. 01 data-block-size binary-long. 01 input-block-size binary-long. procedure division using by value input-block as binary-short occurs any returning output-block as binary-short occurs any. set data-block-size to data-block::Length set input-block-size to input-block::Length if data-block(data-ptr) = 255 move 0 to data-block(data-ptr) else add 1 to data-block(data-ptr) end-if if data-block(data-ptr) = 255 move 0 to data-block(data-ptr) else add 1 to data-block(data-ptr) end-if if data-block(data-ptr) = 255 move 0 to data-block(data-ptr) else add 1 to data-block(data-ptr) end-if if data-block(data-ptr) = 255 move 0 to data-block(data-ptr) else add 1 to data-block(data-ptr) end-if
Visual COBOL 2010 - this makes playing around like this really straight forward!
Below is the compiler in Visual COBOL 2010 and then the source code for the compiler. I have included the entire compiler in one source file.
Here is my working compiler with the null optimizer:
$set sourceformat(variable) $set ilusing(BF) class-id BFTester as "BF.BFTester". method-id main static. 01 compiler type BF.BFCompiler value new BF.BFCompiler. procedure division. display compiler::compile("++++++++++[>+++++++>++++++++++>+++>+<<<<-]>++.>+.+++++++..+++.>++.<<+++++++++++++++.>.+++.------.--------.>+.>.") display "All done" end method. end class. class-id BFCompiler as "BF.BFCompiler". 01 optimizer type BFOptimizer value new NullBFOptimizer. 01 tokens type BFToken occurs any value table of type BFToken ( new BFDecr new BFIncr new BFLeft new BFRight new BFInput new BFOutput new BFLoopStart new BFLoopEnd ). 01 separator string value x"0D0A". method-id compile public. 01 current-char character. 01 current-token type BFToken. *> Use a list as a dicationary is overkill when we only *> have 8 memebers and this is a trivial implementation 01 ast type List[type BFToken] value new List[type BFToken]. 01 code-builder type StringBuilder value new StringBuilder. procedure division using by value source-code as string returning generated-code as string. *> These loops make up the parser and lexer *> all in one perform varying current-char through source-code::ToCharArray *> respocibility for what characters make up the language *> is devolved to the token objects perform varying current-token through tokens if current-token::BFSymbol::Chars(0) equals current-char invoke ast::Add(current-token) end-if end-perform end-perform *> Now we have the ast we optimize it invoke optimizer::Optimize(ast) *> Now we generate *> First put in boiler plate invoke code-builder::Append("01 data-block binary-short occurs 32768.") invoke code-builder::Append(separator) invoke code-builder::Append("01 data-ptr binary-long value 1.") invoke code-builder::Append(separator) invoke code-builder::Append("01 input-ptr binary-long value 1.") invoke code-builder::Append(separator) invoke code-builder::Append("01 output-list type List[binary-short] value new List[binary-short].") invoke code-builder::Append(separator) invoke code-builder::Append("01 data-size binary-long.") invoke code-builder::Append(separator) invoke code-builder::Append("01 input-size binary-long.") invoke code-builder::Append(separator) invoke code-builder::Append("procedure division") invoke code-builder::Append(separator) invoke code-builder::Append(" using by value input-block as binary-short occurs any") invoke code-builder::Append(separator) invoke code-builder::Append(" returning output-block as binary-short occurs any.") invoke code-builder::Append(separator) invoke code-builder::Append("set data-block-size data-block::Length") invoke code-builder::Append(separator) invoke code-builder::Append("set input-block-size input-block::Length") invoke code-builder::Append(separator) perform varying current-token through ast invoke code-builder::Append(current-token::COBOLCode) invoke code-builder::Append(separator) end-perform *> Final boiler plate invoke code-builder::Append("set output-block to output-list::ToArray") invoke code-builder::Append(separator) set generated-code to code-builder::ToString goback. end method. end class. interface-id BFToken as "BF.BFToken". method-id get property BFSymbol. procedure division returning ret as string. end method. method-id get property COBOLCode. procedure division returning ret as string. end method. end interface. class-id BFIncr as "BF.BFIncr" implements type BFToken. 01 source-code. 03 filler pic x(29) value "if data-block(data-ptr) = 255". 03 filler pic x(02) value x"0D0A". 03 filler pic x(34) value " move 0 to data-block(data-ptr)". 03 filler pic x(02) value x"0D0A". 03 filler pic x(04) value "else". 03 filler pic x(02) value x"0D0A". 03 filler pic x(34) value " add 1 to data-block(data-ptr)". 03 filler pic x(02) value x"0D0A". 03 filler pic x(06) value "end-if". method-id get property BFSymbol. procedure division returning ret as string. move "+" to ret. end method. method-id get property COBOLCode. procedure division returning ret as string. move source-code to ret end method. end class. class-id BFDecr as "BF.BFDecr" implements type BFToken. 01 source-code. 03 filler pic x(27) value "if data-block(data-ptr) = 0". 03 filler pic x(02) value x"0D0A". 03 filler pic x(36) value " move 255 to data-block(data-ptr)". 03 filler pic x(02) value x"0D0A". 03 filler pic x(04) value "else". 03 filler pic x(02) value x"0D0A". 03 filler pic x(40) value " subtract 1 from data-block(data-ptr)". 03 filler pic x(02) value x"0D0A". 03 filler pic x(06) value "end-if". method-id get property BFSymbol. procedure division returning ret as string. move "-" to ret. end method. method-id get property COBOLCode. procedure division returning ret as string. move source-code to ret end method. end class. class-id BFDecr as "BF.BFRight" implements type BFToken. 01 source-code. 03 filler pic x(30) value "if data-ptr = data-block-size". 03 filler pic x(02) value x"0D0A". 03 filler pic x(22) value " move 1 to data-ptr". 03 filler pic x(02) value x"0D0A". 03 filler pic x(04) value "else". 03 filler pic x(02) value x"0D0A". 03 filler pic x(21) value " add 1 to data-ptr". 03 filler pic x(02) value x"0D0A". 03 filler pic x(06) value "end-if". method-id get property BFSymbol. procedure division returning ret as string. move ">" to ret. end method. method-id get property COBOLCode. procedure division returning ret as string. move source-code to ret end method. end class. class-id BFDecr as "BF.BFLeft" implements type BFToken. 01 source-code. 03 filler pic x(15) value "if data-ptr = 1" 03 filler pic x(02) value x"0D0A". 03 filler pic x(37) value " move data-block-size to data-ptr". 03 filler pic x(02) value x"0D0A". 03 filler pic x(04) value "else". 03 filler pic x(02) value x"0D0A". 03 filler pic x(28) value " subtract 1 from data-ptr". 03 filler pic x(02) value x"0D0A". 03 filler pic x(06) value "end-if". method-id get property BFSymbol. procedure division returning ret as string. move "<" to ret. end method. method-id get property COBOLCode. procedure division returning ret as string. move source-code to ret end method. end class. class-id BFDecr as "BF.BFInput" implements type BFToken. 01 source-code. 03 filler pic x(51) value "move input-block(input-ptr) to data-block(data-ptr)" 03 filler pic x(02) value x"0D0A". 03 filler pic x(32) value "if input-ptr = input-block-size". 03 filler pic x(02) value x"0D0A". 03 filler pic x(23) value " move 1 to input-ptr". 03 filler pic x(02) value x"0D0A". 03 filler pic x(04) value "else". 03 filler pic x(02) value x"0D0A". 03 filler pic x(22) value " add 1 to input-ptr". 03 filler pic x(02) value x"0D0A". 03 filler pic x(06) value "end-if". method-id get property BFSymbol. procedure division returning ret as string. move "," to ret. end method. method-id get property COBOLCode. procedure division returning ret as string. move source-code to ret end method. end class. class-id BFDecr as "BF.BFOutput" implements type BFToken. method-id get property BFSymbol. procedure division returning ret as string. move "." to ret. end method. method-id get property COBOLCode. procedure division returning ret as string. move "invoke output-list::Add(data-block(data-ptr))" to ret end method. end class. class-id BFDecr as "BF.BFLoopStart" implements type BFToken. 01 source-code pic x(55) value "perform with test before until data-block(data-ptr) = 0". method-id get property BFSymbol. procedure division returning ret as string. move "[" to ret. end method. method-id get property COBOLCode. procedure division returning ret as string. move source-code to ret end method. end class. class-id BFDecr as "BF.BFLoopEnd" implements type BFToken. 01 source-code pic x(11) value "end-perform". method-id get property BFSymbol. procedure division returning ret as string. move "]" to ret. end method. method-id get property COBOLCode. procedure division returning ret as string. move source-code to ret end method. end class. interface-id BFOptimizer as "BF.BFOptimizer". method-id Optimize. procedure division using by value ast as type System.Collections.Generic.List[type BFToken]. end method. end interface. class-id NullBFOptimizer as "BF.NullBFOptimizer" implements type BFOptimizer. method-id Optimize. procedure division using by value ast as type System.Collections.Generic.List[type BFToken]. *> Do nothing - does not optimize end method. end class.
I ran this compiler which was compiling a simple 'Hello World' program. I directed the output to a text file and then cut and paste the COBOL into a Visual Studio 2010 Managed COBOL project. When I ran the project, it produced the expected array (which in ASCII says Hello World! followed by a new line).
Simple Optimization
In the above implementation I have used a null optimiser; it does not do anything. The optimiser has the AST passed to it and is expected to perform all optimisations in place. I have been doing this project for fun (lunch is for lightweights etc.) and so I have not implemented much in the way of an optimiser. One thing which is really simple but insanely verbose about BF is its need of long sections of identical symbols. My optimiser takes chains of incr tokens and converts them into a single token which performs the same operation but in one step. To do this I have added a new token which is not added to the lexer but which is usable by the generator. This token replaces the inefficient incr chains in the AST.
The optimiser works by iterating over the AST and adding all non incr tokens directly to a new AST. When it encounters incr tokens, it counts over them and when the next non incr token is found it creates a single token to replace the chain of incr tokens and adds this new token to the new AST. Once the entire AST has been iterated over, it replaced the contents of the AST with the new AST.
class-id BFIncrMany as "BF.BFIncrMany" implements type BFToken. 01 source-code string. method-id new. 01 source-builder type StringBuilder value new StringBuilder. procedure division using by value how-many as binary-long. invoke source-builder::AppendFormat("add {0} to data-block(data-ptr)" how-many) invoke source-builder::Append(x"0D0A") invoke source-builder::AppendFormat("compute data-block(data-ptr) = data-block(data-ptr) b-and 255") set source-code to source-builder::ToString end method. method-id get property BFSymbol. procedure division returning ret as string. move "Not Implemented" to ret. end method. method-id get property COBOLCode. procedure division returning ret as string. move source-code to ret end method. end class. class-id IncrBFOptimizer as "BF.IncrBFOptimizer" implements type BFOptimizer. method-id Optimize. 01 in-incr condition-value value false. 01 how-many binary-long. 01 current-token type BFToken. 01 swapsies type List[type BFToken] value new List[Type BFToken]. procedure division using by value ast as type System.Collections.Generic.List[type BFToken]. perform varying current-token through ast if in-incr if current-token::BFSymbol = "+" *> This simple logic means very long chains of *> + could overflow - but this is demo only add 1 to how-many else set in-incr to false invoke swapsies::Add(new BFIncrMany(how-many)) end-if else if current-token::BFSymbol = "+" add 1 to how-many set in-incr to true else invoke swapsies::Add(current-token) end-if end-if end-perform invoke ast::Clear perform varying current-token through swapsies invoke ast::Add(current-token) end-perform end method. end class.
Now we can see how the large chains of increment token generated code have been replaced by simple additions:
01 data-block binary-short occurs 32768. 01 data-ptr binary-long value 1. 01 input-ptr binary-long value 1. 01 output-list type List[binary-short] value new List[binary-short]. 01 data-size binary-long. 01 input-size binary-long. procedure division using by value input-block as binary-short occurs any returning output-block as binary-short occurs any. set data-block-size data-block::Length set input-block-size input-block::Length add 10 to data-block(data-ptr) compute data-block(data-ptr) = data-block(data-ptr) b-and 255 if data-ptr = data-block-size move 1 to data-ptr else add 1 to data-ptr end-if add 17 to data-block(data-ptr) compute data-block(data-ptr) = data-block(data-ptr) b-and 255 add 27 to data-block(data-ptr) compute data-block(data-ptr) = data-block(data-ptr) b-and 255 add 30 to data-block(data-ptr) compute data-block(data-ptr) = data-block(data-ptr) b-and 255 add 31 to data-block(data-ptr) compute data-block(data-ptr) = data-block(data-ptr) b-and 255 if data-ptr = 1 move data-block-size to data-ptr else subtract 1 from data-ptr end-if if data-ptr = 1 move data-block-size to data-ptr else subtr...