Cat - A Statically Typed Programming Language Interpreter in C#

Christopher Diggins

5.00/5 (12 votes)

Nov 4, 2006

MIT

14 min read

72960

536

This article contains the public domain implementation of an interpreter for a statically typed stack-based programming language in C# called Cat. The accompanying article is a high-level description of how the various modules work, a brief description of the language, and links to related work.

Download source code and executables - 108 Kb

Sample Image - cat.jpg

Introduction

This article presents an interpreter for the statically typed functional stack-based programming language Cat (http://www.cat-language.com/). Cat can perhaps most succinctly be described as the functional cousin of Forth with a type system. Cat is inspired by a number of languages, the most notable being Joy (another functional cousin of Forth, but without a type system). A list of related languages is available at the end of the article.

The version of the Cat interpreter included with this article is version 0.9.6. The most recent version of the interpreter and source code can be accessed publicly at http://www.cat-language.com/download.html.

This article and source code should be of interest to people who are interested in implementing parsers, compilers, or interpreters. This article is not a tutorial and hence is not recommended for beginners, but instead is a reasonably well-documented example of a working interpreter which does type inference and type checking which you can hack and modify for your own purposes, either as a stand-alone interpreter or scripting add-on.

License

The Cat executables and source code are all public domain. This means it can be used and modified without restriction, requirement, or attribution. You can release any modified version of this code using any license you want, or for any purpose. Feel free to use the source code in any commercial products, and make yourself look great in front of your boss. One caveat though, there is no warranty, express or implied. This means you use the code at your own risk, just like anything you get for free.

Background

Cat was designed to be a simple scripting language with a strong type system and also a portable intermediate language. I found that existing intermediate languages are not easily read and written by humans, and usually lacked a type system. I also found that many scripting languages tended to have burdensome syntax and lacked a static type system. The advantages of Cat I believe are its simplicity, expressiveness, extendability, and portability.

The Cat language syntax and semantics are inspired by stack based programming languages (e.g. Forth, Postcript, Joy), intermediate languages (e.g. CIL/MSIL, Java byte code) and statically typed functional languages (e.g. Haskell, OCaml).

Cat is a bit of an odd duck for a scripting language because it is statically typed, and does type inference. This is not a unique idea, several languages do it (e.g. Haskell, OCaml, Scala, F#) however you will not find a C# version of their source code posted on CodeProject any time soon.

There are several reasons why Cat might be interesting to you:

It is public domain, and therefore friendly to commercial applications.
The language definition is very small and can be implemented and ported with relative ease.
The source code is written in C#.
A dynamic version of Cat can be implemented very easily.

Static Typing

A statically typed language with type inference behaves much like a dynamic language, in that type annotations are not neccessary. As a result a Cat program can be very compact. For example consider the following Cat program which will add one to a value:

define inc { 1 + }

In a weakly typed language, you would probably be able to call such a function passing it a string, and have it fail during execution.

This is, of course, assuming that the control path is in fact executed. A challenging problem with dynamic typing is that type errors can lay dormant until after the software is released publicly.

What is special about type inference is that the fact that this function requires an integer is deduced at compile time, and a call using a non-integer can be detected. Type inference mitigates the overhead of writing type annotations for your functions but without sacrificing the safety of having your compiler identify type errors. Safe and convenient.

Cat Basics

The Cat programming environment consists of a set of definitions of functions (user defined and primitive) and two evaluation stacks. Most operations manipulate the "primary" stack, and a handful use the "auxiliary" stack. When I refer to simply "the stack", I am referring to the primary stack.

Cat functions pop their arguments from the stack and push their results onto the stack. Literal numbers (e.g. 42, 0, 1999, etc.) and literal strings (e.g. "all your base are belong to us") are also functions which push single values onto the stack.

Binary arithmetic operators (e.g. +, *, <=, etc.) take their operands from the top of the stack, and push their results back onto the stack. in Cat the top value is treated as the right hand operand, and the below value is treated as the left hand operand. As a result of this Cat syntax looks like reverse polish notation (RPN). Here is an example of Cat in action:

>> 33 3 *
main stack: 99
>> 1 2 + 5 *
main stack: 99 15
>> >
main stack: true
>> pop
main stack: _empty_

All phrases (sequence of terms terminated by a new line) in the language are either definitions of new functions, or are instructions to be executed by the interpreter.

Cat functions are defined as follows:

>> define what_is_the_answer { 6 7 * }
main stack: _empty_
>> what_is_the_answer
main stack: 42

For more information about the Cat language there is a detailed Cat tutorial at http://code.google.com/p/cat-language/wiki/Tutorial. There is an even more in-depth and technical description of the language at http://www.cat-language.com/manual.html.

A formal definition of the Cat syntax can be found at http://www.cat-language.com/manual.html#syntax. In brief the Cat interpreter accepts commands directly in the form of either: text words, groups of symbols, numbers, strings, or references.

I should point out that a nifty feature of Cat is the ability to define new symbols. So if I was a fan of the old fashioned ? operator from my Atari Basic, I could write:

>> define ? { write_line }

The Source Files

Nothing exciting here, just a list of the files in the project along with a brief description.

CatBase.cs	Defines a base class for some of the types which contain useful utility functions
CatStack.cs	Defines a custom stack class used for the implementation of the primary and auxiliary evaluation stacks
CatType.cs	Defines the base class and simple primitive types for Cat
Config.cs	Contains a set of global flags for controlling the interpreter behaviour
ConstraintSolver.cs	Used by the type inference engine to compute the type by resolving the constraint rules
Executor.cs	Manages the primary and auxiliary stack, and contains the implementation of the Primitive functions
Functions.cs	Provides definitions for classes which represent Cat functions
FxnType.cs	Defines the object which represents function types
Interpreter.cs	Contains the interpreter fetch and execute loop
Main.cs	Provides an entry point for the application and calls the initialization procedures
Parser.cs	Defines the abstract syntax tree (AST) node type, and creates an AST representation of types and programs from a string
Primitives.cs	Registers the predefined primitive and library definitions
Scope.cs	Defines the Scope class which processes and stores defininitions
TypeInferer.cs	Infers a function's type from its definition
TypeStack.cs	Used to store sequences of types such as a function type's consumption and production
TypeVars.cs	Defines type variables and type variable declarations
UnitTest.cs	Contains numerous tests of the standard library and type inference engine

The Interpreter

Here is a broad overview of how the interpreter works:

The interpreter fetches commands from the user.
Instructions are sent to the global scope object (accessed through Scope.Global()) for processing.
The scope object looks to see if the instructions are in fact a definition. The clue being if it starts with the word "define" followed by white-space.
If the instruction given to the interpreter forms a definition then:
- A definition object is created
- The type is inferred
- If a type annotation was present in the definition then it is compared to the inferred type.
or

If the instructions are indeed instructions directly intended for compiler, then they are parsed and executed in sequence.

The Parser

The parser's job is to convert the source code from a text format to a more easily managed format, usually an abstract syntax tree (AST). Traditionally parsing is a second step, after a lexical analysis phase which separates a text stream into tokens. The lexical analysis (lexxing) phase is not actually neccessary, and is skipped entirely by the Cat language.

The AST generated by the Cat parser is represented using the AstNode class.

The Cat parser is a straightforward implementation of what is known as a simple recursive descent parser. The Cat parser creates a new tree node to represent each of the different elements of the language, as expressed in the grammar production. There are several different kinds of language elements for which nodes are created, which are differentiated using the NodeType enum.

public enum NodeType
{
    Definition,
    StackState,
    SimpleType,
    FxnType,
    Root,
    Quotation,
    Word,
    Number,
    String,
    ParanGroup,
    BraceGroup,
    LineComment,
    Reference,
    VarDecl,
    Undefined
};

The language parser uses various cues and will look ahead a single character if necessary to decide what language element is being parsed. This obviates the need to back up during failure in the Cat parser. In more complex languages, a parser would often need to be able to abandon a parsing rule, delete the current node, and try the next one.

One big difference about the Cat language which is worth pointing out is that sequences of symbols are labelled as NodeType.Word and are used to define new functions.

The Type Inferer

[Note: this section is only a gross oversimplification of the type inference algorithm, and is very technical. You can easily skip this section, and still understand and use the project]

The type inference algorithm for Cat is very complex and the current implementation is somewhat brittle. This is partly due to the fact that this is a brand new experimental type system and under active development. The following algorithm description is only an approximation of the steps taken. I am trying to convey the logic of what happens, rather than the nitty-gritty reality.

The type inference algorithm is initialized with a function with a type signature of:

(_main_:any*)(_aux_:any*) -> (_main_)(_aux_)

This is called the return type. The action of each operation that makes up the function definition is then abstractly simulated on a stack containing types. That is to say that only the consumption, and production of types is simulated, not the actual operation itself.

Type checking and type inference algorithms usually work by simulating the action of functions at the type level. For example, to simulate an integer addition function you would pop two integers from the type stack and push one integer onto the type stack.

In the case of the type inference algorithm, the type stack is initialized with a multi-type variable. When a type is popped from the type stack and a multi-type variable is found, it conveys a constraint: the type variable must match that value. This constraint is used to more precisely describe the type variables, through a deductive process.

In the Cat type inference algorithm, the simulated type stacks are taken from a FunctionType object called ret. By applying the following algorithm, starting with the first instruction in the definiton, the return type can be correctly inferred:

The instruction's type is converted to canonical form. That is to say, a default main and auxiliary variable is added to the consumption and production.
For each type in the primary consumption of the current instruction:
- A type is popped from the production of ret if it matches.
- If the types do not match, then one of them must be a type variable or a type error has occurred.
- If one or both of the types are variables then a constraint rule is generated. This rule defined a constraint on the variable, saying what types it must match to.
The previous algorithm is then repeated for the auxiliary consumption types.
Now we have numerous constraints which describe the various type variables. These need to be resolved.
The constraint resolution algorithm assures that each type variable either stays as a variable reference or becomes associated with a single concrete variable or a set of concrete variables if it is a multi-type variable.
All resolved variables in ret and the type of the current function are then mapped to new types according to the results of the constraint resolution process.
The primary and auxiliary production of the current instruction's type (now that it is resolved) are added to to the production of the return type.
Repeat this process for the next instruction if there is another, otherwise quit.

Now that all of the instructions are simulated and all of the constraints are resolved, the FxnType ret now contains the inferred type.

The Executor

The Executor manages the primary and auxiliary stack, as well as provides the core implementation of the primitive functions. The primitive functions are static void functions with no parameters. They are referred to using the PrimitiveDelegation delegation type.

The executor provides a public function for executing raw strings, but the actual work of processing the string is passed to a Scope object. This design decision may appear a bit strange, but it lays the groundwork for possibly supporting namespaces in a future version of Cat. A namespace requires individual scope objects, which can be potentially nested.

Functions

All functions are registered with the global scope object Scope.Global() in the AtomicOps class using the RegAtomicOp method. The RegAtomicOp method requires a PrimitiveDelegate, a string for the function name, a string for the type of the function, and a string for the description. The description is used by the help and desc_of primitive functions.

Alternatively you can define a standard library function using existing functions. For this purpose you could use the AtomicOps.RegLibFxn method. The RegLibFxn method is similar in structure to RegAtomicOps, but requires the definition of the function as a string and the type can be inferred. If you provide a type, it will be compared against the type inferred from the instructions as a failsafe mechanism to assure type safety.

Unit Tests

The unit tests would probably be more accurately called integration tests, since they only test high-level functionality like the type inference mechanism, and the primitive functions. The tests are an excellent way to learn the language by observing what it is expected to do in numerous circumstances, and comparing it against your own expectations. The tests can be turned on and off, and have their behaviour modified, through various flags found in the Config class.

Known Issues

There are some tests of the type system which fail under the current implementation. These are located in the function RunKnownIssuesTests which is not executed by default. The flag for controlling this is in the Config class.

Some known issues with the current Cat release are:

Type errors are not very helpful to the user.
The type system cannot express recursively defined types.
Instructions passed to the interpreter are not statically type-checked.

Untyped Cat (Kitten)

Cat doesn't have to be used as a statically typed language. Type annotations can be ignored and the type inference feature can be turned off. To do this, simply set the flag Config.gbStaticTyping in Config.cs to false, and voila, you have a Kitten interpreter. Kitten is the name of the untyped variant of Cat.

Every Cat program will be a valid Kitten program, but the Kitten interpreter will be much more permissive. For example, you would be able to write the following code in Kitten, which would fail in Cat:

define f { [42] ["all your base"] if }

The typing requirements of Cat, which are not enforced in Kitten, are that the two functions passed to the if function must have the same signature. In the example, one pushes an integer on the stack and the other pushes a string.

It is worth noting that Kitten resembles the Joy language even more closely than Cat.

Related Work

Here are some languages which inspired the Cat language, or are very closely related.

Stack based languages:

Intermediate Languages:

Functional Languages:

Scripting Languages

Conclusion

I'd be very happy to hear about spin-off projects that result from this work, or ideas for applications of the Cat language which you are interested in seeing implemented. I hope this code serves you well, and that you find the language interesting and helpful.