The Future of Software Development: CodeDOMs (Part 1)

KenBeckett

4.95/5 (21 votes)

Nov 4, 2012

CDDL

19 min read

44834

Modeling the semantics of programming languages.

Introduction

This article is about creating code models, or “CodeDOMs”, that model the semantics of programming languages. This is Part 1 of a series on codeDOMs which will include lots of source code, but this first article is a necessary background discussion of what is being attempted and why it’s needed.

Most Computer Languages Today are Lacking Something Important

Software developers must not only create applications which meet requirements and are as robust and bug-free as possible, but they must also do their best to build software which is as easy as possible to maintain and extend – not only by themselves, but by others. This often involves the use of object-oriented techniques to create an easy-to-understand object model which can also be easily extended. Such techniques are not always used in these days of HTML and scripting languages, but it’s probably safe to say that the majority of senior developers would agree that object-oriented analysis, design, and programming techniques are a “best practice” for large, complex applications.

Therefore, it’s somewhat ironic that the tools used to build such applications rarely expose an object model and usually are not extensible. This problem starts with the most important tools of all: the language compilers. Developers will often find the language they are using to be somewhat lacking in capability or features for their specific needs, but will have no recourse but to work within the confines of the language, usually waiting many years for new features to arrive. Experienced developers working on complex systems will often end up effectively implementing a “Domain-Specific Language” (DSL) as a natural way of simplifying core logic in the system. This “language” might consist only of a set of helpful methods or types (the developer might not even realize that they’ve effectively created a DSL), or it might be a true scripting or compiled language (whether homegrown or 3^rd party). However, they will not be able to extend the primary language to accommodate the DSL, limiting the power of this technique (consider the addition of “inline SQL” with the LINQ feature of C# as one example of what could be accomplished by allowing for such extensions).

More importantly, the closed nature of language compilers puts a huge burden on 3^rd party developers of related tools, such as code analysis tools, code difference comparers, etc. Vendors must implement their own parsing and reference resolving, which requires huge amounts of time and is exceedingly complex. This duplication of effort between the compiler, editor, and other tools results in poor quality due to inconsistencies, bugs, and performance and memory usage issues. It also creates a large barrier to market for such tools, limiting choices for developers. No matter how good a developer is, good tools will make them better. A good static analysis tool will find issues in anyone’s code. But, the lack of tool quality can drive developers away from using tools that would otherwise improve their code.

In summary, the qualities that software developers strive to provide in software that they create do not effectively exist in the core language tools that they use to write their code: they do not have easily extensible object models (at least, not publicly accessible ones). And, this causes serious difficulty for all higher level tools used in the development process. In other words, “The cobbler’s children have no shoes”. The majority of developers probably don’t even realize this, or understand just how much of a limitation this is to them in their daily work – it’s just the way it’s always been. Languages are closed.

Why are Computer Languages Generally “Closed”?

Computer languages are almost always defined in terms of text, and they are almost always closed to extension by users. I would argue that this closed nature is due more to tradition than good reason. The tradition when creating a computer language is to create a “grammar” that defines its text representation. This grammar is then fed into a code generator that creates fearsomely convoluted code that “parses” the language and builds an Abstract Syntax Tree (AST) which can then be analyzed for correctness and used to generate executable code (or pseudo instructions which are converted to real opcodes later). Any change to the language requires a change to the grammar, re-generation of the parsing code, and re-building of the entire compiler – everything about this architecture is a barrier to changes by anyone who isn’t a member of the compiler team. And, even if you’re a member of that team (perhaps you’ve created your own DSL), you will probably dread having to debug a problem and step into that ugly generated parsing logic.

Computer languages are text-based. That’s how they started, and it’s never changed (ignoring a few rare exceptions that have never really caught on). Developers are taught in school that this is just how it’s done. Most of them have probably never really thought very hard about why it’s done that way, much less about possible alternatives and the world of benefits that they might open up. Most of them are also probably scared away from ever creating their own languages after that one compiler theory class that they had (grammars, compiler-compilers, lexers, ASTs, semantic analysis, EBNF, LL(k), LALR – ack!). It’s the traditional text nature of computer languages combined with the traditional methods of implementing them that results in their closed nature. Once upon a time, perhaps this all made plenty of sense, but at least since the advent of graphical IDEs, it has become somewhat archaic in my opinion.

It’s time to think outside the (text) box. It’s time to swallow the red pill.

Object Models for Code: CodeDOMs

What is a “computer language”, exactly? Why not define computer languages in terms of objects rather than text? After all, text-based languages are generally converted to objects by modern compilers in order to be processed. Text is actually a terrible format for a computer to digest – it’s used for the benefit of humans, not machines. Using text is an ancient tradition with computer languages, because it allows developers to easily write code using any text editor. However, for decades now most developers use IDEs with what are essentially graphical editors (with colors, fonts, pop-up menus, tooltips, intellisense, collapsible sections, etc). They get the feel that they’re working with text, and their code is stored in text files, but the IDE is very far from being a simple text editor – it’s using a hidden object model to represent the code internally.

So, why not make the huge leap of designing languages directly as an object model that represents the semantics of the language and forget about text and all of the limitations that come with it? Well, maybe we can’t completely forget about text, but can we at least make the text representation second-class to the object model instead of the other way around? This isn’t actually a completely new idea – Smalltalk provided an object model for code, and various visual programming environments have effectively done it. The fact that such attempts have not succeeded wildly doesn’t mean that the general idea isn’t a good one – just that various drawbacks existed which prevented widespread adoption. There’s no question that the potential benefits are huge and numerous, but a successful design will require almost zero drawbacks. Drag-and-drop programming may have its uses, but preventing developers from typing away madly, writing temporary pseudo-code, or anything else that they do today, would make most of them very unhappy. When it comes to editing code, the user must have at least the option of a very similar experience as to what they have with text languages. Also, although storing programs as objects in a database would seem to make a great deal of sense (after all, that’s what is generally done with any other complex data), it also makes sense to retain the option of storing as text for backwards compatibility.

The logical conclusion of this line of thinking is: We should start designing computer languages primarily as an object model that represents the semantics of the language, but also with an alternative text representation and with easy conversion between the two. This might not sound all that different from what basically exists today, but the focus on making an object model the first-class implementation will have a huge impact: all tools will be consistent and much easier to create instead of everybody constantly replicating effort by creating their own incompatible object models. Also, the language will be completely open and extensible like any good object-oriented design. The number and quality of language tools will both increase dramatically. Sometimes, a relatively minor change in viewpoint can make a tremendous difference in outcome.

Let’s call such object models for code “CodeDOMs”. The “DOM” stands for “Document Object Model”. It’s not perfect, but it’s concise and there is a history of prior use for this term (although perhaps not with exactly the same definition, which will be addressed later). Also, what language should be used to create the object model? In most cases, the same language being modeled – that might sound a bit strange at first, but it actually makes a lot of sense (it’s known as “bootstrapping”). The codeDOM could also be provided in other languages if for some reason they might be used to manipulate the primary language.

How Do You Examine and Edit Code Without Using Text?

The codeDOM objects would be displayed very much like text is displayed in an IDE, meaning using what is really a graphical display, but one that still uses plenty of text using colored fonts. After all, it will still make sense for code objects such as types and methods to have names, statements to use keywords, and of course to have comments throughout the code. It should always be possible to get a virtually identical display as you would with text, but it will also be possible to get a more graphical display if desired (or use a different “skin”), since it will be objects that are rendered instead of plain text. For example, background colors and enclosing lines might be used to represent objects that are children of others, comments that are associated with specific code objects, etc. Sub-expressions that “wrap” onto more than one line might be vertically centered within the parent expression, meaning that all text might not line up exactly into specific rows (or columns) on the screen. Most IDEs already use a proprietary object model internally for display that is mapped to the text – the idea here is to provide a public object model for the language instead. It’s also a suggestion that IDEs move towards a more graphical display, dropping the idea of the represented code appearing almost exactly as the lines and columns of a text file.

Actually, once you start to think of code being displayed truly graphically instead of as text, many new things become possible. Any statement could have its body optionally collapsed. Many syntax characters lose their importance, and things such as braces, semi-colons, or even statement and method parenthesis might be optionally hidden. Comments could be optionally displayed in a proportional font or hidden. Documentation comments could be displayed in a WYSIWYG format instead of as XML. Real mathematical symbols could be used to display some operators in place of the ASCII characters that traditionally fill in for them. You could even customize existing UI controls for code objects, or create your own – such as a mapping table that looks like a spreadsheet dropped into your program that maps one column of values directly onto another (implemented with a hidden ‘switch’ statement, or even with custom code generation). You might choose to see a tree-like representation of code objects, such as to make the evaluation order of a complex expression more obvious. The possibilities are basically unlimited – and best of all – each user could customize their own view of the code as they desire (no more concerns or arguments about formatting). You could even provide the option to map keywords and library names to the (human) language of the programmer, instead of forcing everybody to deal with English.

As far as editing, the GUI would probably have more graphical editing options than a standard IDE, such as the use of drop-down selections, drag-and-drop, etc. The use of a more graphical display could make it quicker and easier to select proper code fragments than when using a text-based display. However, doing editing “right” would mean allowing the user to just type away normally, parsing the code on the fly into code objects, while also easily allowing for code fragments or pseudo-code that isn’t quite valid yet. This is an area where previous attempts at this sort of thing have often fallen short, but there is theoretically no reason that such text-like editing couldn’t still be supported for a tree of objects.

The average user of a language with a standardized codeDOM wouldn’t necessarily need to learn or use the codeDOM. They could learn the language much as they do with text-based languages today. They would benefit from the codeDOM through the increased number and quality of language tools that it would bring about, but they wouldn’t need to use it directly themselves. Most likely, though, the day would come that they’d find themselves using the codeDOM to create a tool, extend the language, add code analysis rules, or generate code. A codeDOM also provides “reflection” and “expression tree” support (used in modern, managed languages such as C#).

How Much Does a CodeDOM Really Buy Us?

I’ve already talked about quite a few benefits, but here’s a recap plus some additional ideas:

Better consistency between tools, less memory usage, and better performance.
A much bigger selection of tools, with much better overall quality.
A much higher level of customization for all tools, starting with the language itself.
Much better support for DSLs, and tight integration with the primary language.
A more graphical display and manipulation of code in addition to text-like editing.
Highly customizable display of code by each individual user – making it easier to read and understand code, and finally putting an end to formatting style disagreements.
Better and more easily customized code analysis with better performance.
Much better version control based upon actual code object changes instead of text.
The ability to store code in a database instead of text files, increasing performance by eliminating the need for parsing, and providing better management of large codebases.
Search, analyze, and refactor code using powerful SQL queries.

I certainly hope that most readers are starting to buy into the whole codeDOM idea by now, and that many of you are more than a little excited about the possibilities. Honestly, I’ve been thinking about them myself since the 90’s, and frankly I’m quite disappointed in the entire industry that it’s taking so long to implement something with such obvious huge benefits! As software developers, we not only need this sort of thing, we needed it decades ago! On that note, I’ll risk being a bit grandiose and present a “manifesto” of sorts (without actually using that overused word).

The Software Developer’s Ultimatum

As software developers, we are often asked to perform monumental tasks in ridiculously short periods of time, making few mistakes along the way, and resulting in something that is easy to understand and extend by those who come along after us. Sadly, we often fall far short of meeting these expectations.

We accept part of the blame, acknowledging that we are only human. However, let it be known that the tools which are provided for us to accomplish our work are woefully inadequate for the job. To create better software, we need better tools. Support for object-oriented techniques, managed code, and agile methodologies have been big advances, but they’re not enough. Our core tools are lacking in many respects, and until we are provided with better ones, the quality of our work will suffer accordingly.

Specifically, a modern computer language platform should include:

1)      A publicly accessible and extensible set of classes that model the semantics of the language, implemented in the language itself, and in addition to the text representation with easy conversion between the two. We need to escape the restrictions of text, and the need for language tools to constantly re-parse it.
2)      A publicly accessible and extensible graphical editor that allows for direct manipulation of code objects while also providing most commonly used text-oriented editing features. We need better code editing and refactoring tools, and we need to be able to customize them ourselves.
3)      Integrated and easily customizable support for code analysis that can be improved over time to be far better than what is currently available. We need better code analysis without performance issues, and we need to be able to customize it ourselves.
4)      Integrated and automatic version control that determines differences at the code object level instead of at the text level, and provides excellent branching and merging capabilities. We need to know exactly what has changed, not a rough guess, and we need simple & error-free merging.
5)      Optional storage of code objects directly in a database instead of text files, allowing for faster, more powerful, and parallel searching and analysis operations on very large codebases. We need the proper tools for analyzing mountains of code, not scanning thousands of text files.

When provided with such a development platform, our productivity and the quality of our code will increase significantly. We will finally be free to concentrate mainly on the problems we are trying to solve instead of working through a fog of limitations with our tools.

As of Fall 2012 – despite the creation of many new managed and scripting languages in the last 20 years – there really aren’t any widely-used computer languages available to us that provide any of these features, much less all of them.

Design Goals for a CodeDOM

I’m going to lay out some primary design goals for a codeDOM as I’ve been defining it. My idea of a codeDOM is a set of classes that can be used to create a tree of objects that represent the semantics (meaning) of code in a particular language (a new language or an existing one). These “code objects” will represent the code in a form that is most easily manipulated by other code, making it as easy as possible for a programmer to write code that analyzes and modifies the code, thus greatly facilitating the creation of language tools. I think the most important design goals for such a codeDOM are:

Clearly named classes that model the actual semantics of the language. All statements, operators, comments, etc. should have their own classes, they should be in a logical hierarchy with a single common base class, and preferably should be implemented in the language being modeled (although implementations for other languages could also exist).
Child objects should have a reference to their parent. It should be possible to associate comments with a particular code object in addition to having standalone comments.
Conservative use of memory. Large codebases will have millions of objects, so no extra fields or objects for non-essential syntax, tokens, or formatting that really apply only for text.
Easy modifications of the code object tree. Rename an object simply by changing its Name property. Assigning an object to a new parent should automatically update its parent reference.
Support for formatting code objects as text (for export, debugging, etc), but implemented in an unobtrusive way that defaults to standard formatting if not explicitly specified.
Support for parsing existing text code into code objects in addition to creating them manually.
Support for resolving of symbolic references into direct references to other code objects.

It’s impossible to overstate the importance of design goal A. The names of the classes, their hierarchy, and their members basically define the language in programmatic form, and they should match the text representation as closely as possible.

Doesn’t Something Like This Already Exist?

As I’ve already mentioned, this isn’t exactly a revolutionary idea – at least, not the basic idea of an object model that represents code. But, the specific viewpoint that we should start designing languages as an open object model first with a text representation more of an afterthought is perhaps somewhat radical. And, it seems to me that few of the design goals laid out above have really been met by any existing tool to date. However, in .NET, there are code modeling classes in the System.CodeDOM and System.Linq.Expressions namespaces, and then there’s the Roslyn project. I’ll address these further in my next article.

Enough Talk, Already – How About Some Code?

I’ve been writing code for over 30 years, and I’m tired of waiting for some big company to finally create a development environment that gets to the promising, gleaming future where code is stored as objects instead of text. Forget about flying cars – I want codeDOMs!

The good news is that I’ve actually been working on exactly that for a long time now, and I’m prepared to hand over a lot of my sources to the public domain in the hopes of increasing interest in codeDOMs, and also just to share useful code. In this series of articles, I will share a codeDOM for C#, displaying a codeDOM with WPF, an object-oriented parsing technique, codeDOM classes for reading/writing VS Solution and Project files, examples of using Reflection and Mono Cecil to load metadata from assemblies, how to resolve symbolic references in a codeDOM, calculating metrics from and searching a codeDOM, an analysis of what tools such as Roslyn give us, and more.

In my next article, I will jump into creating a codeDOM based upon C#, with source included. I’ll try to keep it simple, clean, and well organized, but we’ll have about 45,000 lines of code spread across about 300 types for a start, and several times that much by the end of this series. Click here for Part 2.