OpenC++ - A C++ Metacompiler and Introspection Library

Software Developer's Journal

4.83/5 (14 votes)

Apr 24, 2006

CPOL

16 min read

56728

505

This article presents and overview of source code parsing techniques using OpenC++, including a sample implementation of automated serialisation.

This article is written by Grzegorz Jakacki, and was originally published in the August 2005 issue of the Software Developer's Journal. You can find more articles at the SDJ website.

Download source - 3.34 Kb

Introduction

Anybody who has ever tried to implement serialisation in C++ and has gone through the ordeal of writing serialising functions for classes (and maintaining them later!) knows that this is mostly a repetitive task that would be best performed by a machine. The task seems simple: for every class, generate a function which, roughly speaking, outputs all members to a stream. For instance, take the following class:

class Car
{
   Engine engine_;
   Person* maintainer_;
   ....
};
the serialization function could look like this
void Car::Serialize(ostream& os)
{
   os << "Car{";
   os << " engine_="; engine_.Serialize(os);
   os << " maintainer_=&"; 
   if (maintainer_) maintainer_->Serialize(os); else os << "NULL";
   os << " }";
}

In fact, the reality is more complex (in this example, we ignored issues like pointer cycles or base subobjects), but this does not change the situation that much; the task remains clear and tedious: implement a member function for every class, based on some simple scheme.

Surprisingly, this problem does not have any straightforward solution. The content of the function Serialize() can easily be built from the list of Car's members and their types; however, it is not easy at all to get a grip on such a list. First of all, it is impossible inside the C++ language. Some languages (most notably Lisp, but also Java, Smalltalk, Python, and others) expose the program code in the form of a data structure, which can be inspected (and even modified) by the program itself. This feature, known as reflection, is totally missing from C++. Without some external support, a C++ program has little chance of, for instance, obtaining a list of its own classes.

The only solution to automation must, therefore, either leverage some intrusive techniques (e.g., the Boost Serialization Library) or take advantage of an external code generator. Taking a program's source code, it should be fairly easy to extract the list of its classes, and then, for each class, extract the list of member variables with their types and inject the correct definitions of Serialize(). It should be but it is not. All of these tasks are far too complex for a script that one could write in one afternoon. In order to do the job correctly, the script would not only need to perform a lexical analysis of the source code, but also endeavour in the area of identifier binding and type elaboration, at least to be able to resolve typedefs. This is not easy at all. If you are not convinced yet, think about nested classes, namespaces, etc.

It is astonishing how closed C++ compilers remain today. Over the past 10 years, the mainstream web browsers, which in a large part are compilers of mark-up languages, opened up their data structures and made them available via standardised introspection APIs. Today, an HTML file loaded into a browser can be inspected and modified by JavaScript snippets or dynamically linked plug-ins. Unfortunately, in the domain of C++, a similarly widespread opening of compilers or standardisation of introspection APIs has not taken place.

An Open Compiler

In the early 90s, PhD. candidate Shigeru Chiba, working in a team led by Gregor Kiczales at Xerox PARC, defined a reflection system for C++, and implemented it in a C++ front-end compiler called OpenC++. The motivating goal of his work was to enable easy implementation of language extensions, custom code optimisations, and non-intrusive augmentation of the reused code (some of these ideas crystallised later into what is known today as aspect-oriented programming).

OpenC++ parses and analyses a C++ translation unit, creating its abstract representation in the form of objects representing entities that occur in C++ programs: expressions, environments, types, classes, and class members. These objects exist in an OpenC++ compiler (at meta-level), and should not be confused with objects that will exist in the compiled program during execution (base-level). To avoid confusion, the objects and classes of the OpenC++ introspection API are called metaobjects and metaclasses respectively.

The Introspection API makes metaobjects available to the C++ plug-ins that can be dynamically or statically added to the compiler and executed during compilation. Plug-ins can inspect and modify the metaobjects. After plug-ins have terminated, OpenC++ outputs the source code reflecting the current state of metaobjects, and ships it to the underlying back-end compiler.

This is a very sharp tool. First of all, you can ignore the ability to modify metaobjects, and merely use OpenC++ for source code analysis, which is a lot by itself. Secondly, the source-to-source translation on the abstraction level of types and expressions enables a very straightforward way to augment the compiler, be it in code optimisation, additional type checking, or code synthesis. For our serialisation example, we will take a look at the former.

As a bonus feature, the OpenC++ parser accepts several constructs that are not part of C++, but may be handy for implementing language extensions like parallelism, lambda notation, etc. Due to the limited size of this article, we will not dive into these waters.

In Chiba's system, particular emphasis is put on metaobjects representing classes. When OpenC++ encounters a class in compiled source code, it uses an Abstract Factory pattern to create a metaobject representing this class. The metaobject must implement the interface of an API metaclass Class, but the implementation may be defined by a client. The interface of Class not only provides introspection facilities (e.g., querying a class's name, bases, or members), but also defines a group of member functions that specify if and how different aspects of the class should be translated when shipped to the back-end compiler. The standard implementations provided by the metaclass Class "do nothing" (i.e., perform identity translation), but clients deriving a subclass are free to redefine them (Template Method pattern).

There are two ways to tell OpenC++ which metaclass it should use when creating class metaobjects. One way is to use an API to instruct the abstract factory mechanism to use another metaclass in place of the default (Class::SetDefaultMetaclass()). Another, more fine-grained way is to use the OpenC++ syntax extension to specify metaclasses for individual classes in the source code, on a case-by-case basis. The syntax extension dedicated to this purpose is:

metaclass MetaclassName ClassName;

For instance:

metaclass Serializable ShoppingCart;

specifies that the actual type of the metaobject representing the class ShoppingCart and its subclasses should be a metaclass Serializable. Effectively, this means that the translation of the aspects of ShoppingCart will be administered by the member functions of Serializable.

Back to Serialisation

Before we proceed to implementing serialisation with OpenC++, a few words on the serialisation proper. We assume that serialised objects may have pointers, references, and value data members, and that we need to serialise all objects reachable through them. However, not to bog ourselves down with details, we shall also assume that:

classes of serialised objects do not employ multiple inheritance,
an object and its (unique) base subobject (if any) share the same address,
serialised classes do not contain array data members,
if a serialised class has a pointer data member, it has to point to a serialisable class (and ditto for references).

We shall implement the two-step serialisation scheme. In the first step (enumeration), the serialisation algorithm traverses serialised objects via pointers and references, collecting their addresses into a set of pointers. By inspecting this set, the algorithm can determine if a given object has already been encountered, thus avoiding infinite recursion. In the second step (output), the algorithm traverses the objects again, serialising them, and removing their addresses from the set of pointers. Again, by inspecting the set, the algorithm can tell if a given object has been serialised.

Listing 1 shows two exemplary classes defined so that different kinds of data members are exposed (fundamental type, reference, pointer, user-defined type). Listing 2 shows the serialisation functions that need to be added for each serialisable class.

Listing 1. Exemplary classes and their data members

class Human
{
   ....
   int id_;
};
class PregnantMother : public Human
{
   ....
   Human& parent_;
   Human* listener_;
   Human  baby_;
};

Listing 2. Serialisation code for classes defined in Listing 1

typedef std::set<const void*> Visited;    // (a)
class Human
{
   ....
   virtual void Enumerate(Visited& visited) const// (n)
   {                            // (n)
      if (visited.find(this) != visited.end()) return;// (c)
   }                                                 // (n)

   virtual void Output(                          // (n)
      std::ostream&, Visited& visited) const     // (n)
   {                                             // (n)
      os << "Human(" << (void*)this << ")";      // (b)
      if (visited.find(this) == visited.end()) return;// (d)
      visited.erase(this);                       // (h)
      os << "{";                                 // (e)
      os << " id_=" << id_;                      // (i)
      os << "}";                                 // (m)
   }                                             // (n)
};

class PregnantMother : public Human
{
   ....
   virtual void Enumerate(Visited& visited) const  // (n)
   {                                               // (n)
      if (visited.find(this) != visited.end()) return;// (c)
      Role::Enumerate(visited);                    // (f)
      visited.insert(this);                        // (g)
      parent_.Enumerate(visited);                  // (j)
      if (listener_) listener_->Enumerate(visited);// (j)
      baby_.Enumerate(visited);                    // (k)
   }                                               // (n)
   virtual void Output(                            // (n)
      std::ostream& os, Visited& visited) const    // (n)
   {                                               // (n)
      os << "Derived(" << (void*)this << ")";      // (b)
      if (visited.find(this) == visited.end()) return;// (d)
      os << "{";                                   // (e)
      os << " Human = "; Human::Output(os, visited);// (f)
      visited.erase(this);                          // (h)
      os << " parent_ = (&)"; parent_.Output(os, visited);// (j)
      os << " listener_ = (*)";                     // (j)
      if (listener_) listener_->Output(os, visited);// (j)
      else os << "0";                               // (j)
      os << " baby_ = "; baby.Output(os, visited);  // (k)
      os << "}";                                    // (m)
   }                                                // (n)
}

After p->Enumerate(visited) terminates, the set visited should contain the addresses of all objects reachable from p. If p->Output(os, visited) is subsequently called, all the reachable objects will be serialised to os, and the visited set will be emptied. The calling protocol necessary to serialise p to std::cout is then:

Visited v;
p->Enumerate(v);
p->Output(std::cout, v);

For the client's convenience, we can wrap it into a handy template shown in Listing 3, so that the calling protocol is simplified to:

Serialize(std::cout, p);

Listing 3. A wrapper function template enabling a convenient calling protocol

template <class T>
inline void Serialize(std::ostream& os, const T* obj)
{
   assert(obj);
   Visited visited;
   obj->Enumerate(visited);
   obj->Output(os, visited);
   assert(visited.size() == 0);
}

It must also be noted that due to the fact that we are introducing names into scopes which we do not own, care must be taken so that our identifiers (Output, Enumerate, Visited) do not clash with existing ones. We are ignoring this issue here for the sake of clarity of presentation, but obviously in production code, it should not be neglected.

The (meta)algorithm for function generation works as follows. It creates the functions Enumerate() and Output(). Enumerate() works as follows:

calls Enumerate for the base class, if any.
inserts this into the set of visited objects (if still not there).
calls Enumerate for its members, except the members of fundamental types.

Output() works as follows:

outputs the class's name and object's address.
returns if the object's address has already been removed from the set.
calls Output() for the base class.
removes this from the set of visited objects (if still there).
serialises data members using operator<< (fundamental types), or by calling Output() (other allowed types).

The implementation of these (meta)algorithms in OpenC++ is straightforward. We start by defining our own metaclass Serializable. We shall derive it from the OpenC++ metaclass Class and redefine Class::TranslateClass(). This member function will be called by the OpenC++ compiler for all class metaobjects corresponding to the compiled classes, which are declared Serializable in the source code.

class Serializable : public Class
{
   void TranslateClass(Environment* env) { .... }
};

The implementation of TranslateClass() is shown in Listing 4. We will not explain it step by step, but rather highlight the key details that facilitate an understanding of the whole function.

Listing 4. Implementation of Serializable::TranslateClass()

void Serializable::TranslateClass(Environment* env)
{
   VerifySingularInheritance(env, this);
   EmitInitialDeclarations(env);    // (a)
   // create buffers to store the generated code
   stringstream enumerate;
   stringstream output;
   // build function bodies
   EmitClassNameAndAddress(output, Name()->ToString()); // (b)
   enumerate << "if (visited.find(this) != visited.end()) return;\n";// (c)
   output    << "if (visited.find(this) == visited.end()) return;\n";// (d)
   output    << "os << \"{\";\n";      // (e)

   EmitCodeForBaseClasses(output, this);// (f)
   enumerate << "visited.insert(this);\n";// (g)
   output    << "visited.erase(this);\n";// (h)
   // emit code for each data member
   for(int n = 0; true; ++n) {
      Member m;
      bool hasMember = NthMember(n, m);
      if (! hasMember) break;
      if (m.IsFunction()) continue;
      TypeInfo t;
      m.Signature(t);
      if (t.IsBuiltInType()) {
         EmitCodeForBuiltin(output, m.Name()->ToString());  // (i)
      }
      else
      {
         Class* c;
         if (t.IsPointerType() || t.IsReferenceType()) {
            EmitCodeForCompoundType(
               enumerate, output, env, t, m.Name());  // (j)
         }
         else if (t.IsClass(c)) {
            EmitCodeForValue(
               enumerate, output, m.Name()->ToString());  // (k)
         }
         else {
            ReportError(env, m.Name(), "cannot handle this type");
         }
      }
   }
   output << "os << \"}\";\n";  // (m)
    
   // create function definitions and make them into
   // new members of the translated class
   AppendMemberFunctions(
      this, enumerate.str().c_str(), output.str().c_str());  // (n)
}

Our implementation of TranslateClass() is built around a loop, which iterates over the members of the translated class. In the loop, we find out about the type of each member, and output the appropriate source code to two buffers. The buffers are eventually inserted into the translated source code as bodies of new member functions (Enumerate() and Output()). Additionally, at the very beginning, we check the assumption about the multiplicity of inheritance (only zero or one base allowed), and extend the translated program with definitions of types used by Enumerate() and Output() (e.g., Visited, std::set, std::iostream) and the declaration of the template Serialize().

In Listings 1 and 4, we marked the source code with letters to show the correspondence between metacode and the code generated with it.

Listing 5. Code generation for base classes and pointer members

static void EmitCodeForBaseClasses(ostream& output, Class* c)
{
   if (Class* base = c->NthBaseClass(0)) {
      Ptree* name = base->Name();
      output << Ptree::Make(
         "os << \" %p = \"; %p::Output(os, visited);\n", name, name);
   }
}

static void EmitCodeForPointer(
   ostream& enumerate, ostream& output, const char* name)
{
   enumerate << Ptree::Make(
      "if (%s) %s->Enumerate(visited);\n", name, name)->ToString();
    
   output << Ptree::Make(
      "os << \" %s = (*) \";"
      "if (%s) %s->Output(os,visited); else os << \"NULL\";\n", 
      name, name, name) ->ToString();
}

Listing 5 shows the functions responsible for code generation for pointer members and base classes. The full source code is available for download above.

The interaction of the serialisation plug-in with the OpenC++ compiler boils down to the following API calls:

Class::NthMember() enables random access to members.
Class::Name(), Member::Name() obtain the name of the class or the member.
Class::Definition() returns a code chunk representing the class's definition.
Class::NthBaseClass() enables random access to the bases.
Class::AppendMember() injects a new member in the class's scope.
Class::InsertBeforeToplevel() injects new declarations in the top-level scope, immediately before the scope in which the class occurs.
Ptree::Make() creates a code chunk.
Ptree::ToString() creates C-string out of a code chunk
Environment::GetLineNumber() obtains a file name and the line number associated with a code chunk
TypeInfo::IsFunction(), TypeInfo::IsPointerType(), TypeInfo::IsReferenceType(), TypeInfo::IsBuiltInType(), TypeInfo::IsClass() enable querying the properties of a type.
TypeInfo::Dereference() extracts the component type from a pointer or a reference.
TypeInfo::ClassMetaobject() obtains a class's metaobjects from a type representing a class (so that the client can, for example, examine its members).

Listing 6. Error reporting

static void ReportError(Environment* env, Ptree* where, const string& what)
{
   if (where) { 
      int line;
      assert(env);
      Ptree* filename = env->GetLineNumber(where, line);
      std::cerr << filename << ":" << line << ":";
   }
   else std::cerr << "ERROR:";
   std::cerr << what << endl;
   std::exit(1);
}

Listing 6 demonstrates how a plug-in can report a compilation error. Note that, unlike in template metaprogramming, the metaprogram has very fine-grained control over what is being reported and how.

The CD accompanying this issue contains an exemplary program example.cc that can be used to verify that serialisation-support works as expected. To be able to appreciate our new toy, we need to compile the metaclass Serializable and link it, dynamically or statically, with the main module of the OpenC++ compiler. In this presentation, we will use static linking. First, we prepare a compiler with the Serializable metaclass linked in:

$ occ2 Serializable.mc -o custom-cc

We can verify which metaclasses are available in the compiler, by issuing the command:

$ ./custom-cc -l
occ: loaded metaclasses: Serializable

Now, we may try to compile the example with the customised compiler:

$ custom-cc -P example.cc

(Option -P forces an additional preprocessor pass, which is necessary here to resolve the #include directives injected by our metacode.) The obtained executable builds the data structure and serialises it.

OpenC++ as a Library

Throughout the 10 years of its existence, OpenC++ has proved useful not only as a metacompiler, but also as a reusable parser and type elaborator. If you are satisfied with the ability to get a grip on metaobjects as they stand after the translation unit is compiled, you can simply supply a plug-in that does its job in YourMetaclass::TranslateClass().

This may not be enough though. The plug-in model does not give much control over the execution flow within the OpenC++ compiler. Projects that demand more flexibility must resort to diving into the OpenC++ source code and interfacing to it directly, as certain successful projects do.

This is not so scary as it seems. OpenC++ is rooted in research on software architecture, and it shows in its design. Clearly visible and natural applications of design patterns (truly natural, since OpenC++ was designed before the widespread adoption of patterns) make code reading a pleasant task. The clarity of the design surely helped the project survive more than a decade, without picking up too much clutter or collapsing under its own weight.

The OpenC++ architecture is built on an abstraction of a translation unit called Program. Program is, more or less, an addressable sequence of characters. A streaming lexer and a recursive-descent parser build a parse tree linked into the components of a Program. This solution enables de-parsing of unchanged parts of the syntax tree with minimum effort, and guarantees that de-parsed text will be identical with the original. This, in turn, greatly simplifies keeping track of the correspondence between the line numbers in the original source code and the line numbers in the code shipped to the back-end compiler.

Tree algorithms, such as typing, binding, or translation are implemented using the Visitor pattern. The implementation of the syntax tree is based on a variant of the Composite pattern, but it is slightly unorthodox. The (multiary, by its nature) syntax tree is encoded in a binary tree built upon two kinds of nodes, as shown in Figure 1.

NonLeaf, a binary node.
Leaf, a nilary node with a link to a program chunk.

Every syntax tree in OpenC++ can be manipulated in terms of this simple interface. Higher-level constructs of a tree (block, function definition) are encoded in predefined patterns of Leafs and NonLeafs. This is useful, since, for instance, a visitor is able to traverse the children of constructs unknown to it, and the addition of new syntax constructs does not necessarily invalidate all existing visitors. On the other hand, finding out which higher-level construct is encoded by a given Leaf/NonLeaf subtree becomes tricky (Google for tree parsing) unless great care is invested into the disambiguation of the encoding. OpenC++ solves this problem by tagging each NonLeaf node that is a root of construct encoding (a similar qualification of Leaf nodes is also implemented). The tag is hidden in the dynamic type of NonLeaf; simply the node object is given a type more specific than NonLeaf. See Figure 2 for a rendition of int main() {} as an OpenC++ syntax tree.

Figure 1. Hierarchy of Ptree metaclasses.

Figure 2. Syntax tree representing int main() {}

An important factor influencing the reusability of OpenC++ code is the deployment of the Boehm-Demers-Weiser (BDW) garbage collector to metaobjects. On one hand, this is a blessing, since the client is freed from constant worries about object ownership, but on the other hand, it proves to be an obstacle for porting. Unpleasant surprises also await those who put garbage-collected objects in unaware STL containers – lack of GC-aware allocators leads to objects being reclaimed prematurely. Nevertheless, the BDW collector has been making constant progress over the last few years in terms of stability, on many platforms, and the experiences of OpenC++ users show that in many cases (for example, batch compilation), garbage collection can be ignored altogether.

However powerful OpenC++ might be, there are several areas where there could still be improvement. One of them is templates support. OpenC++ was created just before templates migrated to mainstream. The templates support was added as an afterthought, and the project never had the resources to work on it properly (some limited support is available, but much is left to be desired). Another issue is that OpenC++ is not particularly skilled with overloading. Clearly, some work invested in these areas would increase the usefulness of the product.

Last, but not least, we should mention the legal aspects of building on OpenC++ code. OpenC++ is free software under a very liberal, MIT-style license (without copyleft). Help yourself.

Today and Tomorrow

OpenC++ design may be timeless, but even the Sphinx may need a facelift from time to time. Since 2002, OpenC++ has been under community-based development at SourceForge. In 2004, the project undertook an effort to separate the metacompiler framework from the introspection library, in order to facilitate reusing the latter. This idea spawned the OpenC++Core library subproject, focused on the API part and leaving out the metaprogramming framework. It is still in the alpha stage, but alive and kicking.

In the long term, OpenC++ is looking at an ambitious task of C++ refactoring. Refactoring differs from source-to-source translation in that, it requires the output code to be human-readable. The presence of a preprocessor greatly complicates this task for C or C++. This is not impossible though, and would surely be useful.

Acknowledgements

I would like to thank Prof. Shigeru Chiba and the decision makers at Xerox PARC, for making OpenC++ available to everybody. I also want to thank Prof. Chiba for his input into this article. My special thanks go to the volunteers and the past and present developers of OpenC++, who keep the project moving ahead.

Bibliography

Shigeru Chiba, A Metaobject Protocol for C++, in Proceedings of OOPSLA'95.
Marshall Cline, C++ FAQ Lite, Serialization and Unserialization
VFiasco
Synopsis

Volunteering

If you are interested in OO, C++, and compilers, and if you feel you would like to work on something interesting and useful, consider volunteering for OpenC++. Please refer to opencxx.sourceforge.net/volunteer, or contact the author.