ESS: Extremely Simple Serialization for C++

Jerry Evans

4.94/5 (13 votes)

Apr 20, 2009

BSD

15 min read

91119

1685

An article on persistent C++ objects. Includes several console mode test apps and an MFC GUI demo.

Download source with project files for VS2003 - 31.6 KB
Download source with project files for VS2008 - 31.6 KB
Download source and release binary - 313 KB (contains a GUI demo project for VS2003)

Introduction

In this article, I will describe the implementation of a light-weight mechanism for persisting C++ objects to XML or binary formats. Articles of this nature do not naturally generate a great deal of visual content, hence the many interspersed code fragments. I hope you find it interesting.

The sample code contains VS2003 and VS2008 projects which build a console mode Unit Test application. /W4 is used throughout.

In addition, there is a ultra-simple Contacts MFC app which displays content in a hybrid grid-tree control. The purpose of the GUI app is to show how:

To extend ESS marshalling for your own classes
Use simple versioning
How ESS copes with real-life code - protected constructors, virtual void functions, etc.
Dynamic object creation - and how ESS handles errors

When to use

I've found a multitude of applications for this technique - including persisting of program options, saving state for undo/redo operations, automatically enabling XML file formats for application data, client-server communications (i.e., packets on the wire), and storage of non-relational data as XML in SQL databases.

Key features

All ISO C++ compliant, portable, code
Does not require persistent classes to share a common base class
Does not require RTTI-enabled compilation
Respects existing access control for constructors and destructors
Will serialize pointers to classes/structs that can be serialized
Correctly restores contents of containers of pointers to polymorphic objects
Emphasizes compile-time checking to minimize runtime errors
Macros used only for brevity and forwards to debuggable code
Implementation is completely in-line - only need to #include ESS header files
Very simple to add new storage formats - JSON, for example

Constraints

Requires serializable classes to have a void constructor
The current implementation assumes serialization happens in one thread - little is required to add thread safety
Explicitly does not support serialization of 'C' pointer types, especially void* and friends
UNICODE string storage not yet implemented for XML
No theoretical impediments to using ESS with multiple inheritance, but completely untested

Conventions

In order to prevent endless repetition, let us assume any class C0 is a base class in an arbitrary hierarchy, wherein C1 is derived from C0, and C2 is, in turn, derived from C1. A root class describes one which is the 'least-derived'. RTTI is run-time type information, and is the abbreviation I'll apply when discussing how to keep a record of class names and derivation information at runtime. Thus, we have:

A note on macros and templates

In a nutshell, generally very useful, if applied with taste and discretion. I greatly prefer code that can be correctly debugged, which sets an easy test all macros must pass: if you step into a macro in a debugger, do you see code or text?

The ESS_REGISTER, ESS_RTTI, and ESS_STREAM macros all use the string'ize operator (#) to generate strings from class and instance names. This is, I believe, a good thing as it reduces scope for error. ESS_RTTI also declares the templated factory class responsible for creating new instances on the heap as a friend so it has access to protected/private constructors and destructors. This makes it much simpler to apply ESS to existing code. I would count these as definite benefits.

Where executable code is contained in a macro, it is always forwarded to a templated inline function so you can debug things properly. For example:

// the trivial
#define ESS_ROOT(rootname) typedef ess::root<rootname> ess_root;

// forwarded to a template function
#define ESS_STREAM(stream_adapter,class_member)        \
    ess::stream(stream_adapter,class_member,#class_member)

// and the slightly hairier mix ...
#define ESS_RTTI(classname,rootname)\
friend ess::CFactory<classname,rootname>; \
virtual const char* get_name()\
{ return ess::get_name_impl<classname>(#classname); }\
static ess::class_registry<classname>* get_registry()\
{ return ess::get_registry_root<classname,rootname>(#rootname); }

ESS_RTTI is the most complex as the ESS macros get.

One more philosophical declaration: templates are wonderful, but template meta-programming is not. Why? Template meta-programming fails the debugger test.

A minimal ESS example

Let us start with a simple example. C0 is the root class we wish to serialize, what follows is an ultra simple inline implementation.

// primary header file
#include "ess_stream.h"
// use this header for XML storage
#include "ess_xml.h"
class C0
{
    // so we can differentiate
    short m_id;
    // vector of pointers to C0
    std::vector<C0*> m_children;
    // here is the serialization function -
    // it is symmetric working for both reading and writing
    virtual void serialize(ess::archive_adapter& adapter)
    {
        ESS_STREAM(adapter,m_id);
        ESS_STREAM(adapter,m_children);
    }
public:
    // specify the inheritance root
    ESS_ROOT(C0)
    // set up RTTI
    ESS_RTTI(C0,C0)
};

class C1: public C0
{
    // for illustration - real class
    // would probably have more code
    ESS_RTTI(C1,C0)
};

and here is the code to perform serialization in both directions:

int version = 1;
std::string xml_root = "root";
std::string instance_name = "x";
// always use try/catch blocks as any problems use throw()
try
{
    // register the class
    ess::Registry registry;
    // macro'ised variety for brevity and no spelling errors
    registry << ESS_REGISTER(C0,C0);

    // where data is stored ...
    ess::xml_medium storage;
    {
        // instance to serialize
        C0 c0;
        C1 c1;
        // this version hides an XML parser...
        ess::xml_storing_adapter adapter(the_storage,xml_root,version);
        // store root C0
        ess::stream(adapter,c0,"c0");
        // store derived C1
        ess::stream(adapter,c1,"c1");
    }
    // deserialize to p0
    {
        //
        C0* p0 = 0;
        // restore from XML storage
        Chordia::xml_source xmls(storage.c_str(),storage.size());
        // and an adaptor
        ess::xml_loading_adapter adapter(xmls,xml_root,version);
        // stream into C0 pointer ...
        ess::stream(adapter,p0,instance_name);
        // p0 is now ready to use ...
        // we own this
        delete p0;
    }
}
catch(...)
{
}

The XML generated in example() is this:

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<root version="1"/>
<class derived_type="C0" name="c0">
    <signed_short name="m_id" value="1"/>
    <vector name="m_children" count="0">
    </vector>
</class>
<class derived_type="C1" name="c1">
    <signed_short name="m_id" value="1"/>
    <vector name="m_children" count="0">
    </vector>
</class>
</root>

In detail

RTTI
Registration
Adaptors
Error handling
Unit Tests

Let us get down to the details. Persisting basic C++ classes is not too difficult - MFC has had a mechanism to do just this since the beginning of time. We start out in a similar way; the basis of the system is serialization by decomposition - classes are reduced to atomic elements, which then are written and read at runtime. Reading and writing uses a symmetric serialize() function which reduces programming requirements and possible errors. The really tricky bits come from the following:

No common base class
Correctly reconstructing pointers to polymorphic class instances
Keeping it compiler friendly
Keeping it programmer friendly

RTTI and friends

Hypothesis 1: If we want to restore polymorphic types correctly, then:

we have to be able to distinguish between derived types in some way, and
we have to be able to do this at runtime.

Cutting to the chase, the simplest way to do this is to outfit each ESS compliant class with a virtual get_name() function. We can then do this:

// identifying instances at runtime
std::vector<C0*> vec;
vec.push_back(new C0);    // base class
vec.push_back(new C1);    // derived class
std::string n0 = vec[0]->get_name(); // gives us "C0"
std::string n1 = vec[1]->get_name(); // gives us "C1"

Hypothesis 1 implies that we need to be able to create (in a type-safe way) arbitrary instances of different types given only a string. We also have an added wrinkle which we show here by introducing an unrelated hierarchy prefixed with 'D':

// Example 1.1
// not C++ ?
C0* pc0 = hey_presto("C0");
C1* pc1 = hey_presto("C1");
D0* pd0 = hey_presto("D0");
D1* pd1 = hey_presto("D1");

Well, we can achieve something close to example 1.1 by having a static function in each base class C0 and D0, such that:

C0* pc0 = C0::hey_presto("C0");
D0* pd0 = D0::hey_presto("D0");

Enter templates. We can escape the burden of having to qualify the type name by having a templated version of hey_presto():

template <typename T>
inline
T*
hey_presto(const std::string& classname)
{
    // find the classname in something
    return new T_Or_Derivative_Of_T;
}

// i.e.
C0 pc0 = hey_presto<C0>>("C0");
D0 pd0 = hey_presto<D0>("C0");

In reality, the solution is somewhat more complicated. In order to be type safe, flexible, and efficient, we apply the generic solution of indirection. We'll submit to the temptation to add one more macro - it has been unveiled already, but let's take a closer look:

// simplify for the sake of example by removing the
// ess:: namespace qualifier
static class_registry<classname>* get_registry()
{
    return get_registry_root<classname,rootname>(#rootname);
}

// thus the macro invocation ESS_RTTI(C0,C0) becomes:
class C0
{
    // static function returns a templated type
    static class_registry<C0>* get_registry()
    {
        return get_registry_root<C0,C0>("C0");
    }
};

If we continue to follow the calling trail, we get the following in quick succession:

//-----------------------------------------------------------------------------
// Snippet 1:
// Simplified get_registry_root
template <typename Derived,typename Root>
inline
class_registry<Derived>* get_registry_root(const char* rootname)
{
    return
        reinterpret_cast<class_registry<Derived>*>
            (get_registry_impl<Root>(rootname));
}

//-----------------------------------------------------------------------------
// Snippet 2:
// templated inline function that is called by the ESS_ROOT macro implementation
template <typename Root>
inline
class_registry<Root>* get_registry_impl(const char* rootname)
{
    // when this function is called the registry for the
    // hierarchy based on T is created and will last
    // for the duration of the program run.
    static ess::class_registry<Root> s_registry(rootname);
    return &s_registry;
}

//-----------------------------------------------------------------------------
// Snippet 3:
// Finally we reach the ground floor! Details elided here.
template <typename Root>
class class_registry
{
    public:
    // register a factory capable of creating a Root thing
    bool Register(const char* classname,IFactory<Root>* pFactory) {}
    // point of creation for Root derived instances
    Root* Create(const std::string& classname) {}
};

In other words, the code in the snippets above equips each root class with a static, templated, class_registry instance. As you can probably guess from the member function names, class_registry<C0>->Create("C0") will indeed return us a new C0 instance. We will deal with the Register() member function in more detail in the next section - suffice it to say that we are getting very close to the hey_presto() function we wanted before.

As ever with C++, the devil is in the detail. get_registry_impl() in snippet 2 above returns a pointer to a static class instance. That means:

there is only ever going to be one instance of class_registry
class_registry will only be created when get_registry_impl() is called
class_registry is accessible to all classes derived from Root

This, in turn, implies that the following is possible:

// this gives us a C0
C0* p0 = C0::get_registry()->Create("C0");
// this gives us a derived C1 but only accessible via root type
C0* p0 = C0::get_registry()->Create("C1");

Although this is pretty much sufficient for the task at hand, it does make (eventually) for awkward code. To really polish things off nicely, we want to be able to do this:

// yep - as expected
C0* p0 = C0::get_registry()->Create("C0");
// hey presto!
C1* p1 = C1::get_registry()->Create("C1");

Although this looks simple enough, recall that C++ static functions are not virtual. In fact, you cannot have static functions with the same name in two different though related classes. Or, can you? Let's look again:

// yep - as expected
class_registry<C0>* rc0 = C0::get_registry();
C0* p0 = rc0->Create("C0");
// hey presto!
class_registry<C1>* rc1 = C1::get_registry();
C1* p1 = rc1->Create("C1");

The code here obscures the fact that although the static functions have the same name, they actually have a different signature as they return different, but related types. This actually is a bit of a hey-presto moment because we can now write a single templated inline function which creates any arbitrary instance from a string:

template<typename Type>
inline
Type*
instance_from_name(const std::string& classname)
{
    // since get registry is a static with a different signature
    // at each level of inheritance we can overload the function name
    ess::class_registry<Type>* p = Type::get_registry();
    // creates correct derived type or throws ...
    return p->Create(classname);
}

Now, we have a single function which works in both cases - note that the template argument type is different in (3).

// 1. get root from root
C0* p0 = instance_from_name<C0>("C0");
// 2. get derived via root - fine for reloading containers
C1* p1a = instance_from_name<C0>("C1");
// 3. but now we can get derived from derived too so
// we can access member functions of C1 directly
C1* p1b = instance_from_name<C1>("C1");

Now, to try and wrap up this somewhat involved section, we will follow the compiler when we actually stream stuff back from storage. Here is the relevant inline function in ess_stream.h; it is a templated function with a signature that matches pointers to types:

template<typename Type>
inline
void
stream(stream_adapter& adapter,Type*& pointer,const std::string& name)
{
    std::string derived_type = get_class_name(adapter);
    // simplified
    pointer = instance_from_name<Type>(derived_type);
    //arg = instance_from_name(derived_type);
    // deserialize the instance
    pointer->serialize(adapter);
}

// example usage
C0* p0 = 0;
ess_stream(...,p0,...);
C1* p1 = 0;
ess_stream(...,p1,...);

Now, the remarkable thing about this code it that it all returns the same thing, the static registry class that was declared way back in the root of C0. The templating means that there are multiple ways in which the compiler can establish a type-safe way in which to access the registry, in turn enabling the creation of arbitrary type instances. Note, however, that this convenience comes at a cost. It is now theoretically possible to instantiate partially finished classes! Without using compiler generated RTTI, it is impossible (I believe) to guard against this error at compile time. Indeed, it is very hard to guard against it at run time. Any thoughts on this would be welcomed.

// pathological
C1* p1 = instance_from_name<C1>("C0");

Registration redux

The purpose of registration is to ensure that each class registry is called into existence before any construction is attempted. A desirable side effect of the implementation is that it is quite hard to do this - after all, any code that serializes a class will end up accessing the registry. However, imagine you have opened your brand new XML enabled persistent object application and selected File >>: Open - the runtime will start to throw as it attempts to instantiate classes that have not yet been mapped into the system. I also believe that explicit registration is a good idea as it makes it easy to discover where the persistent process starts and thus simplifies debugging or problem diagnosis. Registration itself is trivial, and need only be done once.

// use the long-hand
ess::registry_manager registry;
    registry
        << ess::class_registrar<C0,C0>("C0")
        << ess::class_registrar<C1,C0>("C1")
        << ess::class_registrar<C2,C0>("C2");
// macro short-hand
ess::registry_manager registry;
    registry
        << ESS_REGISTER(C0,C0)
        << ESS_REGISTER(C1,C0)
        << ESS_REGISTER(C2,C0);

Note too that the registry object does not have to be kept around. Registration actually does three things -

forces the static class_registry instance into existence,
creates a templated factory class to create the type in question,
inserts the factory class instance into the registry under the classname key,

Multiple registration is not an error, unless you try and register a class name with a different factory - this, to me at least, implies some possible programming error. The system will throw - which is another reason to keep registration in one place in the code. Although I have not specifically tried it (my coding convictions are set against it), this system should work for classes that are exported from dynamic link libraries.

There is another variant of registry_manager which I have found useful when working with the Diagram Editor posted on CodeProject some years back. I have got a heavily modified version that uses ESS for undo/redo and saving as XML.

// typed registry with CDiagramEntity as root
ess::typed_registry_manager<CDiagramEntity> registry;
// register the 3 classes of interest
registry
    << ESS_REGISTER(CEditor,CDiagramEntity)
    << ESS_REGISTER(CListBox,CDiagramEntity)
    << ESS_REGISTER(CStatic,CDiagramEntity);

// typed_registry_manager exposes instance_from_name()

Adaptors

The purpose of the archive_adapter is to make it easy to store in new formats. An adapter class that loads from (say) JSON or some proprietary binary format only needs to implement the overloaded read() functions in the archive_adapter class. The same is true for an adaptor that writes. The source code shows a completely different take on adaptors - have a look at the binary_debug_adapter class in ess_binary.h. What this does is to dump a binary archive to a text file as it is streamed, in real time; useful if you want to understand binary storage in more detail.

XML is the favoured storage format - I felt it desirable that the XML generated contained enough information to allow desk checking and debugging, and the ubiquity of the format means exchange and interoperability is simple. As well as storing the contents of class members, we want to store their names. For each of the intrinsic types, along with the supported container types, we have a number of inline free functions called stream which have a specific type signature, which all do the same thing - take the arg and name parameters and then:

If the archive_adapter is storing, write the argument data and its name to the underlying storage
If the archive_adapter is loading, then read the value of the named item back into the arg parameter

namespace ess
{
// for each intrinsic
inline void
    stream(archive_adapter& adapter,bool& arg,const std::string& name)    {...}
// ... more free functions as above
inline void
    stream(archive_adapter& adapter,GUID& arg,const std::string& name)    {...}
// now we have a generic templating. The following is for references
template<typename Type> inline void
    stream(archive_adapter& adapter,Type& arg,const std::string& name) {...}
// and one for pointer types
template<typename Type> inline void
stream(archive_adapter& adapter,Type*& arg,const std::string& name)    {...}
// and specializations for std::vector
template<class Type> inline void
    stream(archive_adapter& adapter,std::vector<Type>& arg,const std::string& name)    {...}
// and for std::map
template<typename Key,typename Value> inline void
    stream(archive_adapter& adapter,std::map<Key,Value>& arg,const std::string& name) {...}
}

Error detection

Although ESS makes it easy to add runtime persistence to C++ classes, we do not want to sacrifice any of the compile time checking the language affords. Indeed, wherever possible, we want to warn the programmer if the shotgun pointed at the foot is about to fire. Consider the following:

ESS_ROOT(C0)
ESS_RTTI(C0,C0)
ESS_RTTI(C1,C0)
ESS_RTTI(C2,C1) <- wrong...

though easily done. C1 is not the root class. How can we detect this at compile time? With some difficulty! There is something called a compile_time_checker in the ess_rtti header file. It exists solely to ensure that a class that has been declared as an ESS_ROOT is always used as the root in the ESS_RTTI macros. In other words, it will fail to compile if the sort of error shown above is made.

template <typename Derived,typename Root>
struct compile_time_checker

The following categories of error do not require support code - as they end up being syntax errors or (much more likely) irresolvable, unrelated type errors.

Attempting to serialize an unsupported type
Adding persistence to a type with non-void constructors
Attempting to serialize a class or structure which does not implement get_name()
Attempting to serialize a class or structure which does not implement get_registry()
Attempting to serialize a class or structure which does not implement serialize() somewhere in the hierarchy

Non-existent derivations

class CD3 : public CD2 { ESS_RTTI(CD3,CDX) }

The upshot is that the foreseeable runtime errors are:

Attempting to serialize an unregistered type - i.e., loading a class which is unknown to the compiler. This will fail as the class_registry instance will throw an exception.
Attempting to de-serialize an instance whose layout has changed in some way. The way in which this mode fails is important - if the layout is 'out-of-order', then the runtime should detect this and throw().

The Unit Tests

These are pretty straightforward and are all contained in the ess_main.cpp source file. The idea is to assemble a test case which will verify (or not) the key implementation expectations. Code obviously has to meet a basic set of requirements in order to compile - and as many errors as can be checked at compile time are checked. However, there are a set of conditions that can only be tested at runtime. The most basic tests are:

Will a persistent class store itself?
Will the data generated by persisting a class suffice to create a new instance?
If the new instance is itself serialized, will the resulting storage be equal to the initial storage (i.e., from 1.)?
Can the runtime support detecting programming errors such as incorrect derivation?

Fitting ESS: The 42 line guide

Here are the steps required to make any class ESS compliant. All code is contained within the ess namespace, hence the ess:: qualifier everywhere.

// main ESS include file - will pull in ess_rtti.h
#include "ess_stream.h"
// For XML storage.
#include "ess_xml.h"
// or for binary storage
#include "ess_binary.h"

// the base class of any persistent hierarchy uses the ESS_ROOT macro
class persistent_base
{
    // example persistent member
    some_type class_member;
    public:
    // use ESS_ROOT in the 'least-derived' class
    ESS_ROOT(persistent_base)
    ESS_RTTI(persistent_base,persistent_base)
    // serialization function - virtual
    virtual void serialize(ess::archive_adapter& adapter)
    {
        //
        ESS_STREAM(adapter,class_member);
    }
}

// any subsequent descendents use the ESS_RTTI macro
class persistent_derived : public persistent_base
{
    some_type class_member;
    public:
    // note the ESS_RTTI arguments - name of this class
    // then the name of the *root* class
    ESS_RTTI(persistent_derived,persistent_base)
    // ensure serialize in the base class is called ...
    virtual void serialize(ess::archive_adapter& adapter)
    {
        // stream members of this class
        ESS_STREAM(adapter,class_member);
        // stream members of the base class
        persistent_base::serialize(adapter);
    }
}

That is it. All implementation is in-line, and with the exception of CoCreateGuid(), runtime support only uses constructs in the std:: namespace, namely std::string, std::map, and std::vector. If you want more detail on how to extend ESS to marshal your own types, then see the ess_class.h file in the ESS_GUI project. It shows how to persist COleDateTime.

The source code

Both the VS2003 and VS2008 archives contain two folders:

./codeproject/ess_code/ess_0X
./include/...

Ensure that the paths are created when unzipping as this will mean the projects should build out of the box, without any need to set new #include paths and the like. Please let me know if anyone finds this not to be the case. The MFC GUI project should unzip in exactly the same manner.

Items for the future

I have intentionally avoided the following issues in the current implementation.

Support for less frequently used containers, std::list and std::stack for example. My code rarely uses these classes - support is easy to add.
Smart Pointers and friends - these only recently arrived in the TR update for VS2008, and are not found in the standard C++ libraries yet. I was unwilling to roll my own.
Endian issues in the binary storage system - right now, it is all Intel ordering. It would be nice to use network ordering for binary storage.

Other points arising

There is a degree of annoying anti-symmetry in the read and write versions of the XML storage. I'd like to smooth that off.
Efficiency. The upper levels of the XML reader/writer could probably do with some streamlining.

Conclusion

That pretty much wraps it up. I hope I have conclusively demonstrated that type-safe, standards-compliant, and portable persistent C++ code can be created with the minimum of programming effort. A comparison with C# code is interesting. Forsaking the less efficient but automated persistence afforded by Reflection, manually specifying members for serialization using the XML tag has a similar level of in-code overhead to ESS. Any constructive discussion is most welcome, and I would love to get feedback on improvements and beautification.

I have not had the opportunity to try compiling with Visual C++ 6.0 as it is no longer is in use here. I suspect it lacks the templating machinery required for ESS to work. I would love to be proved wrong on this. Also, despite best efforts, I could not persuade the Cygwin bundled GCC to find the right standard header files. My patience with command line tools that don't play nice is limited these days, so the test was binned.

Credits and References

The XML parser is a stripped down version of the parser contained in DLib. This library contains some interesting things, including one variety of serialization - thanks to the team.
The BOOST library offers extensive, heavy-weight, C++ serialisation support.
Templates (1) Modern C++ Design, Andrei Alexandrescu, Addison-Wesley 2002: Amazon.
Templates (2) C++ Templates, D Vandervoorde, N.M Josuttis, Addison-Wesley 2003: Amazon UK.
C++ object databases were all the rage in the early 90's, and they all had to solve the marshalling problem. See Gigabase, POET/Versant, and Objectivity to scope contemporary FOSS and commercial offerings.
Thanks to Johan for the UML editor.
Thanks to Michal Mecinski for the original tree/grid (www.mimec.org).
I have ruthlessly extended this control to handle arbitrary column counts, multiple-selection, cell addressing, cell colouring, item data, and checkbox support. Any bugs are my own. See view.cpp and ColumnTreeWnd.h for details.
The canonical URL for ESS updates will be NovaDSP.com.

Footnote: I am in the job market. Please get in touch if you have any interesting opportunities. Thanks.

History

Version 1.01 - 14 March 2009
Version 1.02 - 17 March 2009