XML Serialization of Complex .NET Objects

Antoniu-Gabriel Rozsa

4.86/5 (15 votes)

Oct 20, 2008

CPOL

17 min read

138901

2630

Yet another custom XML serializer, with a slightly different approach.

Download source code - 11.1 KB

Introduction

This article deals with the topic of serializing objects to XML in a C# 2.0 context. This subject is widely discussed on CodeProject; several articles present various solutions while debating ease of use, advantages, disadvantages, and speed. The purpose of my CustomXmlSerializer solution is to present an alternative approach to XML serialization, one that has not been presented yet; at least, I didn't find an article dealing with this approach.

My article is based on Marcus Deecke's article about deep XML serialization; I strongly recommend reading this article (which, by the way, is very good) before reading mine. I was inspired by Marcus's ideas, and I think his explanations are also pretty good.

Background

The need for serialization has appeared in one of the projects I am working on. I was assigned the task to implement a feature that would allow users to save and load their data into files. That's a pretty common feature, one would assume, and I would generally agree if the application wasn't already saving data into a database. All data that needed to be saved into individual files was internally stored in objects. The objects were created with a database-based context in mind, which made my task a little more difficult.

Naturally, I decided to implement a solution that is able to serialize / deserialize these objects. I started by searching the net for a solution, having in mind the following particularities about my objects:

All of the data contained in the objects must be saved; so clearly, I need deep serialization of my objects.
The data, once saved, must be loadable with any future versions of the software.
My objects are quite complex. They interact with each other in many different ways, they hold references among themselves, references that must be correctly serialized and deserialized.
The objects contain a lot of properties. Many of these properties have code inside their accessors, code that is necessary for the implementation of the business logic of the application. Saving the values of all these properties would be a waste of CPU time (it would take time to evaluate all the properties and execute the logic behind them) and disk space (all the properties' values would have to be serialized). I want to serialize my objects in a way that would allow the creation of a copy of the original object after deserialization.
My objects must be modified as little as possible, some of them being enough complex already. Forcing the developers who worked on the code of these objects to make changes that would accommodate serialization is not acceptable.
The output must be as compact as possible.

Out of the above requirements, no. 2, deserves a little more attention. I had to choose a serialization solution which is able to accommodate virtually any change in the codebase of my project. This would imply that the objects that are serialized may, and probably will, change in time, as new versions roll out. And unfortunately, the changes can be pretty profound:

New properties / fields can be added.
Properties / fields can be deleted, or renamed.
Namespaces may change, or code may get moved from one assembly to another.
Types of properties / fields may change.

And still, the new version of the software must be able to successfully deserialize the data.

The Solution

While trying to accommodate all the requirements listed above, I had to eliminate all of .NET's built-in serializtion methods, since they all had limitations, and some of them would have required extensive code modifications to accommodate the future versions requirement (no. 2). My attention turned to XML serialization. I made the choice of XML over binary serialization because of XML's universal and descriptive nature. XML data is much easier to manage and debug when radical code changes occur, compared to a custom binary format that in time will surely prove to be incomplete, and will require a lot of conversion and adaptation code. The one downside of XML is its verbosity. Fortunately, text files compress very well, so the size of the saved files can be easily reduced by archiving.

So, what I needed was an XML serializer able to serialize / deserialize almost anything I throw at it. One of the viable solutions turned up on Codeproject, and it was the code in Marcus Deecke's article. Inspired by his ideas, I developed CustomXmlSerializer from scratch. I am not a big fan of reinventing the wheel, so I would have used his code as a starting point had it not been for my need to deal with these rather complex objects that called for a slightly different approach. So, I decided to create a brand new serializer.

Approach to Serialization

An object's state is fully determined by the values of its fields. The most efficient way to save or restore an object's state is to act on its fields. Accessing an object's state through any other means (for example, its properties) implies an overhead. CustomXmlSerializer will serialize only the fields of a given object, regardless of their access modifiers.

Since the deserialization code cannot rely on the fact that the structure of the object has remained unchanged since the serialization took place, it is necessary to save this structure into the file as well.

The XML Format

The output of a serialized object might look like this:

<Test1 type="CommonLibrary.Test1" 
       assembly="CustomXmlSerializerTester, Version=1.0.0.0, 
                 Culture=neutral, PublicKeyToken=null" 
       version="1" culture="en-US" hasTypeCache="true">
  <TypeCache>
    <TypeInfo typeid="2" type="System.Int32" 
         assembly="mscorlib, Version=2.0.0.0, Culture=neutral, 
                   PublicKeyToken=b77a5c561934e089" />
    <TypeInfo typeid="3" type="System.String" 
         assembly="mscorlib, Version=2.0.0.0, Culture=neutral, 
                   PublicKeyToken=b77a5c561934e089" />
    <TypeInfo typeid="4" type="System.DateTime" 
         assembly="mscorlib, Version=2.0.0.0, Culture=neutral, 
                   PublicKeyToken=b77a5c561934e089" />
    <TypeInfo typeid="6" type="System.Double" 
         assembly="mscorlib, Version=2.0.0.0, Culture=neutral, 
                   PublicKeyToken=b77a5c561934e089" />
  </TypeCache>
  <tEnum type="CommonLibrary.TestEnum" 
         assembly="CustomXmlSerializerTester, Version=1.0.0.0, 
                   Culture=neutral, PublicKeyToken=null"
         value="2" />
  <A value="1" typeid="2" />
  <B typeid="2" value="2" />
  <Str value="this is a string" typeid="3" />
  <Dt1 value="10/20/2008 3:09:15 PM" typeid="4" />
  <Dt2 typeid="4" value="10/30/2008 12:00:00 AM" />
  <Arr type="System.Int32[]" 
        assembly="mscorlib, Version=2.0.0.0, Culture=neutral, 
                  PublicKeyToken=b77a5c561934e089">
    <Arr typeid="2" value="1" />
    <Arr typeid="2" value="2" />
    <Arr typeid="2" value="3" />
  </Arr>
  <privDbl value="5.6" typeid="6" />
  <base1 type="CommonLibrary.Base1" 
        assembly="CustomXmlSerializerTester, Version=1.0.0.0, 
                  Culture=neutral, PublicKeyToken=null" id="2">
    <baseInt typeid="2" value="99" />
    <baseStr typeid="3" value="base1's basestring" />
    <protInt typeid="2" value="1" />
    <privDbl typeid="6" value="2.3" />
  </base1>
  <base2 id="2" />
  <baseInt typeid="2" value="33" />
  <baseStr typeid="3" value="this is a basestring" />
  <protInt typeid="2" value="15" />
  <base.privDbl typeid="6" value="3.4" />
</Test1>

Fields are serialized into nodes. To reduce file size, the names of fields are used as the names of the nodes (I am taking advantage of the fact that almost all identifier names that are accepted by the C# compiler are also valid XML node names).

During serialization, two main kinds of objects are distinguished:

simple types: primitive types (int, double, string, bool, etc.), enums, and DateTime.
complex types: all other types that will have their fields serialized.

In the case of simple types, the value can be found inside the value attribute. Complex types have several fields themselves which get serialized as child nodes.

Serializing Type Information

Type information consists of two tokens: type name and assembly. Since type information can get pretty verbose, CustomXmlSerializer uses a type dictionary so that types can be referred to by an ID.

Serializing Object References

All objects of non-primitive types that get serialized by CustomXmlSerializer are added to an object dictionary. If there is more than one reference to one of the entries, the respective object is serialized only once, and the other fields may refer to it by an ID. This approach ensures the correct serialization of an object graph.

Serialization of Base Classes

Serializing complex types implies the serialization of the full class hierarchy behind a certain type. This means that all ancestors of an object (referred to by the base reference) must be serialized. Fields of the base class (or classes) are serialized in the same manner as the ones of the type itself: they are all flatly listed in the XML as if they belonged to the type to be serialized.

class Base1
{   
   protected int protInt = 1;
   private double privDbl = 2.3;
}

class Test1 : Base1
{
   public int myInt = 7;
   public double privDbl = 3; // note that privDbl is overridden!!
}

An instance of Test1 contains, in fact, an instance of Base1, so CustomXmlSerializer must serialize all fields combined. The output contains four nodes. Since privDbl is overridden in Test1, the value of Base1.privDbl must be serialized in a node that is distinguishable from Test1.privDbl. This special node is named "base.privDbl".

Versioning

As the software evolves, the objects that are to be serialized evolve too. Compatibility with older versions is mandatory. A newer version of the software is required to be able to deserialize older versions of its objects. Older versions, however, are not required to be able to deal with the newer versions of their objects (it is assumed that a user can always upgrade to the newest version of the software which is guaranteed to open any saved data).

In order to prevent older versions from attempting to deserialize data not meant for them, the root node of the XML contains the version attribute. The deserializer will check the version value from the file against the maximum supported by the current code, and will refuse deserialization in case the saved version is greater than the supported version.

Usage

Serialization is done using the CustomXmlSerializer class.

Test1 t = new Test1();
// init t

// serialize t with version 1
XmlDocument doc = CustomXmlSerializer.Serialize(t, 1, "Test1");
// save XML document to disk or do anything else with it
doc.Save(@"c:\out.xml");

The root node may be named arbitrarily. Serialize() will use the given parameter as the node's name.

Simple deserialization using CustomXmlDeserializer (no code changes occurred between serialization and deserialization):

// load XML document and parse it
XmlDocument doc = new XmlDocument();
doc.Load(...);
// deserialize a Test1 instance having a version number of at most 1
Test1 t = (Test1)CustomXmlDeserializer.Deserialize(doc.OuterXml, 1);

If significant code changes occurred since deserialization and Test1 was affected, a different overload of Deserialize() can be used to handle translations that might be necessary. Suppose, part of the code (usually a class) referenced by one of Test1's fields was moved from assembly Asm1 to Asm2. Files serialized before the move will contain references to Asm1. But, using this reference will fail since the code that the deserializer is looking for is now in Asm2. Therefore, a translator is needed that will enable CustomXmlDeserializer to instantiate the sought type from its new home, Asm2. The translator is an instance of a class that must implement the CustomXmlDeserializer.ITypeConverter interface.

// load XML
string xml = "...";
// deserialize a Test1 instance having a version number of at most 2
// since Test1's code has been changed, Test1TypeConverter
// is used to resolve changed type names
Test1 t2 = (Test1)CustomXmlDeserializer.Deserialize(xml, 2, 
                  new Test1TypeConverter());

Controlling Serialization

CustomXmlSerializer is designed to be able to serialize objects without requiring the developer to modify his classes. However, controlling the serialization process is sometimes useful. The output of of CustomXmlSerializer.Serialize() is controlled through attributes.

System.Xml.Serialization.XmlIgnoreAttribute - sticking this attribute in front of a field will make both the serializer and the deserializer ignore the existence of that field. I decided to use .NET's built-in attribute rather than create my own because of its suggestive name, and to keep compatibility with .NET's built-in serialization methods.
XmlIgnoreBaseTypeAttribute - this attribute is used on classes (or structs) that serve as base classes to derived types. Using it keeps the serializer from exploring the type's fields. This can be useful in some cases if serializing the base class of a type is not necessary because there are other ways to accurately restore the state of an object.
XmlSerializeAsCustomTypeAttribute - applied to classes (or structs), this attribute forces the serializer to treat the type as a complex type. Complex types always have their fields serialized. Using the attribute might prove to be useful if CustomXmlSerializer's normal judgment on how the type should be serialized is not appropriate or inefficient. For example, all types that implement the IEnumerable interface are serialized as an array. This behavior might not be desirable for a custom class, and can be avoided by using an attribute.
CustomXmlSerializationOptionsAttribute - applied to classes (or structs), this attribute defines how instances of the decorated type should be serialized. Two options can be specified, both of which default to true:

create a type dictionary (cache) - it might be useful to turn off the generation of a type cache for debugging purposes, if the serialized output is to be inspected by humans, since following typeids is uncomfortable.
use graph serialization - turning off this option will make the algorithm explicitly serialize every object, regardless of the fact that it has already been serialized once earlier or not. With the option off, the deserializer won't reproduce the exact object graph that was originally serialized.

The attribute is only considered for the "root" object's type. So, the attribute is ignored if the type is serialized as a field value. I thought that it makes more sense to apply serialization options to the entire file, not to particular objects or subparts of the original type.

Internals of the Serialization Process

Type information and the object's fields are all accessed using Reflection.

An object is serialized based on its type.

primitive types, strings and DateTime objects are serialized directly into the value attribute of their respective nodes.
enums are serialized as longs into the value attribute of their respective nodes. This treatment conserves space in the output file, and allows the developer to freely rename the enum's members in future versions without having to worry about deserialization.
instances of types that implement the IXmlSerializable interface are serialized into a value subnode. Types implementing this interface are considered to be able to fully serialize / deserialize themselves. For example, the DataSet type is serialized through this interface because in this case, a field by field serialization would be overkill.
types implementing the IEnumerable interface are serialized as a collection, in a foreach loop. This treatment covers most of the framework's Array and List derivatives, both the generic versions and the pre-generic ones.
class and struct types that do not fall in any of the above categories are serialized as complex types: their fields are enumerated and serialized as child nodes.

Deserialization is basically symmetrical to serialization. There are, however, differences in the process because, for example, collections are easily serialized (as they implement IEnumerable), but on deserialization, the exact type of the collection (List, Dictionary, Array, etc.) is important because these types all behave differently. First, the type of the object to be deserialized is "infer"-ed either from the type dictionary (using the typeid attribute of the node) or from the type and assembly attributes of the node. Type information is translated by a "type converter". This handles code movement among assemblies, or namespace or type name changes that might have occurred in code since the serialization. Basically, the ProcessType() function returns the new information corresponding to a type name, assembly name pair. Having the full name of the assembly that should contain the definition of the type the deserializer is looking for, the assembly is loaded into the application domain using the Assembly.Load() method.

Once the type of the object is deduced, the object's value is interpreted.

Primitive types, strings, and DateTimes are created using the Convert class' static methods.
enums are created using Enum.ToObject().
Arrays are deserialized as follows: the number of elements is determined by counting the number of child nodes. An instance of the array class is created using the constructor that takes the number of elements (an int) as a parameter. In a foreach loop, each entry of the array is assigned the deserialized value of its respective node.
Types that do not fall in the above categories are instantiated using Activator.CreateInstance(). Hence the requirement of having a parameterless constructor for a type to be deserializable. The access modifier of the parameterless constructor is not important.
Types that implement the IXmlSerializable interface are allowed to deserialize themselves from the contents of the value node.
Types that implement the IList interface are deserialized by using their Add() method to fill them with content.
Types that implement the IDictionary interface are deserialized from their child nodes. Since these types implement IEnumerable, they are serialized as a collection by enumerating their entries. These entries are either KeyValuePair entries (in the case of the generic Dictionary), or DictionaryEntry entries (in the case of the Hashtable). The deserializer processes these entries, and adds the values to the object using the Add() method.
The two dictionary entry types, KeyValuePair and DictionaryEntry, receive special treatment during deserialization. Instead of being deserialized into an object of their respective type, an instance of Dictionary<string, object> is created and filled with two entries: key and value. This special treatment means that if you serialize a dictionary entry as the "root" object, on deserialization, you will get a Dictionary<string, object>, and not what you have serialized. Of course, you can still retrieve your data from the returned object, it just requires an additional line of code. Even though this behavior is not expected, I think that we will very seldom need to serialize dictionary entries outside a dictionary; therefore, this little oddity is acceptable.
Types that are not part of any of the categories listed above are deserialized as complex types, field-by-field.

All values are deserialized using the culture that they were serialized with. The culture information is stored in the culture attribute of the root node of the XML file.

Points of Interest

When developing CustomXmlSerializer, I have concentrated on developing a serialization component that would fit my (actually, the project's) needs. Thus, I cannot (and do not want to) say that it is perfect or fit to serialize any type of object. It is merely another solution for the problem of XML serialization. It tries to overcome some of the shortcomings of other similar components available for download.

There are still some cases left unsolved; disadvantages to using the CustomXmlSerializer:

Naming of XML tags. The naming scheme cannot be customized, and using the names of the internal fields of a class as node names might reveal information about the code, which could be a potential security risk.
Binary data is not correctly serialized. In my opinion, binary data should be serialized in a binary file, which is way more efficient. Still, I admit that serialization of binary data to XML is sometimes useful, and it's on my TODO list for the future. Currently, if serialization of binary data is desired, custom code must be used (the class containing binary data should implement IXmlSerializable).
Assembly loading. It is assumed that all code necessary for deserialization of a type is available for the deserializer to load. The XML file contains the full names of the assemblies necessary to initialize a type. A common scenario is that the version of an assembly changes, which normally renders the deserialization of types contained in that assembly impossible. The solution to this is to use the type converter to process the affected assembly names and return CustomXmlDeserializer the new full name for the assembly. This solution is not very complex, but still requires the developer to write code every time an assembly's version is changed.
Speed. Since serialization is done in XML, and Reflection is used extensively, speed is clearly not one of the component's features. I haven't compared performance to other similar solutions, mainly because I wasn't required to optimize the component for speed. Consequently, I wouldn't recommend using the serializer in real-time scenarios without first checking if its performance is satisfactory.
Ability to control the serialization process solely by using attributes. Currently, serialization can be controlled only through attributes (see here). Using attributes is usually quite neat, but decorating code is only possible if we have access to the source code. This may not always be the case. At other times, modifying the code of a class in any way is possible, but not desirable. A way around this would be to create a class that holds code-serialization-option associations. These associations could be deduced from attribute decorations, or created dynamically by the developer responsible for serializing the objects. Writing such a class is another item on my TODO list.
Deserializing KeyValuePair and DictionaryEntry instances can be a little awkward (see here).
Handle changes of types for fields. As I mentioned earlier in the article, handling code changes is a requirement. Code movement among assemblies and namespaces and possible type name modifications are handled by the type converter. However, there are other possible scenarios of code change. For example, the type of a field might change.

// Test2 at the time of serialization
class Test2
{
   public int Val;
   public string Str;
}

// Test2 at the time of deserialization
class Test2
{
   public string Val;
   // The new type of the field is incompatible with the old one,
   // so deserialization will fail.
   // If the new type of Val was compatible with the old one,
   // for example if it was double, deserialization would have been OK.

   public string Str;
}

Currently, there is no mechanism implemented that could make the new Test2 deserialize correctly. If we want to avoid the exception, the new string Val field must be renamed or decorated with the XmlIgnore attribute. Alternatively, the string type could be replaced by a type assignable from the original int type.

Conclusion

CustomXmlSerializer is able to deal with many kinds of business objects, serializing data in a relatively efficient way. It is a viable alternative to other serialization solutions, and should fit common code scenarios. It is obviously perfectible, but I think it is more flexible than other similar serialization components that can be found out there.

I tried to enumerate all of its disadvantages, hoping that anything that doesn't fit one of the listed problems will serialize flawlessly. Of course, I might be wrong, and I'm looking forward to feedback about testing the component. I admit I did not have enough time to create a thorough test kit (proof to that is the brevity of the test code in the downloadable solution).