Introduction

This is the second of two articles on optimizing serialization, especially for use in remoting.

In this article we will be looking at a complete and real-world example of how to use the Fast Serialization code introduced in Part 1 (here) to serialize DataSets and DataTables including Typed variants.

First, we'll look at some test results to see whether incorporating this code is really worthwhile.
Then I'll describe how the code is used at a basic level
Next is how to incorporate it into the .Net remoting infrastructure
Finally, I'll go through how the code works and how you might use the same or similar techniques on non-DataSet classes.

1. Test Results

I have produced some test results using various sizes of source data to give an idea of the magnitude of reduction in size and time taken you might expect to achieve.

The time taken includes both Serialization and Deserialization and is to millisecond accuracy.
Time is averaged over 10 runs plus an initial non-timed JIT pass.
A full garbage collect (non-timed) is performed between passes to ensure that only the serialization/deserialization routines are timed.

Northwind - tables only

A DataSet containing all 13 tables from Northwind.

Method Size (bytes) Time Taken (seconds)

Vanilla .NET serialization 1,431,297 0.560

Fast Serialization 383,001 0.031

73.2% smaller 18.1x times faster

Northwind - both tables and views

A DataSet containing all 13 tables and 16 views from Northwind.

Method Size (bytes) Time Taken (seconds)

Vanilla .NET serialization 5,635,208 2.624

Fast Serialization 688,910 0.126

87.8% smaller 20.8x times faster

Large Reference Data Table Test - 35,125 rows of 15 columns

A DataSet containing a single DataTable loaded with a large reference database table comprising:

4 x int columns

1 x varchar(15) column

1 x varchar(50) column

2 x varchar(30) columns

1 x varchar(6) column

1 x varchar(500) column

4 x char(1) column

1 x datetime column

Method Size (bytes) Time Taken (seconds)

Vanilla .NET serialization 16,736,432 7.402

Fast Serialization 2,340,840 0.571

86.0% smaller 12.96x times faster

Massive Reference Data View Test - 39,071 rows of 91 columns

A DataSet containing a single DataSet loaded with a massive database view comprising:

13 x bit columns

6 x char(1) columns

2 x char(2) columns

1 x char(3) columns

12 x datetime columns

19 x int columns

1 x numeric(10,2) column

5 x numeric(10,4) columns

4 x numeric(15,4) columns

6 x numeric(4,2) columns

1 x varchar(10) column

3 x varchar(20) columns

1 x varchar(255) column

1 x varchar(3) column

5 x varchar(5) columns

7 x varchar(50) columns

1 x varchar(500) columns

3 x varchar(70) columns

If the XML output from a DataSet containing this data is written to a file, it takes a whopping 112MB and 2,118,900 lines!

No problem for the Fast Serializer but .NET cannot serialize a DataSet of this magnitude and crashes with an OutOfMemoryException.

Method Size (bytes) Time Taken (seconds)

Vanilla .NET serialization <failed> <failed>

Fast Serialization 6,513,586 2.960

From these results, it becomes clear that Fast Optimization is always faster and always produces a smaller size. For smaller sets of data, the vanilla .Net serialization may be adequate but it doesn't scale as well as Fast Optimization and the time and size differences become more apparent as more data is serialized until the vanilla .Net serializer just cannot cope and throws an exception.

It is important to note that stream compression (via Custom Sinks) can reduce the final data size to less than half of even the smallest size shown here. However, you may find that the smaller sizes generated much quicker with Fast Serialization are acceptable anyway and that overhead taken for compression is not worthwhile except maybe for transmission over known slow connections.

2. Using the Code

The code download includes a class called AdoNetHelper. This has a number of static methods to provide serialization and deserialization services for supported ADO.Net objects. There are also a couple of non-serialization specific helper methods and this is the reason I created a single helper class rather than separate classes - I like to keep generic helper code in one place.

In a nutshell, for serialization you pass in the ADO.Net object you want to serialize and get a byte array back. For deserialization, you pass in the byte array (and either an empty ADO.Net object or a Type which will be instantiated) and get back your ADO.Net object fully populated.

Supported ADO.Net objects are DataSet, DataTable, Typed DataSet and Typed DataTable. There is also support for a 'simple' DataTable where 'simple' is defined as a Typed DataTable which is known to contains all unmodified rows and no errors. This allows us to just serialize the raw data itself without any infrastructure overhead such as RowState and Columns etc.

DataSet

  public static byte[] SerializeDataSet(DataSet dataSet)
  public static DataSet DeserializeDataSet(byte[] serializedData)

A plain DataSet is quite easy to serialize - one method to serialize and one to deserialize.

All of the infrastructure will be stored including Tables, Columns, Rows, Constraints, Extended Properties, Xml namespaces and so on.

Typed DataSet

  public static byte[] SerializeTypedDataSet(DataSet dataSet)
  public static DataSet DeserializeTypedDataSet(Type dataSetType,
                                                byte[] serializedData)
  public static DataSet DeserializeTypedDataSet(DataSet dataSet,
                                                byte[] serializedData)

A Typed DataSet is one generated using the Microsoft MSDataSetGenerator from an .XSD schema. It already contains the infrastructure so serialization only requires storage of the actual data for the pre-defined tables.

Rather than use the generated DataSet class directly, I have found it useful to derive a new class from it and write all code against that instead. That way I can add methods, interfaces etc. without having to worry about them being overwritten should the schema change and provide an easy way to hook in our serialization code for remoting purposes.

As an example, if I have a set of data called XXX, I name the schema file XXXDataSetSchema.xsd which auto-generates XXXDataSetSchema.cs; I then create XXXDataSet.cs which contains a class called XXXDataSet and inherits from XXXDataSetSchema. Only XXXDataSet is used in the application.

For deserialization you can either pass in the Type of the Typed DataSet (in which case one will be created) or a pre-instantiated one - the data will then be restored into the pre-created tables.

DataTable

  public static byte[] SerializeDataTable(DataTable dataTable)
  public static DataTable DeserializeDataTable(byte[] serializedData)

Very similar usage to the plain DataSet. The DataTable can be standalone or part of an existing DataSet but where part of a DataSet, only Unique Constraints will be serialized; any Foreign Key constraints are ignored.

Typed DataTable

  public static byte[] SerializeTypedDataTable(DataTable dataTable)
  public static DataTable DeserializeTypedDataTable(DataTable dataTable,
                                                byte[] serializedData)
  public static DataTable DeserializeTypedDataTable(Type dataTableType,
                                                byte[] serializedData)

It isn't really feasible to subclass a generated Typed DataTable in the same way as a Typed DataSet. So, whilst these helper methods can easily generate the serialized byte array, getting remoting to use these methods is a little more tricky but not impossible - see next section.

Simple DataTable

  public static byte[] SerializeSimpleTable(DataTable dataTable)
  public static DataTable DeserializeSimpleDataTable(DataTable dataTable,
                                                byte[] serializedData)
  public static DataTable DeserializeSimpleDataTable(Type dataTableType,
                                                byte[] serializedData)

Usage is identical to Typed DataTable but the intent is that these tables are known to have no errors (at both the row and column level) and the all of the RowStates should all be Unchanged as would be the case if it has just been populated by a database query.

In fact, the routine will also run with Added and Modified rows (but not Deleted rows - that will throw an exception) but on deserialization, the RowState for all rows will be Unchanged.

This set of methods can also be used for other DataTables not generated from an XSD file. For example, LLBLGenPro has the concept of a TypedView and a TypedList which both derive from DataTable and used for read-only data but use their own internal schema to generate the infrastructure of Columns etc. As long as the DataTable infrastructure has already been configured, these deserialization methods will repopulate all of the data.

All of these helper methods, after appropriate parameter validation, use internal, nested classes to perform the actual work and return the result. As they stand, it is possible to use them as-is within remoting by passing around the generated byte array instead of the real object in your service interfaces. However that doesn't make code easy to read or understand so let's look at alternative methods of using this code.

3. Usage in Remoting

We now have a way of taking an ADO.Net serializable object and putting it into a byte array and vice versa. Now we need a way of getting .Net Remoting to use our serialization code rather than letting the DataSet do its XML serialization thing.

Unfortunately, there is no way to make this completely transparent since we can't modify the DataSet/DataTable source code. But we do have a choice of methods to make it work depending on how far you want to/are able to change your source code.

What we must do is implement the ISerializable interface somehow so that the BinaryFormatter created by the Remoting runtime will give us control to serialize ourself. There are two ways of doing this, one is to implement ISerializable directly on the object to be serialized and the other is to use a surrogate which is an external class and so requires no modification to the object itself.

In both cases there will be two methods to implement. A GetObjectData method to serialize and a constructor to deserialize. Actually in the case of a surrogate object this will be a SetObjectData but the principal is the same - you get an object of the correct type but one that is completely uninitialized and it your responsibility to do whatever is required to configure the object.

Let's look at methods for plain DataSets first:

FastSerializableDataSet

The simplest way of all is to create a new class derived from DataSet which does nothing else except implement ISerializable so that it can get control at serialization time.

Here is a sample class (included in the download):

[Serializable]
public class FastSerializableDataSet: DataSet, ISerializable
{
    #region Constructors
    public FastSerializableDataSet(): base() {}
    public FastSerializableDataSet(string dataSetName): base(dataSetName) {}
    #endregion Constructors

    #region ISerializable Members
    protected FastSerializableDataSet(SerializationInfo info,
                                      StreamingContext context)
    {
        AdoNetHelper.DeserializeDataSet(this,
                               (byte[]) info.GetValue("_", typeof(byte[])));
    }

    public void GetObjectData(SerializationInfo info,
                              StreamingContext context)
    {
        info.AddValue("_", AdoNetHelper.SerializeDataSet(this));
    }
    #endregion
}

Pretty simple really as it boils down to two lines - one for serialization and one for deserialization.

If you can change your project so that all references to DataSet are replaced with FastSerializableDataSet instead then you should be good to go!

However life is rarely as simple as that and it may not be acceptable to all change references to DataSet so is there a less intrusive way?

WrappedDataSet

If you are not able to change all DataSet references in your code, you may be able to change just those that occur in your remoting interface.

Below is a class that 'wraps' a plain DataSet and provides overloaded implicit operators so that you can assign/pass a DataSet wherever you have a WrappedDataSet reference. WrappedDataSet implements ISerializable and its implementation just serializes the wrapped DataSet instead. By using WrappedDataSet in your interface declarations rather than DataSet, you can use Fast Serialization with changing other references in your application.

Here is the class:

[Serializable]
public class WrappedDataSet: ISerializable
{
#region Casting Operators
    public static implicit operator DataSet (WrappedDataSet wrappedDataSet)
    {
        return wrappedDataSet.DataSet;
    }

    public static implicit operator WrappedDataSet (DataSet dataSet)
    {
        return new WrappedDataSet(dataSet);
    }
#endregion Casting Operators


#region Constructors
    public WrappedDataSet(DataSet dataSet)
    {
        if (dataSet == null) throw new ArgumentNullException("dataSet");
        this.dataSet = dataSet;
    }
#endregion Constructors


#region Properties
    public DataSet DataSet {
        get { return dataSet; }
    } DataSet dataSet;
#endregion Properties


#region ISerializable Members
    protected WrappedDataSet(SerializationInfo info, StreamingContext context)
    {
        dataSet = AdoNetHelper.DeserializeDataSet((byte[]) info.GetValue("_",
                                                  typeof(byte[])));
    }

    public void GetObjectData(SerializationInfo info, StreamingContext context)
    {
        info.AddValue("_", AdoNetHelper.SerializeDataSet(dataSet));
    }
#endregion
}

TypedDataSets

A Typed DataSet already has the constructor required for deserialization but it is tied in to the plain DataSet deserialization which we want to bypass so that isn't of use to us.

There is also a private InitClass method which creates the tables/relationships (which in turn create the columns/constraints) infrastructure etc. Now we do need to use this method since any replacement deserialization constructor we write will be deserializing a completely uninitialized object.

If you follow my advice above about deriving from the generated DataSet code rather than using it directly, then all we need to do is implement ISerialization on the derived class:

#region ISerializable Members
protected DerivedFromGeneratedDataSet(SerializationInfo info,
                                      StreamingContext context)
{
   AdoNetHelper.DeserializeTypedDataSet(this,
                                  (byte[]) info.GetValue("_", typeof(byte[])));
}

void ISerializable.GetObjectData(SerializationInfo info,
                                 StreamingContext context)
{
   info.AddValue("_", AdoNetHelper.SerializeTypedDataSet(this));
}
#endregion

On deserialization, this will call the base parameterless constructor which will run InitClass setting up the infrastructure for us. We just then pass ourself and the serialized data to the helper method and the job is done.

Surrogates

If neither of the above are suitable then the only option left is to take control at a lower level using a surrogate object to perform the serialization/deserialization.

In principal this should be relatively simple: A surrogate object is a class, implementing ISerializationSurrogate, which 'knows' how to perform serialization for a given Type and it is 'incorporated' into the remoting/serialization process to actually do the work in place of the normal reflection-based way. This is achieved by a class which implements ISurrogateSelector.

If you want to do this directly with a BinaryFormatter you create yourself then the code is fairly simple because one of the constructors accepts an ISurrogateSelector object. This object decides which Surrogate, if any, can deal with a given Type and provides an instance of that Surrogate to the Binary Formatter when requested.

However, the problem is that Microsoft has only allowed certain parts of Remoting to be public and not others. Some digging in Reflector shows that a new BinaryFormatter is created, a new RemotingSurrogateSelector is then attached to it but neither of these objects is accessible or configurable externally, unfortunately.

All is not lost however, we can get around this with the use of custom sinks. There are a number of good articles on CodeProject, .NET Remoting Customization Made Easy: Custom Sinks, for example) which describe how to insert custom sinks both before and after the formatter sink but what we need to do is replace the formatter sink itself.

In the download, I have including some example classes to do this:

CustomBinaryClientFormatterSinkProvider
CustomBinaryClientFormatterSink
CustomBinaryServerFormatterSinkProvider
CustomBinaryServerFormatterSink

They were written by examining the Microsoft code using Reflector and removing all non-Http code (I only use HTTP channels) and non-essential parts such as TypeFilterLevel (always Full) etc. Calls to non-accessible internal code have been emulated by duplicating the inaccessible code directly in the class or substituting equivalent classes (e.g. MemoryStream rather than the inaccessible ChunkedMemoryStream). Wherever a BinaryFormatter is created, I just ensure that our Surrogate Selector becomes part of the chain.

To use in your applications, you can either use App.config or write manual code:
(remember that port would likely be '0' on the client side)

App.Config file configuration

 <system.runtime.remoting>
  <application>
   <channels>
    <channel ref="http" port="999">
     <clientProviders>
      <formatter
     type="SimmoTech.Utils.Remoting.CustomBinaryClientFormatterSinkProvider,
SimmoTech.Utils"/>
     </clientProviders>
     <serverProviders>
      <formatter
      type="SimmoTech.Utils.Remoting.CustomBinaryServerFormatterSinkProvider,
SimmoTech.Utils"/>
     </serverProviders>
    </channel>
   </channels>
  </application>
 </system.runtime.remoting>

Then use this line at or near the start of your application:

RemotingConfiguration.Configure(
                AppDomain.CurrentDomain.SetupInformation.ConfigurationFile);

Code configuration

   CustomBinaryServerFormatterSinkProvider serverProvider
                          = new CustomBinaryServerFormatterSinkProvider();
   CustomBinaryClientFormatterSinkProvider clientProvider
                          = new CustomBinaryClientFormatterSinkProvider();

   IDictionary properties = new Hashtable();
   properties["port"] = 999;

   HttpChannel channel = new HttpChannel(properties, clientProvider,
                                         serverProvider);
   ChannelServices.RegisterChannel(channel);

The code for ISerializationSurrogate implementation requires no state so we can combine the functionality for both ISurrogateSelector and ISerializationSurrogate into one class called AdoNetFastSerializerSurrogate

The main code for the ISerializationSurrogate implementation looks like this:

public ISerializationSurrogate GetSurrogate(Type type,
                  StreamingContext context, out ISurrogateSelector selector)
{
   if (typeof(DataSet).IsAssignableFrom(type) ||
      typeof(DataTable).IsAssignableFrom(type))
   {
      selector = this;
      return this;
   } else
   {
      selector = null;
      return null;
   }
}

which just says if the type is DataSet or DataTable (or derived from either of those) then indicate this as the ISurrogateSelector and return this as the ISerializationSurrogate.

The code for serialization looks like this:

public void GetObjectData(object obj, SerializationInfo info,
                          StreamingContext context)
{
   byte[] data;
   if (obj.GetType() == typeof(DataSet) || obj is IModifiedTypedDataSet )
    data = AdoNetHelper.SerializeDataSet(obj as DataSet);
   else if (obj.GetType() == typeof(DataTable))
    data = AdoNetHelper.SerializeDataTable(obj as DataTable);
   else if (obj is DataSet)
    data = AdoNetHelper.SerializeTypedDataSet(obj as DataSet);
   else if (obj is DataTable)
    data = AdoNetHelper.SerializeTypedDataTable(obj as DataTable);
   else
   {
    throw new InvalidOperationException("Not a supported Ado.Net object");
   }
   info.AddValue("_", data);
  }

The type is checked and the correct helper method on AdoNetHelper is called to create the byte[] which is then stored into the SerializationInfo block passed in. (Note the IModifiedTypedDataSet check on the first comparison. This was added to allow Typed DataSets which have additional tables and/or columns at runtime (I happen to use one!) to be supported. By treating them as ordinary DataSets, they will have their schema saved including the additional tables/columns - you would not need this in most circumstances)

The code for deserialization looks like this:

public object SetObjectData(object obj, SerializationInfo info,
                            StreamingContext context,
                            ISurrogateSelector selector)
{
   obj = createNewInstance(obj);
   byte[] data = (byte[]) info.GetValue("_", typeof(byte[]));

   if (obj.GetType() == typeof(DataSet) || obj is IModifiedTypedDataSet)
    return AdoNetHelper.DeserializeDataSet(obj as DataSet, data);
   else if (obj.GetType() == typeof(DataTable))
    return AdoNetHelper.DeserializeDataTable(obj as DataTable, data);
   else if (obj is DataSet)
    return AdoNetHelper.DeserializeTypedDataSet(obj as DataSet, data);
   else if (obj is DataTable)
    return AdoNetHelper.DeserializeTypedDataTable(obj as DataTable, data);
   else {
    throw new InvalidOperationException("Not a supported Ado.Net object");
   }

  }

Essentially it is the reverse of serialization with one twist: the object that is passed in is completely uninitialized - not even a default constructor has been called on it. The createNewInstance method just creates a new instance of an object of the same type as that passed in - that way we know we have an initialized object to pass to the AdoNetHelper deserialization methods. For DataSet and DataTable object, we create a new instance directly and for any others, such as Typed DataSets, we use Activator.CreateInstance passing in the type - hence the need for a parameterless constructor for supported objects.

This all sounds complicated but really is a one-time setup. After that you can get Fast Serialization without modifying any of your application code.

4. How it works

The serialization/deserialization code uses BitVector32 flags extensively so for those unfamiliar with this struct, here is a quick primer - feel free to skip past it.

BitVector32

A BitVector32 struct is a wrapper around an Int32 so allows up to 32 bits of information to be stored in one of two ways (but not at the same time): by Section or by Mask.

Sections are created using the CreateSection static method and define small integers. So if you needed to store a number where values range between 0 and 59 (a Minute or Second for example), a section of 6 bits would be allocated (since 6 bits can hold values between 0 and 63). Further sections are 'linked' by calling the CreateSection static overload and passing in the previous BitVector32.Section instance - this ensures that the section bits do not overlap. Booleans are not supported directly in Section mode but you can create a section of 1 bit and manually get/set a '1' value to achieve the same result. See the SerializationWriter.WriteOptimized(DateTime) method for an example of how to use Sections.

Mask mode is used to store up to 32 boolean values. Creating a mask involves calling BitVector32.CreateMask static method which returns an Int32 with a single bit set. Masks are 'linked' in the same way as Sections by passing the previous Mask/Int32 into the BitVector32.CreateMask static overloaded method.

Typically, you will define a set of masks/ bit flags (I use the term interchangeable in this article) as static readonly ints, since their value will not change at runtime, and then use them in your methods by passing them to the BitVector32 indexer to get/set boolean values.

Here is a sample:

private static readonly int TypeAFirstBitFlag = BitVector32.CreateMask();
private static readonly int TypeASecondBitFlag
                              = BitVector32.CreateMask(TypeAFirstBitFlag);
private static readonly int TypeAThirdBitFlag
                              = BitVector32.CreateMask(TypeASecondBitFlag);

private static readonly int TypeBFirstBitFlag = BitVector32.CreateMask();
private static readonly int TypeBSecondBitFlag
                              = BitVector32.CreateMask(TypeBFirstBitFlag);

public void MyMethod() {
  BitVector32 myFlags = new BitVector32();

  myFlags[TypeAFirstBitFlag] = myBoolValue1;
  myFlags[TypeASecondBitFlag] = true;
  myFlags[TypeAThirdBitFlag] = false;
}

Note the linking of flags/masks after the first creation and note that you create different sets of flags for different object types.

Using Bit Flags

Here are some tips when defining sets of bit flags:-

Make it clear which object type a flag is holding information for. e.g. All of the Dataset flags begin with "DataSet". I had an obscure problem during development before I did this where a DataTable was mistakenly using a flag for a DataSet - at the time they happened to be in the same position and so had the same value but when I 'moved' the DataSet flag (see later for why), it broke the tests and wasn't immediately clear why.
Ensure that the int mask passed to the CreateMask method is correct as it is used to create the next int mask value. If you pass the same mask as a parameter twice, you will get two masks that have the same value - there will be no compiler or runtime errors.
Use readable and descriptive names - Use 'Is' and 'Has' frequently such as in "DataSetIsCaseSensitive" or "TableHasRows" but only when it reads correctly.
Where 28 or fewer flags are defined in a set (probably always!), use the WriteOptimized(BitVector32) method in SerializationWriter to store the flags in the fewest number of bytes - typically 1 or 2.
Order the flags so that flags that are least likely to be set are placed earlier in the list (thus generating higher int mask values) - even consider inverting the logic in some cases. This doesn't matter where there are 7 or fewer flags, they will never take more than a single byte but may help where there are 8 or more. The number of bytes taken up in the serialization stream depends on the highest bit actually set and not the number of defined flags. So if you have 12 defined flags but the highest 5 are rarely set then they will take up 1 byte typically and only rarely take two bytes.
If you need to store just a single bit of information but against many items, say in a large list, considering using a BitArray instead - store the BitArray into the stream before the collection contents and you can then access each bit in turn, via an indexer, in your main loop. For two or more bits of information, you are probably better sticking with a BitVector32.

A bit flag can be used in several ways

To directly store a boolean data value from the target object
To store whether a data collection has any items or not (this is easier and quicker than storing an Int32 count value of 0)
To store whether a data value is present or not
To store whether a data value is different from the default value
To store whether a data value is different from a common value

The last three items in the list sound similar but there are subtle differences. The first of the three is usually a simple comparison to null "myValue != null" (or "myValue != null && myValue.Count != 0" if the value in question is a collection that is lazy-instantiated collection); the second of the three requires some inside knowledge (again Reflector is your friend) and the third is a judgement call which requires knowledge of the most likely value.

A flag doesn't have to directly relate to data value. For example, in the UniqueConstraints flag set, I have a flag called "UniqueConstraintHasMultipleColumns" which is just internal information as to whether a column count needs to be stored or whether we can assume a single column.

Another example is in the DataRelation flag set where I have "RelationIsPrimaryKeyOnParentTable". This would be true in the vast majority of cases so I can avoid saving any column information for that side of the relationship.

Another good example is for a DataColumn which has AutoIncrement set to True. By default, AutoIncrementSeed and AutoIncrementStep are set to 0 and 1 respectively. However, I don't believe I am alone in setting these values both to -1 to ensure that generated values don't clash with real database values. By creating two flags, ColumnHasAutoIncrementNegativeStep and ColumnHasAutoIncrementUnusedDefaults, we can effectively have two 'default' values which would apply in the majority of cases and so save having to store two long values.

It is also possible to use one bit flag as a condition for multiple values. ColumnHasAutoIncrementUnusedDefaults is an example - if false then both AutoIncrementSeed and AutoIncrement are written as a pair.

Analyzing Your Object(s)

I can't honestly say that I sat down, analysed the requirements and wrote the code in one go - there were a number of revisions and 'optimization opportunities'. What I will try and do here is give a bit of general guidance used in this and other similar projects which might help if you Fast Serialize your own classes:

Identify all of the classes involved and how they relate to each other - hierarchies etc., this usually gives good starting guidance but remember that what you put into/get out of the byte[] and in which order is entirely down to you.
Exclude internal classes - you can't easily access those anyway and it's the data they contain that's important. As long as you can get that data directly or indirectly, you will be able to recreate your object.
Exclude transient data such as indexes, lookup Hashtables, DataViews etc. These are not required to be able to recreate the data - remember you want to serialize the absolute minimum amount of data required to be able to recreate the object - any internal data can be left to the object to recreate later.
Let the deserialization side influence how to write the serialization side. There may be order dependencies during deserialization that are not immediately apparent if you start writing the serialization code first. An obvious example is ForeignKeyConstraints which require both involved tables to be deserialized before the constraint can be created. By removing the association of ForeignKeyConstraints from DataTables altogether, we can ensure that all tables have been recreated before processing them whereas if we tried to process them as a constraint as part of a DataTable, we would get problems depending on the order that the tables were added to the DataSet.
Separate routines into private methods so they are reusable, an example being the code for ExtendedProperties which reused in several places. Also you may have a need to serialize different 'root' objects - in this project, the code for serializing a DataSet and a DataTables will use shared code.
Use unit tests to help you refactor safely. I must admit I'm still relatively new to unit testing but have found them absolutely invaluable for this project.
The Serialization and Deserialization code could be refactored as two separate classes but remember that the flags need to be accessible from both. For this particular project, I didn't anticipate needing customization or inheritance so I chose to use internal classes and static helper methods - feel free to separate them if you prefer.

Here is the rough overview I started coded from:

DataSet
(flags/own data)
DataTables
ForeignKeyConstraints
DataRelations
ExtendedProperties
DataTable
(flags/own data)
DataColumns
ExtendedProperties
DataRows
UniqueConstraints
DataColumn/DataRow/UniqueConstraints/ForeignKeyConstraints/DataRelation
(flags/own data)
ExtendedProperties (except DataRow)
ExtendedProperties
(2 x object[] for keys and values)

Below, I have put the major class types as headings and described a little about how they are coded and optimized.

DataSet

Here is the code for serializing a DataSet:-

   public byte[] Serialize(DataSet dataSet)
   {
    this.dataSet = dataSet;
    writer = new SerializationWriter();

    BitVector32 flags = GetDataSetFlags(dataSet);
    writer.WriteOptimized(flags);

    if (flags[DataSetHasName]) writer.Write(dataSet.DataSetName);
    writer.WriteOptimized(dataSet.Locale.LCID);
    if (flags[DataSetHasNamespace]) writer.Write(dataSet.Namespace);
    if (flags[DataSetHasPrefix]) writer.Write(dataSet.Prefix);
    if (flags[DataSetHasTables]) serializeTables();
    if (flags[DataSetHasForeignKeyConstraints])
        serializeForeignKeyConstraints(getForeignKeyConstraints(dataSet));
    if (flags[DataSetHasRelationships]) serializeRelationships();
    if (flags[DataSetHasExtendedProperties])
        serializeExtendedProperties(dataSet.ExtendedProperties);

    return getSerializedBytes();
   }

I won't be reproducing the code for all of the methods but this is the top-level code and this pattern tends to repeat for DataTable, DataRow etc.

Create a BitVector32 and populate it with the relevant flags for the object at hand.
Write information into the SerializationWriter instance; use the flags to make this conditional wherever possible.

Deserialization is generally a reverse of this process. It is essential that all values read from the SerializationReader must be in the same order they were written. However, you don't necessarily have to apply the read information to your object in the same order. This sample shows that the flags are read first but the DataSetAreConstraintsEnabled value, obtained from a bit flag, is applied last to prevent exceptions whilst row data is being deserialized.

   public DataSet DeserializeDataSet(DataSet dataSet, byte[] serializedData)
   {
    this.dataSet = dataSet;
    reader = new SerializationReader(serializedData);

    dataSet.EnforceConstraints = false;

    BitVector32 flags = reader.ReadOptimizedBitVector32();

    if (flags[DataSetHasName]) dataSet.DataSetName = reader.ReadString();

    dataSet.Locale = new CultureInfo(reader.ReadOptimizedInt32());
    dataSet.CaseSensitive = flags[DataSetIsCaseSensitive];

    if (flags[DataSetHasNamespace]) dataSet.Namespace = reader.ReadString();

    if (flags[DataSetHasPrefix]) dataSet.Prefix = reader.ReadString();

    if (flags[DataSetHasTables]) deserializeTables();
    if (flags[DataSetHasForeignKeyConstraints])
        deserializeForeignKeyConstraints();

    if (flags[DataSetHasRelationships]) deserializeRelationships();
    if (flags[DataSetHasExtendedProperties])
        deserializeExtendedProperties(dataSet.ExtendedProperties);

    dataSet.EnforceConstraints = flags[DataSetAreConstraintsEnabled];

    throwIfRemainingBytes();
    return dataSet;
   }

Some objects have an implied order. For example, DataTables are serialized first, followed by ForeignKeyRelationships and then Relationships. These could have been written in any order since all the information is present at serialization time. However Deserialization requires that all of the tables are deserialized before ForeignKeyConstraints and deserialization of Relationships requires that all of the ForeignKeyConstraints are in place. It is usually helpful therefore to consider how the deserialization will work before writing the serialization code.

DataTable

If we have reached the code to serialize a DataTable then we know there is at least one to save. So we start by saving the count followed by the details for each DataTable in turn.

Some fields on a DataTable are not publicly accessible but we must have their values in order to be able to deserialize correctly. CaseSensitive and caseSensitiveAmbient are examples. The former is a property on the DataTable so there is no problem getting that value and setting it on deserialization. However that on its own is not enough. Reflector shows that the getter uses a private field called "caseSensitiveAmbient" which, when set, returns the CaseSensitive property of the containing DataSet. This allows all DataTables to use the same shared value unless True or False is specifically set on a DataTable. If we failed to restore the caseSensitiveAmbient value then all our deserialized DataTables would use the DataSet value even if a value was specifically set on the DataTable.

DataTables are also containers for DataColumns and DataRows and UniqueContraints.

DataColumn

Serialization of these pretty much follow the same pattern of retrieving flags and conditionally serializing values according to the flags.

We have already mentioned the ColumnHasAutoIncrementUnusedDefaults and ColumnHasAutoIncrementNegativeStep flags in an earlier example. Where the former is not set, we read the required values from the stream.

DataRow

Serialization begins with writing the number of rows first. Remember that this method won't be called at all if there are no rows (from the DataTable at least) but this method is also used by the helper methods that write the row data only.

A DataRow has a RowState property which has 5 possible values: Added, Deleted, Detached, Modified and Unchanged. For our purposes however, Detached is not an option so we only need to cater for 4 values. This could have been achieved by designating one of the possible values, say Unchanged, as the default and then storing the RowState only when it isn't Unchanged. However, a DataRow also may have RowError assigned and/or Errors associated with at the column level. These indicate a need for a flag set anyway so I used two additional flags, RowHasOldData and RowHasNewData - to store this information. By using these flags, it is easy to work out which DataRowVersion needs queried to get the values as an object array:

For Added and Unchanged rows, we need to get the DataRowVersion.Current version values;
For Deleted rows, we need to get the DataRowVersion.Original version values;
For Modified rows, we need to get both Current and Original version values and we use an overload on SerializationWriter which accepts two object arrays (must be of the same length). This optimization writes the first object array as normal but for the second object array, it compares each value with its equivalent in the first and where they are the same, just stores a SerializedType.DuplicateValueType TypeCode. Since most modified rows only change a small subset of the entire list of values, this is a great space saver.

We could have used the ItemArray property to retrieve the values for most of these row versions but ItemArray throws an exception for Deleted rows. To get DeletedRow values requires the use of an indexer overload that allows the required DataRowVersion to be passed in. Rather than use one method for Deleted rows and another for all other states, I created a helper method which takes a DataRow and a DataRowVersion as parameters and returns an object array (actually ItemArray does the same thing internally anyway - did you know that the data is actually stored within the DataColumn not the DataRow? I didn't - Reflector rocks!). Since this is general purpose method, I made it public, static and placed it the AdoNetHelper class.

We also have to deal with Expression columns to ensure that their values are set to null prior to serialization. To do this, I create an int[] containing the ordinal number of any DataColumn that has an Expression. This is done prior to the loop for serializing values since it need only be calculated once, and a private helper method gets all of the values for DataRow and required DataRowVersion and where the int[] is not empty, sets to null those values which correspond to calculated columns.

Here is the code for deserialization:

   private void deserializeRows(DataTable dataTable)
   {
    ArrayList readOnlyColumns = null;
    int rowCount = reader.ReadOptimizedInt32();

    dataTable.BeginLoadData();
    for(int i = 0; i < rowCount; i++)
    {
     BitVector32 flags = reader.ReadOptimizedBitVector32();
     DataRow row;

     if (!flags[RowHasOldData])
      row = dataTable.LoadDataRow(reader.ReadOptimizedObjectArray(),
                                 !flags[RowHasNewData]);
     else if (!flags[RowHasNewData])
     {
      row = dataTable.LoadDataRow(reader.ReadOptimizedObjectArray(), true);
      row.Delete();
     }
     else
     {
      /* LoadDataRow doesn't care about ReadOnly columns but ItemArray does
         Since only deserialization of Modified rows uses ItemArray we do this
         only if a modified row is detected and just once */
      if (readOnlyColumns == null)
      {
       readOnlyColumns = new ArrayList();
       foreach(DataColumn column in dataTable.Columns)
       {
        if (column.ReadOnly && column.Expression.Length == 0)
        {
         readOnlyColumns.Add(column);
         column.ReadOnly = false;
        }
       }
      }

      object[] currentValues;
      object[] originalValues;
      reader.ReadOptimizedObjectArrayPair(out currentValues,
                                          out originalValues);
      row = dataTable.LoadDataRow(originalValues, true);
      row.ItemArray = currentValues;
     }

     if (flags[RowHasRowError]) row.RowError = reader.ReadString();
     if (flags[RowHasColumnErrors])
     {
      int columnsInErrorCount = reader.ReadOptimizedInt32();
      for(int j = 0; j < columnsInErrorCount; j++)
      {
       row.SetColumnError(reader.ReadOptimizedInt32(), reader.ReadString());
      }
     }

    }

    // Must restore ReadOnly columns if any were found when deserializing a
    // Modified row
    if (readOnlyColumns != null && readOnlyColumns.Count != 0)
    {
     foreach(DataColumn column in readOnlyColumns)
     {
      column.ReadOnly = true;
     }
    }

    dataTable.EndLoadData();

   }

The indexers on DataRow that include a DataRowVersion parameter are readonly and so we need an alternate way of putting the object array back in. LoadDataRow is the quickest way especially if we with wrap the process with BeginLoadData first, which tells the DataTable to expect a lot of data to be added and to suspend indexing/constraint checking until EndLoadData is called.

For Unchanged and Added rows, we just supply the deserialized object array and a bool (obtained from our flag set) to specify whether to AcceptChanges immediately or not (true for Unchanged and false for Added. DeletedRows is similar but we accept and then immediately call Delete() to get the desired effect.

Modified rows are treated slightly differently because we have two sets of object arrays but LoadDataRow can only take the first one - after that we need to apply the second set on the existing data to 'modify it'.

The ItemArray property will allow us to do this but there is a gotcha - an exception will be thrown for ReadOnly columns. To get around this we need to remove the ReadOnly status for all columns (except those with an Expression) whilst we are deserializing the data.

To do this optimally, we do this if, and only if, a Modified row is found and we do it just the once making a note of the column ordinals we changed - when all data is deserialized we make those columns ReadOnly again.

UniqueConstraint

A UniqueConstraint always has a name - it's never null or empty. If you don't specify a particular name, it will default to "Constraintxx" where xx is the next number within the DataSet.

We can use this to our advantage - the method that gets the flags associated with a UniqueConstraint uses a regular expression to find a default name. If one is found, we only need to serialize the xx number; otherwise we store the full assigned constraint name.

A Unique Constraint ultimately needs to store DataColumns on deserialization. There are a number of ways to achieve this - we could store the column names and since they would already have been stored previous during serialization of the DataColumn they would only take up the size of a string token. However, we can do better than that - since we know that the DataTable, and its DataColumns, would already have been restored at the point of deserializing the UniqueConstraints, we only need to store the ordinal number(s) of the DataColumn(s) and then look them up from the DataColumnCollection of the DataTable - just a single byte for each column (unless you have a DataColumn with an ordinal in excess of 127).

Further, by using a bit flag indicating whether the constraint is comprised of more than one DataColumn, we don't need to store the column count in the general case of a single-column constraint.

ForeignKeyConstraint

A ForeignKeyConstraint needs to store two sets of DataTable/DataColumn(s) combinations. We know that the number of columns involved will always be the same for both and we can also use the same ForeignKeyConstraintHasMultipleColumns flag optimization as we used for the UniqueConstraint to save having to store a column count where it is known to be one.

Although a ForeignKeyConstraint is located within the ConstraintCollection of the child DataTable, we are serializing these outside of the DataTable because we want to get rid of order dependencies during deserialization and also we don't want to serialize ForeignKeyConstraints where a DataTable is just being serialized by itself.

Therefore we need to identify the table(s) involved which we can do by using the ordinal of the DataTable within the DataSet Tables collection - just a single byte. Following this we do the same for the DataColumn information - an optional column count followed by a list of one or more DataColumn ordinals.

Typically, the parent-side will be the PrimaryKey of the DataTable. We can check for this and use a bit flag so that, for the parent DataTable, we don't even need to store the DataColumn ordinals - just the single byte to identify the table.

A ForeignKeyConstraint also has rules: AcceptRejectRule, UpdateRule, and DeleteRule.

For the latter two, the options are None, Cascade, SetNull or SetDefault. Since the default is Cascade, we can invert the meaning of the bit flags and name them as ForeignKeyConstraintHasNonCascadeUpdateRule and ForeignKeyConstraintHasNonCascadeDeleteRule. By doing this, it is more likely that these bits will not be set and, since the number of possible flags for a ForeignKeyConstraint is just at the point where it will be stored as either one or two bytes, it tips the balance in favour of a single byte. (true, this is only going to save us a handful of bytes for a typical DataSet but this type of optimization is as easy to do this way as any other, is only done once and the principal is sound and is scaleable - if this were done for a type stored many times the savings become more noticeable.)

DataRelation

A DataRelation uses the same optimization techniques for DataTable/DataColumn storage as that of a ForeignKeyConstraint.

ExtendedProperties

This is an instance of PropertyCollection and, according to documentation, is available on DataSet, DataTable, DataColumn objects. It is actually also on Constraints and Relationships too. Vanilla DataSet serialization only serializes string values in XML but we can do better and serialize other objects too.

A PropertyCollection is derived from Hashtable (in fact it doesn't add anything else at all) so all we need to do is create an object array of the keys and an object array of the values and store those. Since it is a common requirement for all of the types mentioned above we have a separate method which takes any PropertyCollection as a parameter and serializes its contents. Remember that the method won't even be called if the PropertyCollection is empty.

Final Words and a Caveat

One caveat with Fast Serialization I should point out is that it was originally designed for remoting purposes. Since remoting would involve the use of identical code on both serialization and deserialization sides at any given time, it is impervious to any modifications made the code. This isn't necessarily the case however if you persist the serialized data to a file or database - any changes to flags or storage order would likely make the data unreadable.

Vanilla serialization also has this problem to some extent but it is possible, albeit messy, to check for the string keys in SerializationInfo to determine whether a value is present or not and act accordingly.

So you have the choice of using the code exclusively for remoting; assuming that the code is bug free and will never change; or add an additional version number to the SerializationInfo block and be prepared to create version-specific code should the need ever arise.

Hopefully you will have found the code in this and the previous article useful for your serialization/remoting needs.

Please feel to comment, suggest any changes or reports any bugs here on Code Project.

Changes from v1 to v2

Added support for .NET 2.0 using conditional compilation. Either add a "NET20" to the conditional compilation symbols in your project properties (on the Build tab under General), or search for "#if NET20" and manually remove the unwanted code and conditional constructs.
Updated FastSerializer code for NET 2.0 support (the source code is also included in the download from this article, but see the History section in the first article for full details of all changes). Brief highlights are:
- Added support for NET 2.0 Dates.
- Added support for typed arrays.
- Added support for Nullable generic types.
- Added support for List<T> and Dictionary<K,V> generic types.
- Added support for optional data compression.
Updated AdoNetHelper code for NET 2.0 support.
- DataTable culture and case-sensitively changed in .NET 2.0 Dates.
Two minor bug fixes - thanks to Ales Helbelka for spotting these.

History

2006-10-31 v1.0 released onto CodeProject.
2006-11-27 v2.0 released onto CodeProject.
- FastSerializer.cs
  - Added support for .NET 2.0 and optional real-time compression (see the first article for full details of all changes).
- AdoNetHelper.cs
  - Added .NET 2.0-conditional code for DataTable culture and case-sensitivity.
  - Use WriteOptimized(string) for consistency.
  - FIX: Added new bit flag and code for DataColumn where MaxLength = int.MaxValue - thanks to Ales Helbelka.
  - FIX: Added check that ParentKeyConstraint is not null in GetRelationFlags - thanks to Ales Helbelka.
- AdoNetHelperTests.cs
  - Added .NET 2.0-conditional code for DataTable culture and case-sensitivity.
  - Updated expected serialized size.
  - Added test for DataColumn.MaxLength = int.MaxValue.
  - Added test for DataRelation with no primary key.
- FastSerializableDataSet.cs
  - Added .NET 2.0-conditional code for GetObjectData() method as it is now virtual.