Optimizing Serialization in .NET

SimmoTech

Rate me:

4.87/5 (88 votes)

16 May 2010Public Domain31 min read

666.1K

332

133

Provides code and techniques to enable developers to optimize serialization of data

Download v1.0 files - 19.26 KB (.NET 1.1 only)
Download v2.0 files - 39.38 KB (.NET 1.1 / NET 2.0)
Download v2.1 files - 50.93 KB (.NET 1.1 / NET 2.0)
Download v2.2 files - 115.55 KB (NET 2.0 only) - Project format is VS2010, but you just need the files from the Serialization folder

Introduction

This is the first of two (or possibly three, depending on interest) articles on optimizing serialization, especially for use in remoting.

This first article includes general-purpose code which is used to store 'owned data' (defined later) into a compact unit with maximum speed. The second article will provide an example of how to use this code to serialize datasets as self-contained units. The possible third article will cover how to serialize Entitys and EntityCollections from LLBLGenPro - the leading O/R mapper - as an example of how to take over the whole serialization process to get excellent results. Although this is specific to one application, you may find the techniques used useful in your code.

The code in this article was inspired by three articles on CodeProject:

A Fast Serialization Technique by Tim Haynes
Fast/Compact Serialization Framework by .Shoaib
A Raw Serializer by Marc Clifton

Background

If you've ever used .NET remoting for large amounts of data, you will have found that there are problems with scalability. For small amounts of data, it works well enough, but larger amounts take a lot of CPU and memory, generate massive amounts of data for transmission, and can fail with Out Of Memory exceptions. There is also a big problem with the time taken to actually perform the serialization - large amounts of data can make it unfeasible for use in apps, regardless of the CPU/memory overhead, simply because it takes so long to do the serialization/deserialization. Using data compression via Server and Client sinks can help with the resulting transmission size, but doesn't help with the excesses earlier in the process.

Having said that, you can't blame .NET for this, bearing in mind all of the work it does: it will ensure that the whole graph of objects required to rebuild the original object are discovered, and that multiple references to the same object are dealt with properly to ensure that only one common instance is deserialized. It also has to do this via reflection, and be able to do this without having any prior knowledge of the objects involved, so overall it does a pretty good job. It will also allow you to take part in the serialization/deserialization process by letting you implement the ISerializable interface if you know you can do a better job than just recreating the field data via reflection.

'Prior knowledge' is the key here. We can use that to optimize how certain 'owned data' (defined later) is stored, and let .NET deal with the rest. That is what this article will be about.

As a taster, let me give an example of the scale of optimization that may be possible:

I had a set of reference data from 34,423 rows in a database table, and it was stored in a collection of entities. Serializing this (to a MemoryStream for maximum speed) took a whopping 92 seconds, and resulted in a 13.5MB lump of serialized data. Deserializing this took around 58 seconds - not very usable in a remoting scenario!

By using the techniques in this article, I was able to serialize the same data down to 2.1MB, which took just 0.35 seconds to serialize, and 0.82 seconds to deserialize! Also, the CPU usage and memory were just a fraction of that used by the raw .NET serializer.

Using the Code

As mentioned in the Introduction, the downloadable code is pretty general purpose (have to avoid using 'generic' here!), so there is nothing specific to remoting as such. Basically, you put 'owned' data into an instance of the SerializationWriter class, and then use the ToArray method to get a byte[] containing the serialized data. This can then be stored in the SerializationInfo parameter passed to the ISerializable.GetObjectData() method, as normal, like this:

public virtual void GetObjectData(SerializationInfo info,
                                  StreamingContext context)
{
    SerializationWriter writer = new SerializationWriter();
    writer.Write(myInt32Field);
    writer.Write(myDateTimeField);
    writer.Write(myStringField);
    // and so on

    info.AddValue("data", writer.ToArray());
}

Deserialization is essentially the reverse: in your deserialization constructor, you retrieve the byte[] and create a SerializationReader instance, passing the byte[] to its constructor. Data is then retrieved in the same order that it was written:

protected EntityBase2(SerializationInfo info, StreamingContext context)
{
    SerializationReader reader =
      new SerializationReader((byte[]) info.GetValue("data",
                               typeof(byte[])));
    myInt32Field = reader.ReadInt32();
    myDateTimeField = reader.ReadDateTime();
    myString = reader.ReadString();
    // and so on
}

Just copy the FastSerializer.cs file into your project, change the namespace to suit, and it is ready for use. The download also includes a FastSerializerTest.cs file with 220+ tests, but this isn't required, and only include it if you want to modify the code and ensure you don't break anything.

(from v2.2)
Each class now has its own file. Just copy the files from the FastSerialization/Serialization folder into your project and change the namespace to suit, and it is ready for use. The download also includes 700+ unit tests under the FastSerializer.UnitTest folder.

Owned Data

I've mentioned previously the concept of 'owned data', so let's try to define this: Owned Data is object data that is:

Any value-type/struct data such as Int32 or Byte or Boolean etc. Since value-types are structs and recreated as they are passed around, any value-type data within your object cannot be affected by another object, and so it is always safe to serialize/deserialize them.
Strings. Although they are reference-types, they are immutable (cannot be changed once created), and so have value-type semantics, and it is always safe to serialize them.
Other reference-types created by (or passed in to) your object that are never used by external objects. This would include internal or private Hashtables, ArrayLists, Arrays etc., because they are not accessible by external objects. It could also include objects that were created externally and passed to your object to use exclusively.
Other reference-types (again created by your object or passed to it) which might be used by other objects but you know will not cause a problem during deserialization. The problem here is that your object itself has no knowledge of what other objects may be serialized in the same object graph. So if it serialized a shared-object using the SerializationWriter into a byte[], and that same shared-object was serialized again by another external object using its SerializationWriter, then two instances would end up being deserialized - a different instance for each - because the serialization infrastructure would never get to see them to check and deal with the multiple references.
This may or may not be a problem depending on the shared object: If the shared object was immutable/had no field data, then getting two instances created during deserialization, although inefficient, would not cause a problem. But if there was field data involved and it was supposed to be shared by the two objects referencing it, then it would be a problem because each now has its own independent copy. The worst case scenario is when the shared object stores references back to the object that references it, then there is a risk of a loop which would cause serialization to fail pretty quickly with an OutOfMemoryException or StackOverflow. Having said all this, it is only a problem if multiple referencing objects are serialized in the same graph. If only one object is part of the graph, then the referenced object can be considered 'owned data' - the other references become immaterial - but it is up to you to identify this situation.

The bottom line is to make sure that only carefully identified 'Owned Data' is stored within the SerializationWriter, and let the .NET Serialization Infrastructure take care of the rest.

How Does it Work (Short Version)

SerializationWriter has a number of Write(xxx) methods overloaded for a number of types. It also has a set of WriteOptimized(xxx) methods which can store certain types in a more optimized manner but may have some restrictions for the values to be stored (which are documented with the method). For data that is unknown at compile-time, there is the WriteObject(object) method which will store the data type as well as the value so that SerializationReader knows how to restore the data back again. The data type is stored using a single byte which is based around an internal enum called SerializedType.

SerializationReader has a number of methods to match those on SerializationWriter. These can't be overloaded in the same manner, so each is a separate method named to describe its use. So for example, a string written using Write(string) would be retrieved using ReadString(), WriteOptimized(Int32) would be retrieved using ReadOptimizedInt32(), and WriteObject (object) would be retrieved using ReadObject(), and so on. As long as the equivalent method to retrieve data is called on SerializationReader and, importantly, in the same order, then you will get back exactly the same data that was written.

How Does It Work (Long Version)

The Write(xxx) methods store data using the normal size for the type, so an Int32 will always take 4 bytes, a Double will always take 8 bytes, etc. The WriteOptimized(xxx) methods use an alternative storage method that not only depends on the type but its value also. So, an Int32 that is less than 128 can be stored in a single byte (by using 7-bit encoding), but Int32 values of 268,435,456 or larger or negative numbers can't be stored using this technique (otherwise they would take 5 bytes and so can't be considered optimizable!), but if you want to store a value, say the number of items in a list, that you know will never be negative and will never reach the limit, then use the WriteOptimized() method instead.

DateTime is another type that has an optimized method. Its limitations are that it can't optimize a DateTime value that is precise to sub-millisecond levels, but for a common case where just a date is stored without a time, then it will take up only 3 bytes in the stream. (A DateTime with hh:mm and no seconds will take 5 bytes - still much better than the 8 bytes taken by the Write(DateTime) method.) The WriteObject() method uses the SerializedType enumeration to describe the type of object which is next in the stream. The enumeration is defined as byte, which gives us 256 possible values. Each basic type will take up one of these values, but 256 is quite a lot to go at, so I've made use of some of them to 'encode' the values of well-known values in with their type. So each numeric type also has a 'zero' version (some also have a 'minus one' version too) which allows just a single byte to specify the type and the value. This allows objects which have a lot of data, whose type is unknown at compile-time, to be serialized very compactly.

Since strings are used quite extensively, I have paid special attention to them to ensure that they are always optimized (strings don't have a WriteOptimized(string) overload since they are always optimized!). To that end, I have allocated 128 out of the 256 values for string usage - actually string lists. This allows any string to be written using a string list token (consisting of a byte plus an optimized Int32), to ensure that a given string value is written once and once only - if a string is seen multiple times, then only the string list token is stored multiple times. By making 128 string lists available, each having fewer strings rather than one string list containing many strings, the string token will be just 1 byte for the first 16,256 unique strings, thereafter taking 2 bytes for the next 2,080,768 unique strings! That should be enough for anyone! Special care has been taken to generate a new list once the current list reaches 127 in size, to take advantage of the smaller string token size - once all 128 possible lists are created, then strings are allocated to them in a round robin fashion. Strings are only tokenized if they are longer than two chars - Null, Empty, 'Y', 'N', or a single space have their own SerializedTypeCodes, and other 1-char strings will take 2 bytes (1 for type, and 1 for the char itself).

The other big advantage of using string tokens is that during deserialization, only one instance of a given string will be stored in memory. Whilst the .NET serializer does the same for the same string reference, it doesn't do this where the references are different but the value is the same. This happens a lot when reading database tables that contain the same string value in a column - because they arrive via a DataReader at different times, they have the same value but different references, and get serialized many times. This doesn't matter to SerializationWriter - it uses a Hashtable to identify identical strings regardless of their reference. So a deserialized graph is usually more memory-efficient than the graph before serialization.

Arrays of objects also have been given special attention since they are so prevalent in database-type work, whether as part of a DataTable or entity classes. Their contents are written using the WriteObject() method of course, to store the type and/or optimized value, but there are special optimizations such as looking for sequences of null or DBNull.Value. Where a sequence is identified, then a SerializationType identifying the sequence is written (1 byte) followed by the length of the sequence (1 byte, typically). There is also an overloaded method called WriteOptimized(object[], object[]) which takes two object arrays of the same length, such as you might find in a modified DataRow or a modified entity; the first object[] is written as described above, but the values in the second list are compared to their equivalent in the first, and where the values are identical, a specific SerializationType identifies this, thus reducing the size for each identical pair to a single byte regardless of its normal storage size.

Whilst developing SerializationWriter, I came across a need for serializing an object (a factory class) that would be used by many entities in a collection. Since this factory class had no data of its own, only its type needed serializing, but it would be helpful to ensure that each entity ended up using the same factory class during deserialization. To this end, I have added tokenization of any object: Using WriteTokenizedObject (object) will put a token into the stream to represent that object, and the object itself will be serialized later just after the string tokens are serialized. I have also added an overload to this method.

As an additional space saver, if your object can be recreated by a parameter-less constructor, then use the WriteTokenizedObject(object, true) overload instead: this will store just the Type name as a string, and SerializationReader will recreate it directly using Activator.GetInstance.

There is just one property available for configuration: SerializationWriter.OptimizeForSize (which is true, by default) controls whether the WriteObjectMethod() should use the optimized serialization where possible. Because it has to check whether the value falls within the parameters required for optimization, this takes slightly longer to serialize. Setting this to false will bypass these checks, and use the quick and simple method. In reality, these checks will not be noticeable for small sets of data, and take only a few extra milliseconds for large sets of data (tens of megabytes), so normally, leave this property as-is. All of the optimizations are fully documented in the code, especially the requirements for optimizations, which you should see as a tool tip.

Golden Rules for Best Optimization

Here is a list of key points to bear in mind for optimizing your serialization:

Know your data: By knowing your data, how it is used, what range of values are likely, what range of values are possible, etc., you can identify 'owned' data, and that will help decide which methods are appropriate (or indeed, whether you should not use any methods, and serialize the data directly into the SerializationInfo block). There is always a non-optimized method available for any primitive data type, but using the optimized version gives the best results.
Read the data in the same order that it was written: Because we are streaming data rather than associating it with a name, it is essential that it is read back in exactly the same order that it was written. Actually, you will find this isn't too much of a problem - the process will fail pretty quickly if there is an ordering problem, and this will be discovered at design time.
Don't serialize anything that isn't necessary: You can't get better optimization than data that takes up zero bytes! See Techniques for Optimization later in the article, for an example.
Consider whether child objects can be considered 'Owned Data': If you serialize a collection, for example, are its contents considered part of the collection, or separate objects to be serialized separately and just referenced? This consideration can have a big impact on the size of the data serialized, since if the former is true, then a single SerializationWriter is effectively shared for many objects and the effects of string tokenization can have a dramatic effect. See Part 2 of this article, as an example.
Remember that serialization is a Black-Box Process: By this, I mean that the data you serialize doesn't have to be in the same format as it is in memory. As long as at the end of deserialization, the data is as it was before serialization, it doesn't matter what happens to it in the meantime - anything goes! Optimize by serializing just enough data to be able to recreate the objects at the other end. See Techniques for Optimization later in the article, for an example.

SerializedType Enum

Below is a table showing the currently-used SerializedType values. 128 are reserved for string tables, and 70 are listed below, leaving 58 available for other uses.

`NullType`	Used for all `null` values
`NullSequenceType`	Used internally to identify sequences of `null` values in object arrays
`DBNullType`	Used for all `DBNull.Value` instances
`DBNullSequenceType`	Used internally to identify sequences of `DBNull.Value` in object arrays (`DataSet`s use this value extensively)
`OtherType`	Used for any unrecognized types
`BooleanTrueType BooleanFalseType`	For `Boolean` type and values
`ByteType SByteType CharType DecimalType DoubleType SingleType Int16Type Int32Type Int64Type UInt16Type UInt32Type UInt64Type`	Standard numeric value types
`ZeroByteType ZeroSByteType ZeroCharType ZeroDecimalType ZeroDoubleType ZeroSingleType ZeroInt16Type ZeroInt32Type ZeroInt64Type ZeroUInt16Type ZeroUInt32Type ZeroUInt64Type`	Optimization to store numeric type and a zero value
`OneByteType OneSByteType OneDecimalType OneDoubleType OneSingleType OneInt16Type OneInt32Type OneInt64Type OneUInt16Type OneUInt32Type OneUInt64Type`	Optimization to store numeric type and a one value
`MinusOneInt16Type MinusOneInt32Type MinusOneInt64Type`	Optimization to store numeric types and a minus one value
`OptimizedInt32Type OptimizedInt64Type OptimizedInt64Type OptimizedUInt64Type`	Stores 32 and 64 bit types using the fewest bytes possible ? see code for restrictions
`EmptyStringType SingleSpaceType SingleCharStringType YStringType NStringType`	Optimization to store single-char strings efficiently (1 or 2 bytes)
`ObjectArrayType ByteArrayType CharArrayType`	Optimizations for common array types
`DateTimeType MinDateTimeType MaxDateTimeType`	`DateTime` struct with often-used values
`TimeSpanType ZeroTimeSpanType`	`TimeSpan` struct with often-used values
`GuidType EmptyGuidType`	`GUID` struct with often-used values
`BitVector32Type`	Optimized to store a `BitVector32` in 1 to 4 bytes
`DuplicateValueType`	Used internally when storing a pair of object arrays
`BitArrayType`	Optimized to store `BitArray`s
`TypeType`	Stores a `Type` as a `string` (will use full `AssemblyQualifiedName` for non-system Types)
`SingleInstanceType`	Used internally to identify that a tokenized object should be recreated using `Activator.GetInstance()`
`ArrayListType`	Optimization for `ArrayList`

Techniques for Optimization

OK, we have identified 'Owned Data', and seen how it is possible to store it using fewer bytes than its actual in-memory size, using tokens and well-known values, but is there anything else we can do to improve optimization? Certainly, there is...... Let's look at an example of Golden Rule #3 - Don't serialize anything that isn't necessary:

A straightforward example of this is a Hashtable that is used internally to quickly locate a particular item based on one of its properties. That Hashtable can easily be recreated using the deserialized data so there is no need to store the Hashtable itself. For most other scenarios, the problem isn't serialization itself, it's the deserialization. The deserialization needs to know what to expect in the data stream - if it isn't implicit like in the previous example, then you need to store that information somehow.

Enter the BitVector32 class: a little-known class that is your friend here. See the docs for full information, but basically it is a struct, taking four bytes that can be used in either of two ways (but not both together!) - it can use its 32 bits to store 32 boolean flags, or you can allocate sections of multiple bits to pack in data (the DateTime optimization in SerializationWriter uses this technique, so have a look at the code). In its boolean flag mode, it can be invaluable to identify which bits of data have actually been stored and, at deserialization time, your code can check the flags, and either read the expected data, or take some other action where some other action would be to use a default value, or create an empty object, or do nothing (a default value may have already been created in a constructor, for example).

Other benefits of using a BitVector32 are that boolean data values are stored as a single bit, and the BitVector32 may be stored optimized (provided that no more than 21 bits are used - otherwise use Write(BitVector32) for a fixed four bytes) so that a BitVector32 using less than 8 flags will only take a single byte! Similarly, if you find you need lots of flags, say if you have a large list of objects and you need to store a boolean for each, then use a BitArray, which will still use a single bit per item (just rounded to the nearest byte) but can store many, many bits.

As an example of how useful bit flags can be, here is some sample code from the fast DataSet serializer I will write about in Part 2: The flags are created using the BitVector32.CreateMask() method which is overloaded to 'chain' subsequent masks to the previous ones. They are static and read-only, so memory-efficient. This particular set of flags is for a DataColumn: it will take two bytes per serialized column, but note that some data, such as AllowNull and ReadOnly, is already serialized by the flag itself, and that other data will now only be serialized conditionally. In fact, one bit flag (HasAutoIncrement) is used to conditionally serialize three pieces of data (AutoIncrement, AutoIncrementSeed, and AutoIncrementStep).

static readonly int MappingTypeIsNotElement

= BitVector32.CreateMask();
static readonly int AllowNull = BitVector32.CreateMask(MappingTypeIsNotElement);
static readonly int HasAutoIncrement = BitVector32.CreateMask(AllowNull);
static readonly int HasCaption = BitVector32.CreateMask(HasAutoIncrement);
static readonly int HasColumnUri = BitVector32.CreateMask(HasCaption);
static readonly int ColumnHasPrefix = BitVector32.CreateMask(HasColumnUri);
static readonly int HasDefaultValue = BitVector32.CreateMask(ColumnHasPrefix);
static readonly int ColumnIsReadOnly =
                BitVector32.CreateMask(HasDefaultValue);
static readonly int HasMaxLength = BitVector32.CreateMask(ColumnIsReadOnly);
static readonly int DataTypeIsNotString = BitVector32.CreateMask(HasMaxLength);
static readonly int ColumnHasExpression =
                BitVector32.CreateMask(DataTypeIsNotString);
static readonly int ColumnHasExtendedProperties =
                BitVector32.CreateMask(ColumnHasExpression);

static BitVector32 GetColumnFlags(DataColumn dataColumn)
{
  BitVector32 flags = new BitVector32();
  flags[MappingTypeIsNotElement] =
        dataColumn.ColumnMapping != MappingType.Element;
  flags[AllowNull] = dataColumn.AllowDBNull;
  flags[HasAutoIncrement] = dataColumn.AutoIncrement;
  flags[HasCaption] = dataColumn.Caption != dataColumn.ColumnName;
  flags[HasColumnUri] = ColumnUriFieldInfo.GetValue(dataColumn) != null;
  flags[ColumnHasPrefix] = dataColumn.Prefix != string.Empty;
  flags[HasDefaultValue] = dataColumn.DefaultValue != DBNull.Value;
  flags[ColumnIsReadOnly] = dataColumn.ReadOnly;
  flags[HasMaxLength] = dataColumn.MaxLength != -1;
  flags[DataTypeIsNotString] = dataColumn.DataType != typeof(string);
  flags[ColumnHasExpression] = dataColumn.Expression != string.Empty;
  flags[ColumnHasExtendedProperties] =
        dataColumn.ExtendedProperties.Count != 0;
  return flags;
}

Here are the methods that make use of the flags to serialize/deserialize all of the columns of a DataTable: You can see the flags being used to combine serialization of optional data with mandatory data such as ColumnName and defaulted data such as DataType, where that data is always required but only needs to be serialized if it isn't our chosen default (in this case typeof(string)).

void SerializeColumns(DataTable table)
{
  DataColumnCollection columns = table.Columns;
  writer.WriteOptimized(columns.Count);

  foreach(DataColumn column in columns)
  {
    BitVector32 flags = GetColumnFlags(column);
    writer.WriteOptimized(flags);

    writer.WriteString(column.ColumnName);
    if (flags[DataTypeIsNotString])
        writer.Write(column.DataType.FullName);
    if (flags[ColumnHasExpression])
        writer.Write(column.Expression);
    if (flags[MappingTypeIsNotElement])
        writer.WriteOptimized((int) MappingType.Element);

    if (flags[HasAutoIncrement]) {
      writer.Write(column.AutoIncrementSeed);
      writer.Write(column.AutoIncrementStep);
    }

    if (flags[HasCaption]) writer.Write(column.Caption);
    if (flags[HasColumnUri])
        writer.Write((string) ColumnUriFieldInfo.GetValue(column));
    if (flags[ColumnHasPrefix]) writer.Write(column.Prefix);
    if (flags[HasDefaultValue]) writer.WriteObject(column.DefaultValue);
    if (flags[HasMaxLength]) writer.WriteOptimized(column.MaxLength);
    if (flags[TableHasExtendedProperties])
        SerializeExtendedProperties(column.ExtendedProperties);
  }
}
void DeserializeColumns(DataTable table)
{
  int count = reader.ReadOptimizedInt32();
  DataColumn[] dataColumns = new DataColumn[count];
  for(int i = 0; i < count; i++)
  {
    DataColumn column = null;
    string columnName;
    Type dataType;
    string expression;
    MappingType mappingType;

    BitVector32 flags = reader.ReadOptimizedBitVector32();
    columnName = reader.ReadString();
    dataType = flags[DataTypeIsNotString] ?
               Type.GetType(reader.ReadString()) :
               typeof(string);
    expression = flags[ColumnHasExpression] ?
                 reader.ReadString() : string.Empty;
    mappingType = flags[MappingTypeIsNotElement] ?
                  (MappingType) reader.ReadOptimizedInt32() :
                  MappingType.Element;

    column = new DataColumn(columnName, dataType,
                            expression, mappingType);
    column.AllowDBNull = flags[AllowNull];
    if (flags[HasAutoIncrement]) {
        column.AutoIncrement = true;
        column.AutoIncrementSeed = reader.ReadInt64();
        column.AutoIncrementStep = reader.ReadInt64();
    }
    if (flags[HasCaption])
        column.Caption = reader.ReadString();
    if (flags[HasColumnUri])
        ColumnUriFieldInfo.SetValue(column, reader.ReadString());
    if (flags[ColumnHasPrefix])
        column.Prefix = reader.ReadString();
    if (flags[HasDefaultValue])
        column.DefaultValue = reader.ReadObject();
    column.ReadOnly = flags[ColumnIsReadOnly];
    if (flags[HasMaxLength])
        column.MaxLength = reader.ReadOptimizedInt32();
    if (flags[TableHasExtendedProperties])
        DeserializeExtendedProperties(column.ExtendedProperties);

    dataColumns[i] = column;
  }
  table.Columns.AddRange(dataColumns);
}

In Part 2 of this article, I will go into more details about using bit flags and serializing child objects to take full advantage of the optimization features listed in this article.

Please feel to post comments/suggestions for improvements, here on CodeProject.

Changes from v1 to v2

Added support for .NET 2.0 using conditional compilation
Either add a "NET20" to the Conditional compilation symbols in your project properties (on the Build tab under General) or search for "#if NET20" and manually remove the unwanted code and conditional constructs.
Supports .NET 2.0 DateTime including DateTimeKind.
Added support for typed arrays.
Added support for Nullable generic types.
Added support for List<T> and Dictionary<K,V> generic types.
Added support for optional data compression - see MiniLZO section below for details.
Added IOwnedDataSerializableAndRecreatable interface to allow classes and structs to be recognized as types that can entirely serialize/deserialize themselves
Added tests for all new features
Fixed one known bug (BitArray deserialized but rounded up to next 8 bits).

See History section below for details of all changes.

Changes from v2 to v2.1

Bugfix in WriteOptimized(decimal) whereby incorrect flag was used under certain circumstances.
Thanks to marcin.rawicki and DdlSmurf for spotting this.
WriteOptimized(decimal value) will now store decimal values without scale where possible.
This optimization is based on the fact that a given decimal may be stored in different ways depending on how it was created. For example, the number '2' could be stored as '2' with no scaling (if created as decimal d = 2g) or '200' with a scaling of '2' (if created using decimal d = 2.00g).
Data retrieved from SQL Server preserves the scaling and so would normally be stored in memory using the latter format.
There is absolutely no numeric difference between the two and the only visible effect is seen when you display the number without specific formatting using ToString(). However, from the point of view of optimizing serialization, a '2' can be stored more efficiently than a '200' so the code will perform a simple check for this condition and use a zero scaling where possible.
This optimization is on by default but I have added a static boolean property called DefaultPreserveDecimalScale to allow turning it off if required.
Negative integers can now be stored optimized.
This optimization uses twos-complement to transform a negative number and, provided the now positive number is optimizable, will use a different TypeCode to store both the type and the fact it should be negated on deserialization.
Int16/UInt16 can now be stored optimized.
Of course the potential reductions here are very limited but they included for completeness. This also include typed array support and negative numbers.
Enum values stored using WriteObject(object) are now automatically optimized where possible.
A check is made to determine whether the integer value can be optimized or not then the Enum type is stored followed by the optimized or non-optimized value. Since Enums are usually non-negative with a limited range, optimization will be on most of the time. Storing the Enum Type will also allow the deserializer to determine the underlying integer type and get the correct size value back.
Added support for 'Type Surrogates' which allow external helper classes to be written that know how to serialize/deserialize a particular Type or set of Types.
This is a relatively simple feature but has great scope to extend the support of non-directly-supported types without using up the limited Type Codes, without modification of the Fast Serialization code, and without needing to have control of the Type being serialized.
This feature was always in the back of my mind to implement but special thanks to Dan Avni for giving me good reasons for doing it now and for feedback and testing in the field.

A number of ways of achieving the goal were examined including a dictionary of delegates; and an alternative set of Type Codes but the chosen implementation allows good modularisation and reuse of code.

A new interface has been created called IFastSerializationTypeSurrogate which has just three members:
- bool SupportsType(Type) which allows SerializationWriter to query a helper class to see if a given Type is supported
- void Serialize(SerializationWriter writer, object value) which does the serialization; and
- object Deserialize(SerializationReader reader, Type type) which does the deserialization
Any number of Type Surrogate helper classes can be used and they are simply added to a static property called TypeSurrogates on SerializationWriter (no need to duplicate on SerializationReader) which is either List<IFastSerializationTypeSurrogate> or an ArrayList for NET 1.1.

The idea is that Type Surrogate helper classes are added once at the start of your app and where WriteObject has exhausted its list of known types for a given object and would use a plain BinaryFormatter, it will first query each helper in the list to see if the Type is supported.
If a match is found, then a TypeCode is stored followed by the object's Type and then the Serialize method is called on the helper to do the actual work. Deserialization is the reverse process and the same set of helpers must be available to perform the deserialization.

There are a couple of sample Type Surrogate helper classes in the download covering Color, Pair, Triplet, StateBag, Unit and a simple implementation of Hashtable. The structure of the class implementing IFastSerializationTypeSurrogate could be done in a number of ways, but by making the implementation code for serialization/deserialization also available via public static methods, the helper class can also be used to serialize Types that are known at design-time, maybe as part of a larger class supporting IOwnedDataSerializable.

If you create a useful Type Surrogate helper class, you might want to post it here so it can be shared with others to save reinventing the wheel.

Changes from v2.1 to v2.2

You can now pass any Stream to SerializationReader and SerializationWriter.
- Assumptions about a stream's start Position have been removed. The start Position is stored and used relatively.
- Neither SerializationWriter nor SerializationReader require a seekable stream. Passing a non-seekable stream will just mean that the header cannot be updated.
- ToArray() will still just return the portion of the stream written to by SerializationWriter.
The stored data stream has been made more linear. There is now either a 4 byte or 12 byte header followed immediately by the serialized data.
- Tokenized Strings and Objects are now written inline as they are first seen rather than appended later all at once.
- The counts for Tokenized Strings and Tokenized Objects (used for pre-sizing the table Lists in SerializationReader) are now stored in the header for the normal case using a MemoryStream.
  (For alternative streams which are not seekable (e.g. a compression stream) or where allowUpdateHeader was set to false in the relevant constructor, the header will not be updated. In this case, you can specify presize information in the SerializationReader constructor either by passing the final table sizes from the SerializationWriter externally from the stream or by making a reasonable guess. Alternatively, you can not specify presize information at all and let the tables grow as the tokenized items are pulled from the data stream thought this can be memory inefficient and is not recommended)
- SerializationReader is now a forward-only stream reader - no need to jump to the end of the stream and back.
- Once data has been deserialized from a Stream by a SerializationReader, the stream Position will be in the correct position for any subsequent data - no need to jump over token tables.
- For the normal MemoryStream case, a 12 byte header is used: an Int32 for the total length of the serialized data; an Int32 for the String table size and an Int32 for the Object table size.
  If the header is not to be updated then there will be just a 4 byte Int32 header which will be zero.
Replaced AppendTokenTables() with UpdateHeader() since there is no longer any appending.

MiniLZO - Realtime Compression

Included with the v2 source code is a file called MiniLZO.cs which contains a slightly modified version of the code from the Pure C# MiniLZO port article by Astaelan.
That article's code is a direct port of the original C version and as such didn't store the original uncompressed size.

The modifications I have made are as follows:

Modified method signatures to enable any part of a byte array to be compressed.
Store the uncompressed size in 4 bytes at the end of the compressed data.
Added a special method overload which takes a MemoryStream and uses its internal byte[] buffer as the source data.
In addition, it will look at the unused bytes in this buffer and, where possible, use those bytes to provide in-place compression thus saving memory.

It is important to note that:

MiniLZO is covered by the GNU General Public License. The rest of the code in this article is not however - you are free to do with it as you wish
MiniLZO uses 'unsafe' code. 'Unsafe' only in the .NET sense which means that code uses pointers and so cannot be guaranteed by the .NET runtime to not corrupt memory.
In this context, it is pretty safe however since it will detect if any of its pointers are outside the range of the byte array and throw an exception.
The project in which it is contained (either a separate DLL project or an existing one) will need to have the unsafe option checked for it to compile.

Its benefits are that being a memory buffer-only compressor, it is extremely quick to perform compression and even faster to decompress.
It my testing, I got a reduction in size of around 45%-55% which was much faster than other, stream-based compressors at their fastest setting.
Other compressors might produce slightly better compression but at the cost of reduced speed.
Usage of compression has been left entirely optional - however if you know you will be compressing serialized data one way or another, be sure to set the SerializationWriter.OptimizeForSize static property to false for best results.

To use this compression, where you previously used this code...

byte[] serializedData = mySerializationWriter.ToArray();

...do this instead...

// To ensure that all required data is stored

mySerializationWriter.AppendTokenTables();
byte[] serializedData =
       MiniLZO.Compress((MemoryStream) writer.BaseStream);

Deserialization is even simpler...

byte[] serializedData = MiniLZO.Decompress(serializedData);

History

2006-09-25 v1.0 released onto CodeProject
2006-11-27 v2.0 released onto CodeProject
- Added MiniLZO.cs for real-time compression
- FIX: Fixed bug in BitArray serialization - was rounding up to nearest 8 bits
- Added .NET 2.0-conditional code where required
- Added static DefaultOptimizedForSize boolean property - used by the OptimizeForSize property
- Renamed DateHasTimeMask to DateHasTimeOrKindMask
- Added internal UniqueStringList class for faster string token matching:
  - string tokens now assume fixed 128 string lists and use round-robin allocation
  - uses arithmetic on deserialization rather than multi-dimensional arrays - proved faster
  - tweaked hashtable sizes - quadruples for lower sizes, then reverts to doubling.
- Added the WriteOptimized(string) overload:
  - Write(string) calls a new method rather than the other way around since a new method is not virtual and therefore slightly faster. Also consistent naming
- Added CLSCompliant attributes where appropriate
- Reorganized all methods (no improvement but easier to related method types)
- Added new Serialized Types:
  - DuplicateValueSequenceType
  - ObjectArrayType
  - EmptyTypedArrayType
  - EmptyObjectArrayType
  - NonOptimizedTypedArrayType
  - FullyOptimizedtypedArrayType
  - PartiallyOptimizedTypedArrayType
  - OtherTypedArrayType
  - BooleanArrayType
  - ByteArrayType
  - CharArrayType
  - DateTimeArrayType
  - DecimalArrayType
  - DoubleArrayType
  - SingleArrayType
  - GuidArrayType
  - Int16ArrayType
  - Int32ArrayType
  - Int64ArrayType
  - SByteArrayType
  - TimeSpanArrayType
  - UInt16ArrayType
  - UInt32ArrayType
  - UInt64ArrayType
  - OwnedDataSerializableAndRecreatableType
- Added placeholder Serialized types to show how many remain (31 currently)
- Added the IOwnedDataSerializableAndRecreatable interface
- Refactored processObject code to keep array determination in a separate method so that it is reusable
- .NET 2.0 dates including DateTimeKind now fully supported
- WriteOptimized(object[], object[]) updated:
  - slightly optimized
  - now looks for sequences of duplicate values
- writeObjectArray(object[]) updated:
  - slightly optimized
  - now looks for sequences of duplicate values
- Now uses .Equals instead of == everywhere
- Added support for arrays of all primitive/built-in types
- Added support for structures and arrays of structures
- Added support for arrays of classes that implement IOwnedDataSerializableAndRecreatable
- Added support for Dictionary<K,V>
  - Write<K,V>(Dictionary<K,V> value) method
  - Dictionary<K,V>ReadDictionary<K,V>() method for simple creation of Dictionary<K,V>
  - ReadDictionary<K,V>(Dictionary<K,V> dictionary) method to populate a pre-created Dictionary
- Added support for List<T>.
  - Write<T>(List<T> value) method
  - List<T> ReadList<T>() method
- Added support for Nullable<T>
  - WriteNullable(ValueType value) - just calls WriteObject(value) but included for clarity
  - Full list of ReadNullableXXX() methods for all primitive/built-in types
- Refactored how arrays are handled for the .NET 2.0 covariant 'issue'
- ToArray() refactored to split out token writing from returning byte array. Allows support for any external compression routine to work on a MemoryStream if required
- Reordered processing within WriteObject() - in particular DbNull has higher priority
- Test suite updated with lots of tests
2007-02-25 v2.1 released onto CodeProject
- FIX: Fixed bug in WriteOptimized(decimal)
- Added/updated some comments.
- Added optimization to store Decimals without their scale where possible
  - Added static DefaultPreserveDecimalScale property - false by default
  - Added public PreserveDecimalScale property which takes its initial value from the static default but allows configuration on a per-instance basis
  - Updated WriteObject() to always store Decimals optimized since there will always be a saving
  - Removed OptimizedDecimalType typecode for same reason
- Added support for optimizing Int16/UInt16 values
  - Added public constant for HighestOptimizable16BitValue
  - Added internal constant for OptimizationFailure16BitValue
  - Updated code in WriteObject to look for optimizable 16bit values
  - Added WriteOptimized(Int16) and WriteOptimized(UInt16) methods
  - Added WriteOptimized(Int16[]) and WriteOptimized(UInt16[]) methods
  - Added ReadOptimizedInt16() and ReadOptimizedUInt16() methods
  - Added ReadOptimizedInt16Array() and ReadOptimizedUInt16Array() methods
  - Updated ReadInt16Array() and ReadUInt16Array() methods
  - Added new Serialized Types:
    - OptimizedInt16Type
    - OptimizedUInt16Type
- Added support for some negative integer values.
  - Updated code in WriteObject to look for optimizable negative values
  - Added new Serialized Types:
    - OptimizedInt16NegativeType
    - OptimizedInt32NegativeType
    - OptimizedInt64NegativeType
- Added support Enum types
  - Updated WriteObject to look for Enum values and store them as their Type and integer value - optimized where possible
  - Added new Serialized Types:
    - EnumType
    - OptimizedEnumType
- Added support for Type Surrogate helpers
  - Added IFastSerializationTypeSurrogate interface
  - Added TypeSurrogates static property. (List<IFastSerializationTypeSurrogate> for NET 2.0 or ArrayList for NET 1.1)
  - Updated WriteObject to query the helpers in TypeSurrogates before using BinarySerializer as a last resort.
  - Added internal static method to find a helper class for a given type - shared by SerializationWriter and SerializationReader.
  - Added new Serialized Type:
    - SurrogateHandledType
2010-05 v2.2 released onto CodeProject
- Corrected some typos and changed some article text where it is different for v2.2
- Removed conditional compilation - just .NET 2.0 or higher now
- Separated classes into different files
- Renamed some methods (did I really use lower case method names!)
- Used Switches rather than nested if<code>/elses where possible
- Used var.
- Release now contains two projects: one for the code and one for unit tests. VS2010 is used but the code is in subfolders and easily extracted.
- AppendTokenTables() replaced with UpdateHeader()
- Storing token tables inline gets around the problem reported by Simon Thorogood where tokenized strings written by an IOwnedDataSerializableAndRecreatable class are not written into the stream
- Added support for any stream and any starting position.

License

This article, along with any associated source code and files, is licensed under A Public Domain dedication

Written By

SimmoTech

Software Developer (Senior) Hunton Information Systems Ltd.

United Kingdom

Simon Hewitt is a freelance IT consultant and is MD of Hunton Information Systems Ltd.

He is currently looking for contract work in London.

He is happily married to Karen (originally from Florida, US), has a lovely daughter Bailey, and they live in Kings Langley, Hertfordshire, UK.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.