Click here to Skip to main content
Database » Database » ADO.NET     Intermediate License: The Code Project Open License (CPOL)

Compressing Persisted DataSets

By Adrian_Moore

Using .NET 2.0 DeflateStream and GZipStream to compress persisted DataSets.
VB, XML.NET2.0, WinXP, ADO.NET, VS2005, Dev
Posted:9 Jun 2005
Updated:19 Jun 2005
Views:56,787
Bookmarked:35 times
27 votes for this article.
Popularity: 5.60 Rating: 3.91 out of 5
1 vote, 3.7%
1

2
4 votes, 14.8%
3
11 votes, 40.7%
4
11 votes, 40.7%
5

Sample Image

Introduction

Persisting a cached DataSet using the WriteXML method can produce very large XML files. .NET 2.0 introduces a new System.IO.Compression namespace that provides stream objects for compressing data. Of course, compression and decompression are not helpful if the performance of reading or writing the XML file is significantly different. The image at the beginning of this article shows the performance differences of reading and writing XML files with and without compression. A 35 MB XML file was used during the development of this article, but a smaller file is included in the sample code. The remainder of this article highlights important sections of the sample code. You will need Visual Studio 2005 Beta 2 in order to build the sample, or the .NET 2.0 redistributable installed, as a minimum, to run the demo.

Using the code

As a baseline, the time to read and write the raw XML file into and out of the DataSet was measured.

ts1 = New TimeSpan(Now.Ticks)
ds.ReadXml("..\input.xml")
ts2 = New TimeSpan(Now.Ticks)
Console.WriteLine("Took " & ts2.Subtract(ts1).ToString _
                                   & " to read raw XML")

ts1 = New TimeSpan(Now.Ticks)
ds.WriteXml("test.xml")
ts2 = New TimeSpan(Now.Ticks)
Console.WriteLine("Took " & ts2.Subtract(ts1).ToString _
                                  & " to write raw XML")

DeflateStream Class

According to Microsoft's documentation:

This class represents the Deflate algorithm, an industry-standard algorithm for lossless file compression and decompression. It uses a combination of the LZ77 algorithm and Huffman coding. Data can be produced or consumed, even for an arbitrarily long sequentially presented input data stream, using only an a priori bounded amount of intermediate storage. The format can be implemented readily in a manner not covered by patents. For more information, see the RFC 1951: DEFLATE 1.3 specification.

In order to write the DataSet's XML to a compressed stream, the only additional steps, prior to calling WriteXml, are to create a file to store the data and create the DeflateStream object; passing the file's stream and setting the compression mode to Compress.

Note that if the DataSet is small, the resulting compressed XML file might actually be bigger due to the overhead of the initial header data.

Looking at the image at the beginning of the article, the time to compress and write the output stream is only about 10-30% slower than writing the raw XML to a file. However, the resulting file is about 9 times smaller.

outfile = New FileStream("test.xmd", FileMode.Create, FileAccess.Write)
DefStream = New DeflateStream(outfile, CompressionMode.Compress, False)

ds.WriteXml(DefStream)

Reading the compressed XML file is roughly the same. Prior to calling the DataSet's ReadXml method, the file is opened in Read mode and the DeflateStream object is created; passing the file's stream and setting the compression mode to Decompress. The time to decompress and read the input stream is only about 10-30% lower than reading the raw XML from a file.

infile = New FileStream("test.xmd", FileMode.Open, FileAccess.Read)
DefStream = New DeflateStream(infile, CompressionMode.Decompress, False)

ds.ReadXml(DefStream)

GZipStream Class

Again, according to Microsoft's documentation:

This class represents the gzip data format that uses an industry-standard algorithm for lossless file compression and decompression. The format includes a cyclic redundancy check value for detecting data corruption. gzip uses the same algorithm as the DeflateStream class, but can be extended to use other compression formats. The format can be implemented readily in a manner not covered by patents. The format for gzip is available from the RFC 1952: GZIP 4.3 specification.

In the prior code snippets, DeflateStream can be replaced with GZipStream. What's interesting to note is that the performance of the GZipStream class is slightly slower than that of the DeflateStream class. This is probably due to the additional overhead of being extensible.

BinaryFormatter Class (Update)

Based on feedback from the initial article, someone suggested I look at the size and timing of a DataSet stored using the BinaryFormatter class and the DataSet's RemotingFormat property. As DataSets are typically passed between tiers of a distributed application, XML can generate rather large data packets. In an effort to avoid this, Microsoft has provided a way to persist a DataSet in binary form by setting the DataSet's RemotingFormat property to SerializationFormat.Binary.

Dim formatter As New BinaryFormatter

ds.RemotingFormat = SerializationFormat.Binary
formatter.Serialize(outfile, ds)

Based on the image above, it can be seen that the binary file is about 1/3 the size of the original XML file, but roughly 3 times larger than the compressed XML file. However, the time to load the binary DataSet into memory is almost 6 times faster than reading an XML file. Out of interest, I decided to compress the binary file just to see if the binary data could be compressed any further.

Dim formatter As New BinaryFormatter
ds.RemotingFormat = SerializationFormat.Binary

DefStream = New DeflateStream(outfile, CompressionMode.Compress, False)
formatter.Serialize(DefStream, ds)

The result was that the additional time to compress the binary data gives a marginal decrease in size and so is probably not worth using.

Points of Interest

I found out that it's very important to explicitly close both the compression stream and the output file stream after calling the WriteXml method of the DataSet.

outfile = New FileStream("test.xmz", FileMode.Create, FileAccess.Write)
ZipStream = New GZipStream(outfile, CompressionMode.Compress, False)

ds.WriteXml(ZipStream)

' neglecting to close either of the following streams

' results in a corrupted file when trying to read later

ZipStream.Close() ' important to close this first to flush compressed stream

outfile.Close()   ' important to close this second to flush output stream

As the comment indicates, if these streams are not explicitly closed, the file will be corrupt. An "Unexpected End of File" exception will be thrown later when using ReadXml to read the compressed file. This wasn't mentioned in Microsoft's documentation and may be due a bug in the beta software.

After running the demo, try renaming the resulting .xmd or .xmz files to .zip. The resulting compressed archive cannot be read by Windows as a valid Zip file. While some might consider this a limitation, I think this is a simple way to protect the contents of the XML data from the casual user.

It's worth mentioning that for those looking for more flexibility in compression formats, the SharpZipLib project has been around since the early days of .NET and provides Zip, GZip, Tar and BZip2 archive formats.

Conclusions

The new GZipStream and DeflateStream classes greatly reduce the size of a DataSet persisted to a file with little additional cost.

Storing a DataSet to binary format does reduce the size of the DataSet persisted to a file, but not an much as the compression scheme. However, the binary file is much faster to load afterwards.

I hope this article has been helpful to someone. Don't forget to vote.

History

  • 05-06-2005 - Initial release.
  • 14-06-2005 - Updated with serializing to binary format.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Adrian_Moore


Member
Adrian Moore is the Development Manager for the SCADA Vision system developed by ABB Inc in Calgary, Alberta.

He has been interested in compilers, parsers, real-time database systems and peer-to-peer solutions since the early 90's. In his spare time, he is currently working on a SQL parser for querying .NET DataSets (http://www.queryadataset.com).

Adrian is a Microsoft MVP for Windows Networking.
Occupation: Web Developer
Location: Canada Canada

 Msgs 1 to 8 of 8 (Total in Forum: 8) (Refresh)FirstPrevNext
GeneralNice articlememberCodeWizard195110:07 22 Dec '09  
Thanks for this article. I wish I had read it two years ago. I appreciate the simple examples that are easily understood.

CodeWiz51

-- Life is not a spectator sport. I came to play.
My Web Site, Blog & Wiki

GeneralGood articlememberdjlove8:20 9 Jul '08  
Thanks for a good article, on a topic that catches a lot of people out, like me!

For what it's worth with a real world datatable containing a lot of duplicated values in it's large rows and areas of emptiness these are the statistics:

Xml = 75.9 MB
Binary (by setting the DataTable.RemotingFormat = Binary) = 9.7 MB (13% of the xml size)
Binary with DeflateStream compression = 1.6 MB (2% of the original size)

In addition the serialization of DataTables and DataSets can take a long time (not surprising given the quantity of xml produced) and I found:

Xml serialise = 4250 ms, deserialise = 5587 ms
Binary serialise = 734, deserialise = 344 ms

So about an order of 10x faster using the Binary representation. Compression approximately doubled the time taken in the binary case.

Relative to the default behaviour if you are sending large datatables using remoting you can make huge savings in terms of size and speed. Also if you are saving datatables for later use again you will make massive storage savings. Good luck!
GeneralBinary FormattermemberChris Lennon14:55 28 Dec '06  
"The result was that the additional time to compress the binary data gives a marginal decrease in size and so is probably not worth using."

Not according to the tests I have run. Compressing a binary file yields a file 40% of the size of a compressed non-binary (i.e. non-serialized) file.

This is a 60% saving in file size and is thus well worth it in some cases (e.g. if you are going to move datasets over the wire)

I am using in memory datasets / streams rather than XML files so this could account for the difference

See
http://www.eggheadcafe.com/articles/20041128.asp

for a great implemetation of this approach
GeneralRe: Binary FormattermemberChris Lennon9:24 2 Jan '07  
OK, that's not really the full story. I was using static data (e.g. a line of text 'this is the data' and adding it to a datatable over and over. When I made my data dynamic (i.e. a guid) indeed the compressing the binary did not make much of a difference at all compared to compressing the XML

It's quite interesting - it seems the binary formatter includes an algorithm to basically say 'repeat this data' thus massively compressing the size of a data table that contains multiple rows of repeated data

In some cases data will be repeated (e.g. you have a customer table and 90% of the rows have the same information in the 'Country' column) In these cases compressing the binary may be worthwhile

But this is un-tested
GeneralBinaryFormattermemberUri N.12:55 10 Jun '05  
Compressing Persisted DataSets can be achieved by using BinaryFormatter serialization which also decreases the XML size.
Have you tested, by any chance, the performance of this approach?

Thanks

GeneralRe: BinaryFormattermemberAdrian_Moore5:10 12 Jun '05  
Not according to Microsoft's online documentation:

"DataSet objects are serialized as XML even if you use the binary formatter. This means that the output stream is not compact"

I checked this and sure enough, the binary formatting improves times to read and write the dataset, but the file size does not improve.
GeneralRe: BinaryFormattermemberUri N.14:15 12 Jun '05  
Hi,

If you use the DataSet.RemotingFormat = SerializationFormat.Binary; and BinaryFormatter to serialize the DataSet you actually improve the file size
(the file itself no longer maintains its XML format).

You can find the info here:
First Look at ADO.NET 2.0[^]

GeneralRe: BinaryFormattermembergxdata23:01 8 Feb '06  
The disk file won't retain its XML format but I assume you can read it back via an appropriate method, to XML format?

What's the size comparison and the speed, for DeflateStream and GZipStream versus the BinaryFormatter method?

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

PermaLink | Privacy | Terms of Use
Last Updated: 19 Jun 2005
Editor: Smitha Vijayan
Copyright 2005 by Adrian_Moore
Everything else Copyright © CodeProject, 1999-2010
Web21 | Advertise on the Code Project