Click here to Skip to main content
15,885,979 members
Articles / Programming Languages / C#

GZipStream is Helpful, But Has Some Missing Features

Rate me:
Please Sign up or sign in to vote.
3.00/5 (3 votes)
28 Jul 2009CPOL4 min read 23.2K   6   2
GZipStream is helpful, but has some missing features

I recently had to work around a problem in a particularly ugly way (which I won't detail :-) ), so after that painful experience, I opted to create a class to solve my specific issue in a sane and reusable manner! Out of this unexpected need, the class “GZipHelper” was born. This is really just a wrapper around the base .NET System.IO.Compression.GZipStream. It was kind of a sad day as I really didn’t want to be doing this type of wrapper code. I was hoping it would have just been nativity available in the existing GZipStream class and I could have got on with solving my real business problem at hand.

Firstly, it should be said that the standard GZipStream stream provides the functionality I’m sure the MS engineers expected it to do, which was for HTTP based compression (at least I think that was its expected purpose). However, it is certainly not a fully featured class that is really easy to use for the programmers looking to get quick & helpful access to the GZip compression.

Specifically, the problem I needed to solve was I needed to know how big any given “.GZ” decompressed file was without fully reading and decompressing the file. It seemed trivial enough – “gzip.exe -l” does what I needed, but no amount of hunting within MSDN helped. So on to the ever handy GZip wikipedia entry that detailed enough of the file format and provided the reference to the “GZIP file format specification version 4.3“.

So armed with this information, we can start to decode the GZip file format to extract the length. In fact, this class will check the file to see if it is GZip compressed and returns the decompressed length for that or the regular file length if it is not compressed.

The following class functions have been implemented (see the bottom of the article for the link to the full project):

C#
/// <summary>
/// Utility class to help with managing GZip (.gz) files in .Net
/// </summary>
/// <remarks>
/// This is a trivial wrapper class on top of <see cref="GZipStream"/> that does a little magic
/// under the covers by looking at the underlying data format and retrieves the
/// stored data information within the GZip compressed file.
/// </remarks>
public class GZipHelper
{
   /// <summary>
   /// Gets the compressed file details
   /// </summary>
   /// <param name="filename">The filename.</param>
   /// <returns>True if file exists, else false</returns>
   public bool GetFileDetails(string filename);

   /// <summary>
   /// Gets the compressed file information from a file stream
   /// </summary>
   /// <param name="fileStream">The file stream.</param>
   /// <remarks>
   /// Definitions provided by RFC 1952 -GZIP File Format Specification (May 1996).
   /// Coding was performed against ftp://ftp.isi.edu/in-notes/rfc1952.txt
   /// </remarks>
   public void GetFileInformation(FileStream fileStream);

   /// <summary>
   /// Compresses the file
   /// </summary>
   /// <param name="filename">The filename.</param>
   /// <param name="overWriteExisting">if set to <c>true</c> [over write existing].</param>
   /// <returns></returns>
   public void CompressFile(string filename, bool overWriteExisting);

   /// <summary>
   /// Decompresses the file.
   /// </summary>
   /// <param name="filename">The filename.</param>
   /// <param name="overWriteExisting">if set to <c>true</c> [over write existing].</param>
   /// <returns></returns>
   public bool DecompressFile(string filename, bool overWriteExisting);

   /// <summary>
   /// Returns a seekable stream into either a file or compressed file (defaults read-only)
   /// </summary>
   /// <remarks>
   /// Decompresses the stream into a <see cref="MemoryStream"/> if the file is compressed
   /// otherwise just returns back a regular <see cref="FileStream"/> as a <see cref="Stream"/>
   /// </remarks>
   /// <param name="filename">The filename to open.</param>
   /// <returns>Reference to opened stream</returns>
   public Stream GetSeekableStream(string filename);
}

In combination to this, the following properties are available:

  • CompressedLength – Size of the compressed file (or regular file size if not compressed)
  • DecompressedLength – Size of the file if it were uncompressed (or regular file size if not compressed)
  • IsTextFile – Indicates if GZip thought the file was text based, potentially leading to better compression
  • CompressionModeValue – Numeric indication of the compression mode used
  • CRC16Present – Indicates a CRC16 is available for the file
  • ExtraFieldsPresent – Additional meta fields are available in the file
  • FileNamePresent – GZip contains the original file name
  • FileCommentPresent – Compressed file has a comment associated with it
  • IsCompressed – Indicates if the file is GZip compressed or not
  • CompressedDate – If stored, this is the date the file was compressed.
  • CRC32CRC32 value associated with the file

Along with the project, there are MSTest harnesses to test the class (trivial implementations). So the features of the class are:

  • Can trivially determine a true file size (regardless if it was compressed via GZip or is uncompressed). This makes your code path much more readable if you are dealing with mixed file types.
  • Provides a Seekable stream into the compressed file via via a MemoryStream. The key is that you don't need to worry about the compression (unless you are reading in BIG files) as you will get back a Stream for either a File or a Compressed file – both support seeking. This can be handy if you problem assumes it can Seek in the stream and you need to access GZip files!
  • Trivial Decompress file, this also honors the CompressedDate. If that date is set, then the decompressed file has that creation date.
  • Trivial Compress file. Unfortunately at the time of writing, I’ve not updated the header to include the date of the compressed file. This may come in a later version (and if so I’ll update the blog :-) – but definitely no promises!).

Simple example usages are (taken straight from the unit tests!):

C#
// Perform a file compression
GZipHelper actual = new GZipHelper();
actual.CompressFile(_fileName, true);

// Perform a file decompression
GZipHelper actual = new GZipHelper();
string fileName = "CSharpHackerSmallTest.txt.gz";
actual.DecompressFile(fileName, true);

// Get a seekable stream
GZipHelper actual = new GZipHelper();
using (Stream dataStream = actual.GetSeekableStream("CSharpHackerSmallTest.txt.gz"))
{
    // Silly seek - but it just shows it can be done
    dataStream.Seek(0, SeekOrigin.Begin);
    StreamReader sr = new StreamReader(dataStream);
    string contents = sr.ReadToEnd();

    Assert.AreEqual(119, contents.Length);
}

// Gets natural decompressed file length from a compressed file.
GZipHelper actual = new GZipHelper();
actual.GetFileInformation("CSharpHackerSmallTest.txt.gz");
Assert.AreEqual(119, actual.DecompressedLength);

Finally, it should be noted that by all accounts the standard implementation of GZipStream in the base .NET libraries (actually the DeflateStream) have a problem when attempting to compress random or already compressed data. There is a Microsoft Connect article [http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=93930] that details the issue.

The GZipStream and DeflateStream classes can _significantly_ increase the size of “compressed” data. That means, they don’t just add a few header bytes as stand-alone compressors do, but they _inflate_ the data by as much as 50%. This is apparently because these classes do not check for incompressible data which is a standard feature of all stand-alone compressors. Both classes work fine when the data actually can be compressed.

Please refer to this thread for more details:
http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=179704&SiteID=1

The base implementation worked for me and met my specific needs without the need of bringing in any third party DLLs. Which incidentally also has a nice benefit for those looking to bring this into proprietary software of avoiding any licensing discussions with supervisors! If you want a more robust GZipStream implementation, you can check out http://dotnetzip.codeplex.com/. This apparently has a drop in replacement, but this class could still be useful even if use this drop in replacement as well.

I hope this helps someone out there. :-)

[Download GZipHelper (Source + Project) Here]

This download link will always have the latest and greatest version.

Gareth

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
United States United States
I'm Gareth and am a guy who loves software! My day job is working for a retail company and am involved in a large scale C# project that process large amounts of data into up stream data repositories.

My work rule of thumb is that everyone spends much more time working than not, so you better enjoy what you do!

Needless to say - I'm having a blast.

Have fun,

Gareth

Comments and Discussions

 
GeneralMy vote of 1 Pin
Paw Jershauge18-Jul-13 2:49
Paw Jershauge18-Jul-13 2:49 
GeneralMy vote of 5 Pin
VMAtm9-Jun-13 3:32
VMAtm9-Jun-13 3:32 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.