GZipStream is Helpful, But Has Some Missing Features
GZipStream is helpful, but has some missing features
I recently had to work around a problem in a particularly ugly way (which I won't detail ), so after that painful experience, I opted to create a class to solve my specific issue in a sane and reusable manner! Out of this unexpected need, the class “
GZipHelper
” was born. This is really just a wrapper around the base .NET System.IO.Compression.GZipStream. It was kind of a sad day as I really didn’t want to be doing this type of wrapper code. I was hoping it would have just been nativity available in the existing GZipStream
class and I could have got on with solving my real business problem at hand.
Firstly, it should be said that the standard GZipStream
stream provides the functionality I’m sure the MS engineers expected it to do, which was for HTTP based compression (at least I think that was its expected purpose). However, it is certainly not a fully featured class that is really easy to use for the programmers looking to get quick & helpful access to the GZip compression.
Specifically, the problem I needed to solve was I needed to know how big any given “.GZ” decompressed file was without fully reading and decompressing the file. It seemed trivial enough – “gzip.exe -l
” does what I needed, but no amount of hunting within MSDN helped. So on to the ever handy GZip wikipedia entry that detailed enough of the file format and provided the reference to the “GZIP file format specification version 4.3“.
So armed with this information, we can start to decode the GZip file format to extract the length. In fact, this class will check the file to see if it is GZip compressed and returns the decompressed length for that or the regular file length if it is not compressed.
The following class functions have been implemented (see the bottom of the article for the link to the full project):
/// <summary>
/// Utility class to help with managing GZip (.gz) files in .Net
/// </summary>
/// <remarks>
/// This is a trivial wrapper class on top of <see cref="GZipStream"/> that does a little magic
/// under the covers by looking at the underlying data format and retrieves the
/// stored data information within the GZip compressed file.
/// </remarks>
public class GZipHelper
{
/// <summary>
/// Gets the compressed file details
/// </summary>
/// <param name="filename">The filename.</param>
/// <returns>True if file exists, else false</returns>
public bool GetFileDetails(string filename);
/// <summary>
/// Gets the compressed file information from a file stream
/// </summary>
/// <param name="fileStream">The file stream.</param>
/// <remarks>
/// Definitions provided by RFC 1952 -GZIP File Format Specification (May 1996).
/// Coding was performed against ftp://ftp.isi.edu/in-notes/rfc1952.txt
/// </remarks>
public void GetFileInformation(FileStream fileStream);
/// <summary>
/// Compresses the file
/// </summary>
/// <param name="filename">The filename.</param>
/// <param name="overWriteExisting">if set to <c>true</c> [over write existing].</param>
/// <returns></returns>
public void CompressFile(string filename, bool overWriteExisting);
/// <summary>
/// Decompresses the file.
/// </summary>
/// <param name="filename">The filename.</param>
/// <param name="overWriteExisting">if set to <c>true</c> [over write existing].</param>
/// <returns></returns>
public bool DecompressFile(string filename, bool overWriteExisting);
/// <summary>
/// Returns a seekable stream into either a file or compressed file (defaults read-only)
/// </summary>
/// <remarks>
/// Decompresses the stream into a <see cref="MemoryStream"/> if the file is compressed
/// otherwise just returns back a regular <see cref="FileStream"/> as a <see cref="Stream"/>
/// </remarks>
/// <param name="filename">The filename to open.</param>
/// <returns>Reference to opened stream</returns>
public Stream GetSeekableStream(string filename);
}
In combination to this, the following properties are available:
CompressedLength
– Size of the compressed file (or regular file size if not compressed)DecompressedLength
– Size of the file if it were uncompressed (or regular file size if not compressed)IsTextFile
– Indicates if GZip thought the file was text based, potentially leading to better compressionCompressionModeValue
– Numeric indication of the compression mode usedCRC16Present
– Indicates aCRC16
is available for the fileExtraFieldsPresent
– Additional meta fields are available in the fileFileNamePresent
– GZip contains the original file nameFileCommentPresent
– Compressed file has a comment associated with itIsCompressed
– Indicates if the file is GZip compressed or notCompressedDate
– If stored, this is the date the file was compressed.CRC32
–CRC32
value associated with the file
Along with the project, there are MSTest harnesses to test the class (trivial implementations). So the features of the class are:
- Can trivially determine a true file size (regardless if it was compressed via GZip or is uncompressed). This makes your code path much more readable if you are dealing with mixed file types.
- Provides a Seekable stream into the compressed file via via a
MemoryStream
. The key is that you don't need to worry about the compression (unless you are reading in BIG files) as you will get back a Stream for either a File or a Compressed file – both support seeking. This can be handy if you problem assumes it can Seek in the stream and you need to access GZip files! - Trivial Decompress file, this also honors the
CompressedDate
. If that date is set, then the decompressed file has that creation date. - Trivial Compress file. Unfortunately at the time of writing, I’ve not updated the header to include the date of the compressed file. This may come in a later version (and if so I’ll update the blog
– but definitely no promises!).
Simple example usages are (taken straight from the unit tests!):
// Perform a file compression
GZipHelper actual = new GZipHelper();
actual.CompressFile(_fileName, true);
// Perform a file decompression
GZipHelper actual = new GZipHelper();
string fileName = "CSharpHackerSmallTest.txt.gz";
actual.DecompressFile(fileName, true);
// Get a seekable stream
GZipHelper actual = new GZipHelper();
using (Stream dataStream = actual.GetSeekableStream("CSharpHackerSmallTest.txt.gz"))
{
// Silly seek - but it just shows it can be done
dataStream.Seek(0, SeekOrigin.Begin);
StreamReader sr = new StreamReader(dataStream);
string contents = sr.ReadToEnd();
Assert.AreEqual(119, contents.Length);
}
// Gets natural decompressed file length from a compressed file.
GZipHelper actual = new GZipHelper();
actual.GetFileInformation("CSharpHackerSmallTest.txt.gz");
Assert.AreEqual(119, actual.DecompressedLength);
Finally, it should be noted that by all accounts the standard implementation of GZipStream
in the base .NET libraries (actually the DeflateStream
) have a problem when attempting to compress random or already compressed data. There is a Microsoft Connect article [http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=93930] that details the issue.
The
GZipStream
andDeflateStream
classes can _significantly_ increase the size of “compressed” data. That means, they don’t just add a few header bytes as stand-alone compressors do, but they _inflate_ the data by as much as 50%. This is apparently because these classes do not check for incompressible data which is a standard feature of all stand-alone compressors. Both classes work fine when the data actually can be compressed.Please refer to this thread for more details:
http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=179704&SiteID=1
The base implementation worked for me and met my specific needs without the need of bringing in any third party DLLs. Which incidentally also has a nice benefit for those looking to bring this into proprietary software of avoiding any licensing discussions with supervisors! If you want a more robust GZipStream
implementation, you can check out http://dotnetzip.codeplex.com/. This apparently has a drop in replacement, but this class could still be useful even if use this drop in replacement as well.
I hope this helps someone out there.
This download link will always have the latest and greatest version.
Gareth