Click here to Skip to main content
Click here to Skip to main content

How to compress and decompress a file using Huffman Algorithm

By , 14 Jun 2007
 

Introduction

Compressing data is a very important issue when talking about handling data through digital technology, the purpose is to present the same data with the smallest size possible. However, there isn't a magical way which fits all cases meaning this can be done in many ways. It depends on our requirements and the way we store, transmit and check the data.

In my project I suggest one method which takes advantage of the fact that there are groups of files (I call them family) that are similar to each other in context.

Family of files

In many cases we handle many files that came from the same source; for example two or more files with little differences like forms, templates or just some successive frames in video stream.

Instead of storing all the files independently we can find the shared part, isolate it from all the members of the family then link them back to it. In other words we store one shared copy and for every member and just the difference.

But where is the catch?

Nothing new! But I tried to show how to use some techniques we learnt in class and implement that concept. Here is the flow:

  1. Given file f and family F = (f<sub>1</sub>,....,f<sub>n</sub>) where F lies in our database.
  2. Calculate the entropy of f:H(f):
  3. For each couple (f,f<sub>i</sub>) : H(f,f<sub>i</sub>), calculate the entropy value of f X f<sub>i</sub>:H(f,f<sub>i</sub>)
  4. If H(f) < H(f,f<sub>i</sub>) i = 1 .... n, then compress f and finish.
    Else compress f X f<sub>k</sub> where f X f<sub>k </sub>= min{H(f,f<sub>i</sub>) i = 1 .... n} and link it with f<sub>k</sub>.

(*) Compressing files is performed using Huffman's methods.

Examples and Results

Given the following images (binary .bmp files)

Original file

Family file (the twin)

* *

* *

. .

. .

We got the following sizes (in Bytes):

Original file

67014

Compression of the original file

12290

Compression of the Xor result of the above files

8377

That's means we got just 12.5% file size after compression using the twin file and 18.3% when using only compression. In other cases we got even better results. (see the download at the top of this article for further examples)

Using the code

The code is written in C# under Visual Studio 2005. It contains the following main parts:

class Decoding

The input is an array of bytes of the original file. The operations are:

  • Calculating the entropy
  • Calculating the entropy given another array of bytes (I called it twin but it should be a file from family)
  • Decoding: decode the data using Huffman method

class Encoding

The input is DecodedInfo (data after compression). The operations are:

  • Encoding data by reconstructing the original file
  • Saving the result
  • Loading compressed file

class DecodedInfo

This is the structure of the compressed file. It contains the following fields:

  • decodedData: array of bytes containing the result of compression
  • codeMap: variable code which calculates using the Huffman's tree
  • totalBits: size of the result
  • twinFileName: the name of the file which is used to calculate the xor result (it's the file that give the lowest entropy)
  • originalFileName: just the name of the original file I use it when saving the result

class Hamming

The input is a histogram of the bytes in the original file. Using this information, we can build the tree which allow us to calculate the variable length code of the compressed data. This class is simply an implementation of the algorithm we learnt in class.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Hesham Yassin
Instructor/Trainer
Israel Israel
Member
Hesham Yassin is a C#,Assembler, C and VB programmer. He started programming since 2004. The first language he learn was C. between 2006 to 2007, He studied Electonics and built a wireless control device which activate a media player.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralCompressing Big Image FilesmemberSachin Dubey1 Oct '08 - 2:25 
Hi,
I want to save image in database but image having big size are occupying lot of space.........so i have to compress it and then save it....i have to compress it upto 90%........
 
pls help and suggest me.........
 
thnx in advance

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web04 | 2.6.130516.1 | Last Updated 14 Jun 2007
Article Copyright 2007 by Hesham Yassin
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid