
Introduction
Compressing data is a very important issue when talking about handling data through digital technology, the purpose is to present the same data with the smallest size possible. However, there isn't a magical way which fits all cases meaning this can be done in many ways. It depends on our requirements and the way we store, transmit and check the data.
In my project I suggest one method which takes advantage of the fact that there are groups of files (I call them family) that are similar to each other in context.
Family of files
In many cases we handle many files that came from the same source; for example two or more files with little differences like forms, templates or just some successive frames in video stream.
Instead of storing all the files independently we can find the shared part, isolate it from all the members of the family then link them back to it. In other words we store one shared copy and for every member and just the difference.
But where is the catch?
Nothing new! But I tried to show how to use some techniques we learnt in class and implement that concept. Here is the flow:
- Given file
f and family F = (f<sub>1</sub>,....,f<sub>n</sub>) where F lies in our database.
- Calculate the entropy of
f:H(f):
- For each couple
(f,f<sub>i</sub>) : H(f,f<sub>i</sub>), calculate the entropy value of f X f<sub>i</sub>:H(f,f<sub>i</sub>)
- If
H(f) < H(f,f<sub>i</sub>) i = 1 .... n, then compress f and finish.
Else compress f X f<sub>k</sub> where f X f<sub>k </sub>= min{H(f,f<sub>i</sub>) i = 1 .... n} and link it with f<sub>k</sub>.
(*) Compressing files is performed using Huffman's methods.
Examples and Results
Given the following images (binary .bmp files)
|
Original file
|
Family file (the twin)
|
|
* *
* *
|
. .
. .
|
We got the following sizes (in Bytes):
|
Original file
|
67014
|
|
Compression of the original file
|
12290
|
|
Compression of the Xor result of the above files
|
8377
|
That's means we got just 12.5% file size after compression using the twin file and 18.3% when using only compression. In other cases we got even better results. (see the download at the top of this article for further examples)
Using the code
The code is written in C# under Visual Studio 2005. It contains the following main parts:
class Decoding
The input is an array of bytes of the original file. The operations are:
- Calculating the entropy
- Calculating the entropy given another array of bytes (I called it twin but it should be a file from family)
Decoding: decode the data using Huffman method
class Encoding
The input is DecodedInfo (data after compression). The operations are:
Encoding data by reconstructing the original file
- Saving the result
- Loading compressed file
class DecodedInfo
This is the structure of the compressed file. It contains the following fields:
decodedData: array of bytes containing the result of compression
codeMap: variable code which calculates using the Huffman's tree
totalBits: size of the result
twinFileName: the name of the file which is used to calculate the xor result (it's the file that give the lowest entropy)
originalFileName: just the name of the original file I use it when saving the result
class Hamming
The input is a histogram of the bytes in the original file. Using this information, we can build the tree which allow us to calculate the variable length code of the compressed data. This class is simply an implementation of the algorithm we learnt in class.
Hesham Yassin is a C#,Assembler, C and VB programmer. He started programming since 2004. The first language he learn was C. between 2006 to 2007, He studied Electonics and built a wireless control device which activate a media player.