12,302,223 members (24,015 online)
alternative version

46.5K views
37 bookmarked
Posted

Fast and Simple Huffman Compressor

, 2 Apr 2012 CPOL
 Rate this:
Fast and simple huffman compressor

Introduction

Huffman tree compression is almost as simple as RLE compression, but can be equally fast and gives more reasonable compression ration, thus is more effective. This source code, along with the binary, is an example of Huffman tree compression.

Background

Huffman tree compression and tree construction is fairly described on Wiki. Huffman tree coding is based on prefixes. So there cannot be two different leaves with exact same paths, even more, there must be a path, that ends up only in one leaf (no intersection leaves). Huffman tree coding, furthermore, is based on a frequency or probability of appearance of each one particular symbol. Given this idea, we work with a priority queue, that is initially filled with small trees, each with its symbol/char and probability of appearance/`ref.count`. We sort our queue/array descending, so most frequent symbols are topmost, while least common are the last. We then merge trees, starting from least one end, going topmost, while merging two trees into one new. After each merging, sorting is performed, so the queue/array is ordered descending again. We do merging, sorting until there is no more than 1 tree in the queue. Then we can finish with tree construction and go to compression, where each symbol is replaced by its path in bits from tree.

Using the Code

The code contain one `public` class `simple_Huffman` in simple_huffman.cpp.

Main Class

```#include <memory.h>     // memset
#include <stdlib.h>     // qsort
typedef unsigned char BYTE;
class simple_Huffman{
private:
class node{
public:
node *left, *right;
int count;       // ref. count
int code;        // code in bits
int codelength;  // code length
BYTE chr;        // ASCII char, serves in compression and then in writing of frequencies
// constructor, memset is faster than individ.
node(void)  {memset(this, 0, sizeof(node));}
~node() {
if (left) {delete left; left = NULL;}
if (right) {delete right; right = NULL;}
}
};
// Array of trees, here will be only one tree at pos.
// 0 when finished with merging
node *trees[256];
node *leaves[256]; // array of leaves, each contains ASCII char,
// ref. count, later code and codelength too. It is a reference to original trees
node *trees_backup[256];
// as the tree is constructed, this stores
// exact steps (movings) of the last tree
int STEPS[256];
int stepscount;
node nodes[256];

void MakeTree();
void MakeTreeD(); // maketree for decompression (no searching)
int treescount;
BYTE *allocatedoutput;
// initialization of leaves with codes and code lengths
void InitializeLeaves(node *, int, int);
// sorting, we want to sort descending
static int comparefreq(const void *A, const void *B);
void moveTreesRight(node **toTree);
void moveTreesRight2(node **toTree);
void tryToPop();
void moveToTop();
void performInternalCleanup();
void initialize();
int LastError;
public:
simple_Huffman();
int Compress(BYTE *input, int inputlength);
int Decompress(BYTE *input, int inputlength);
int CompressFile(char *InFile, char *OutFile);
int DecompressFile(char *InFile, char *OutFile);
void Finalize(); // call after each stream compress/decompress to deallocate
BYTE *getOutput(); // call after each stream compress/decompress to obtain output
int getLastError(); // get code of last error (-1 input file failed
// to open, -2 output failed, -3 output allocation failed)
};```

Using the Code

```int Compress(BYTE *input, int inputlength);   // compress streams, return value is output size
int Decompress(BYTE *input, int inputlength); // decompress streams, return value is output size
int CompressFile(char *InFile, char *OutFile);  // compress file directly, return value is output size
int DecompressFile(char *InFile, char *OutFile); // decompress file directly return value is output size```

Points of Interest

Huffman compression gives fairly better ratio than RLE, and is more effective in general. But, of course, it does not consume patterns, nor repeats. Giving ratio of saving ranges from 20% - 50% in general.

Huffman tree coding is used worldwide, e.g., in deflate gzip, JPEG images, ....

If you want to learn huffman in instant, feel free to visit best Czech page, intended for this subject. Go here and fill up a textfield with some letters. Then click "Cetnosti" to count symbols. You see the small trees? Nice. Now click "Vytvor strom" to create final tree, or try "Jeden krok" for one per click tree merging.

Storing path information in output tree structure

When storing a tree inside an output file, it would be interesting to determine whether `int` is sufficiently large enough, to contain all possible paths, and if there are any bad cases, where it fails. While writing a Huffman table (output tree structure), basically we have several choices to do so. The choices are discussed below.

In a two-pass Huffman implementation (get probabilities, then perform compression/decompression), we need to inform decoder, about a tree structure, used in compression. That structure can be represented by a table, simply with a list of characters and their probabilities. Basically you have a few choices on how to generate an original tree. You can:

1. Send a tree as a whole (path length, path itself, symbol) or
2. send character counts (count, character) or
3. send sorted sequence of characters, present in the original stream and send a "recipe" on how to reconstruct a tree or
4. send a canonic representation of tree

The details:

1. Sending a tree with a path length, path itself and a symbol requires 4+4+1 = 9 bytes (4 bytes length, 4 bytes path, 1 byte char). When there is 256 possible characters, resulting header would be 256*9 = 2304 bytes.
2. Sending character counts require 4 bytes for counts and 1 byte for char. It sums 4+1 = 5 bytes. A 256 characters would cost 256 * 5 = 1280 bytes.
3. Sorted sequence is present in current implementation (3.0), and we will discuss that more closely here. A  tree-making algorithm goes from end to the begining (of trees array), passing through all character trees (a tree with only 1 character) and merging 2 last trees into 1 new tree. Each time, when merging is executed, all remaining trees are re-sorted, to reflect actual change in trees, and to assume, that it will be 2 with minimal probabilities, which are last. Merging takes 2 trees and creates a new tree, which have 1 left and 1 right node (sub-tree), both are those minimal probability trees taken. Their probabilities is summed and included in newly-created tree. Now we need to assume, that trees are in correct order (most to last probable). To do this, we can simply re-sort whole tree. A key thing here is, that re-sorting is not needed at all. Think just deeper about this: re-sorting checks all trees, to make sure they are sorted. But we know, that it is only 1 tree, that changed actually and needs to be checked. So, we simply check, whether there is a better position for that new tree (from last to first). Whether a better position is found, we call this checking step. We store each step (int value = new position). A step list can be named "recipe" as it tells us, how to reconstruct original tree using exact steps.

Our sorted sequence is just characters sorted along their probabilities (from most to last). This is initial array, just before any step has been performed.

So we need at most 256 bytes for characters and 256 bytes for steps, the sum of 512 bytes.  We saved 768 bytes in comparsion to 2. Another benefit is the fact, that we are worrying about path length storing (4 bytes sufficient? see ) no longer.

For a canonical representation you can visit [1].  Or you can see [2], which seems to be easier to read.

References

[1] http://www.arturocampos.com/ac_canonical_huffman.html#Canonical%20Huffman

History

• 01.04.2012 - Added `int CompressFile` and `int DecompressFile`. Header included in the compressed output is now shorter (max before (bytes): 1280, current max: 512). Compression and decompression is now slight faster. This version is not compatible with any previous version.
• Complete history list can be found in the header of the huffman.cpp C++ source file.
• 30.8.2010 - First source revision, classes split into separate headers, plus made public only Compress and Decompress (plus constructor).
• 29.8.2010 - Comments and class rename.

Share

 Software Developer (Junior) SOMI Systems Slovakia
Got 1st and 2nd degree from computer science in 2011 on university of Matej Bel, city of Banska Bystrica. Currently work for computer-oriented company SOMI Systems in Slovakia.
Working for SOMI Systems a.s., http://somi.sk/

You may also be interested in...

 First PrevNext
 Huffman, Shannon-Fano, Arithmetic Encoding and history.... Andy Bantly3-Apr-12 6:49 Andy Bantly 3-Apr-12 6:49
 Re: Huffman, Shannon-Fano, Arithmetic Encoding and history.... `Randor` 8-Apr-12 4:16 `Randor` 8-Apr-12 4:16
 Re: Huffman, Shannon-Fano, Arithmetic Encoding and history.... Jan Mojzis10-Apr-12 4:04 Jan Mojzis 10-Apr-12 4:04
 Re: Huffman, Shannon-Fano, Arithmetic Encoding and history.... Andy Bantly10-Apr-12 4:11 Andy Bantly 10-Apr-12 4:11
 Re: Huffman, Shannon-Fano, Arithmetic Encoding and history.... Jan Mojzis13-May-12 23:26 Jan Mojzis 13-May-12 23:26
 Amount of memory Ben Hugh31-Mar-12 6:22 Ben Hugh 31-Mar-12 6:22
 Re: Amount of memory Jan Mojzis2-Apr-12 3:36 Jan Mojzis 2-Apr-12 3:36
 Re: Amount of memory Jan Mojzis2-Apr-12 9:51 Jan Mojzis 2-Apr-12 9:51
 risk of "self allocated output" Michael Stammberger17-Oct-10 1:44 Michael Stammberger 17-Oct-10 1:44
 Re: risk of "self allocated output" Jan Mojzis27-Oct-10 10:56 Jan Mojzis 27-Oct-10 10:56
 Re: risk of "self allocated output" Jan Mojzis2-Apr-12 9:45 Jan Mojzis 2-Apr-12 9:45
 My vote of 3 JeanWaljean7-Sep-10 20:15 JeanWaljean 7-Sep-10 20:15
 Re: My vote of 3 Jan Mojzis8-Sep-10 2:20 Jan Mojzis 8-Sep-10 2:20
 Dynamic Huffman supercat931-Aug-10 5:56 supercat9 31-Aug-10 5:56
 Re: Dynamic Huffman [modified] Jan Mojzis2-Sep-10 10:19 Jan Mojzis 2-Sep-10 10:19
 Re: Dynamic Huffman JeanWaljean7-Sep-10 19:29 JeanWaljean 7-Sep-10 19:29
 Re: Dynamic Huffman Jan Mojzis2-Apr-12 9:40 Jan Mojzis 2-Apr-12 9:40
 Why public? Niklas Lindquist30-Aug-10 2:49 Niklas Lindquist 30-Aug-10 2:49
 Re: Why public? Jan Mojzis30-Aug-10 9:08 Jan Mojzis 30-Aug-10 9:08
 Re: Why public? JeanWaljean7-Sep-10 19:55 JeanWaljean 7-Sep-10 19:55
 My vote of 2 John M. Drescher29-Aug-10 18:22 John M. Drescher 29-Aug-10 18:22
 Re: My vote of 2 Jan Mojzis2-Apr-12 9:33 Jan Mojzis 2-Apr-12 9:33
 My vote of 1 Bething229-Aug-10 11:47 Bething2 29-Aug-10 11:47
 Re: My vote of 1 Jan Mojzis29-Aug-10 14:51 Jan Mojzis 29-Aug-10 14:51
 Re: My vote of 1 Yuantu Huang31-Aug-10 0:45 Yuantu Huang 31-Aug-10 0:45
 Last Visit: 31-Dec-99 18:00     Last Update: 30-May-16 7:19 Refresh 12 Next »