Introduction
Huffman tree compression is almost as simple as RLE compression, but can be equally fast and gives more reasonable compression ration, thus is more effective. This source code, along with the binary, is an example of Huffman tree compression.
Background
Huffman tree compression and tree construction is fairly described on Wiki. Huffman tree coding is based on prefixes. So there cannot be two different leaves with exact same paths, even more, there must be a path, that ends up only in one leaf (no intersection leaves). Huffman tree coding, furthermore, is based on a frequency or probability of appearance of each one particular symbol. Given this idea, we work with a priority queue, that is initially filled with small trees, each with its symbol/char and probability of appearance/ref.count. We sort our queue/array descending, so most frequent symbols are topmost, while least common are the last. We then merge trees, starting from least one end, going topmost, while merging two trees into one new. After each merging, sorting is performed, so the queue/array is ordered descending again. We do merging, sorting until there is no more than 1 tree in the queue. Then we can finish with tree construction and go to compression, where each symbol is replaced by its path in bits from tree.
Using the Code
The code contain one public class simple_Huffman in simple_huffman.cpp.
Main Class
class simple_Huffman{
private:
node *trees[256]; node *leaves[256]; node *empty; int treescount;
void InitializeLeaves(node *, int, int); static int comparefreq(const void *A, const void *B); public:
simple_Huffman() {for (register int i = 0; i < 256; i++)
{trees[i] = new node(); leaves[i] = trees[i]; } empty = new node();}
~simple_Huffman() {delete empty; for (register int i = 0;
i < treescount; i++) delete trees[i];}
int Compress(BYTE *input, int inputlength, BYTE *output);
int Decompress(BYTE *input, int inputlength,
BYTE *&output); };
Using the Code
int Compress(BYTE *input, int inputlength, BYTE *output);
int Decompress(BYTE *input, int inputlength, BYTE *&output);
Points of Interest
Huffman compression gives fairly better ratio than RLE, and is more effective in general. But, of course, it does not consume patterns, nor repeats. Giving ratio ranges from 20% - 50% in general.
If you want to learn huffman in instant, feel free to visit best Czech page, intended for this subject. Go here and fill up a textfield with some letters. Then click "Cetnosti" to count symbols. You see the small trees? Nice. Now click "Vytvor strom" to create final tree, or try "Jeden krok" for one per click tree merging.
It would be interesting to determine whether int is sufficiently large enough, to contain all possible paths, and if there are any bad cases, where it fails. It can easily happen to have number 5 million, but testing has not proven any overflows so far.
While writing Huffman table, we have several choices to do so. One is to include the full path to each symbol (this case needs to store path and length of path) or just write symbols with their respective probabilities/ref.count. I've chosen the second one, because it's less space consuming.
Huffman tree coding is used worldwide, e.g., in deflate gzip.
History
- 30.8.2010 - First source revision, classes split into separate headers, plus made public only Compress and Decompress (plus constructor)
- 29.8.2010 - Comments and class rename