Click here to Skip to main content
Licence CPOL
First Posted 29 Aug 2010
Views 12,726
Downloads 544
Bookmarked 18 times

Fast and Simple Huffman Compressor

By Jan Mojzis | 30 Aug 2010
Fast and simple huffman compressor
1 vote, 12.5%
1
1 vote, 12.5%
2
1 vote, 12.5%
3
1 vote, 12.5%
4
4 votes, 50.0%
5
3.75/5 - 8 votes
μ 3.75, σa 2.77 [?]

Introduction

Huffman tree compression is almost as simple as RLE compression, but can be equally fast and gives more reasonable compression ration, thus is more effective. This source code, along with the binary, is an example of Huffman tree compression.

Background

Huffman tree compression and tree construction is fairly described on Wiki. Huffman tree coding is based on prefixes. So there cannot be two different leaves with exact same paths, even more, there must be a path, that ends up only in one leaf (no intersection leaves). Huffman tree coding, furthermore, is based on a frequency or probability of appearance of each one particular symbol. Given this idea, we work with a priority queue, that is initially filled with small trees, each with its symbol/char and probability of appearance/ref.count. We sort our queue/array descending, so most frequent symbols are topmost, while least common are the last. We then merge trees, starting from least one end, going topmost, while merging two trees into one new. After each merging, sorting is performed, so the queue/array is ordered descending again. We do merging, sorting until there is no more than 1 tree in the queue. Then we can finish with tree construction and go to compression, where each symbol is replaced by its path in bits from tree.

Using the Code

The code contain one public class simple_Huffman in simple_huffman.cpp.

Main Class

//base class simple_Huffman, call only Compress and Decompress
class simple_Huffman{
  private:
     node *trees[256];  // Array of trees, here will be only one tree at pos. 
                        // 0 when finished with merging 
     node *leaves[256]; // array of leaves, each contains ASCII char,
                        // ref. count, later code and codelength too. 
                        // It is a reference to original trees.
     node *empty;       // Pseudoleaf, serving in sorting and merging as empty leaf
                        // avoiding testing on NULL
     int treescount;
     void InitializeLeaves(node *, int, int); // initialization of leaves with 
                                              // codes and code lengths
     static int comparefreq(const void *A, const void *B); // sorting, we want to sort 
                                                           // descending
  public:    
    simple_Huffman() {for (register int i = 0; i < 256; i++) 
	{trees[i] = new node(); leaves[i] = trees[i]; } empty = new node();}
    ~simple_Huffman() {delete empty; for (register int i = 0; 
		i < treescount; i++) delete trees[i];}
     int Compress(BYTE *input, int inputlength, BYTE *output);
     int Decompress(BYTE *input, int inputlength, 
		BYTE *&output); // output is self-allocated!!!   
                                  // DO NOT attempt to allocate it!!!
}; 

Using the Code

int Compress(BYTE *input, int inputlength, BYTE *output);
int Decompress(BYTE *input, int inputlength, BYTE *&output);// output is self-allocated!!!
                                                            // DO NOT attempt to 
                                                            // allocate it!!!

Points of Interest

Huffman compression gives fairly better ratio than RLE, and is more effective in general. But, of course, it does not consume patterns, nor repeats. Giving ratio ranges from 20% - 50% in general.

If you want to learn huffman in instant, feel free to visit best Czech page, intended for this subject. Go here and fill up a textfield with some letters. Then click "Cetnosti" to count symbols. You see the small trees? Nice. Now click "Vytvor strom" to create final tree, or try "Jeden krok" for one per click tree merging.

It would be interesting to determine whether int is sufficiently large enough, to contain all possible paths, and if there are any bad cases, where it fails. It can easily happen to have number 5 million, but testing has not proven any overflows so far.

While writing Huffman table, we have several choices to do so. One is to include the full path to each symbol (this case needs to store path and length of path) or just write symbols with their respective probabilities/ref.count. I've chosen the second one, because it's less space consuming.

Huffman tree coding is used worldwide, e.g., in deflate gzip.

History

  • 30.8.2010 - First source revision, classes split into separate headers, plus made public only Compress and Decompress (plus constructor)
  • 29.8.2010 - Comments and class rename

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Jan Mojzis

Software Developer (Junior)
SOMI Systems
Slovakia Slovakia

Member
Got 1st and 2nd degree from computer science in 2011 on university of Matej Bel, city of Banska Bystrica. Currently work for computer-oriented company SOMI Systems in Slovakia.
Working for SOMI Systems a.s., http://somi.sk/

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
Generalrisk of "self allocated output" PinmemberMichael Stammberger2:44 17 Oct '10  
GeneralRe: risk of "self allocated output" PinmemberJan Mojzis11:56 27 Oct '10  
GeneralMy vote of 3 PinmemberJeanWaljean21:15 7 Sep '10  
GeneralRe: My vote of 3 PinmemberJan Mojzis3:20 8 Sep '10  
GeneralDynamic Huffman Pinmembersupercat96:56 31 Aug '10  
GeneralRe: Dynamic Huffman [modified] PinmemberJan Mojzis11:19 2 Sep '10  
GeneralRe: Dynamic Huffman PinmemberJeanWaljean20:29 7 Sep '10  
QuestionWhy public? PinmemberNiklas Lindquist3:49 30 Aug '10  
AnswerRe: Why public? PinmemberJan Mojzis10:08 30 Aug '10  
GeneralRe: Why public? PinmemberJeanWaljean20:55 7 Sep '10  
GeneralMy vote of 2 PinmemberJohn M. Drescher19:22 29 Aug '10  
GeneralMy vote of 1 PinmemberBething212:47 29 Aug '10  
GeneralRe: My vote of 1 PinmemberJan Mojzis15:51 29 Aug '10  
GeneralRe: My vote of 1 PinmemberYuantu Huang1:45 31 Aug '10  
GeneralRe: My vote of 1 PinmemberBlake Miller13:52 9 Sep '10  
GeneralRe: My vote of 1 PinmemberPete Goodsall6:41 21 Sep '10  
GeneralYou are an idiot PinmemberIvanovaIsGod10:35 26 Sep '11  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web04 | 2.5.120209.1 | Last Updated 30 Aug 2010
Article Copyright 2010 by Jan Mojzis
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid