12,511,920 members (45,536 online)
alternative version

203.1K views
32 bookmarked
Posted

# Using PPMD for compression

, 11 Jun 2001
 Rate this:
This article presents a class for using PPM to compress a file.

## Theoretical Background

(from Unbounded length contexts for PPM by John G. Cleary, W. J. Teahan, Ian H. Witten

Prediction by partial matching, or PPM, is a finite­context statistical modeling technique that can be viewed as blending together several fixed­order context models to predict the next character in the input sequence. Prediction probabilities for each context in the model are calculated from frequency counts which are updated adaptively; and the symbol that actually occurs is encoded relative to its predicted distribution using arithmetic coding. The maximum context length is a fixed constant, and it has been found that increasing it beyond about six or so does not generally improve compression.

The basic idea of PPM is to use the last few characters in the input stream to predict the upcoming one. Models that condition their predictions on a few immediately preceding symbols are called finite­context models of order k, where k is the number of preceding symbols used. PPM employs a suite of fixed­order context models with different values of k, from 0 up to some pre­determined maximum, to predict upcoming characters.

For each model, a note is kept of all characters that have followed every length­k subsequence observed so far in the input, and the number of times that each has occurred. Prediction probabilities are calculated from these counts. The probabilities associated with each character that has followed the last k characters in the past are used to predict the upcoming character. Thus from each model, a separate predicted probability distribution is obtained.

These distributions are effectively combined into a single one, and arithmetic coding is used to encode the character that actually occurs, relative to that distribution. The combination is achieved through the use of escape probabilities. Recall that each model has a different value of k. The model with the largest k is, by default, the one used for coding. However, if a novel character is encountered in this context, which means that the context cannot be used for encoding it, an escape symbol is transmitted to signal the decoder to switch to the model with the next smaller value of k. The process continues until a model is reached in which the character is not novel, at which point it is encoded with respect to the distribution predicted by that model. To ensure that the process terminates, a model is assumed to be present below the lowest level, containing all characters in the coding alphabet. This mechanism effectively blends the different order models together in a proportion that depends on the values actually used for escape probabilities.

## PPMD_Coder

`PPMD_Coder` is based on PPMD var E by Dmitri Shkarin. You can download the original code here.

I have created a static library around it and an easy to use wrapper class. The `constructor` expects 4 parameters:

• The name of the Input file
• The name of the Output file. If empty or NULL and you `Compress` a file the extension ".ppm" will be added, if you `Uncompress` a file the name stored in the compressed file will be used.
• The memory in MBytes (between 1 and 256) which can be used by the program. It will be dynamically allocated on the heap so don't use unrealistic values. After compression you can look at the statistic to see how much memory was used. It's a good idea to keep the value as low as possible because for decompression the same amount of memory must be allocated or decompression will fail.
• The Order size (between 2 and 16). A value of 6 is a good starting point but it's worth trying out higher values.
• The memory and order size can also be set later with `OrderSize` and `SubAllocatorSize`. Values which are out of range are adjusted, the return value is the size actually set. They have only impact on the compression.

Now you can decide what to do: call `Compress` or `Uncompress`. Both throw an `exception` when an error occurs. Additionally they return `TRUE` on success or `FALSE`. Here is an example:

```#include "ppmd_coder.h"
...

// Use default values: output name will get the extension "ppm",
// MemorySize is 8 and OrderSize is 6
PPMD_Coder ppmd(szInput);

try
{
// Without exceptions you would check the return value.
ppmd.Compress();

// File is compressed now
// Get some statistic:
DWORD m1, m2;
ppmd.GetMemoryUsage(m1, m2);

// m1 = used memory only MBytes
// m2 = used memory only bytes below 1 MBytes
TRACE(_T("Memory used:       %i.%i MBytes\n"), m1, m2);
TRACE(_T("Compression ratio: %2.2f\n"), ppmd.GetRatio());
}
catch (exception &e)
{
// Something went wrong, display error message.
AfxMessageBox(e.what());
}
```

Easy, isn't it? Decompression works the same way only that you call `ppmd.Uncompress()` instead of `Compress()` and you must used `ppmd.GetRatioUncompressed()` to get the right ratio.

## Final words

Please have a look at the demo application. It's very simple but allows you to test the class. With little effort it's possible to support the compression of several files into one archive, check the orginal code for an example. Some special care must be taken so I advise you to study the code carefully before you do anything.

Compare the compressed file with other programs (BZip, GZip, Rar, Ace, ...) and you will be surprised how well the algorithm works. And it's not too slow! The static library could be changed to support some kind of callback to see how much of the file has been processed yet. Maybe in another version...

A list of licenses authors might use can be found here

## Share

 Web Developer Germany
No Biography provided

## You may also be interested in...

 Pro Pro

 First Prev Next
 Question about PPMd Member 103249511-Jun-15 2:21 Member 10324951 1-Jun-15 2:21
 The actual ppm algorithm Sai Hemanth SV22-Mar-12 19:59 Sai Hemanth SV 22-Mar-12 19:59
 hi, i am trying to implement this ppm algorithm and compare with other algorithms. You have given how compression occurs but prediction was not covered.can you give the method where exactly the prediction is taking palce using ppm. pseudocode also will be fine. Thanks in advance, hemanth
 arithmetic code july ryanhar12-Jul-10 20:11 july ryanhar 12-Jul-10 20:11
 What about Dmitry Shkarin's PPMD_J compression for non-stationary sources / files? romulus4-Mar-08 10:52 romulus 4-Mar-08 10:52
 arithmetic coding Ahmed jamil Kattan13-Nov-07 6:04 Ahmed jamil Kattan 13-Nov-07 6:04
 Piece of Art Ahmed jamil Kattan8-Nov-07 10:06 Ahmed jamil Kattan 8-Nov-07 10:06
 Stepwise Algorithum and flowchart desaleritesh5-Nov-07 6:37 desaleritesh 5-Nov-07 6:37
 Re: Stepwise Algorithum and flowchart Andreas Muegge8-Nov-07 21:08 Andreas Muegge 8-Nov-07 21:08
 why not cryptography vivek_vis16-Mar-07 19:18 vivek_vis 16-Mar-07 19:18
 Re: why not cryptography Andreas Muegge21-Mar-07 3:22 Andreas Muegge 21-Mar-07 3:22
 warning while linking demo project Ozge Colak4-Dec-06 22:26 Ozge Colak 4-Dec-06 22:26
 how to choose the best order and subAllocatorSize? [modified] Ozge Colak15-Jun-06 22:12 Ozge Colak 15-Jun-06 22:12
 Re: how to choose the best order and subAllocatorSize? Andreas Muegge19-Jun-06 23:33 Andreas Muegge 19-Jun-06 23:33
 Re: how to choose the best order and subAllocatorSize? Ozge Colak20-Jun-06 5:25 Ozge Colak 20-Jun-06 5:25
 Clarification Sharmila.manoharan16-Jun-05 20:17 Sharmila.manoharan 16-Jun-05 20:17
 Re: Clarification Andreas Muegge22-Jun-05 20:24 Andreas Muegge 22-Jun-05 20:24
 Re: Clarification Sharmila.manoharan22-Jun-05 20:35 Sharmila.manoharan 22-Jun-05 20:35
 Sometimes it's better than ZIP or RAR, but... andystone28-Apr-05 15:19 andystone 28-Apr-05 15:19
 Re: Sometimes it's better than ZIP or RAR, but... Andreas Muegge28-Apr-05 20:35 Andreas Muegge 28-Apr-05 20:35
 Re: Sometimes it's better than ZIP or RAR, but... andystone28-Apr-05 23:28 andystone 28-Apr-05 23:28
 Buffer Compressing (Suggestion) J. A. Acosta21-Apr-05 10:47 J. A. Acosta 21-Apr-05 10:47
 Re: Buffer Compressing (Suggestion) Andreas Muegge21-Apr-05 21:31 Andreas Muegge 21-Apr-05 21:31
 Re: Buffer Compressing (Suggestion) Ozge Colak12-Jun-06 22:29 Ozge Colak 12-Jun-06 22:29
 Problem linking in VC++.NET 2003 paindivine7518-Feb-04 3:58 paindivine75 18-Feb-04 3:58
 Re: Problem linking in VC++.NET 2003 Andreas Muegge18-Feb-04 4:08 Andreas Muegge 18-Feb-04 4:08
 Re: Problem linking in VC++.NET 2003 zhanqxun20-Dec-10 23:42 zhanqxun 20-Dec-10 23:42
 End of data block Maksymus00716-Feb-04 9:21 Maksymus007 16-Feb-04 9:21
 Re: End of data block Andreas Muegge17-Feb-04 6:10 Andreas Muegge 17-Feb-04 6:10
 Re: End of data block Maksymus00717-Feb-04 7:07 Maksymus007 17-Feb-04 7:07
 dll file kgaurav6-Nov-03 21:29 kgaurav 6-Nov-03 21:29
 Re: dll file Andreas Muegge10-Nov-03 20:39 Andreas Muegge 10-Nov-03 20:39
 It can't be used in multithread programs! sxbyl16-Jun-02 16:30 sxbyl 16-Jun-02 16:30
 Re: It can't be used in multithread programs! Andreas Muegge17-Jul-02 3:52 Andreas Muegge 17-Jul-02 3:52
 Re: It CAN be used in multithread programs! Anonymous18-Jul-02 21:47 Anonymous 18-Jul-02 21:47
 Re: It CAN be used in multithread programs! Andreas Muegge18-Jul-02 22:39 Andreas Muegge 18-Jul-02 22:39
 Re: It CAN be used in multithread programs! Anonymous5-Oct-05 1:09 Anonymous 5-Oct-05 1:09
 Do you sth faster? Marcin30-Jul-01 5:01 Marcin 30-Jul-01 5:01
 Re: Do you sth faster? Andreas30-Jul-01 21:21 Andreas 30-Jul-01 21:21
 How does it compare? Giles Forster12-Jun-01 9:29 Giles Forster 12-Jun-01 9:29
 Re: How does it compare? Andreas12-Jun-01 21:21 Andreas 12-Jun-01 21:21
 Re: How does it compare? Giles Forster13-Jun-01 11:32 Giles Forster 13-Jun-01 11:32
 Not too slow??? Anonymous12-Jun-01 4:49 Anonymous 12-Jun-01 4:49
 Order is NOT compression level! Andreas12-Jun-01 20:33 Andreas 12-Jun-01 20:33
 You're correct Anonymous19-Jun-01 0:28 Anonymous 19-Jun-01 0:28
 Last Visit: 31-Dec-99 18:00     Last Update: 30-Sep-16 22:11 Refresh 1