Click here to Skip to main content
Click here to Skip to main content

Arithmetic Compression With C#

, 10 Dec 2009
Rate this:
Please Sign up or sign in to vote.
This article describes arithmetic compression with an implemetnation in C#.

Introduction

According to Wikipedia:

Arithmetic coding is a method for lossless data compression. Normally, a string of characters such as the words ‘hello there’ is represented using a fixed number of bits per character, as in the ASCII code. Like Huffman coding, arithmetic coding is a form of variable-length entropy encoding that converts a string into another form that represents frequently used characters using more bits, with the goal of using fewer bits in total. As opposed to other entropy encoding techniques that separate the input message into its component symbols and replace each symbol with a code word, arithmetic coding encodes the entire message into a single number, a fraction ne, where (0.0 ≤ n < 1.0).

Some of the properties of arithmetic coding are:

  1. When applied to independent and identically distributed (IID) sources, the compression of each symbol is provably optimal.
  2. It is effective in a wide range of situations and compression ratios. The same arithmetic coding implementation can effectively code all the diverse data created by different processes such as modeling parameters, transform coefficients, signaling etc.
  3. It simplifies the modeling of complex sources, yielding near optimal or significantly improved compression for sources that are not IID.
  4. Its main process is arithmetic, which is supported with ever-increasing efficiency by all general purpose or digital processors.
  5. It is suited for use as a compression black-box by those who are not coding experts or do not want to implement the coding algorithm themselves.

In this article, I present an implementation of an arithmetic coder in C# (coder.cs in the source code). The coding algorithm is adapted from an algorithm originally presented in C by Mark Nelson in his paper available here (http://dogma.net/markn/articles/arith/part1.htm). The coding algorithm has been separated from the statistical model and encapsulated as a C# object.

Description

The ‘coder’ object in the source code attached to this article can work as a black box implementation of an arithmetic encoder/decoder. However, to use it, we need to understand the basics of statistical modeling for arithmetic coding and a few of the basic concepts of arithmetic coding.

In general, using arithmetic coding depends on creating a statistical model of the data. In this example, I will assume that we are trying to encode the words “HELLO WORLD”.

Creating the statistical model of the data proceeds as follows:

  1. Taking the number of independent characters in the words to be encoded (HELLO WORLD), we obtain, in alphabetical order:
    1. D
    2. E
    3. H
    4. L
    5. O
    6. R
    7. W

    The total number of characters to be encoded is 10. For convenience and clarity, I have ignored the space between the words HELLO and WORLD.

  2. We can arrange these characters into a table with their corresponding frequency, as below:
  3. Character Frequency
    D 1
    E 1
    H 1
    L 3
    O 2
    R 1
    W 1
  4. Each of the characters will be assigned a range based on its frequency/ probability of occurrence. This range will be between 0 and 1, as below (note that I have not used any optimizations for the frequency and probability model; for the most optimal compression, this is necessary; however, for the purposes of our example, this should do).
  5. Character Frequency Probability Range
    D 1 1/10 0.0 – 0.1
    E 1 1/10 0.1 – 0.2
    H 1 1/10 0.2 – 0.3
    L 3 3/10 0.3 – 0.6
    O 2 2/10 0.6 – 0.8
    R 1 1/10 0.8 – 0.9
    W 1 1/10 0.9 – 1.0
  6. The algorithm for encoding is as below:
  7. Set low to 0.0
    Set high to 1.0
    While there are still input symbols do
        get an input symbol
        code_range = high - low.
              high = low + code_range *  high_range of the symbol being coded
             low = low + code_range * low_range of the symbol being coded
    End of While
    output low

    Applying this to our input (HELLO WORLD), we obtain:

    encoding H (Hs range is 0.2 - 0.3) Range(or code_range above) = 1 - 0 = 1
    low =  0 + (1 * 0.2) = 0.2 
    high = 0 + (1 * 0.3) = 0.3
    no output
    encoding E (Es range is 0.1 - 0.2) Range = 0.3 - 0.2 = 0.1
    
    low =  0.2 + (0.1 * 0.1) = 0.21 
    high = 0.2 + (0.1 * 0.2) = 0.22
    output 0.2
    encoding L (Ls range is 0.3 - 0.6) Range = 0.22 - 0.21 = 0.01
    
    low =  0.21 + (0.01 * 0.3) = 0.213
    high = 0.21 + (0.01 * 0.6) = 0.216
    output 0.21
    encoding the next L (Ls range is 0.3 - 0.6) Range = 0.216 - 0.213 = 0.003
    
    low =  0.213 + (0.003 * 0.3) = 0.2139
    high = 0.213 + (0.003 * 0.6) = 0.2148
    no output
    encoding O (Os range is 0.6 - 0.8) Range = 0.2148 - 0.2139 = 0.0009
    low =  0.2139 + (0.0009 * 0.6) = 0.21444
    high = 0.2139 + (0.0009 * 0.8) = 0.21462 
    output 0.214
    encoding W (Ws range is 0.9 - 1.0) Range = 0.21462 - 0.21444 = 0.00018
    low =  0.21444 + (0.00018 * 0.9) = 0.214602
    high = 0.21444 + (0.00018 * 1.0) = 0.21462
    output 0.2146
    encoding O (Os range is 0.6 - 0.8) Range = 0.21462 - 0.214602 = 0.000018
    low =  0.214602 + (0.000018 * 0.6) = 0.2146128
    high = 0.214602 + (0.000018 * 0.8) = 0.2146164
    output 0.21461
    encoding R (Rs range is 0.8 - 0.9) Range = 0.2146164 - 0.2146128 = 0.0000036
    low =  0.2146128 + (0.0000036 * 0.8) = 0.21461568
    high = 0.2146128 + (0.0000036 * 0.9) = 0.21461604
    no output
    encoding L (Ls range is 0.3 - 0.6) Range = 0.21461604 - 0.21461568 = 0.00000036
    low =  0.21461568 + (0.00000036 * 0.3) = 0.214615788
    high = 0.21461568 + (0.00000036 * 0.6) = 0.214615896
    output 0.214615
    encoding D (Ds range is 0.0 - 0.1) Range = 0.214615896 - 0.214615788 = 0.000000108
    low =  0.214615788 + (0.000000108 * 0.0) = 0.214615788
    high = 0.214615788 + (0.000000108 * 0.1) = 0.2146157988
    output 0.2146157 and 88 from low
    ...

    Giving the table below:

    Character Frequency Probability Range Low High
    H 1 1/10 0.2 – 0.3 0.2 0.3
    E 1 1/10 0.1 – 0.2 0.21 0.22
    L 3 3/10 0.3 – 0.6 0.213 0.216
    L 3 3/10 0.3 – 0.6 0.2139 0.2148
    O 2 3/10 0.6 – 0.8 0.21444 0.21462
    W 1 1/10 0.9 – 1.0 0.214602 0.214620
    O 2 2/10 0.6 – 0.8 0.2146128 0.2146164
    R 1 1/10 0.8 – 0.9 0.21461568 0.21461604
    L 3 3/10 0.3 – 0.6 0.214615788 0.214615896
    D 1 1/10 0.0 – 0.1 0.214615788 0.214615806

    So the final low value 0.214615788 will encode the string “HELLOWORLD” (without the space, which was omitted for clarity).

  8. The algorithm for decoding the number and retrieving the encoded string/data is as below:
  9. Find the symbol represented by the range that the number is in, output it. Remove the effects of encoding and repeat. In pseudo-code:

    get the number encoding the data
    loop    
    current symbol  =  the symbol/character in which range the number falls
    current range =  current symbols high value – current symbols low value   
    subtract current symbols low value from number     
    divide the number by the current range 
    end loop

    This algorithm would give us the working below, taking the output from the example above:

    The number is 0.214615788
    current symbol =  H (range: 0.2 – 0.3)
    current range =  0.3 – 0.2 = 0.1
    subtract current symbols low value from number   = 0.214615788 – 0.2 = 0.014615788
    divide the number by the current range = 0.014615788/0.1 = 0.14615788
    
    current symbol =  E (range: 0.1 – 0.2)
    current range =  0.2 – 0.1 = 0.1
    subtract current symbols low value from number   =  0.14615788 – 0.1 = 0.04615788
    divide the number by the current range = 0.04615788/0.1 = 0.4615788
    
    current symbol =  L (range: 0.3 – 0.6)
    current range =  0.6 – 0.3 = 0.3
    subtract current symbols low value from number   =  0.4615788 – 0.3 = 0.1615788
    divide the number by the current range = 0.1615788/0.3 = 0.538596
    
    current symbol =  L (range: 0.3 – 0.6)
    current range =  0.6 – 0.3 = 0.3
    subtract current symbols low value from number   =  0.538596 – 0.3 = 0.238596
    divide the number by the current range = 0.238596/0.3 = 0.79532
    
    current symbol =  O (range: 0.6 – 0.8)
    current range =  0.8 – 0.6 = 0.2
    subtract current symbols low value from number   =  0.79532 – 0.6 = 0.19532
    divide the number by the current range = 0.19532/0.2 = 0.9766
    
    current symbol =  W (range: 0.9 – 1.0)
    current range =  0.9 – 1.0 = 0.1
    subtract current symbols low value from number   =  0.9766 – 0.9 = 0.0766
    divide the number by the current range = 0.0766/0.1 = 0.766

    This working yields the table below:

    Character Frequency Probability Range Number
    H 1 1/10 0.2 – 0.3 0.214615788
    E 1 1/10 0.1 – 0.2 0.14615788
    L 3 3/10 0.3 – 0.6 0.4615788
    L 3 3/10 0.3 – 0.6 0.538596
    O 2 3/10 0.6 – 0.8 0.79532
    W 1 1/10 0.9 – 1.0 0.9766
    O 2 2/10 0.6 – 0.8 0.766
    R 1 1/10 0.8 – 0.9 0.83
    L 3 3/10 0.3 – 0.6 0.3
    D 1 1/10 0.0 – 0.1 0.0

Implementation

The above algorithms give us the basis of a working implementation of arithmetic coding.

In a practical implementation of arithmetic coding/decoding, note that:

  1. The range assigned to a symbol includes the lower limit but not the upper limit of that range. In other words, in our example, the symbol ‘H’ owns the range 0.2 to 0.29999...but not 0.3. Mathematically, this can be written as H owns the range (0.2 ≤ n < 0.3).
  2. We can use integer mathematics to implement arithmetic coding. This allows us to simplify the representation of symbols and ranges, and frees us from some of the limitations of floating point calculations. We achieve this simplification by using integers and setting an imaginary decimal point at the beginning of the range. Using this simplification and applying it to our example, our probability tables now looks like this:
  3. Character Frequency Probability Range
    D 1 1/10 0000 – 0999
    E 1 1/10 1000 – 1999
    H 1 1/10 2000 – 2999
    L 3 3/10 3000 – 5999
    O 2 2/10 6000 – 7999
    R 1 1/10 8000 – 8999
    W 1 1/10 9000 – 9999

    Basically, this means that the symbol D owns the range 0.0000 to 0.0999...which is the range (0.0 ≤ n < 0.1) written in mathematical notation.

    Keep in mind that the number 99999... should be thought of as 0.9999..., with an infinity of 9s after the decimal point. In the limit, there is an infinitesimally small difference between this upper limit 1.

  4. Consider the encoding table above:
  5. Character Frequency Probability Range Low High
    H 1 1/10 0.2 – 0.3 0.2 0.3
    E 1 1/10 0.1 – 0.2 0.21 0.22
    L 3 3/10 0.3 – 0.6 0.213 0.216
    L 3 3/10 0.3 – 0.6 0.2139 0.2148
    O 2 3/10 0.6 – 0.8 0.21444 0.21462
    W 1 1/10 0.9 – 1.0 0.214602 0.214620
    O 2 2/10 0.6 – 0.8 0.2146128 0.2146164
    R 1 1/10 0.8 – 0.9 0.21461568 0.21461604
    L 3 3/10 0.3 – 0.6 0.214615788 0.214615896
    D 1 1/10 0.0 – 0.1 0.214615788 0.214615806

    Notice that as the encoding proceeds, the high and low numbers significant digits tend to converge.

    In the second iteration (while encoding E), the digit 2 converges for both the high and low numbers, and will never change again regardless of how many more characters we encode thereafter. This is a property of the encoding algorithm which continually narrows the encoding range.

    As encoding proceeds, we obtain the sequence of low numbers, 0.2, 0.21, 0.213, and the high numbers, 0.3, 0.22, 0.216 (output when encoding L). At this point, the significant digits 2 and 1 have converged for both the high and low numbers.

    Again, at the fifth iteration (while encoding 0), we have the number 0.21444 in the current low number, and 0.21462 in the current high number. At this point, the most significant numbers 2, 1, and 4 have converged.

    In a practical implementation, once the significant digits of the high and low numbers have converged, they can be considered to have no further effect on the calculation. They act to simply narrow the encoding range, retaining the ‘decimal place’ of the encoding, but have no further significant effect on subsequent calculations.

    If we imagine our upper and lower numbers as being in an infinitely large array (or a very large array), in our practical calculation, we may safely ship out any significant digits that have converged, and simply shift the entire array one more ‘place’ to the right.

  6. The above calculations can be modeled using integer mathematics. As described below:
  7. We may set the range between high and low to 00000 and 99999, with an imaginary decimal point in front of these numbers. Furthermore, we will assume that the number of 0s in the low number after the decimal point stretches to infinity, and the number of 9s in the high number also stretches to infinity.

    For the purposes of our calculation, while the range between 00000 (0.00000...) and 99999 (0.99999...) is actually 99999 (0.99999...), we increment it by 1 (0.000...1). This is because the number of 9s after the decimal point is taken to be infinite. Therefore, the difference between the high and low numbers in the limit is actually 1.

    Keep in mind that the imaginary decimal point is in front of each of the high or low range numbers below. The numbers 0.000000... and 0.99999... should be assumed to continue indefinitely. Since our bounds are now 0.00000 and 0.99999, and not 0.0 and 1.0, when calculating the range, we will add 1 to compensate. I.e., the high value should be considered to be 1.

    In the pseudo code below, MSD stands for the Most Significant Digit or the Left Most Digit in the number.

    set low to 00000
    set high to 99999
    
    while there are still input symbols
    
        get an input symbol
        code_range = (high – low) + 1
              high = low + code_range *  high_range of the symbol being coded
             low = low + code_range * low_range of the symbol being coded
    
    while the MSD of high and low match
    if the MSD of the high and low match
                output MSD
                remove MSD from low and shift a new 0 into low
                remove MSD of high and shift a new 9 into high
    end if
    end while the MSD of high and low match
    end while there are still input symbols

    Applying this to our input (HELLO WORLD) we obtain:

    First initialize high and low,
    
    Set low to 00000 (or 0.000...)
    Set high to 99999(or 0.999...)
    
    encoding H (Hs range is 0.2 – 0.3) Range(or code_range above) = 
                   99999 - 00000 = 99990 + 1 = 100000 (or 1.00000...)
    
    low = 00000 + (100000 * 0.2) = 20000 (or 0.200000)
    high = 00000 + (100000 * 0.3) = 30000 – 1 = 29999 (or .299999...)

    Also, at this point, 2 has converged, so shifting out 2 and furthermore shifting in another 9 into the high array gives us the new high and low, as below:

    output 2
    
    Set low to 00000
    Set high to 99999
    
    encoding E (Es range is 0.1 – 0.2) Range = (99999 – 00000) + 1 = 100000
    
    low =  00000 + (100000 * 0.1) = 10000 (or 0.100000...)
    high = 19999 + (100000 * 0.2) = 20000 – 1 = 19999 (or 0.199999....)

    At this point, 1 has converged, so shifting out 1 and furthermore shifting in another 9 into the high array gives us the new high and low, as below:

    output 21
    
    Set low to 00000
    Set high to 99999
    
    
    encoding L (Ls range is 0.3 – 0.6) Range = (99999 – 00000) + 1 = 100000
    
    low =  00000 + (100000 * 0.3) = 30000 (or 0.30000)
    high = 00000 + (100000 * 0.6) = 60000 – 1 = 59999 (since 0.5999 = 0.599999...)
    
    no output
    
    
    Set low to 30000
    Set high to 59999
    
    encoding the next L (Ls range is 0.3 – 0.6) 
       Range= (59999 - 30000) + 1 = 29999 + 1 = 30000
    
    
    low =  30000 + (30000 * 0.3) = 30000 + 9000 =  39000  
    high = 30000 + (30000 * 0.6) = 30000 + 18000 = 48000 – 1 = 47999 
    
    no output
    
    
    Set low to 39000
    Set high to 47999
    
    encoding O (Os range is 0.6 - 0.8) Range = (47999 – 39000) + 1 = 9000
    
    low =  39000 + (9000 * 0.6) = 39000 + 5400 = 44400
    high = 39000 + (9000 * 0.8) = 39000 + 7200 = 46200 – 1 = 46199

    At this point, 4 has converged, so shifting out 4 and furthermore shifting in another digit into both the high and low array gives us the new high and low, as below:

    output 214
    
    Set low to 44000 
    Set high to 61999 
    
    encoding W (Ws range is 0.9 – 1.0) Range = (61999 – 44000) + 1 = 18000
    
    low =  44000 + (18000 * 0.9) = 44000 + 16200 = 60200
    high = 44000 + (18000 * 1.0) = 44000 + 18000 = 62000 – 1 = 61999

    At this point, 6 has converged, so shifting out 6 and furthermore shifting in another digit into both the high and low array gives us the new high and low, as below:

    output 2146
    
    Set low to 02000
    Set high to 19999
    
    encoding O (Os range is 0.6 – 0.8) Range = (19999 - 02000) + 1 = 18000
    
    low =  2000 + (18000 * 0.6) = 2000 + 10800 = 12800
    high = 2000 + (18000 * 0.8) = 2000 + 14400 = 16400 – 1 = 16399

    At this point, 1 has converged, so shifting it out and shifting in another digit into both the high and low array gives us the new high and low, as below:

    output 21461
    
    Set low to 28000
    Set high to 63999
    
    encoding R (Rs range is 0.8 – 0.9) Range = (63999 – 28000) + 1 = 36000
    
    low =  28000 + (36000 * 0.8) = 28000 + 28800 = 56800
    high = 28000 + (36000 * 0.9) = 28000 + 32400 = 60400 – 1 = 60399
    
    no output
    
    Set low to 56800
    Set high to 60399
    
    encoding L (Ls range is 0.3 – 0.6) Range = (60399 – 56800) + 1 = 3600
    
    low =  56800 + (3600 * 0.3) = 56800 + 1080 = 57880
    high = 56800 + (3600 * 0.6) = 56800 + 2160 = 58960 – 1 = 58959

    At this point, 5 has converged, so shifting it out and shifting in another digit into both the high and low array gives us the new high and low, as below:

    output 214615
    
    Set low to 78800
    Set high to 89599
    
    encoding D (Ls range is 0.0 – 0.1) Range = (89599 – 78800) + 1 = 10800
    
    low =  78800 + (1080 * 0.0) = 78800 + 0 = 78800
    high = 78800 + (1080 * 0.1) = 78800 + 108 = 78908

    At this point, 7 and 8 have converged, so shifting out 7 and 8 and further shifting another digit into both the output array gives us the coded string:

    output 21461578
    
    output 8 from low 
    
    the decimal procedure : output 0.2146157 and 88 from low
    
    END ENCODING
    ...

    Giving the table below:

    Character Probability Range Low High Output
    Initialize     00000 99999  
    H 1/10 0.2 – 0.3 20000 29999 2
    E 1/10 0.1 – 0.2 10000 19999 1
    L 3/10 0.3 – 0.6 30000 59999  
    L 3/10 0.3 – 0.6 39000 47999  
    O 3/10 0.6 – 0.8 44400 46199 4
    W 1/10 0.9 – 1.0 60200 61999 6
    O 2/10 0.6 – 0.8 12800 16399 1
    R 1/10 0.8 – 0.9 56800 60399  
    L 3/10 0.3 – 0.6 57880 58959 5
    D 1/10 0.0 – 0.1 78800 78907 78

    And finally, we also shift out the last digit in low: 8.

    This logic is implemented in the method encode_symbol of the coder object/class (coder.cs in the source code).

    Underflow

    It is possible that while encoding symbols, a situation could arise in which the high and low cannot converge. In the event that the encoded word has a string of 0s or 9s in it, the high and low values will slowly converge on a value, but may not see their most significant digits match immediately. For example, high and low may look like this:

    High:     700003      
    Low:      699994

    The calculated range is only a single digit long, which means the encoder does not have enough precision to be accurate.

    In effect, the range between high and low has become so small that any calculation will always return the same values. Moreover, since the most significant digits of both high and low are not equal, the algorithm can't output the digit and shift.

    The way to avoid underflow is to prevent it altogether. This is done by modifying the algorithm slightly. If the two MSDs don't match, but are now on adjacent numbers, a second test is done. If high and low are one apart, we test to see if the second most significant digit in high is a 0, and the second digit in low is a 9. If so, it means that the underflow is threatening.

    When there is potential for underflow, the encoder does a slightly different shift operation. It deletes the second digits from high and low, and shifts the rest of the digits left. The most significant digit, however, stays in place. Also, an underflow counter is set to mark the digit that was discarded. The operation looks like this:

    Before    After               
    ------    -----
    High:          40344     43449
    Low:           39810     38100
    Underflow:     0         1

    This checks if underflow occurs after every iteration/calculation operation.

    When the MSDs finally converge to a single value, it outputs the value, and then a number of "underflow" digits that were previously discarded. The underflow digits will be all 9s or 0s, depending on whether the high and low converged to the higher or lower value. In the C# implementation of this algorithm, the underflow counter keeps track of how many ones or zeros to output.

  8. Decoding
  9. Previously, in the decoding process, we could use the entire input number. This, however, may not be possible in practice since we can't perform an operation like that on a number that could potentially be millions or billions of bytes long. Just as in the encoding process, the decoder can operate using simple finite integer calculations.

    The decoder maintains three integers. The first two, high and low, correspond exactly to the high and low values maintained by the encoder. The third number, code, contains the current bits being read in from the input bits stream.

    Important: The current probability is determined by where the present code value falls along that range. If you divide the value-low by high-low+1, you get the actual probability for the present symbol.

  10. Finally, there must be some mechanism to stop the decoding calculation since, in theory, it is possible for an encoding number to yield more decoded characters than were encoded in it. Two methods for doing this are possible. A special <stop> character may be encoded into the number, or alternatively, a number representing the number of characters to encode can be passed to the decoding function/method.

The C# implementation

The encoding procedure written in C# is included in the downloadable source code for this article. The code for the encoder as well as the decoder were first published (in C) in an article entitled "Arithmetic Coding for Data Compression" in the February 1987 issue of "Communications of the ACM", by Ian H. Witten, Radford Neal, and John Cleary, published again by Mark Nelson as C code, and then ported to C# by myself and is being published here with the author's permission.

I have modified the code slightly so as to further isolate statistical modeling and arithmetic coding. The coder (coder.cs in the source code) class is an object that implements arithmetic coding. This class can then be used by any statistical model of the data. I have included a test project and a few examples of testing the coding and decoding methods of the class.

There are two major differences between the algorithms shown earlier and the code included in this article.

The first difference is in the way probabilities are transmitted. In the algorithms shown above, the probabilities were kept as a pair of floating point numbers on the 0.0 to 1.0 range. Each symbol had its own section of that range. In the C# class included here, a symbol has a slightly different definition. Instead of two floating point numbers, the symbol's range is defined as two integers, which are counts along a scale.

The scale is also included as part of the coder class definition: as the property 'scale' of the class (coder.scale). This scale is used in the methods encode_symbol and decode_symbol to convert the probability integer into its floating point equivalent, or to decode an encoded symbol into its char equivalent. This also means that a user of the class can input the probability of a symbol as integers instead of floating point numbers. See the included test solution for an example.

So, for instance in the "HELLO WORLD" example, the letter H was defined previously as the high/low pair of 0.2 and 0.3. In the code being used here, "H" would be defined as the low and high counts of 2 and 3, with the symbol scale being 10.

The second difference in this algorithm is that all of the comparison and shifting operations are being done in base 2, rather than base 10. The illustrations given previously were done on base 10 numbers to make the algorithms a little more comprehensible. The algorithms work properly in base 10, but masking off digits and shifting in base 10 on most computers is expensive and slow. Instead of comparing the two MSD digits, we now compare the two MSD bits.

There are two things missing that are needed in order to use the encoding and decoding algorithms. The first is a set of bit oriented input and output routines. These are shown in the code listing and are presented here as the bit IO routines.

The coder.symbol structure is responsible for storing the probabilities of each character, and performing two different transformations.

During the encoding process, the coder.symbol structure has to take a character to be encoded and convert it to a probability range. The probability range is defined as a low count, a high count in the structure.

During the decoding process, the coder.symbol structure has to take a count derived from the input bit stream and convert it into a character for output.

The coder object is created with a 'dictionary' or 'alphabet' made up of these coder.symbol structures.

Test code

An example program is shown in the included solution. It implements a compression/expansion program that uses some arbitrary models based on the discussed examples.

The decoding method needs to know the length of the encoded string so as to know when to stop decoding the message.

The test project encodes an arbitrarily defined input string, and writes it out to a MemoryStream object. This is the return value of the Encode method of the class. The MemoryStream object returned will contain an array of bytes with the compressed data in binary.

One can then decode the stream by calling the Expand method of the class. The Expand method of the class takes a memory stream and the length of the encoded message as parameters. To test it, we can encode a string and then pass the binary stream returned back to the Expand method for decoding.

During the encoding process, a routine called convert_int_to_symbol is called. This routine gets a given input character, and converts it to a low count, high count, and scale using the current model. Since our model is a set of fixed probabilities, this just means looking up the probabilities in the input coder.symbol struct. Once those are defined, the encoder can be called.

During the decoding process, there are two functions associated with modeling. In order to determine what character is waiting to be decoded on the input stream, the model needs to be interrogated to determine what the present scale is. In our example, the scale (or range of counts) is fixed by the coder.scale property. When decoding, a modeling function called convert_symbol_to_int is called. It takes the given count, and determines what character matches the count. Finally, the decoder is called again to process that character out of the input stream.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Agola Kisira Odero
Software Developer (Senior)
Virgin Islands (British) Virgin Islands (British)
No Biography provided

Comments and Discussions

 
QuestionDecoding The Compression. PinmemberAyushjalan0529-Apr-14 23:32 
QuestionNot able to use this code to read any file and compress/decompress it PinmemberlalithaManju4-May-13 23:15 
AnswerRe: Not able to use this code to read any file and compress/decompress it PinmemberAgola Kisira Odero5-May-13 19:27 
GeneralMy vote of 3 PinmemberHarvey Green3-Jan-12 18:00 
GeneralThis is just a straight port from C ... PinmemberHarvey Green3-Jan-12 13:33 
GeneralCode Problem PinmemberJudd_Token26-Jan-10 4:55 
GeneralStrong article PinmemberRozis28-Nov-09 9:55 
Impressive - Rozis
GeneralRe: Strong article Pinmembermartyn_mcfly29-Nov-09 2:03 
GeneralNeeds formatting PinmemberRichard MacCutchan28-Nov-09 6:06 
GeneralRe: Needs formatting PinmemberMember 280236728-Nov-09 6:31 
GeneralRe: Needs formatting Pinmember0x3c028-Nov-09 6:44 
GeneralRe: Needs formatting PinmemberRichard MacCutchan28-Nov-09 7:20 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.140721.1 | Last Updated 10 Dec 2009
Article Copyright 2009 by Agola Kisira Odero
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid