12,069,536 members (58,795 online)
alternative version

43.2K views
20 bookmarked
Posted

# RunLength Compression

, 21 Jan 2005 CPOL
 Rate this:
Fast Run-Length coding with variable runs sizes.

## Introduction

Run-Length is a lossless compression technique, its performance depends heavily on the input data statistics. It depends on counting sequence of repeated runs and save the run plus its count instead of the sequence of runs. So it depends heavily on the repetition in the input data. I have used it to compress array of bits ('1's and '0's), and each time I use it, I found my self rewrite the code to handle the new data format, and the type of the run (one bit, byte, word, ...). So I preferred to write a general code that has the runs size and count as an input to the compression function. You can skip the Background section if you have good experience about Run-Length algorithm.

## Background

The main idea of the Run-Length is to replace runs of the same data (like 'aaaaaa' or 'ccccc') with a counter saying how many repetitions we have. If we have 'aaaaaa' we just output 'a' and a byte to save the count, so we replaced 6 characters with only 2, and that can be done for all repeated runs in the input data, and the other unrepeated runs will be saved as it is. Compressed data format can be of any thing, just the decompression algorithm should follow this format structure in the decompression.

Example

Input data: aaaaaabcdddddcccccrrrrrrrrrfdde (31 bytes)
Output data: a,5,b,0,c,0,d,4,c,4,r,8,f,0,d,1,e,0 (17 bytes)

### Run-Length algorithm

#### Compression Algorithm can summarized as follows:.

// for input b with length l
index = 0
while( index < l )
{
run = b[index]
length = 0
if run = b[++index]
while run = b[index+length]
index++, length++;
output (run, length);
}

The algorithm scans one character (run) at a time. If the a character is equal to its next character, then it checks for repetition, then outputs the character with the repetition that follow it. If the a character is not equal to its next character, then it outputs it and a repetition of zero.

#### Decompression Algorithm is as follows:

// for input b with length l
index = 0
while( index < l )
{
run = b[index++]
length = b[index++]
output (run, length+1)
}

The decompression is so simple, just outputs the run repeated with its length plus one.

## Code Usage

Code usage is so simple, just the input buffer and its length, and the output buffer and its length. The compress function includes additional parameters:

• nBitsPerSample: to determine number of bits to save run length value, so maximum run length will be up to 2^nBitsPerSample-1.
• nRuns: to determine an optional dictionary of runs, with a default runs of the ASCII characters (0...255).
• nRunCount: to determine number of runs in the previous parameter, with a default value of 256 (256 characters).
• nRunSize: to determine the number of bytes in the each run, with a default value of 1 (1 byte).
bool CompressRunLength(BYTE *pSrc, int nSrcLen, BYTE *&pDes,
int &nDesLen, int nBitsPerSample, void* nRuns,
int nRunCount, int nRunSize);
bool DecompressRunLength(BYTE *pSrc, int nSrcLen,
BYTE *&pDes, int &nDesLen);

You can use the parameters (..., void* nRuns, int nRunCount, int nRunSize) to input any dictionary for runs and you can decide run size in bytes, see the next example to compress a file contains a runs with size 3:

BYTE by[] = { 0x0,0x0,0x0,  0x84,0x82,0x84,
0xce,0xd3,0xd6,  0xff,0xff,0xff };
CompressRunLength(pSrc, nSrcLen, pDes, nDesLen,
12/*nBitsPerSample*/, by, 4/*nRunCount*/,
3/*nRunSize*/);

Note: Input runs must be sorted as I am using binary search to search in runs array buffer.

And if you want to compress only for zeros and ones, you can use that:

BYTE by[] = { 0x0, 0xff };
CompressRunLength(pSrc, nSrcLen, pDes, nDesLen,
12/*nBitsPerSample*/, by, 2/*nRunCount*/, 1/*nRunSize*/);
// (^_^)

## Code Description

### CompressRunLength function

• Scan the input buffer pSrc and check if the run is repeated or not using binary search.
• If run is repeated then,
• set next bit in the destination buffer:
*(pDes+(nDesLen>>3)) |= 1 << (nDesLen++&7)
• add run index to the destination buffer:
*(DWORD*)(pDes+(++nDesLen>>3)) |= nRunIndex << (nDesLen&7)
• loop to get run length
while(...) nRunLength++, ...
• add run length to the destination buffer:
*(DWORD*)(pDes+(nDesLen>>3)) |= nRunLength << (nDesLen&7)
• If run is not repeated, then add next byte to the destination buffer:
*(WORD*)(pDes+(++nDesLen>>3)) |= *(pSrc+nByte++) << (nDesLen&7)
bool CompressRunLength(BYTE *pSrc, int nSrcLen, BYTE *&pDes,
int &nDesLen, int nBitsPerSample,
void* nRuns, int nRunCount, int nRunSize)
{
...
// loop in the source buffer
while(nByte < nSrcLen)
if((... (nRunIndex =
BinarySearch(pSrc+nByte, nRunSize, nRuns, nRunCount))!= -1 &&
memcmp(pSrc+nByte+nRunSize,
(BYTE*)nRuns+nRunIndex*nRunSize, nRunSize) == 0) ||
...)
{    // set bit to 1 to indicate type found
*(pDes+(nDesLen>>3)) |= 1 << (nDesLen++&7);
*(DWORD*)(pDes+(++nDesLen>>3)) |=
nRunIndex << (nDesLen&7);
nDesLen += nBitsPerTypeIndex;
// skip the two repeated runs
nByte += nRunSize*2;
// get run length - 2 (without the two repeated runs)
nRunLength = 0;
while(nRunLength < nMaxRunLength && nByte < nSrcLen &&
((... memcmp(pSrc+nByte,
(BYTE*)nRuns+nRunIndex*nRunSize, nRunSize) == 0) || ...))
nRunLength++, nByte += nRunSize;
// save run length and increment destination length
//by bits per sample
*(DWORD*)(pDes+(nDesLen>>3)) |=
nRunLength << (nDesLen&7);
nDesLen += nBitsPerSample;
}
else    // copy one byte
*(WORD*)(pDes+(++nDesLen>>3)) |=
*(pSrc+nByte++) << (nDesLen&7), nDesLen += 8;
nDesLen = (nDesLen+7)/8;    // bits to bytes
...
}
• (nDesLen>>3): >>3 to divide by 8 to reach the right byte to start with.
• |=: to avoid writing over the previous code.
• (nDesLen&7): &7 to get the reminder of dividing by 8, to get the start bit.
• nRunLength << (nDesLen&7): to shift the value to left to match the starting bit at the destination buffer.

### DecompressRunLength function

• If next bit in the source buffer is on, then it is a run with its length:
if((*(pSrc+(nSrcIndex>>3)) >> (nSrcIndex++&7)) & 1)
nRunIndex = GetDWORD(pSrc, nSrcIndex, nMaxTypeIndex)
nRunLength = GetDWORD(pSrc, nSrcIndex, nMaxRunLength)+2
• loop to out the repeated runs.
for(...)
• If next bit in the source buffer is off, then it is a byte that needs to out:
*(WORD*)(pDes+(nDesIndex>>3)) |= GetWORD(pSrc, nSrcIndex, 0xff)
bool DecompressRunLength(BYTE *pSrc, int nSrcLen,
BYTE *&pDes, int &nDesLen)
{
...
nSrcLen <<= 3; // bytes to bits
while(nSrcIndex < nSrcLen-8)
if((*(pSrc+(nSrcIndex>>3)) >> (nSrcIndex++&7)) & 1)
{
nRunIndex = GetDWORD(pSrc, nSrcIndex, nMaxTypeIndex),
nSrcIndex += nBitsPerTypeIndex;
nRunLength = GetDWORD(pSrc, nSrcIndex, nMaxRunLength)+2,
nSrcIndex += nBitsPerSample;
for(nRun = 0; nRun < nRunLength; nRun++)
for(nByte = 0; nByte < nRunSize; nByte++, nDesIndex += 8)
*(WORD*)(pDes+(nDesIndex>>3)) |=
nRuns ? GetWORD(nRuns+nRunSize*nRunIndex,
nByte<<3, 0xff) : (BYTE)nRunIndex;
}
else    // copy one byte
*(WORD*)(pDes+(nDesIndex>>3)) |=
GetWORD(pSrc, nSrcIndex, 0xff),
nDesIndex += 8, nSrcIndex += 8;
...
}
• (nDesIndex>>3): >>3 to divide by 8 to reach the right byte to start with.
• (nDesIndex&7): &7 to get the reminder of dividing by 8, to get the start bit.
• >>(nDesIndex&7): to shift right to the start bit to take the code at bit 0 (closer the wall).

## Points of Interest

1. At the compression function, I am using a binary search to search for the run that matches the current run in the input data:
int BinarySearch(void* pValue, int nVlaueSize,
void* pArray, int nCount)
{
int nIndex, nResult, nStart = 0, nEnd = nCount-1;
while(nStart <= nEnd)
{
nIndex = (nEnd+nStart)/2;
if((nResult = memcmp((BYTE*)pArray+nIndex*nVlaueSize,
pValue, nVlaueSize)) == 0)
return nIndex;
if(nResult > 0)
nEnd = nIndex-1;
else
nStart = nIndex+1;
}
return -1;
}

So, as I noted before, input runs must be sorted.

2. Also, I am using a very good macros to extract values starting from any bit in the input buffer:

mask is used to reset any extra indeed bits.

3. My code here in this compression is using bits to save values, not bytes, so I am saving any value starting from any bit, so I didn't waste any bits. Or I can say, I use every bit in the compressed data.

## Source code files

• RunLength.cpp
• RunLength.h

## Thanks to...

I owe a lot to my colleagues for helping me in implementing and testing this code. (JAK)

## Share

 Other Egypt

## You may also be interested in...

 View All Threads First Prev Next
 it is the first compression schema that make bigger my files(strings) miketj25-Aug-06 13:50 miketj 25-Aug-06 13:50
 Re: it is the first compression schema that make bigger my files(strings) Hatem Mostafa26-Aug-06 0:44 Hatem Mostafa 26-Aug-06 0:44
 Sir, This type of compression is used to compress inverted indices, which include long bitsets. BitSet means array of bits that include zeros and ones. And it can be used after applying difference compression for sorted arrays. If u want a compression algorthim to comress text use LZW http://www.codeproject.com/cpp/LZW.asp[^] thanks
 Re: it is the first compression schema that make bigger my files(strings) muhahahaha10-Jan-08 1:46 muhahahaha 10-Jan-08 1:46
 Re: it is the first compression schema that make bigger my files(strings) evolved20-Feb-09 8:20 evolved 20-Feb-09 8:20
 Last Visit: 31-Dec-99 19:00     Last Update: 9-Feb-16 4:09 Refresh 1