|
|||||||||||||||||||||
|
|||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
IntroductionOne day, I was talking with one of my colleagues about compression, and I was surprised when he told me that there is an algorithm that uses a dictionary during compression then leaves it aside and saves the compressed data without the dictionary. I asked him, "How can it construct the original data again?", he replied: "Simply by constructing the dictionary again during the decompression process"!!! It is the LZW compression. I decided that day to read about it and try to implement this so nice idea algorithm, and I spent a long time to make it as fast as I can, but I think I failed as it compresses 1 MB/sec in normal cases. The code seems hard to be understood, but it will be easy if you take a look at the algorithm through the Background section or through any of the References down the page. BackgroundIn 1984, Terry Welch was working on a compression algorithm for high-performance disk controllers. He developed a rather simple algorithm that was based on the LZ78 algorithm and that is now called LZW. LZW is a lossless 'dictionary based' compression algorithm. Dictionary based algorithms scan a file for sequences of data that occur more than once. These sequences are then stored in a dictionary, and within the compressed file, references are put where-ever repetitive data occurred. LZW IdeaLZW compression replaces strings of characters with single codes. It does not do any analysis of the incoming text. Instead, it just adds every new string of characters it sees to a table of strings. Compression occurs when a single code is output instead of a string of characters. So, any code in the dictionary must have a string in the compressed buffer for the first time only, that means the dictionary just keeps a reference to uncompressed strings in the compressed buffer, no need to let the dictionary take a copy of these strings. So, it is not necessary to preserve the dictionary to decode the LZW data stream. This can save quite a bit of space when storing the LZW-encoded data. The code that the LZW algorithm outputs can be of any arbitrary length, but it must have more bits in it than a single character. The first 256 codes (when using eight bit characters) are, by default, assigned to the standard character set. The remaining codes are assigned to strings as the algorithm proceeds. The sample program runs as shown with 12 bit to 31 bit codes. This means, codes 0-255 refer to individual bytes, while codes 256-2^n refer to substrings, where n equal to number of bits per code, as in the following figure.
LZW AlgorithmCompressionThe LZW Compression Algorithm can be summarized as follows: w = NIL;
while ( read a character k )
{
if wk exists in the dictionary
w = wk;
else
{
add wk to the dictionary;
output the code for w;
w = k;
}
}
Original algorithm used a dictionary of 4094 entries; first 256 for ASCII codes, and the rest for substrings. As you see, the algorithm scans one character at a time. If the string is in the dictionary, then it adds the character to the current string and goes for the next character. If the work string is not in the dictionary, it adds the work string to the dictionary and sends the work string without the new character. It then sets the work string to the new character. Compression ExampleInput string is "^WED^WE^WEE^WEB^WET".
w k output index symbol
-----------------------------------------
NIL ^
^ W ^ 256 ^W
W E W 257 WE
E D E 258 ED
D ^ D 259 D^
^ W
^W E 256 260 ^WE
E ^ E 261 E^
^ W
^W E
^WE E 260 262 ^WEE
E ^
E^ W 261 263 E^W
W E
WE B 257 264 WEB
B ^ B 265 B^
^ W
^W E
^WE T 260 266 ^WET
T EOF T
[1] Stepping through the start of the algorithm for this string, you can see that in the first pass through the loop, a check is performed to see if the string "^W" is in the table. Since it isn't, the code for '^' is output, and the string "^W" is added to the table. Since we have 256 characters already defined for codes 0-255, the first string definition can be assigned to code 256. After the third letter, 'E', has been read in, the second string code "WE" is added to the table and the code for letter 'W' is output. This continues until in the second word, the characters '^' and 'W' are read in, matching string number 256. In this case, the code 256 is output, and a three character string is added to the string table. The process continues until the string is exhausted and all of the codes have been output. DecompressionThe LZW Decompression Algorithm is as follows: read a character k;
output k;
w = k;
while ( read a character k )
// k could be a character or a code.
{
entry = dictionary entry for k;
output entry;
add w + entry[0] to dictionary;
w = entry;
}
As we said before, the decompression builds its own dictionary that matches exactly the compressor's, so that only the codes need to be saved in the compression. Decompression ExampleInput string is "^WED<256>E<260><261><257>B<260>T".
w k output index symbol
-------------------------------------------
^ ^
^ W W 256 ^W
W E E 257 WE
E D D 258 ED
D <256> ^W 259 D^
<256> E E 260 ^WE
E <260> ^WE 261 E^
<260> <261> E^ 262 ^WEE
<261> <257> WE 263 E^W
<257> B B 264 WEB
B <260> ^WE 265 B^
<260> T T 266 ^WET
[1] Just like the compression algorithm, it adds a new string to the string table each time it reads in a new code. All it needs to do in addition to that is translate each incoming code into a string and send it to the output. The important thing to note is that the string table ends up looking exactly like the table built up during compression. The output string is identical to the input string from the compression algorithm. Note that the first 256 codes are already defined to translate to single character strings, just like in the compression code. Code UsageCode usage is so simple, just the input buffer and its length, and the output buffer and its length. The compress function includes an additional parameter bool CompressLZW(BYTE* pSrc, int nSrcLen, BYTE* &pDes, int &nDesLen, int nBitsPerSample = 12); bool DecompressLZW(BYTE *pSrc, int nSrcLen, BYTE *&pDes, int &nDesLen); Code DescriptionI have only two functions CompressLZW Function
1:bool CompressLZW(BYTE *pSrc, int nSrcLen, BYTE *&pDes, int &nDesLen, int nBitsPerSample = 12) 2:{ 3: ... 4: CBinaryTree<CBuffer, CBuffer*, int, int> dictionary; 5: // keep first 256 IDs for ascii Samples 6: dictionary.Serial = 256; 7: // scan the input buffer 8: while(nSrcLen-- > 0) 9: { 10: ... 11: pNode = dictionary.Insert(&node, -1, pNode); 12: if(pNode->Count > 1 && nSrcLen) 13: // (repeated Sample) 14: nSample = pNode->ID, node.m_nLength++; 15: else 16: { // write last success Sample 17: *(DWORD*)(pDes+sizeof(DWORD)+1+ (nDesLen>>3)) |= nSample << (nDesLen&7); 18: nDesLen += nBitsPerSample; 19: // initialize node to next Sample 20: node.m_pBuffer += node.m_nLength-1; 21: node.m_nLength = 2; 22: // copy first byte of the node as a new Sample 23: nSample = *node.m_pBuffer; 24: ... 25: } 26: } 27: ... 28:} I included only the main points of the algorithm. I find line 17 is the only line that needs a description: 17: *(DWORD*)(pDes+sizeof(DWORD)+1+(nDesLen>>3)) |= nSample << (nDesLen&7); It copies the sample code to the compressed buffer, but it needs to start copying at the right bit. I know some people always invert the function of the shift operators in their minds, so I will dig a little bit on this line:
DecompressLZW Function
1:bool DecompressLZW(BYTE *pSrc, int nSrcLen, BYTE *&pDes, int &nDesLen) 2:{ 3: // copy bits pre Sample 4: int nBitsPerSample = *(pSrc+sizeof(DWORD)); 5: ... 6: // dictionary array 7: vector<CBUFFER *> dictionary; 8: // scan input buffer 9: while(nDesIndex < nDesLen) 10: { 11: // read sample code 12: nSample = (*(DWORD*)(pSrc+sizeof(DWORD)+1+ (nSrcIndex>>3)))>>(nSrcIndex&7) & (nMaxSamples-1); 13: nSrcIndex += nBitsPerSample; 14: // get code string node from the dictionary 15: if(nSample >= 256) 16: if(nSample-256 < (int)dictionary.size()) 17: { // normal case, valid dictionary Sample 18: pNodeSample = dictionary.at(nSample-256); 19: nSampleLen = pNodeSample->m_nLength+1; 20: // copy dictionary node buffer to decompressed buffer 21: memcpy(pDes+nDesIndex, pNodeSample->m_pBuffer, pNodeSample->m_nLength); 22: nDesIndex += pNodeSample->m_nLength; 23: } 24: else // out of range Sample 25: ... 26: else 27: nDesIndexSave = nDesIndex, *(pDes+nDesIndex++) = (BYTE)nSample, nSampleLen = 2; 28: ... 29: // add current segment to the dictionary 30: dictionary.push_back(new CBuffer(node)); 31: // increment next node pointer to the last char of the added Sample 32: node.m_pBuffer += node.m_nLength-1; 33: node.m_nLength = nSampleLen; 34: } 35: ... 36:} As in the compression, I included only the main points of the algorithm. I find line 12 is the only line that needs a description at this point: 12: nSample = (*(DWORD*)(pSrc+sizeof(DWORD)+1+ (nSrcIndex>>3)))>>(nSrcIndex&7) & (nMaxSamples-1); It copies the sample code from the compressed buffer, but it needs to start copying at the right bit:
Points of Interest
Source Code FilesLZW.cpp, LZW.h and BinaryTree.h. Updates
References
Thanks to...I owe a lot to my colleagues for helping me in implementing and testing this code. (JAK) | ||||||||||||||||||||