Click here to Skip to main content
12,892,709 members (45,931 online)
Click here to Skip to main content
Add your own
alternative version


159 bookmarked
Posted 17 Dec 2001

CRC32: Generating a checksum for a file

, 17 Dec 2001
Rate this:
Please Sign up or sign in to vote.
How to generate a CRC32 based on a file
<!-- Download Links --> <!-- Article image -->

Sample Image - crc32.gif


Recently I wrote a program in which I wanted to generate a CRC for a given file. I did some checking on the web for sample CRC code, but found very few algorithms to help me. So I decided to learn more about CRCs and write my own code. This article describes what a CRC is, how to generate them, what they can be used for, and lastly source code showing how it's done.

What is a CRC

CRC is an acronym for Cyclic Redundancy Checksum or Cyclic Redundancy Check (depending on who you ask). A CRC is a "digital signature" representing data. The most common CRC is CRC32, in which the "digital signature" is a 32-bit number. The "data" that is being CRC'ed can be any data of any length; from a file, to a string, or even a block of memory. As long as the data can be represented as a series of bytes, it can be CRC'ed. There is no single CRC algorithm, there can be as many algorithms as there are programmers. The ideal CRC algorithm has several characteristics about it. First, if you CRC the same data more than once, you must get the same CRC every time. Secondly, if you CRC two different pieces of data you should get two very different CRC values. If you CRC the same data twice, you get the same digital signature. But if you CRC data that differs (even by a single byte) then you should get two very different digital signatures. With a 32-bit CRC there are over 4 billion possible CRC values. To be exact that's 232 or 4,294,967,296. With that many CRC values it's not difficult for every piece of data being CRC'ed to get a unique CRC value. However, it is possible for spurious hits to happen. In other words two completely different pieces of data can have the same CRC. This is rare, but not so rare that it won't happen.

Why use CRCs

Most of the time CRCs are used to compare data as an integrity check. Suppose there are two files that need to be compared to determine if they are identical. The first file is on Machine A and the other file is on Machine B. Each file is a rather large file (say 500 MB), and there is no network connection between the two machines. How do you compare the two files? The answer is CRC. You CRC each of the two files, which gives you two 32-bit numbers. You then compare those 32-bit numbers to see if they are identical. If the CRC values are different, then you can be 100% guaranteed that the files are not the same. If the CRC values are the same, then you can be 99% sure that the files are the same. Remember, because spurious hits can happen you cannot be positive that the two files are identical. The only way to be positive they are the same is to break down and do a comparison one byte at a time. But CRCs offer a quick way to be reasonably certain that two files are identical.

How to generate CRCs

Generating CRCs is a lot like cryptography in that involves a lot of mathematical theories. Since I don't fully understand it myself, I won't go into a lot of those details here. Instead I'll focus on how to program a CRC algorithm. Once you know how the algorithm works you should be able to write a CRC algorithm in any language on any platform. The first part of generating CRCs is the CRC lookup table. In CRC32 this is a table of 256 specific CRC numbers. These numbers are generated by a polynomial (the computation of these numbers and what polynomial to use are part of that math stuff I'm avoiding). The next part is a CRC lookup function. This function takes two things, a single byte of data to be CRC'ed and the current CRC value. It does a lookup in the CRC table according to the byte provided, and then does some math to apply that lookup value to the given CRC value resulting in a new CRC value. The last piece needed is the actual data that is to be CRC'ed. The CRC algorithm reads the first byte of data and calls the CRC lookup function which returns the CRC value for that single byte. It then calls the CRC lookup function with the next byte of data and passes the previous CRC value. After the second call, the CRC value represents the CRC of the first two bytes. You continuously call the CRC lookup function until all the bytes of the data have been processed. The resulting value is the CRC for the whole data.

Code Details

In this sample program I wanted to show that there are many different ways of generating CRCs. There are over 8 different CRC functions, all based on the above steps for generating CRCs. Each function differs slightly in it's intended use or optimization. There are four main CRC functions, each described below. There are also two separate CRC classes, but more on that later. And lastly there are a few helper functions that CRC strings.

C++ Streams: The first function represents the simplest CRC function. The file is opened using the C++ stream classes (ifstream). This function uses nothing but standard C++ calls, so this function should compile and run using any C++ compiler on any OS.

Win32 I/O: This function is more optimized in that it uses the Win32 API for file I/O; CreateFile, and ReadFile. This will speed up the processing, but by using the Win32 API the code is no longer platform independent.

Filemaps: This function uses memory mapped files to process the file. Filemaps can be used to greatly increase the speed with which files are accessed. They allow the contents of a file to be accessed as if it were in memory. No longer does the programmer need to call ReadFile and WriteFile.

Assembly: The final CRC function is one that is optimized using Intel Assembly. By hand writing the assembly code the algorithm can be optimized for speed, although at the sacrifice of being easy to read and understand.

Those are the four main CRC functions. But there are actually two versions of each function. There are two classes, CCrc32Dynamic and CCrc32Static, each of which have the above four functions for a total of eight. The only difference between the static and dynamic classes is the CRC table. With the static class the CRC table and all the functions in the class are static. The trade off is simple. The static class is simpler to use, but the dynamic class uses memory more efficiently because the CRC table (1K in size) is only allocated when needed.

// Using the static class is as easy as one line of code
dwErrorCode = CCrc32Static::FileCrc32Assembly(m_strFilename, dwCrc32);

// Whereas there is more involved when using the dynamic class
CCrc32Dynamic *pobCrc32Dynamic = new CCrc32Dynamic;
dwErrorCode = pobCrc32Dynamic->FileCrc32Assembly(m_strFilename, dwCrc32);
delete pobCrc32Dynamic;

Whenever you calculate a CRC you need to take into account the speed of the algorithm. Generating CRCs for files is both a CPU and a disk intensive task. Here is a table showing the time it took to CRC three different files. The columns are the different file sizes, the rows are the different CRC functions, and the table entries are in seconds. The system these numbers were captured on is a dual Pentium III at 1 GHz with a 10,000 RPM SCSI Ultra160 hard drive.

44 Kb34 Mb5 Gb
C++ Streams0.00130.80125
Win32 I/O0.00090.6085

As expected the C++ streams is the slowest function followed by the Win32 I/O. However, I was very surprised to see the filemaps were not faster than the Win32 I/O, in fact they are slower. After I thought about it some, I realized memory mapped files are designed to provide fast random access to files. But when you CRC you access the file sequentially. Thus filemaps are not faster, and the extra overhead of creating the "views" of the file are why it's slower. Filemaps do have one advantage that none of the other functions have. Memory mapped files are guaranteed to be able to access files up to the maximum file size in NT which is 264 or 18 exabytes. Although the Win32 I/O may handle files of this size, none of the documentation confirms this. [Note: The largest file I have CRC'ed is 40 GB, which all eight functions successfully CRC'ed, but took over 10 minutes each.]

If anyone who reads this article knows a way to improve the speed even more, please post the code or email me. Especially if you know of a speed improvement for the assembly code. I will bet there are further optimizations that can be made to the assembly code. After all I don't know Intel Assembly very well, therefore I'm sure there are optimizations I don't know about.


This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


About the Author

Brian Friesen
Web Developer
United States United States
No Biography provided

You may also be interested in...

Comments and Discussions

GeneralQuestion about assembly Pin
Rick York17-Dec-01 19:20
memberRick York17-Dec-01 19:20 
GeneralRe: Question about assembly Pin
Tim Smith18-Dec-01 2:01
memberTim Smith18-Dec-01 2:01 
GeneralRe: Question about assembly Pin
Rick York18-Dec-01 5:36
memberRick York18-Dec-01 5:36 
GeneralRe: Question about assembly Pin
Tim Smith18-Dec-01 10:45
memberTim Smith18-Dec-01 10:45 
Well, all CRCs are based off of the same theory of polynomial arithmetic (most of which I don't understand.) The size of the polynomial is where the name comes from. However, early on people started calling very specific CRCs based on the polynomial size. Thus we have CRCs called CRC16 and CRC32. These names don't even come close to describing all the elements of a CRC.

Here is all the different parameters for a CRC:

1. What is the polynomial?
2. Are the input bytes reflected?
3. What is the initial value of the CRC?
4. Is the final value XORed?
5. Is the polynomial reversed?

1. The polynomial specifies the mask of the CRC and the size. A polynomial such as x^16 + x^12 + x^5 + 1 has a mask of 0x1021. The X^16 isn't included in the mask. A 32 bit CRC would have an x^32 element.

2. When each bit is processed in a CRC, you can either start from the high bit and move to the low bit, or the low bit and move to the high. The latter case is know as reflected. Now, the slow way of doing a reflected CRC is to reflect the input values as they are being processed. The fast way is to reflected the whole bloody algorithm including the polynomial mask. Thus making the reflected CRC just as fast as the non-reflected. In my experience, it is more common to find reflected algorithms than non-reflected. If your CRC is being shifted left, that is a non-reflected CRC. If your CRC is being shifted to the right, that is a reflected CRC.

3. The initial value of a CRC is very important. This value is usually 0xFFFF for a 16 bit crc and 0xFFFFFFFF for a 32 bit crc. Some CRCs use a 0 initial value which isn't a very good idea since you can't tell the difference between a stream of 11 bytes of 0 and 12 bytes of zero.

4. Many algorithms XOR the final value with 0xFFFF for a 16 bit crc and 0xFFFFFFFF for a 32 bit crc. This is purely up to the whim of the programmer.

5. Well, some moron had to do it. They used a non-reversed polynomial mask in a reflected algorithm or they used a reversed polynomial mask in a non-reflected algorithm. The good news is that a reversed poly seems to work just as well.

So, to answer your question, which CRC32 are you talking about? I have seen two or three.

Tim Smith
Descartes Systems Sciences, Inc.
GeneralRe: Question about assembly Pin
Tim Smith18-Dec-01 10:48
memberTim Smith18-Dec-01 10:48 
GeneralRe: Question about assembly Pin
Rick York18-Dec-01 12:19
memberRick York18-Dec-01 12:19 
GeneralRe: Question about assembly Pin
Tim Smith18-Dec-01 14:05
memberTim Smith18-Dec-01 14:05 
GeneralRe: Question about assembly Pin
Brian Friesen18-Dec-01 13:30
memberBrian Friesen18-Dec-01 13:30 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170424.1 | Last Updated 18 Dec 2001
Article Copyright 2001 by Brian Friesen
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid