Get access to the new Intel® IoT Developer Kit, a complete hardware and software solution that allows developers to create exciting new solutions with the Intel® Galileo and Intel® Edison boards. Visit the Intel® Developer Zone for IoT.

Check out the (non) CRC implementation below. What's wrong with it?

I'm working on a connectivity library for IoT devices. A serious part of every communication protocol is the data integrity check. You send a collection of bytes to another machine and that machine has to know that there were no errors on the way. IP for example already has a good integrity check. The problem is that a TCP socket is basically a stream. This means that you don't really know when a buffer begins and ends. By using an integrity check we can verify that we are looking at full buffers.

The basic concept is to have some correlation between the first and the last byte. For example if first 2 bytes are length of the buffer and the last two bytes are 0x0102 then we can keep scanning the buffer until the first two bytes point to 0x0102. The problem is that every number we may choose can appear inside the buffer as part of the data. For example what if we have 10,000 bytes of 0x0102? We will find 0x0102, assume that it is the length, jump forward and find 0x0102. Magic number is probably one of the worst ways to go. Length is a characteristic of the buffer. We need to have another characteristic of the buffer at the end as well. The preferred method is to use all the data in the buffer to generate an integrity check value. For this there are commonly Checksum, XOR, CRC, and Hush. The question is how much CPU and RAM vs. the quality of error detection.

Checksum was originally a simple addition of all bytes (sum...). The problem is that 0x80+0x10 is the same as 0x10+0x80. This means that replacing one byte might be detected, but there is high probability that one or more bytes might compensate to get the same result. Here is another example: 0x10+0x20+0x30 has the same checksum as 0x60+0x00+0x00. Clearly a different buffer.

Using XOR is very simple and low cost for the CPU. The problem is that 0x80^0x80 is the same as 0x81^0x81. This means that one error in the buffer may be detected but there is good probability that a second error will hide the first. XOR is safe as long as there is only one error in each bit of all bytes (or any odd number of errors for any bit).

CRC - Cyclic Redundancy Check is probably the best way to go. There is a nice algorithm which I will detail below. It is more expensive in CPU and RAM but much more reliable. Every error propagates to continually produce completely different numbers. It is very hard to compensate for one error with another error. If you have 0x12 instead of 0x10, there isn't only one number that has to change to a specific location to compensate. CRC16 means 16 bit number and CRC 32 means 32 bit number. This means that CRC16 is 1 in 64K probability that the buffer is correct. If a byte changed and caused a different CRC16 to be produced, it would take several different bytes to compensate for it because one byte cannot repair a 16 bit value.

Advanced Hush functions usually consume too much RAM and CPU but eventually the output number is what matters. if you have a 16 bit integrity value then even the best algorithm will not save you from a 1 : 64K ratio.

I googled "crc16 algorithm c" and found several results. The first to demonstrate the concept is here: http://www.nongnu.org/avr-libc/user-manual/group__util__crc.html

uint16_t crc16_update(uint16_t crc, uint8_t a)
{
   int i;
   crc ^= a;
   for (i = 0; i < 8; ++i)
   {
      if (crc & 1) crc = (crc >> 1) ^ 0xA001;
      else crc = (crc >> 1);
   }
   return crc;
}

This example is good because it shows you the cycles. For every bit we shift and based on a selected bit value we either xor with a magic number or not. This loop consumes CPU so we can actually prepare the result for every given byte in a lookup table as seen here: http://stackoverflow.com/questions/22432066/how-to-use-table-based-crc-16-code

unsigned int crctable[256] =
{
   0x0000, 0x1189 ........
};

while (len--)
{
   crc = (crc << 8) ^ crctable[((crc >> 8) ^ *c++)];
}

I tried to find something that I could use on an Arduino device without wasting CPU and without using a great portion of the device's memory. Here is what I got (translated from C# without compiling, I'll explain why below)

myCrc ^= *buffer++;
myCrc = (myCrc) ^ (myCrc << 5) ^ (myCrc << 8);

How is it possible that I don't know about such implementation and it does not come up as first hit in google search. What am I missing here?

The reason I used C# was for the tests (and because this is the project that I had open when I wanted to test this solution).

Here is the function:

      private static ushort CRC16(byte[] buffer, out ushort crc, out ushort myCrc)
      {
         myCrc = 0;
         crc = 0;
         int pos = 0;
         while (pos < buffer.Length)
         {
            crc ^= (ushort)(buffer[pos] << 8);
            for (int i = 0; i < 8; ++i)
            {
               if ((crc & 0x8000) != 0) crc = (ushort)((crc << 1) ^ 0x1021);
               else crc = (ushort)(crc << 1);
            }

            myCrc ^= (ushort)(buffer[pos]);
            myCrc = (ushort)((myCrc) ^ (myCrc << 5) ^ (myCrc << 8));
            //myCrc ^= *buffer++;
            //myCrc = (myCrc) ^ (myCrc << 5) ^ (myCrc << 8);

            pos++;
         }
         return (crc);
      }

Here is the tester code:

      private static void BenchmarkCRC16()
      {
         ushort[] crcTable = new ushort[0x10000];
         ushort[] myCrcTable = new ushort[0x10000];
         for (int a = 0; a < 0x17000; a+=13)
         {
            if (0 == (a % 256)) Console.WriteLine("a = 0x" + a.ToString("X"));
            ushort ua = (ushort)a;
            for (int i = 0; i < 0x10000; i++)
            {
               ushort u = (ushort)i;

               ushort crc16 = 0;
               ushort myCrc = 0;
               CRC16(new byte[] { (byte)(u & 0xFF), (byte)((u >> 8) & 0xFF), (byte)(ua & 0xFF), (byte)((ua >> 8) & 0xFF) }, out crc16, out myCrc);
               crcTable[crc16]++;
               myCrcTable[myCrc]++;
            }
         }
         List<int>   crcUtilization = new List<int>();
         List<int> myCrcUtilization = new List<int>();
         for (int i = 0; i < 0x10000; i++)
         {
            bool ok = false;
            for (int t = 0; t < crcUtilization.Count; t++) { if (crcUtilization[t] == crcTable[i]) { ok = true; break; } }
            if (!ok) crcUtilization.Add(crcTable[i]);
            ok = false;
            for (int t = 0; t < myCrcUtilization.Count; t++) { if (myCrcUtilization[t] == myCrcTable[i]) { ok = true; break; } }
            if (!ok) myCrcUtilization.Add(myCrcTable[i]);
         }
// BREAKPOINT HERE and check out crcUtilization and myCrcUtilization (or print)
      }

      private static void CRCTest(byte[] buffer)
      {
         ushort crc16 = 0;
         ushort myCrc = 0;
         CRC16(buffer, out crc16, out myCrc);
      }

The tests show that myCrc and crc16 both have even spread values so every combination of bytes in the original buffer has a unique number within the limit of 64K. For example if you comment out the line "myCrc = (ushort)((myCrc) ^ (myCrc << 5) ^ (myCrc << 8));" in my test function (only XOR) you will see few results covered repeatedly and many values not used at all (zero).

If such implementation is good then it is clearly suitable for AVX vectorization, as opposed to using a for loop or a lookup table.

I am open for any suggestion or thought.