Using a rolling hash to break up binary files

Clay Borkholm

Rate me:

4.98/5 (35 votes)

29 Jul 2014CPOL8 min read

32.5K

341

Using C# and a Rabin-Karp rolling hash, we'll break a file into identifiable chunks

Download StreamBreaker.zip - 6 KB

Introduction

If you've ever had to figure out what changed in a large binary chunk of data, you may have run into a number of the same obstacles I ran into. Recently, I found a surprisingly simple approach to identifying the changes and then storing only those changes.

I had so much fun figuring it out that I thought I'd post an article to explain it. This atricle attempts to strip away all but the cool part. For example, the sample code's body could be given over to a producer/consumer queue...but this detracts from the simplicity, and I'll leave such optimizations to your imagination.

Background

There are cases where a large volume of information needs to be transmitted from one place to another, but the desire to prevent sending redundant data warrants some CPU cycles to check whether that data already exists at the recipient. Network backup applications, for example, benefit greatly by not backing up data already known to be in the backup repository.

Also, for large, complex web applications, a similar approach is being ported to JavaScript and local storage for use in Microsoft Research's SILO project. In that case, redundant page content is the thing to be prevented. You can read up on SILO at http://research.microsoft.com/apps/pubs/default.aspx?id=131524 or watch a presentation at http://channel9.msdn.com/Events/MIX/MIX11/RES04 that touches on the approach.

Finding Changes in Binary Data - The Problem

I’m presently working on a system that has to back up data on Windows drives. We store that data in a central repository, and we run the backups frequently. Backups are often huge. The first one can be painful, but after that, if we keep some evidence around, subsequent backups are pretty quick.

Some of the evidence comes from the file system. For example, NTFS keeps a rolling journal of files that have been changed. If the journal hasn’t rolled off the changes we captured in the last backup, then it has details of all the files that have been changed. As I’m sure you can imagine, it’s a beautiful thing to know that only 0.05% of the files have changed at backup time.

The NTFS journal solves the single biggest performance hurdle for me, but all is not light and lovely. I still have this one nagging situation. Let’s say we have a file that a user changes constantly…maybe a mail file or a database file. Every backup cycle, there it is again…that 2GB file is dirty! Every backup cycle, I don't really want to add 2GB to the central store. Yuck!

So, the problem is, how do we figure out what’s changed and only back that up?

I tried a couple of different approaches. I won’t embarrass myself by telling you about them. Instead, I’ll embarrass myself by telling you what I finally came up with. In all reality, it’s pretty simple and pretty quick, and so it’s quite acceptable.

In a nutshell, I break the file into pieces, and compare those pieces to the pieces that I had last time, transferring only those pieces that are different. The trick to the whole problem is identifying the pieces.

Now, you can’t just break a file up into arbitrary-sized pieces…say, 64K chunks. What happens, for example, if the file gets new content inserted way up front? My old chunks and my new chunks won’t line up. Similarly, byte-by-byte comparisons until I hit a change…and then assume everything after that is different…for the same reason; an early change blows me up.

If I knew the file’s internal structure, I could probably find optimal approaches, but I’d have to code for each new kind of file I might encounter. I tried getting the longest common substring and developing edits…the way a version control system might. Turns out, this is computationally expensive and complex.

A Possible Solution

Then, I read about using rolling hashes to find natural breakpoints in a file based on the shape of the data. The idea is pretty cool…I’ll find patterns in the data that I can use to break the file up! Then, I’ll have a list of smallish chunks, each of which is comparable and independently storable. We can vary between long lists of small chunks or short lists of big chunks.

A rolling hash is one where you pick a window size…let’s say 64 bytes…and then hash every 64-byte-long segment in the file. I don't mean a hash for 0-63, 64-127, 128-191, etc…but for 0-63, 1-64, 2-65, etc. I know that sounds like something that’s going to burn a hole in your processor, but here’s the surprise…it goes pretty quick with the right hash function. More on that later.

Skipping the apparent implausibility for a second, what does this hash get us? Well…if it’s a reasonably good hash function, we’ll get well distributed hash values for each chunk of data we look at. Assuming we do get random-looking hash values, we can take a regular subset of those hashes and arbitrarily declare that these hashes “qualify” as sentinel hashes.

The smaller the subset of qualifying hashes, the larger the distance between hashes we'll find that qualify, and hence the larger average size of the chunks you’re going to get. The way I figure out if a hash qualifies is:

var matched = ( hash | mask ) == hash;

…where hash is the rolling hash, and mask is a bit mask. The more 1-bits in the mask, the more hashes we’ll exclude and the bigger our chunks are. You can use any way you want to get 1-bits into your mask…but an easy, controllable way is to declare it like this:

const long int mask = ( 1 << 16 ) - 1;

…where 16 is the number of 1-bits I want. Each 1-bit doubles the size of the average chunk.

The hashing algorithm itself was originally described by Richard M. Karp and Michael O. Rabin and can be found here:

http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm

This function's beauty lies in the fact that it is very efficient to apply to a running-window of bytes. Unlike a lot of hashes, you can efficiently add or remove a contributing byte from the hash…hence the “rolling” nature of the hash. It uses only integer addition and multiplication, so it’s fairly processor-friendly. It’s not a cryptographically strong hash, but we’re not using it for cryptography in this context.

Cooked down, for a window of bytes w of length n and a constant seed p, the hash is computed:

C++

hash = p^n(w[n]) + p^n-1(w[n-1]) + … + p^0(w[0]);

...but we'll not ever calculate this all at once for a window. Assuming we start with an empty window, we know the hash of that window is 0.

So, once you’ve got a hash for your window and you’re ready to slide the window one byte, you have to throw out one byte and add the new one. To do this:

hash -= p^n* (w[n]);  // remove the high byte’s contribution
hash *= p; // multiply the remaining array’s contribution by p
hash += newByte;  // just add the new byte…noting that p^0(newByte) == newByte
// <shift the window>

If you pre-calculate p^n before going into your slide loop, the whole operation is pretty small. In fact, the size of your window disappears in the cost of the calculation. As we’re sliding along, recognizing matching hashes, all we need to keep is a little state to tell us where the last chunk ended when we yield a chunk descriptor to the caller. The chunk descriptor is an instance of StreamBreaker.Segment and contains the offset and length of the chunk.

We now have to decide which chunks, if any, we want to keep. We’ll use hashing again...to make a fingerprint for the chunk. Later on, we'll compare these fingerprints to fingerprints we've already stored. This fingerprint will be another hash. There are hashes that are so strong, the likelihood of two different chunks producing the same fingerprint all but disappears. Unfortunately, our rolling hash is not one of these strong hashes.

It would be nice to collect the strong hashes while getting offsets and lengths, and avoid another pass through the file, so I arranged the code to do this using the same rolling hash window. It makes it a little more complicated, but I figured that anybody getting the chunk list needs the chunk fingerprints, too. Every time I cycle the window, I submit it to a strong hashing algorithm. Every time I recognize a chunk boundary, I finalize any pending strong hash and yield the strong hash along with the length and offset. The code is not concerned about which hash algorithm you choose...as long as it's based on System.Security.Cryptography.HashAlgorithm.

Here's the whole class:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Security.Cryptography;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace FunStuff
{
  /// <summary>Uses a rolling hash (Rabin-Karp) to segment a large file</summary>
  /// <remarks>
  /// For a given window w of size n, and a prime number p the RK hash is computed:
  /// p^n(w[n]) + p^n-1(w[n-1]) + ... + p^0(w[0])
  /// 
  /// This hash is such that a contributing underlying value 
  /// can be removed from the big end and a new value added to the small end.
  /// So, a circular queue keeps track of the contributing values.
  /// 
  /// Hash of each chunk is also computed progressively using a 
  /// stronger hash algorithm of the caller's choice.
  /// </remarks>
  public class StreamBreaker
  {
    private const int width = 64;  //--> the # of bytes in the window
    private const long seed = 2273;  //--> a our hash seed
    private const long mask = ( 1 << 16 ) - 1;  //--> a hash seive: 16 gets you ~64k chunks
    private const int bufferSize = 64 * 1024;
    private static object sync = new object( );

    /// <summary>
    /// Subdivides a stream using a Rabin-Karp rolling hash to find sentinal locations
    /// </summary>
    /// <param name="stream">The stream to read</param>
    /// <param name="length">The length to read</param>
    /// <param name="hasher">A hash algorithm to create a strong hash over the segment</param>
    /// <remarks>
    /// We may be reading a stream out of a BackupRead operation.  
    /// In this case, the stream <b>might not</b> be positioned at 0 
    /// and may be longer than the section we want to read.
    /// So...we keep track of length and don't arbitrarily read 'til we get zero
    /// 
    /// Also, overflows occur when getting maxSeed and when calculating the hash.
    /// These overflows are expected and not significant to the computation.
    /// </remarks>
    public IEnumerable<Segment> GetSegments( Stream stream, long length, HashAlgorithm hasher )
    {
      var maxSeed = seed; //--> will be prime^width after initialization (sorta)
      var buffer = new byte[ bufferSize ];
      var circle = new byte[ width ];  //--> the circular queue: input to the hash functions
      var hash = 0L;  //--> our rolling hash
      var circleIndex = 0;  //--> index into circular queue
      var last = 0L;  //--> last place we started a new segment
      var pos = 0L;  //--> the position we're at in the range of stream we're reading

      //--> initialize maxSeed...
      for ( int i = 0; i < width; i++ ) maxSeed *= maxSeed;

      while ( true )
      {
        //--> Get some bytes to work on (don't let it read past length)
        var bytesRead = stream.Read( buffer, 0, ( int ) Math.Min( bufferSize, length - pos ) );
        for ( int i = 0; i < bytesRead; i++ )
        {
          pos++;
          hash = buffer[ i ] + ( ( hash - ( maxSeed * circle[ circleIndex ] ) ) * seed );
          circle[ circleIndex++ ] = buffer[ i ];
          if ( circleIndex == width ) circleIndex = 0;
          if ( ( ( hash | mask ) == hash ) || ( pos == length ) )  //--> match or EOF
          {
            //--> apply the strong hash to the remainder of the bytes in the circular queue...
            hasher.TransformFinalBlock( circle, 0, circleIndex == 0 ? width : circleIndex );

            //--> return the results to the caller...
            yield return new Segment( last, pos - last, hasher.Hash );
            last = pos;

            //--> reset the hashes...
            hash = 0;
            for ( int j = 0; j < width; j++ ) circle[ j ] = 0;
            circleIndex = 0;
            hasher.Initialize( );
          }
          else
          {
            if ( circleIndex == 0 ) hasher.TransformBlock( circle, 0, width, circle, 0 );
          }
        }
        if ( bytesRead == 0 ) break;
      }
    }

    /// <summary>A description of a chunk</summary>
    public struct Segment
    {
      /// <summary>How far into the strem we found the chunk</summary>
      public readonly long Offset;
      /// <summary>How long the chunk is</summary>
      public readonly long Length;
      /// <summary>Strong hash for the chunk</summary>
      public readonly byte[ ] Hash;

      internal Segment( long offset, long length, byte[ ] hash )
      {
        this.Offset = offset;
        this.Length = length;
        this.Hash = hash;
      }
    }

  }
}

Using the code

For clarity, I reduced the solution to a single method returning an enumeration of Segment objects. There’s a little more overhead to returning an enumeration than say, a list…but I expect to benefit from the early discovery of new chunks, and I’m guessing that’s the way you’re thinking about it, too.

To call the method, get yourself an instance of a strong hasher, and a stream containing the data you want to analyze and call it like so:

internal void Test( string fileName )
{
  var list = new List<StreamBreaker.Segment>( );
  using ( var hasher = new SHA1Managed( ) )
  {
    var streamBreaker = new StreamBreaker( );
    using ( var file = new FileStream( fileName, FileMode.Open, FileAccess.Read ) )
    {
      list.AddRange( streamBreaker.GetSegments( file, file.Length, hasher ) );
    }
  }
}

Points of Interest

One thing you'll find is that you'll have some tiny chunks and some really big chunks. An interesting note about really big chunks...the longer the chunk is, the more compressible it is. It got long because it probably has a big span of zeroes or other repeating pattern that just happened to not trigger a match. Compression algorithms love these long runs.

Update

I added a sample application that exercises the whole process. It starts with some lorem ipsum text, runs the process, makes a change to the text (in two places) and runs the process again...and then reports the differences it found. What's most interesting about the sample is that, while the change is essentially a 1-character change in two places, you'll see that the change resulted in three chunk changes. Hmmm... Can you figure out why?

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Clay Borkholm

United States

I enjoy C#, SQL, JavaScript and other languages. I've been working with 'em for a long time. It is a little like being paid to solve crosswords Smile | :)

Life is good!