Normal incremental backups check the archive attribute of a file, to determine if it has changed, and thus in need of being backed up. The backup program then backs up only those files that have changed. When storing to tape, this means that data from incremental backups tend to be spread out over multiple tapes, making it time consuming to restore. That is why the occasional full backup is necessary.
But what if we were able to back up files incrementally, while using the absolute minimum of space on the backup medium? If we were able to do that, then the relative cost of disk storage (already quite low) may become closer to that of tape storage, making it feasible to store ALL of our backed up data in an easily accessible manner, on hard disk. This article proposes a way to do that, and also touches on some of the .NET code we could use.
In addition, we tackle the task of ensuring that private data remains private, inaccessible even to those with direct access to the backup medium.
As a brief introduction, for those unfamiliar with some of the basic concepts,
- Hash algorithm - an algorithm that creates a unique signature for a series of bytes. There is no way to reverse engineer the hash back into the file itself. A small change in the file will result in a totally different hash. Common examples are MD5 and SHA1.
- Symmetric encryption - encryption of data with a single private key. To decrypt the file, the private key is required.
- IV - Initialization Vector, used in symmetric algorithms as a starting point to the encryption, ensuring that the same data will never be encrypted the same way twice. This improves the security of the encryption. It is not necessary to possess the IV in order to decrypt the file.
The first technique used is cryptographic hash functions. The cryptographic hash of a file's contents is pretty much guaranteed to be a unique signature for that file. And it is very small, typically 128 to 160 bits. We can match these signatures against the server's backup log to determine if files have previously been backed up.
If the backup system knows that file ABC.DLL already exists, there is no need for a second person to back up ABC.DLL again - we can just create a pointer to the original file. In that way, we can allow people to backup their whole hard drives across the network, with minimal use of bandwidth and storage. We would only be storing the actual data, nothing duplicated.
We could create a giant, write-once hard disk, with our data packed as efficiently as possible.
To make a backup of a file, the program first computes a hash of the file contents. This is fairly easy. We just open the file using a
FileStream object, and then pass it to our hash function that returns the hash.
private byte SHA1Hash(FileStream inFile)
SHA1 sha1 = new SHA1CryptoServiceProvider();
byte hash = sha1.ComputeHash(inFile);
That's pretty simple. The only complication is that, after the hash is computed, we reset the file stream back to its start. This is so that further processing can be done on the file.
Encrypting the file
Some other code (not shown) then queries the server to determine if that file exists. If we determine that the file needs to be sent to the server, then we need to encrypt the file. However, if you recall, the file needs to be accessible to all clients that have possession of it - otherwise, we would be storing the same data twice = bad :(
We encrypt the file with a key that is known to all clients that possess it - the MD5 hash of the file.
RC2 rc2 = new RC2CryptoServiceProvider();
rc2.Key = MD5Hash(inFile);
ICryptoTransform encryptTransform = rc2.CreateEncryptor();
CryptoStream encryptStream = new CryptoStream(
In the chunk of code above, we use a function
MD5Hash (not shown because it is basically the same as
SHA1Hash) to generate the key for the file, based on the contents of the file. We tell the RC2 provider to generate a random IV, and then set up a
CryptoStream object to encrypt the data directly to a
NetworkStream that has been set up elsewhere.
The parameters to the
CryptoStream constructor are:
netStream - a
NetworkStream object that has been set up elsewhere. This is where the data will be sent.
encryptTransform - the encryptor that we have created. This is what will do the work of encrypting the data
CryptoStreamMode.Write - a flag to tell the stream that we are writing to
netStream, not reading from it.
I like the idea of layering streams, because it eliminates the need for intermediate storage, either on disk or in memory. The data in the code above flows like this:
FileStream --> CryptoStream --> NetworkStream
Eventually, when we download (restore) and decrypt the file, we can follow the reverse flow:
NetworkStream --> CryptoStream --> FIleStream
It is all very well to say that the clients that possess the file have the key needed to decrypt it, but that falls apart a little when the client no longer possesses the file. After all, this is a backup system, and if they need to download the file, we can be sure they no longer have the ability to compute a hash on its contents.
For this reason, the clients store each backed-up file's fingerprint (and other associated
FileInfo data) in a virtual keyring. This keyring is serialized to a file, and encrypted with the client's password. It is then sent to the server for safekeeping. Thus, as long as a user has access to their password, they can retrieve their keyring from the server, and use it to unlock any files they want to restore.
Points of Interest
The idea of sending a file only once to a file server is not new. Some people at Bell Labs came up with the idea already.
They do not encrypt the data, but they do use the idea of SHA1 signatures for blocks of data (many blocks may make up a file). I give them credit for the idea of using SHA1 instead of MD5 as the fingerprint - they went to the trouble of calculating the odds of a non-unique fingerprint, 10^-20, with an exabyte of data stored in 8K fingerprinted blocks.
Eventually, I plan on breaking larger files into blocks as well. It will work much better for large files, like databases, where only part of the file will change at any particular time.
The file made available for download is a single class file that I use on the backup client. It has all of the code discussed in this article, plus some other code needed to communicate to the server.
In a later article, I plan on discussing the communication with the server in more detail. All about a scaleable TCP socket server. But I'll save that for when I've got it working :)