|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
IntroductionNormal incremental backups check the archive attribute of a file, to determine if it has changed, and thus in need of being backed up. The backup program then backs up only those files that have changed. When storing to tape, this means that data from incremental backups tend to be spread out over multiple tapes, making it time consuming to restore. That is why the occasional full backup is necessary. But what if we were able to back up files incrementally, while using the absolute minimum of space on the backup medium? If we were able to do that, then the relative cost of disk storage (already quite low) may become closer to that of tape storage, making it feasible to store ALL of our backed up data in an easily accessible manner, on hard disk. This article proposes a way to do that, and also touches on some of the .NET code we could use. In addition, we tackle the task of ensuring that private data remains private, inaccessible even to those with direct access to the backup medium. BackgroundAs a brief introduction, for those unfamiliar with some of the basic concepts,
File SignaturesThe first technique used is cryptographic hash functions. The cryptographic hash of a file's contents is pretty much guaranteed to be a unique signature for that file. And it is very small, typically 128 to 160 bits. We can match these signatures against the server's backup log to determine if files have previously been backed up. If the backup system knows that file ABC.DLL already exists, there is no need for a second person to back up ABC.DLL again - we can just create a pointer to the original file. In that way, we can allow people to backup their whole hard drives across the network, with minimal use of bandwidth and storage. We would only be storing the actual data, nothing duplicated. We could create a giant, write-once hard disk, with our data packed as efficiently as possible. To make a backup of a file, the program first computes a hash of the file contents. This is fairly easy. We just open the file using a private byte[] SHA1Hash(FileStream inFile)
{
SHA1 sha1 = new SHA1CryptoServiceProvider();
byte[] hash = sha1.ComputeHash(inFile);
//reset back to the start of the file
inFile.Seek(0, SeekOrigin.Begin);
return hash;
}
That's pretty simple. The only complication is that, after the hash is computed, we reset the file stream back to its start. This is so that further processing can be done on the file. Encrypting the fileSome other code (not shown) then queries the server to determine if that file exists. If we determine that the file needs to be sent to the server, then we need to encrypt the file. However, if you recall, the file needs to be accessible to all clients that have possession of it - otherwise, we would be storing the same data twice = bad :( We encrypt the file with a key that is known to all clients that possess it - the MD5 hash of the file. RC2 rc2 = new RC2CryptoServiceProvider();
//use the MD5 hash as the key for the encryption
rc2.Key = MD5Hash(inFile);
rc2.GenerateIV();
ICryptoTransform encryptTransform = rc2.CreateEncryptor();
CryptoStream encryptStream = new CryptoStream(
netStream,
encryptTransform,
CryptoStreamMode.Write);
In the chunk of code above, we use a function The parameters to the
I like the idea of layering streams, because it eliminates the need for intermediate storage, either on disk or in memory. The data in the code above flows like this: FileStream --> CryptoStream --> NetworkStream Eventually, when we download (restore) and decrypt the file, we can follow the reverse flow: NetworkStream --> CryptoStream --> FIleStream The keyringIt is all very well to say that the clients that possess the file have the key needed to decrypt it, but that falls apart a little when the client no longer possesses the file. After all, this is a backup system, and if they need to download the file, we can be sure they no longer have the ability to compute a hash on its contents. For this reason, the clients store each backed-up file's fingerprint (and other associated Points of InterestThe idea of sending a file only once to a file server is not new. Some people at Bell Labs came up with the idea already. They do not encrypt the data, but they do use the idea of SHA1 signatures for blocks of data (many blocks may make up a file). I give them credit for the idea of using SHA1 instead of MD5 as the fingerprint - they went to the trouble of calculating the odds of a non-unique fingerprint, 10^-20, with an exabyte of data stored in 8K fingerprinted blocks. Eventually, I plan on breaking larger files into blocks as well. It will work much better for large files, like databases, where only part of the file will change at any particular time. FootnotesThe file made available for download is a single class file that I use on the backup client. It has all of the code discussed in this article, plus some other code needed to communicate to the server. In a later article, I plan on discussing the communication with the server in more detail. All about a scaleable TCP socket server. But I'll save that for when I've got it working :)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||