The second installment of this series discussed the cost of maintaining systems with various levels of downtime, ways data can be lost, and what constitutes a backup. In this installment, I discuss how to configure your hardware to minimize the need to restore a backup.
Getting the hardware right
Properly setting up hardware is the first step in keeping your data safe. Decisions you make as you specify hardware and set up databases can determine whether you are a database hero or scapegoat. Your average hardware guy does not know SQL Server, and your average DBA does not know hardware. Being the one to understand enough about both worlds can make a real difference in the outcome.
Think for a moment about your average, run-of-the-mill, low-end server. It might have one or two hard drives and one or two power supplies. When SQL Server is installed with the default settings, all programs, data and logs are placed on the same drive, in the same folder (awkwardly, in my view, in the Program Files folder).
How many single points of failure do you have with the default setup? First, the failure of either a single hard drive or single power supply will bring down the server, at least temporarily. In the case of a hard-drive failure, you will need to get a new drive, reinstall SQL Server and restore backups. This will take some time.
There are a number of things you can do to reduce such possible exposure to catastrophic loss of data. The first thing is to consider purchasing a server that has at least two power supplies. A power supply is a fairly common point of failure, and often getting redundant power supplies adds only about $250 to the cost of the system. This is money well spent.
RAID (redundant array of independent [or inexpensive] drives) hard disks are two or more drives used in combination to provide fault tolerance and/or performance benefits. Doing a Google search for RAID will get you a large number of articles covering RAID in great detail, but below is a quick explanation of the popular RAID levels.
- RAID 0—striped disk array without fault tolerance. Blocks of data are spread over two or more drives, providing better performance but no redundancy.
- RAID 1—mirroring and duplexing. Writes data to multiple disks at the same time. Given proper hardware support, speed is about the same as non-RAID drives. You get only one disk's worth of capacity for each two disks you have in RAID 1 configuration.
- RAID 5—data and parity information spread across multiple drives. RAID 5 uses a distributed parity arrangement, so that parity information on a single drive does not become a bottleneck. There is some additional overhead to read and write parity information.
- RAID 0+1—two RAID 0 stripes with a RAID 1 mirror created over them.
- RAID 10—multiple RAID 1 mirrors are created, and a RAID 0 is created on top of these.
There are a number of other RAID levels, but these are the ones used most often. RAID 4, for example, is like RAID 5, but has a dedicated parity drive. It is unlikely anyone would use RAID 4 when the same hardware can generally do a better job in RAID 5 configuration.
In the case of the RAID levels that provide redundancy, you might ask what exactly happens when one of the drives go bad? This also depends on the hardware involved.
In many cases, RAID drives are "hot swappable," meaning that when the controller senses a failure, you can simply remove the offending drive and replace it while the server continues to operate. After the drive is replaced, the new drive will be updated so that it will contain the data and parity information required. A RAID 5 configuration will fail if another drive fails before the first bad drive is replaced. Some controllers allow a "hot standby" drive that contains no data but will automatically be used when one of the drives fails.
Other possible points of failure are the network card, memory, processor and motherboard. Many servers offer redundant network cards and error-correcting memory to reduce the chances of catastrophic failure from those components. In the event of processor or motherboard failure, you will likely need to do some repairs.
Configuration for SQL Server
How does this information on hardware figure into how you will install and configure SQL Server? When specifying the hardware and OS setup for a SQL Server machine, there are a number of things you can do to improve reliability and recoverability.
First and foremost, you want to keep the log and the data on different drives.
Whenever SQL Server makes a change to the database, in addition to changing the data, information on the changes are written to the log (yes, this is a bit of an exaggeration—some operations are not logged). By placing the log on one drive or RAID array and the data on another, you have the greatest chance of eliminating a catastrophic loss of data. If the log volume fails, you can simply repair the volume with no loss of data. If the data volume fails, given a backup and the log, you should be able to recover up to the point of failure.
I recently had an embarrassing lapse in SQL Server backup procedures that stemmed from not placing data and logs on separate volumes. While I have overseen proper backup and recovery procedures on a number of client sites, my own personal sites (which were really hobby sites, not a part of my business) were less well protected. I was using a very old server that had been repaved a few times, and unknown to me, the different logical volumes I was using for data, backup and logs were on the same physical volume. You can guess the rest: hard drive failure, taking all recent data with it.
Fortunately, this was personal data that was not needed for my business; my business data was properly backed up and was restored within an hour or two. But, just because this data has no commercial value, doesn't mean it isn't important. Recovering it involved some clever use of the Google cache of a page that the data was on. What is that story about the shoeless son of the shoemaker?
While SQL Server can be installed on FAT, FAT32 or NTFS volumes, there is no reason not to install SQL Server on an NTFS volume. NTFS supports compression of a volume or directory, however it is strongly recommended not to compress any volume or directory being used for SQL Server data or log files. Compression seriously degrades SQL Server performance. If a volume is to be dedicated to SQL Server's use for log or data, format the drive with a 64 KB cluster size (a size that matches the SQL Server data extent size).
Rather than just using RAID 1 (or RAID 5 or RAID 10) for both data and log arrays, our understanding of how the data on the log and data volumes will be used can allow a more intelligent choice. Presuming you are interested in the best in read-and-write performance and cannot tolerate any data loss, using RAID 10 or 0+1 for data and RAID 1 for the log offers outstanding performance and reliability, but at a substantial cost. For even more reliability, you can have multiple mirrored copies of the RAID 10 or 0+1 data volume and RAID 1 log volume.
Presuming your needs are a little less extreme, and especially if write performance is not critical, using a RAID 5 volume for the data and a RAID 1 volume for the log can offer much of what the more expensive RAID 10/RAID 1 option does.
Using either RAID 5 or RAID 10 for your data, you most likely will be able to recover from a single disk failure. Disk failure is the most likely hardware-related problem with your server.
Eliminating remaining points of failure
Even with the RAID setup described above, there are still multiple points of failure to consider. Should the motherboard or disk controller fail, your data will likely still be on the hard disks, but inaccessible until the hardware problem is resolved. I say the data is "likely" to be on the hard disks, since it is possible for a disk controller to fail in a spectacular enough way that it takes all the data with it.
I commonly work on web applications, and one of the great things about the protocols used on the web is the ease with which you can eliminate a single point of failure. Depending upon a number of factors, you can create a cluster of web servers, with a load balancer in the front of the cluster to distribute incoming requests. The stateless nature of HTTP requests generally means that each single request can be handled by any machine in the cluster. If a user requests a list of items, clicks on one of the items and is sent to a different web server, the request can be handled cleanly. This ignores things like session state, but many web technologies allow session state to be shared among machines in a cluster.
Clustering SQL Servers is much more difficult. To take advantage of the Microsoft solution, you must be using the Windows 2000/2003 Advanced Server or Datacenter version, as well as SQL Server Enterprise Edition. While I have often used clustering for web servers, I have not used clustering on SQL Server. I imagine that is due in part to the overwhelming cost of such a solution, as well as the difficulty of setting it up.
Note also that if you cluster SQL Server on two machines in the same physical location, you are not protecting against all of the types of failure I discussed in part two of this series. Clustering does not prevent catastrophic site failures, such as extended power outages, floods and hurricanes. There are more mundane site failures as well, such as loss of Internet connectivity.
There are a number of possible solutions for mundane site failures. Presuming normal connectivity between two sites, you can create a server that will contain a very recent copy of the live database. I discussed some ways to replicate data from one SQL Server to another in an article on synchronizing SQL Server databases using Red Gate Software's SQL Toolkit.
Something I have learned about recently is log shipping. Books have been written on the topic of SQL Server replication, but while it sounds like it might be a simple solution to the problem, it is not simple at all. More important, it is not cheap. It involves deploying the most reliable servers, adding a second fall-over site, keeping the two sites synchronized, and managing client access to the correct location. Such a setup and configuration can easily bring your costs into the millions of dollars.
Next: specifics of backing up SQL Server data
Now that you know why to backup, what exactly a backup is, and how to configure your hardware to minimize the need to restore a backup, the next step is to explore the options for SQL Server backup. Where should you backup? How often? And what should you do with those backups after you make them? All that in the next article in this series.
About Doug Reilly
Douglas Reilly was a software developer specializing in ASP.NET and mobile development. He lost his long struggle against cancer in December 2006.