Chapter 6. High Availability

The concept of high availability is quite simple: Systems must stay up as much as possible (in some cases, constantly). This requirement is not only aimed at unexpected events such as system failures but also for planned downtimes such as maintenance or upgrades.

Microsoft provides a wide array of features within its operating systems and in SQL Server 2005 that can help you accomplish that goal. The mechanism you choose depends on what level of availability you are after and which parts of the system you want available.

For duplication of data across several servers, you can copy data automatically from one place to another using replication. Several types of replication bring data from one server to many or to both of them bidirectionally, pulling the data from its source or pushing it to a new location. To send an entire database to another server on a short time interval, you can use log-shipping techniques to copy and apply the Transaction Log from one database to another, keeping both in sync. For immediate time requirements, you can copy the database automatically to another server using database-mirroring technology. You can even duplicate an entire server setup, something Microsoft refers to as "clustering." With a cluster, the system uses multiple computers connected to the same database to maintain constant uptime.

In this chapter, I cover what you need to know about the options you have for implementing a highly available system, and I explain how you manage and monitor it. There is a lot to cover, so I overview the main concepts and show you practical examples along the way.

It is often difficult to find a highly available system to practice on because the production systems are sensitive to change and the components that make up the system are expensive. In the "Take Away" section at the end of this chapter, I show you how to set up an entire cluster on a single system. You can use the same technology to practice or test the other high-availability mechanisms, too.

High availability is keeping your systems operational as long as possible. The requirements may even state that the system needs to be available all the time. There are many reasons to implement a highly available system, but the primary one is protecting your systems against a hardware failure. The effort to implement a highly available system often begins after a crisis at the organization or after a crisis at another firm makes the news. The question is asked, "What can we do to prevent this?"

Another reason for high availability is that the system might be so large that it would take a long period of "downtime" to implement maintenance, which the organization cannot afford. The organization might also want to keep the system available during changes in an application or when performing upgrades.

Although all these goals for high availability are easy to state, they are often full of vague terms such as systems, operational, and all the time. It is important to ensure that the organization understands the costs and benefits for a high-availability system. There are several ways to reach the goals you have for high availability after you define the terms.

The place to start is to define what the organization desires to have as "uptime." When asked that question, most often the response is that the organization wants 100 percent uptime; but in fact, this really is not often a necessity. Most organizations do, in fact, have a time when a particular system can be unavailable. If that is not true, you probably already have a highly available system in place.

What the organization wants most of the time is the guarantee that the systems will not go down during regular operational hours, which might be another thing entirely. At some firms, this might be 24 hours a day, 7 days a week; but many firms have a window of time where there are no operations, no product shipped, or no employees at work. At the least, you might be able to define which systems have these kinds of windows and how long they are.

To find out the true uptime needs, detail the functions of your systems and examine the uptime requirements separately. Shipping might have a longer downtime than a product line, or salespeople might be off during holidays. You can design the uptime for each function within the system and apply the appropriate high-availability solution accordingly.

After you have defined exactly how much time the organization wants to make sure their systems are available, you can begin to plan out your strategy to meet that goal. With that goal in mind, you begin to look for current downtime windows to meet it. It might just be a matter of moving your maintenance processes into those windows.

Let's assume for a moment that there is no activity on the system from 1 a.m. until 6 a.m. That is a pretty big window, and you might be able to get all of your maintenance and upgrades done in that time period. If you followed the guidelines in Chapter 1, "Installation and Configuration," for buying a server-class machine with hot-swappable parts and a properly configured RAID storage unit, you may have all the redundancy you need. Perhaps you do not have a five-hour window conveniently available in the middle of the night. You might have five hours throughout the day, in various slices. You can design your maintenance and upgrade time to fit within those windows.

DBA 101: Emergency Recovery Time

To find out your emergency recovery time, test your backups periodically and hardware replacements, time the process, and communicate that information to the organization. The organization owners can create a plan for that downtime when the crisis occurs.

On the other hand, your system might have activity on it at all hours of the day and night, especially if it is used from multiple geographic locations simultaneously. In this case, you do not have enough time to perform maintenance. In high-use situations, when the system stops, it means a stop in production. In this case, you have a real need to protect not only the data your system holds but also the functionality of the system. If your system has these constraints, you need to consider how you will implement some form of high availability.

You have various options for high-availability technologies, each a bit more complex than the other and costing more in time and money to implement. A highly available system also requires more knowledge to operate, more planning, and more attention. You should understand what each of the technologies do for you as well as what is involved in setting one up and maintaining it to explain to the organization what the costs and benefits of the system are.

SQL Server 2005, paired along with the Windows operating system, provides several mechanisms for making the system highly available. You can copy data from one location to another automatically using a process called replication or log shipping. These methods are useful when the data is allowed to have a certain level of latencythat is, when the data can be a few minutes or hours old without severely impacting the application. Neither of these is connection agnostic. The applications must be able to detect that the original system is unavailable and switch to a different server.

In database mirroring, the entire database is duplicated on another system. Once again, the applications must be able to detect the switch if the original server name is unavailable, although if the application is compiled using ADO .NET 2.0, it has the capability to automatically switch when the primary server goes down.

The next level of availability, called clustering, provides more protection. In addition to protecting the data in real time, two or more servers protect a single identity. Applications connect to the shared identity (called a virtual server), so that if one server is unavailable, another continues to provide the shared identity to the network. Clustering requires some shared components, as I point out later, but provides a maximum degree of safety for the data. Not only that, you can take one system (called a node) in the cluster offline to perform maintenance or apply service packs. When that is complete, you can bring the node back online and repeat the process with the other node. Depending on how robust the code is, the users never have to leave the application while this happens.

These implementations are not without cost, however. At the least, you need another system, and that means more licenses for operating systems, SQL Server 2005, and backups and other utility software. In the case of clustering, you also have to adhere to strict hardware requirements.

Let's examine how you can install and manage the various levels of high availability.