Remarks on Stability

Remarks on Stability

Terminal server administrators rely on their central components being as stable as possible. Here, this applies mainly to configuration and setting up hardware components. Creating backups in line with standard corporate strategy is an additional measure, but although this is touched on in Chapter 6, it is not discussed in detail in this book.

Important?

Without adhering to clear-cut installation and backup strategies in combination with recovery processes, the stability of terminal server operation cannot be guaranteed in the long term. Even if hardware and software function properly and are stable, negligence or intentional sabotage might lead to the loss of important system or user files. Furthermore, even error-tolerant hard drive systems can be completely destroyed if certain events occur. For this reason, you must have a backup and a tried-and- tested disaster recovery procedure.

Avoiding System Failures

Tests have shown that an average company loses two to three percent of its sales within the first ten days after a failure in the IT system. If there is no disaster recovery plan and the network and computer infrastructure is not put back into operation within these first ten days, key corporate functions might be unavailable for more than five days. Half of all companies suffer permanent loss if they fail to get their IT up and running within these ten days. Without a disaster recovery plan, many companies end up bankrupt after a system breakdown. This is, naturally, all the more true for companies with businesses based exclusively on a properly functioning IT system.

In addition to the network, the individual servers play an important role for a company’s IT infrastructure. If terminal servers are part of the strategic corporate platform, the administrator’s topmost goal should be to increase their availability. However, the corresponding measures should always be in proportion to the potential damage a server failure could cause. That is why prioritizing individual servers and their services should always come before investing in availability-enhancing technology.

It is very interesting to check out some rough statistics and experiences relating to failure frequency of individual computer components. Securing those components that pose the most problems already significantly reduces the risk of system failure:

  • Hard drive: Responsible for approx. 50 percent of all failures

  • Power supply: Responsible for approx. 25 percent of all failures

  • Fan, ventilation: Responsible for approx. 10 percent of all failures

  • Memory: Responsible for less than 5 percent of all failures

  • Controller: Responsible for less than 5 percent of all failures

  • Other: Responsible for approx. 5 percent of all failures

Based on these statistics, computer availability can be improved by focusing on two weak points (for example, hard drive and power supply) and by placing a secondary focus on one additional weakness (for example, ventilation that affects CPU temperature and other important system components). Hard drives in particular require constant monitoring.

System Availability

The term high availability usually means using redundant and secured hardware combined with special software. Over the past few years, certain norms have become established to describe failure safety.

Table 8.2: Failure Safety Terminology

Availability Factor

Failure Times

Availability

Realization

99%

Approx. 3 days per year

Normal

Server system

99.9%

Approx. 8 hours per year

High

Cluster system

99.999%

Approx. 5 minutes per year

Error Tolerant

Special system

To make terminal server computer hardware as available as possible, the potential for failure of processors, memory, power supply, hard drive system, and network access must be minimized. This is an issue that should not be neglected, especially when it comes to processors and memory in a single-computer solution. However, double or triple power supplies, ventilators, hard drive controllers, and network cards in combination with hard drive RAID systems already provide relatively high redundancy. Passive back planes for individual processor cards and monitoring systems extend failure avoidance even more. Blade systems take the concept one step further by providing power supplies and other core system components in a centralized and redundant manner.

Redundant network lines in both the local and the wide area networks improve availability still more. Central, uninterruptible power supply should also be a standard solution for an environment with increased stability and security requirements.

Clusters and Load Balancing

Clusters dramatically increase failure safety. When Microsoft Cluster Service is used, a cluster represents a group of servers (nodes) that are linked with each other to work like a single system. An independent operating system instance runs on each node with additional communication functions for system synchronization. If one cluster node fails or is removed from the cluster for maintenance, the resources running on that node are transferred to the remaining nodes so they may assume that node’s workload. Users are not aware of any changes if their server fails within one cluster, except perhaps a slight decrease in computing speed. Regrettably, Microsoft Cluster Service solutions are not readily available for terminal servers. For this reason, alternatives for providing high availability must be developed. Microsoft Cluster Service is, however, highly suitable for file servers that save terminal server user and profile data centrally. This solution surpasses the possibilities that RAID load balancing offers for single servers, but it is, of course, much more expensive, as well.

Another option is load balancing, that is, combining two ore more identically configured terminal servers. In contrast to a cluster, load balancing ensures that a user is assigned to the server with the smallest load at the time of logon. If a server that is part of a load-balanced system fails, the sessions of the users logged on to this server are lost. However, the users can immediately log on again to the load-balanced system because the remaining servers are still available. Although downtime and some data loss are not completely avoided this way, they are at least minimized.

Terminal servers running Windows Server 2003 can use the integrated network load balancing service. This service allows the creation of a group of servers with a single virtual IP address and contains mechanisms for the dynamic distribution of user logons. Providing that user and profile data are saved on a dedicated file server, users can log on to and continue working on a load-balanced system even if their terminal server fails. Only data that was modified after it was last saved is lost. In Chapter 11, you will read more about configuring terminal servers in a load-balanced environment.

Note?

Constructing a fail-safe computer and network system for terminal servers is by no means a trivial task. You need comprehensive knowledge of configuring clusters, SANs, SCSI devices, hard drive RAID systems, and load- balanced systems. Nevertheless, there is a growing trend on the market to make this type of environment more and more powerful and easier to handle.

Stable Server Configuration

A terminal server should always be set up according to its purpose. Under no circumstances should such different devices as Microsoft BackOffice servers and Terminal Services be combined for the purpose of providing desktops on a physical server.

Terminal server operation is especially stable if the following points are observed as far as possible:

  • Support by well-trained administrators who have sufficient practical experience and, ideally, have a test or reference environment at their disposal. Furthermore, a clear definition of an escalation path is recommended, all the way up to the decision-making bodies of a company, to address any serious problems that might arise.

  • Use of sufficiently powerful and established standard hardware following at least one week of permanent test operation.

  • Use of modern graphics cards and hard drive controllers with up-to-date and mature drivers.

  • Connection of necessary peripheral devices only. Special peripherals that have nothing to do with the terminal server’s function should be strictly avoided. It is recommended that keyboard, mouse, and monitor be set up to look unattractive to casual users to prevent them from being tempted to work interactively with the computer console.

  • Avoiding backgrounds, screen savers, and animations because they use up a huge amount of a terminal server’s resources when many users access these functions simultaneously. The result is slow responses and increased system instability.

  • If at all possible, no additional installation of other server applications on a terminal server. This minimizes the risk of unfavorable mutual interference between individual applications. The application installation sequence might influence system stability and should therefore always be documented.

  • Setting up servers in cool and secure rooms to avoid temperature problems and unauthorized hardware access.

  • Use of uninterruptible power supply (UPS) to protect the system from network failure and power oscillation.

If it is absolutely necessary to install critical application programs, the server should be rebooted automatically on a regular basis (for example, weekly or monthly). This is especially true for terminal servers with several applications that have known problems relating to releasing main memory when the application is closed. These memory leaks can take away main memory from a terminal server until it becomes unstable.