11.1 Considerations for Fault Tolerance

A computational biologist at Oak Ridge National Laboratory wants to write an parallel application that runs 24/7 on his Beowulf cluster. The application involves analysis of the human genome and is driven by a constant stream of new data arriving from researchers all around the world. The data is not independent because new data helps refine and extend previously calculated sequences. How can he write such a program?

A company wants to write an application to process a constant stream of sales orders coming in from the Web. The program needs to be robust because down time costs not only the lost revenue stream but also wages of workers who are idle. The company has recently purchased a Beowulf cluster to provide a reliable, cost-effective solution. But how does a company write the fault-tolerant parallel program to run on the cluster?

When you are developing algorithms that must be reliable the first consideration is the hardware. The bad news is that your Beowulf cluster will have failures; it will need maintenance. It is not a matter of whether some node in the cluster will fail but when. Experience has shown that the more nodes the cluster has, the more likely one will fail within a given time.

How often a hardware failure occurs varies widely between clusters. It depends on the quality of the components used by the manufacturer. It depends on the room the cluster is set up in. Is it adequately cooled? Is ventilation good? It is possible to have a cool room but have the cluster nodes stacked so close together that the inner nodes get hot and begin to have component failures. It is possible to have the hot air from one node blow into the cool air intake of another node. Hardware failure also depends on the applications being run on the nodes. Some parallel applications do intense sustained calculations that cause the floating point chips to generate much more heat. Other applications read and write intensely to memory, thereby increasing the probability of having a memory fault.

Some clusters have failures every week; others run for months. It is not uncommon for several nodes to fail at about the same time with similar hardware problems. Evaluate your particular cluster under a simulated load for a couple of weeks to get data on expected mean time between failures (MTBF). If the MTBF is many times longer than your average application run time, then it may not make sense to restructure the application to be fault tolerant. In most cases it is more efficient simply to rerun a failed application if it has a short run time.

The second consideration is the fault tolerance of the underlying software environment. If the runtime system is not robust, then the hardware is the least of your problems. The PVM system sits between the operating system and the application and, among other things, monitors the state of the virtual machine. The PVM runtime system is designed to be fault tolerant and to reconfigure itself automatically when a failure is detected. (It doesn't help your fault-tolerant application if the underlying failure detection system crashes during a failure!) The PVM failure detection system is responsible for detecting problems and notifying running applications about the problem. The PVM runtime system keeps track of and automatically reconfigures itself around failed nodes and tasks. It makes no attempt to recover a parallel application automatically.

The third consideration is the application. Not every parallel application can recover from a failure; recovery depends on the design of the application and the nature of the failure. For example, in the manager/worker programs of the preceding chapters, if the node that fails was running a worker, then recovery is possible; but if the node was running the manager, then key data may be lost that can't be recovered.

At the least, any parallel program can be made fault tolerant by restarting it automatically from the beginning if a failure in detected. The most common form of fault tolerance in use today is a variation of this approach, called checkpoint/restart. Instead of starting from the beginning, an application periodically stops calculating and sending messages and writes out its partial results to disk as a checkpoint. When a failure occurs, the runtime system kills the parallel application and automatically restarts it from the last checkpoint. The time lost in this technique is the time from the last checkpoint and the time it takes to write out all the checkpoints during the entire run.

This technique works for MPI, PVM, shared-memory paradigms, and most other programming paradigms. The application developer has to write two routines. One collects and writes out the checkpoint information from all the parallel tasks. The other checks whether the application is restarting, reads in the checkpoint data, and distributes the data to the parallel tasks. While writing these two routines is not trivial, failure recovery without stopping the application can get much more complicated.

On-the-fly recovery of parallel programs is complicated because data in messages may be in flight when the recovery begins. Hence, a race condition arises. If the data does not arrive, then it will need to be resent as part of the recovery. But if the data manages to be received just before the recovery, then there isn't an outstanding receive call, and the data shouldn't be resent.

File I/O is another problem that complicates recovery. File pointers may need to be reset to the last checkpoint to avoid getting a repeated set of output data in the file.

Despite all these issues, a few common methods can be used to improve the fault tolerance of many parallel applications.