9.2 Fault Tolerance

Communicators are a fundamental concept in MPI. Their sizes are fixed at the time they are created, and the efficiency and correctness of collective operations rely on this fact. Users sometimes conclude from the fixed size of communicators that MPI provides no mechanism for writing fault-tolerant programs. Now that we have introduced intercommunicators, however, we are in a position to discuss how this topic might be addressed and how you might write a manager-worker program with MPI in such a way that it would be fault tolerant. In this context we mean that if one of the worker processes terminates abnormally, instead of terminating the job you will be able to carry on the computation with fewer workers, or perhaps dynamically replace the lost worker.

The key idea is to create a separate (inter)communicator for each worker and use it for communications with that worker rather than use a communicator that contains all of the workers. If an implementation returns "invalid communicator" from an MPI_Send or MPI_Recv call, then the manager has lost contact only with one worker and can still communicate with the other workers through the other, still-intact communicators. Since the manager will be using separate communicators rather than separate ranks in a larger communicator to send and receive message from the workers, it might be convenient to maintain an array of communicators and a parallel array to remember which row has been last sent to a worker, so that if that worker disappears, the same row can be assigned to a different worker. Figure 9.3 shows these arrays and how they might be used. What we are doing with this approach is recognizing that two-party communication can be made fault tolerant, since one party can recognize the failure of the other and take appropriate action. A normal MPI communicator is not a two-party system and cannot be made fault tolerant without changing the semantics of MPI communication. If, however, the communication in an MPI program can be expressed in terms of intercommunicators, which are inherently two-party (the local group and the remote group), then fault tolerance can be achieved.

     /* highly incomplete */

     MPI_Comm worker_comms[MAX_WORKERS];
     int last_row_sent[MAX_WORKERS];

     rc = MPI_Send( a[numsent], SIZE, MPI_DOUBLE, 0, numsent+1,
                    worker_comms[sender] );
     if ( rc != MPI_SUCCESS ) {
         /* Check that error class is one we can recover from */
         ...
         MPI_Comm_spawn( "worker" , ... );

Figure 9.3: Fault-tolerant manager.

Note that while the MPI standard, through the use of intercommunicators, makes it possible to write an implementation of MPI that encourages fault-tolerant programming, the MPI standard itself does not require MPI implementations to continue past an error. This is a "quality of implementation" issue and allows the MPI implementor to trade performance for the ability to continue after a fault. As this section makes clear, however, nothing in the MPI standard stands in the way of fault tolerance, and the two primary MPI implementations for Beowulf clusters, MPICH2 and LAM/MPI, both endeavor to support some style of fault tolerance for applications.