Chapter 9: Advanced Topics in MPI Programming

William Gropp and Ewing Lusk

In this chapter we continue our exploration of parallel programming with MPI. We describe capabilities that are more specific to MPI than to part of the message-passing programming model in general. We cover the more advanced features of MPI, such as dynamic process management, parallel I/O, and remote memory access. These features are often described as MPI-2 because they were added to the MPI standard in a second round of specification; however, MPI means both the original (MPI-1) and new (MPI-2) features. We will use the term "MPI-2" to emphasize that a feature was added to MPI in the second round.

9.1 Dynamic Process Management in MPI

A new feature of MPI is the ability of an MPI program to create new MPI processes and communicate with them. (In the original MPI specification, the number of processes was fixed at startup.) MPI calls this capability (together with related capabilities such as connecting two independently started MPI jobs) dynamic process management. Three main issues are introduced by this collection of features:

maintaining simplicity and flexibility;
interacting with the operating system, a parallel process manager, and perhaps a job scheduler; and
avoiding race conditions that could compromise correctness.

The key to avoiding race conditions is to make creation of new processes a collective operation, over both the processes creating the new processes and the new processes being created. Using a collective operation in creating new processes also provides scalability and addresses these other issues.

9.1.1 Intercommunicators

Recall that an MPI communicator consists of a group of processes together with a communication context. Strictly speaking, the communicators we have dealt with so far are intracommunicators. There is another kind of communicator, called an intercommunicator. An intercommunicator binds together a communication context and two groups of processes, called (from the point of view of a particular process) the local group and the remote group. Processes are identified by rank in group, but ranks in an intercommunicator always refer to the processes in the remote group. That is, an MPI_Send using an intercommunicator sends a message to the process with the destination rank in the remote group of the intercommunicator. Collective operations are also defined for intercommunicators; see [50, Chapter 7] for details.

9.1.2 Spawning New MPI Processes

We are now in a position to explain exactly how new MPI processes are created by an already running MPI program. The MPI function that creates these processes is MPI_Comm_spawn. Its key features are the following.

It is collective over the communicator of processes initiating the operation (called the parents) and also collective with the calls to MPI_Init in the processes being created (called the children). That is, MPI_Comm_spawn does not return in the parents until it has been called in all the parents and MPI_Init has been called in all the children.
It returns an intercommunicator in which the local group contains the parents and the remote group contains the children.
The new processes, which must call MPI_Init, have their own MPI_COMM_WORLD, consisting of all the processes created by this one collective call to MPI_Comm_spawn.
The function MPI_Comm_get_parent, called by the children, returns an intercommunicator with the children in the local group and the parents in the remote group.
The collective function MPI_Intercomm_merge may be called by parents and children to create a normal (intra)communicator containing all the processes, both old and new, but for many communication patterns this is not necessary.

9.1.3 Revisiting Matrix-Vector Multiplication

Here we illustrate the use of MPI_Comm_spawn by revisiting the matrix-vector multiply program of Section 8.2. Instead of starting with a fixed number of processes, we compile separate executables for the manager and worker programs, start the manager with

   mpiexec -n 1 manager <number-of-workers>

and then let the manager create the worker processes dynamically. We assume that only the manager has the matrix a and the vector b and broadcasts them to the workers after the workers have been created. The program for the manager is shown in Figure 9.1 and the code for the workers is shown in Figure 9.2.

#include "mpi.h"
#include <stdio.h>
#define SIZE 10000

int main( int argc, char *argv[] )
{
    double a[SIZE][SIZE], b[SIZE], c[SIZE];
    int i, j, row, numworkers;
    MPI_Status status;
    MPI_Comm workercomm;

    MPI_Init( &argc, &argv );
    if ( argc != 2 || !isnumeric( argv[1] ))
        printf( "usage: %s <number of workers>\n", argv[0] );
    else
        numworkers = atoi( argv[1] );

    MPI_Comm_spawn( "worker", MPI_ARGV_NULL, numworkers,
               MPI_INFO_NULL,
               0, MPI_COMM_SELF, &workercomm, MPI_ERRCODES_IGNORE );
    ...
    /* initialize a and b */
    ...
    /* send b to each worker */
    MPI_Bcast( b, SIZE, MPI_DOUBLE, MPI_ROOT, workercomm );
    ...
    /* then normal manager code as before*/
    ...
    MPI_Finalize();
    return 0;
}

Figure 9.1: Dynamic process matrix-vector multiply program, manager part.

#include "mpi.h"

int main( int argc, char *argv[] )
{
    int numprocs, myrank;
    double b[SIZE], c[SIZE];
    int i, row, myrank;
    double dotp;
    MPI_Status status;
    MPI_Comm parentcomm;

    MPI_Init( &argc, &argv );
    MPI_Comm_size( MPI_COMM_WORLD, &numprocs );
    MPI_Comm_rank( MPI_COMM_WORLD, &myrank );

    MPI_Comm_get_parent( &parentcomm );

    MPI_Bcast( b, SIZE, MPI_DOUBLE, 0, parentcomm );

    ...
    /* same as worker code from original matrix-vector multiply */
    ...

    MPI_Comm_free( &parentcomm );
    MPI_Finalize( );
    return 0;
}

Figure 9.2: Dynamic process matrix-vector multiply program, worker part.

Let us consider in detail the call in the manager that creates the worker processes.

    MPI_Comm_spawn( "worker", MPI_ARGV_NULL, numworkers,
               MPI_INFO_NULL,
               0, MPI_COMM_SELF, &workercomm, MPI_ERRCODES_IGNORE );

It has eight arguments. The first is the name of the executable to be run by the new processes. The second is the null-terminated argument vector to be passed to all of the new processes; here we are passing no arguments at all, so we specify the special value MPI_ARGV_NULL. Next is the number of new processes to create. The fourth argument is an MPI "Info" object, which can be used to specify special environment- and/or implementation-dependent parameters, such as the names of the nodes to start the new processes on. In our case we leave this decision to the MPI implementation or local process manager, and we pass the special value MPI_INFO_NULL. The next argument is the "root" process for this call to MPI_Comm_spawn; it specifies which process in the communicator given in the following argument is supplying the valid arguments for this call. The communicator we are using consists here of just the one manager process, so we pass MPI_COMM_SELF. Next is the address of the new intercommunicator to be filled in, and finally an array of error codes for examining possible problems in starting the new processes. Here we use MPI_ERRCODES_IGNORE to indicate that we will not be looking at these error codes.

Code for the worker processes that are spawned is shown in Figure 9.2. It is essentially the same as the worker subroutine in the preceding chapter but is an MPI program in itself. Note the use of intercommunicator broadcast in order to receive the vector b from the parents. We free the parent intercommunicator with MPI_Comm_free before exiting.

9.1.4 More on Dynamic Process Management

For more complex examples of the use of MPI_Comm_spawn, including how to start processes with different executables or different argument lists, see [50, Chapter 7]. MPI_Comm_spawn is the most basic of the functions provided in MPI for dealing with a dynamic MPI environment. By querying the attribute MPI_UNIVERSE_SIZE, you can find out how many processes can be usefully created. Separately started MPI computations can find each other and connect with MPI_Comm_connect and MPI_Comm_accept. Processes can exploit non-MPI connections to "bootstrap" MPI communication. These features are explained in detail in [50].