9.4 Motivation for Communicators

9.4 Motivation for Communicators

Communicators in MPI serve two purposes. The most obvious purpose is to describe a collection of processes. This feature allows collective routines, such as MPI_Bcast or MPI_Allreduce, to be used with any collection of processes. This capability is particularly important for hierarchical algorithms. It also facilitates dividing a computation into subtasks, each of which has its own collection of processes. For example, in the manager-worker example in Section 8.2, it may be appropriate to divide each task among a small collection of processes, particularly if this causes the problem description to reside only in the fast memory cache. MPI communicators are perfect for this; the MPI routine MPI_Comm_split is the only routine needed when creating new communicators. Using ranks relative to a communicator for specifying the source and destination of messages also facilitates dividing parallel tasks among smaller but still parallel subtasks, each with its own communicator.

A more subtle but equally important purpose of the MPI communicator involves the communication context that each communicator contains. This context is essential for writing software libraries that can be safely and robustly combined with other code, both other libraries and user-specific application code, to build complete applications. Used properly, the communication context guarantees that messages are received by appropriate routines even if other routines are not so careful. Consider the example in Figure 9.8 (taken from [48, Section 6.1.2]). In this example, two routines are provided by separate libraries or software modules. One, SendRight, sends a message to the right neighbor and receives from the left. The other, SendEnd, sends a message from process 0 (the leftmost) to the last process (the rightmost). Both of these routines use MPI_ANY_SOURCE instead of a particular source in the MPI_Recv call. As Figure 9.8 shows, the messages can be confused, causing the program to receive the wrong data. How can we prevent this situation? Several approaches will not work. One is to avoid the use of MPI_ANY_SOURCE. This fixes the example, but only if both SendRight and SendEnd follow this rule. The approach may be adequate (though fragile) for code written by a single person or team, but it isn't adequate for libraries. For example, if SendEnd was written by a commercial vendor and did not use MPI_ANY_SOURCE, but SendRight, written by a different vendor or an inexperienced programmer, did use MPI_ANY_SOURCE, then the program would still fail, and it would look like SendEnd was at fault (because the message from SendEnd was received first).

Click To expand
Figure 9.8: Two possible message-matching patterns when MPI_ANY_SOURCE is used in the MPI_Recv calls (from [48]).

Another approach that does not work is to use message tags to separate messages. Again, this can work if one group writes all of the code and is very careful about allocating message tags to different software modules. However, using MPI_ANY_TAG in an MPI receive call can still bypass this approach. Further, as shown in Figure 6.5 in [48], even if MPI_ANY_SOURCE and MPI_ANY_TAG are not used, separate code modules still can receive the wrong message.

The communication context in an MPI communicator provides a solution to these problems. The routine MPI_Comm_dup creates a new communicator from an input communicator that contains the same processes (in the same rank order) but with a new communication context. MPI messages sent in one communication context can be received only in that context. Thus, any software module or library that wants to ensure that all of its messages will be seen only within that library needs only to call MPI_Comm_dup at the beginning to get a new communicator. All well-written libraries that use MPI create a private communicator used only within that library.

Enabling the development of libraries was one of the design goals of MPI. In that respect MPI has been very successful. Many libraries and applications now use MPI, and, because of MPI's portability, most of these run on Beowulf clusters. Table 9.1 provides a partial list of libraries that use MPI to provide parallelism. More complete descriptions and lists are available at www.mcs.anl.gov/mpi/libraries and at sal.kachinatech.com/C/3. Chapter 12 discusses software, including MPI libraries and programs, in more detail.

Table 9.1: A sampling of libraries that use MPI. See Chapter 12 for a more thorough list.

Library

Description

URL

PETSc

Linear and nonlinear solvers for PDEs

www.mcs.anl.gov/petsc

Aztec

Parallel iterative solution of sparse linear systems

www.cs.sandia.gov/CRF/aztec1.html

Cactus

Framework for PDE solutions

www.cactuscode.org

FFTW

Parallel FFT

www.fftw.org

PPFPrint

Parallel print

www.llnl.gov/CASC/ppf/

HDF

Parallel I/O for Hierarchical Data Format (HDF) files

hdf.ncsa.uiuc.edu/Parallel_HDF

NAG

Numerical library

www.nag.co.uk/numeric/fd/FDdescription.asp

ScaLAPACK

Parallel linear algebra

www.netlib.org/scalapack

SPRNG

Scalable pseudorandom number generator

sprng.cs.fsu.edu




Part III: Managing Clusters