9.7 Remote Memory Access

9.7 Remote Memory Access

The message-passing programming model requires that both the sender and the receiver (or all members of a communicator in a collective operation) participate in moving data between two processes. An alternative model where one process controls the communication, called one-sided communication, can offer better performance and in some cases a simpler programming model. MPI-2 provides support for this one-sided approach. The MPI-2 model was inspired by the work on the bulk synchronous programming (BSP) model [54] and the Cray SHMEM library used on the massively parallel Cray T3D and T3E computers [30].

In one-sided communication, one process may put data directly into the memory of another process, without that process using an explicit receive call. For this reason, this is also called remote memory access (RMA).

Using RMA involves four steps:

  1. Describe the memory into which data may be put.

  2. Allow access to the memory.

  3. Begin put operations (e.g., with MPI_Put).

  4. Complete all pending RMA operations.

The first step is to describe the region of memory into which data may be placed by an MPI_Put operation (also accessed by MPI_Get or updated by MPI_Accumulate). This is done with the routine MPI_Win_create:

   MPI_Win win;
   double ulocal[MAX_NX][NY+2];

   MPI_Win_create( ulocal, (NY+2)*(i_end-i_start+3)*sizeof(double),
              sizeof(double), MPI_INFO_NULL, MPI_COMM_WORLD, &win );

The input arguments are, in order, the array ulocal, the size of the array in bytes, the size of a basic unit of the array (sizeof (double) in this case), a "hint" object, and the communicator that specifies which processes may use RMA to access the array. MPI_Win_create is a collective call over the communicator. The output is an MPI window object win. When a window object is no longer needed, it should be freed with MPI_Win_free.

RMA operations take place between two sentinels. One begins a period where access is allowed to a window object, and one ends that period. These periods are called epochs.[2] The easiest routine to use to begin and end epochs is MPI_Win_fence. This routine is collective over the processes that created the window object and both ends the previous epoch and starts a new one. The routine is called a "fence" because all RMA operations before the fence complete before the fence returns, and any RMA operation initiated by another process (in the epoch begun by the matching fence on that process) does not start until the fence returns. This may seem complex, but it is easy to use. In practice, MPI_Win_fence is needed only to separate RMA operations into groups. This model closely follows the BSP and Cray SHMEM models, though with the added ability to work with any subset of processes.

Three routines are available for initiating the transfer of data in RMA. These are MPI_Put, MPI_Get, and MPI_Accumulate. All are nonblocking in the same sense MPI point-to-point communication is nonblocking (Section 9.3.1). They complete at the end of the epoch that they start in, for example, at the closing MPI_Win_fence. Because these routines specify both the source and destination of data, they have more arguments than do the point-to-point communication routines. The arguments can be easily understood by taking them a few at a time.

  1. The first three arguments describe the origin data, that is, the data on the calling process. These are the usual "buffer, count, datatype" arguments.

  2. The next argument is the rank of the target process. This serves the same function as the destination of an MPI_Send. The rank is relative to the communicator used when creating the MPI window object.

  3. The next three arguments describe the destination buffer. The count and datatype arguments have the same meaning as for an MPI_Recv, but the buffer location is specified as an offset from the beginning of the memory specified to MPI_Win_create on the target process. This offset is in units of the displacement argument of the MPI_Win_create and is usually the size of the basic datatype.

  4. The last argument is the MPI window object.

Note that there are no MPI requests; the MPI_Win_fence completes all preceding RMA operations. MPI_Win_fence provides a collective synchronization model for RMA operations in which all processes participate. This is called active target synchronization.

With these routines, we can create a version of the mesh exchange that uses RMA instead of point-to-point communication. Figure 9.13 shows one possible implementation.

Start Figure
void exchang_nbrs( double u_local[][NY+2], int i_start, int i_end,
                   int left, int right, MPI_Win win )
{
    MPI_Aint left_ghost_disp, right_ghost_disp;
    int      c;

    MPI_Win_fence( 0, win );
    /* Put the left edge into the left neighbors rightmost
       ghost cells.  See text about right_ghost_disp */
    right_ghost_disp = 1 + (NY+2) * (i_end-i-start+2);
    MPI_Put( &u_local[1][1], NY, MPI_DOUBLE,
            left, right_ghost_disp, NY, MPI_DOUBLE, win );
    /* Put the right edge into the right neighbors leftmost ghost
       cells */
    left_ghost_disp = 1;
    c = i_end - i_start + 1;
    MPI_Put( &u_local[c][1], NY, MPI_DOUBLE,
             right, left_ghost_disp, NY, MPI_DOUBLE, win );

    MPI_Win_fence( 0, win )
}
End Figure

Figure 9.13: Neighbor exchange using MPI remote memory access.

Another form of access requires no MPI calls (not even a fence) at the target process. This is called passive target synchronization. The origin process uses MPI_Win_lock to begin an access epoch and MPI_Win_unlock to end the access epoch.[3] Because of the passive nature of this type of RMA, the local memory (passed as the first argument to MPI_Win_create) should be allocated with MPI_Alloc_mem and freed with MPI_Free_mem. For more information on passive target RMA operations, see [50, Chapter 6]. Also note that as of 2003, not all MPI implementations support passive target RMA operation. Check that your implementation fully implements passive target RMA operations before using them.

A more complete discussion of remote memory access can be found in [50, Chapters 5 and 6]. Note that MPI implementations are just beginning to provide the RMA routines described in this section. Most current RMA implementations emphasize functionality over performance. As implementations mature, however, the performance of RMA operations will also improve.

[2]MPI has two kinds of epochs for RMA: an access epoch and an exposure epoch. For the example used here, the epochs occur together, and we refer to both of them as just epochs.

[3]The names MPI_Win_lock and MPI_Win_unlock are really misnomers; think of them as begin-RMA and end-RMA.




Part III: Managing Clusters