The discussion of the mesh exchanges for the Jacobi problem in Section 8.3 concentrated on the algorithm and data structures, particularly the ghost-cell exchange. In this section, we return to that example and cover two other important issues: blocking and nonblocking communications and communicating noncontiguous data.
Consider the following simple code (note that this is similar to the simple version of exchange_nbrs in Section 8.3):
if (rank == 0) { MPI_Send( sbuf, n, MPI_INT, 1, 0, MPI_COMM_WORLD ); MPI_Recv( rbuf, n, MPI_INT, 1, 0, MPI_COMM_WORLD, &status ); } else if (rank == 1) { MPI_Send( sbuf, n, MPI_INT, 0, 0, MPI_COMM_WORLD ); MPI_Recv( rbuf, n, MPI_INT, 0, 0, MPI_COMM_WORLD, &status ); }
What happens with this code? It looks like process 0 is sending a message to process 1 and that process 1 is sending a message to process 0. But more is going on here. Consider the steps that the MPI implementation must take to make this code work:
Copy the data from the MPI_Send into a temporary, system-managed buffer.
Once the MPI_Send completes (on each process), start the MPI_Recv. The data that was previously copied into a system buffer by the MPI_Send operation can now be delivered into the user's buffer (rbuf in this case).
This approach presents two problems, both related to the fact that data must be copied into a system buffer to allow the MPI_Send to complete. The first problem is obvious: any data motion takes time and reduces the performance of the code. The second problem is more subtle and important: the amount of available system buffer space always has a limit. For values of n in the above example that exceed the available buffer space, the above code will hang: neither MPI_Send will complete, and the code will wait forever for the other process to start an MPI_Recv. This is true for any message-passing system, not just MPI. The amount of buffer space available for buffering a message varies among MPI implementations, ranging from many megabytes to as little as 128 bytes.
How can we write code that sends data among several processes and that does not rely on the availability of system buffers? One approach is to carefully order the send and receive operations so that each send is guaranteed to have a matching receive. For example, we can swap the order of the MPI_Send and MPI_Recv in the code for process 1:
if (rank == 0) { MPI_Send( sbuf, n, MPI_INT, 1, 0, MPI_COMM_WORLD ); MPI_Recv( rbuf, n, MPI_INT, 1, 0, MPI_COMM_WORLD, &status ); } else if (rank == 1) { MPI_Recv( rbuf, n, MPI_INT, 0, 0, MPI_COMM_WORLD, &status ); MPI_Send( sbuf, n, MPI_INT, 0, 0, MPI_COMM_WORLD ); }
This can be awkward to implement, however, particularly for more complex communication patterns; in addition, it does not address the extra copy that may be performed by MPI_Send.
The approach used by MPI, following earlier message-passing systems as well as nonblocking sockets (see [48, Chapter 9]), is to split the send and receive operations into two steps: one to initiate the operation and one to complete the operation. Other operations, including other communication operations, can be issued between the two steps. For example, an MPI receive operation can be initiated by a call to MPI_Irecv and completed with a call to MPI_Wait. Because the routines that initiate these operations do not wait for them to complete, they are called nonblocking operations. The "I" in the routine name stands for "immediate"; this indicates that the routine may return immediately without completing the operation. The arguments to MPI_Irecv are the same as for MPI_Recv except for the last (status) argument. This is replaced by an MPI_Request value; it is a handle that is used to identify an initiated operation. To complete a nonblocking operation, the request is given to MPI_Wait, along with a status argument; the status argument serves the same purpose as status for an MPI_Recv. Similarly, the nonblocking counterpart to MPI_Send is MPI_Isend; this has the same arguments as MPI_Send with the addition of an MPI_Request as the last argument (in C). Using these routines, our example becomes the following:
if (rank == 0) { MPI_Request req1, req2; MPI_Isend( sbuf, n, MPI_INT, 1, 0, MPI_COMM_WORLD, &req1 ); MPI_Irecv( rbuf, n, MPI_INT, 1, 0, MPI_COMM_WORLD, &req2 ); MPI_Wait( &req1, &status ); MPI_Wait( &req2, &status ); } else if (rank == 1) { MPI_Request req1, req2; MPI_Irecv( rbuf, n, MPI_INT, 0, 0, MPI_COMM_WORLD, &req1 ); MPI_Isend( sbuf, n, MPI_INT, 0, 0, MPI_COMM_WORLD, &req2 ); MPI_Wait( &req1, &status ); MPI_Wait( &req2, &status ); }
The buffer sbuf provided to MPI_Isend must not be modified until the operation is completed with MPI_Wait. Similarly, the buffer rbuf provided to MPI_Irecv must not be modified or read until the MPI_Irecv is completed.
The nonblocking communication routines allow the MPI implementation to wait until the message can be sent directly from one user buffer to another (e.g., from sbuf to rbuf) without requiring any copy or using any system buffer space.
Because it is common to start multiple nonblocking operations, MPI provides routines to test or wait for completion of any one, all, or some of the requests. For example, MPI_Waitall waits for all requests in an array of requests to complete. Figure 9.4 shows the use of nonblocking communication routines for the Jacobi example.[1]
void exchange_nbrs( double ulocal[][NY+2], int i_start, int i_end, int left, int right ) { MPI_Status statuses[4]; MPI_Request requests[4]; int c; /* Begin send and receive from the left neighbor */ MPI_Isend( &ulocal[1][1], NY, MPI_DOUBLE, left, 0, MPI_COMM_WORLD, &requests[0] ); MPI_Irecv( &ulocal[0][1], NY, MPI_DOUBLE, left, 0, MPI_COMM_WORLD, &requests[1] ); /* Begin send and receive from the right neighbor */ c = i_end - i_start + 1; MPI_Isend( &ulocal[c][1], NY, MPI_DOUBLE, right, 0, MPI_COMM_WORLD, &requests[2] ); MPI_Irecv( &ulocal[c+1][1], NY, MPI_DOUBLE, right, 0, MPI_COMM_WORLD, &requests[3] ); /* Wait for all communications to complete */ MPI_Waitall( 4, requests, statuses ); }
MPI nonblocking operations are not the same as asynchronous operations. The MPI standard does not require that the data transfers overlap computation with communication. MPI specifies only the semantics of the operations, not the details of the implementation choices. The MPI nonblocking routines are provided primarily for correctness (avoiding the limitations of system buffers) and performance (avoiding copies).
The one-dimensional decomposition used in the Jacobi example (Section 8.3) is simple but does not scale well and can lead to performance problems. We can analyze the performance of the Jacobi following the discussion in Section 8.2. Let the time to communicate n bytes be
Tcomm = s + rn,
where s is the latency and r is the (additional) time to communicate one byte. The time to compute one step of the Jacobi method, using the one-dimensional decomposition in Section 8.3, is
where f is the time to perform a floating-point operation and p is the number of processes. Note that the cost of communication is independent of the number of processes; eventually, this cost will dominate the calculation. Hence, a better approach is to use a two-dimensional decomposition, as shown in Figure 9.5.
The time for one step of the Jacobi method with a two-dimensional decomposition is just
This is faster than the one-dimensional decomposition as long as
(assuming p ≥ 16). To implement this decomposition, we need to communicate data to four neighbors, as shown in Figure 9.6.
The left and right edges can be sent and received by using the same code as for the one-dimensional case. The top and bottom edges have noncontiguous data.
For example, the top edge needs to send the tenth, sixteenth, and twenty-second element. There are four ways to move this data:
Each value can be sent separately. Because of the high latency of message passing, this approach is inefficient and normally should not be used.
The data can be copied into a temporary buffer by using a simple loop, for example,
for (i=0; i<3; i++) { tmp[i] = u_local[i][6]; } MPI_Send( tmp, 3, MPI_DOUBLE, .. );
This is a common approach and, for some systems and MPI implementations, may be the most efficient.
MPI provides two routines to pack and unpack a buffer. These routines are MPI_Pack and MPI_Unpack. A buffer created with these routines should be sent and received with MPI datatype MPI_PACKED. We note, however, that these routines are most useful for complex data layouts that change frequently within a program.
MPI provides a way to construct new datatypes representing any data layout. These routines can be optimized by the MPI implementation, in principle providing better performance than the user can achieve using a simple loop [120]. In addition, using these datatypes is crucial to achieving high performance with parallel I/O.
MPI provides several routines to create datatypes representing common patterns of memory. These new datatypes are called derived datatypes. For this case, MPI_Type_vector is what is needed to create a new MPI datatype representing data values separated by a constant stride. In this case, the stride is NY+2, and the number of elements is i_end-i_start+1.
MPI_Type_vector( i_end - i_start + 1, 1, NY+2, MPI_DOUBLE, &vectype ); MPI_Type_commit( &vectype );
The second argument is a block count and is the number of the basic datatype items (MPI_DOUBLE in this case); this is useful particularly in multicomponent PDE problems. The routine MPI_Type_commit must be called to commit the MPI datatype; this call allows the MPI implementation to optimize the datatype (the optimization is not included as part of the routines that create MPI datatypes because some complex datatypes are created recursively from other derived datatypes).
Using an MPI derived datatype representing a strided data pattern, we can write a version of exchange_nbr for a two-dimensional decomposition of the mesh; the code is shown in Figure 9.7. Note that we use the same derived datatype vectype for the sends and receives at the top and bottom by specifying the first element into which data is moved in the array u_local in the MPI calls.
void exchange_nbrs2d( double ulocal[][NY+2], int i_start, int i_end, int j_start, int j_end, int left, int right, int top, int bottom, MPI_Datatype vectype ) { MPI_Status statuses[8]; MPI_Request requests[8]; int c; /* Begin send and receive from the left neighbor */ MPI_Isend( &ulocal[1][1], NY, MPI_DOUBLE, left, 0, MPI_COMM_WORLD, &requests[0] ); MPI_Irecv( &ulocal[0][1], NY, MPI_DOUBLE, left, 0, MPI_COMM_WORLD, &requests[1] ); /* Begin send and receive from the right neighbor */ c = i_end - i_start + 1; MPI_Isend( &ulocal[c][1], NY, MPI_DOUBLE, right, 0, MPI_COMM_WORLD, &requests[2] ); MPI_Irecv( &ulocal[c+1][1], NY, MPI_DOUBLE, right, 0, MPI_COMM_WORLD, &requests[3] ); /* Begin send and receive from the top neighbor */ MPI_Isend( &ulocal[1][NY], 1, vectype, top, 0, MPI_COMM_WORLD, &requests[4] ); MPI_Irecv( &ulocal[1][NY+1], 1, vectype, top, 0, MPI_COMM_WORLD, &requests[5] ); /* Begin send and receive from the bottom neighbor */ MPI_Isend( &ulocal[1][1], 1, vectype, bottom, 0, MPI_COMM_WORLD, &requests[6] ); MPI_Irecv( &ulocal[1][0], 1, vectype, bottom, 0, MPI_COMM_WORLD, &requests[7] ); /* Wait for all communications to complete */ MPI_Waitall( 8, requests, statuses ); }
When a derived datatype is no longer needed, it should be freed with MPI_Type_free. Many other routines are available for creating datatypes; for example, MPI_Type_indexed is useful for scatter-gather patterns, and MPI_Type_create_struct can be used for an arbitrary collection of memory locations.
Early implementations of derived datatypes did not achieve good performance, trading simplicity of implementation for performance. More recent implementations provide better performance, sometimes greater than is possible with straightforward user code. See [49, 120, 20] for some examples.
[1]On many systems, calling MPI_Isend before MPI_Irecv will improve performance.