9.6 Parallel I/O

MPI provides a wide variety of parallel I/O operations, more than we have space to cover here. See [50, Chapter 3] for a more thorough discussion of I/O in MPI. These operations are most useful when combined with a high-performance parallel file system, such as PVFS, described in Chapter 19.

The fundamental idea in MPI's approach to parallel I/O is that a file is opened collectively by a set of processes that are all given access to the same file. MPI thus associates a communicator with the file, allowing a flexible set of both individual and collective operations on the file.

This approach can be used directly by the application programmer, as described below. An alternative is to use libraries that are designed to provide efficient and flexible access to files, described in a standard format. Two such libraries are parallel NetCDF [66] and HDF5 [25].

9.6.1 A Simple Example

We first provide a simple example of how processes write contiguous blocks of data into the same file in parallel. Then we give a more complex example, in which the data in each process is not contiguous but can be described by an MPI datatype.

For our first example, let us suppose that after solving the Poisson equation as we did in Section 8.3, we wish to write the solution to a file. We do not need the values of the ghost cells, and in the one-dimensional decomposition the set of rows in each process makes up a contiguous area in memory, which greatly simplifies the program. The I/O part of the program is shown in Figure 9.11.

    MPI_File outfile;
    size = NX * (NY + 2);
    MPI_File_open( MPI_COMM_WORLD, "solutionfile",
                   MPI_MODE_CREATE | MPI_MODE_WRONLY,
                   MPI_INFO_NULL, &outfile );
    MPI_File_set_view( outfile,
                  rank * (NY+2) * (i_end - i_start) * sizeof(double),
                  MPI_DOUBLE, MPI_DOUBLE, "native", MPI_INFO_NULL );
    MPI_File_write( outfile, &ulocal[1][O], size, MPI_DOUBLE,
                   MPI_STATUS_IGNORE );
    MPI_File_close( &outfile );

Figure 9.11: Parallel I/O of Jacobi solution. Note that this choice of file view works only for a single output step; if output of multiple steps of the Jacobi method are needed, the arguments to MPI_File_set_view must be modified.

Recall that the data to be written from each process, not counting ghost cells but including the boundary data, is in the array ulocal[i] [j] for i=i_start to i_end and j=0 to NY+1.

Note that the type of an MPI file object is MPI_File. Such file objects are opened and closed much the way normal files are opened and closed. The most significant difference is that opening a file is a collective operation over a group of processes specified by the communicator in the first argument of MPI_File_open. A single process can open a file by specifying the single-process communicator MPI_COMM_SELF. Here we want all of the processes to share the file, and so we use MPI_COMM_WORLD.

In our discussion of dynamic process management, we mentioned MPI_Info objects. An MPI info object is a collection of key=value pairs that can be used to encapsulate a variety of special-purpose information that may not be applicable to all MPI implementations. In this section we will use MPI_INFO_NULL whenever this type of argument is required, since we have no special information to convey. For details about MPI_Info, see [50, Chapter 2].

The part of the file that will be seen by each process is called the file view and is set for each process by a call to MPI_File_set_view. In our example the call is

    MPI_File_set_view( outfile, rank * (NY+2) * ( ... ),
                   MPI_DOUBLE, MPI_DOUBLE, "native", MPI_INFO_NULL )

The first argument identifies the file; the second is the displacement (in bytes) into the file of where the process's view of the file is to start. Here we simply multiply the size of the data to be written by the process's rank, so that each process's view starts at the appropriate place in the file. The type of this argument is MPI_Offset, which can be expected to be a 64-bit integer on systems that support large files.

The next argument is called the etype of the view; it specifies the unit of data in the file. Here it is just MPI_DOUBLE, since we will be writing some number of doubles. The next argument is called the filetype; it is a flexible way of describing noncontiguous views in the file. In our case, with no noncontiguous units to be written, we can just use the etype, MPI_DOUBLE. In general, any MPI predefined or derived datatype can be used for both etypes and filetypes. We explore this use in more detail in the next example.

The next argument is a string defining the data representation to be used. The native representation says to represent data on disk exactly as it is in memory, which provides the fastest I/O performance, at the possible expense of portability. We specify that we have no extra information by providing MPI_INFO_NULL for the final argument.

The call to MPI_File_write is then straightforward. The data to be written is a contiguous array of doubles, even though it consists of several rows of the (distributed) matrix. On each process it starts at &ulocal [0] [1] so the data is described in (address, count, datatype) form, just as it would be for an MPI message. We ignore the status by passing MPI_STATUS_IGNORE. Finally we (collectively) close the file with MPI_File_close.

9.6.2 A More Complex Example

Parallel I/O requires more than just calling MPI_File_write instead of write. The key idea is to identify the object (across processes), rather than the contribution from each process. We illustrate this with an example of a regular distributed array.

The code in Figure 9.12 writes out an array that is distributed among processes with a two-dimensional decomposition. To illustrate the expressiveness of the MPI interface, we show a complex case where, as in the Jacobi example, the distributed array is surrounded by ghost cells. This example is covered in more depth in Chapter 3 of Using MPI-2 [50], including the simpler case of a distributed array without ghost cells.

/* no. of processes in vertical and horizontal dimensions
   of process grid */
dims[0] = 2;   dims[1] = 3;
periods[0] = periods[1] = 1;
MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 0, &comm);
MPI_Comm_rank(comm, &rank);
MPI_Cart_coords(comm, rank, 2, coords);
/* global indices of the first element of the local array */

/* no. of rows and columns in global array*/
gsizes[0] = m;    gsizes[1] = n;

lsizes[0] = m/dims[0];   /* no. of rows in local array */
lsizes[1] = n/dims[1];   /* no. of columns in local array */

start_indices[0] = coords[0] * lsizes[0];
start_indices[1] = coords[1] * lsizes[1];
MPI_Type_create_subarray(2, gsizes, lsizes, start_indices,
                         MPI_ORDER_C, MPI_FLOAT, &filetype);
MPI_Type_commit(&filetype);

MPI_File_open(comm, "/pfs/datafile",
              MPI_MODE_CREATE | MPI_MODE_WRONLY,
              MPI_INFO_NULL, &fh);
MPI_File_set_view(fh, 0, MPI_FLOAT, filetype, "native",
                  MPI_INFO_NULL);

/* create a derived datatype that describes the layout of the local
   array in the memory buffer that includes the ghost area. This is
   another subarray datatype! */
memsizes[0] = lsizes[0] + 8; /* no. of rows in allocated array */
memsizes[1] = lsizes[1] + 8; /* no. of columns in allocated array */
start_indices[0] = start_indices[1] = 4;
/* indices of the first element of the local array in the
   allocated array */
MPI_Type_create_subarray(2, memsizes, lsizes, start_indices,
                         MPI_ORDER_C, MPI_FLOAT, &memtype);
MPI_Type_commit(&memtype);
MPI_File_write_all(fh, local_array, 1, memtype, &status) ;
MPI_File_close(&fh);

Figure 9.12: C program for writing a distributed array that is also noncontiguous in memory because of a ghost area (derived from an example in [50]).

This example may look complex, but each step is relatively simple.

Set up a communicator that represents a virtual array of processes that matches the way that the distributed array is distributed. This approach uses the MPI_Cart_create routine and uses MPI_Cart_coords to find the coordinates of the calling process in this array of processes. This particular choice of process ordering is important because it matches the ordering required by MPI_Type_create_subarray.
Create a file view that describes the part of the file that this process will write to. The MPI routine MPI_Type_create_subarray makes it easy to construct the MPI datatype that describes this region of the file. The arguments to this routine specify the dimensionality of the array (two in our case), the global size of the array, the local size (that is, the size of the part of the array on the calling process), the location of the local part (start_indices), the ordering of indices (column major is MPI_ORDER_FORTRAN, and row major is MPI_ORDER_C), and the basic datatype.
Open the file for writing (MPI_MODE_WRONLY), and set the file view with the datatype we have just constructed.
Create a datatype that describes the data to be written. We can use MPI_Type_create_subarray here as well to define the part of the local array that does not include the ghost points. If there were no ghost points, we could instead use MPI_FLOAT as the datatype with a count of lsizes [0] *lsizes [1] in the call to MPI_File_write_all.
Perform a collective write to the file with MPI_File_write_all, and close the file.

By using MPI datatypes to describe both the data to be written and the destination of the data in the file with a collective file write operation, the MPI implementation can make the best use of the I/O system. The result is that file I/O operations performed with MPI I/O can achieve hundredfold improvements in performance over using individual Unix I/O operations [116].