19.4 Using PVFS

In the previous edition of this book, the majority of this chapter was dedicated to the specifics of PVFS configuration and use. This information is all available at the PVFS Web site [90], in particular in the User's Guide [91]. Rather than rehash that document, we'll talk a little bit about practical aspects of using PVFS, including implications of the PVFS design on certain types of operations, managing and tuning PVFS file systems, using ROMIO with PVFS, and bug spotting. We hope that this information supplements the online documentation nicely. Section 19.5 describes PVFS2, the next generation of PVFS, which addresses many of the design limitations of PVFS.

19.4.1 Implications of the PVFS Design

The preceding sections have prepared us to discuss the implications of the PVFS design from a practical standpoint. First, PVFS does not perform client-side caching for metadata. Hence, all metadata operations have to travel across the network to the metadata server. For heavy metadata workloads, this design can cause sluggish performance.

Additionally, PVFS does not keep a file size as part of the metadata stored at the metadata server; rather, it calculates this value when it is requested. The advantage is that, during writes, the metadata need not be updated. However, a stat on a file requires not only a message to the metadata server to obtain the static metadata but also a sequence of messages to the I/O servers (performed by the metadata server) in order to obtain the partial sizes necessary to fill in the file size. The ls program performs this operation on every file in a listed directory, which can cause ls to be very slow for PVFS file systems. In practice, this makes PVFS a poor performer for small files, too, because users tend to put all the small files in one directory. Then they ls the directory and are frustrated by the delay. A pvfs-ls utility is provided with PVFS that avoids gathering this metadata, instead just printing directory contents. For users who simply want to see what resides in a directory, this is a much faster option.

PVFS does not cache data at the client side because it has no mechanism for ensuring that cached data is kept synchronized with data in other caches or on I/O servers. Hence, all data reads and writes must cross the network as well. Thus, the size of reads and writes to large files does have a significant impact on performance, especially through the VFS interface, which has particularly high overhead. This design decision makes PVFS perform poorly for benchmarks such as Bonnie [18]. Along these same lines, executing programs stored on a PVFS volume can be quite slow because pages are read one at a time on demand.

Missing Features

Users are occasionally surprised by the fact that some features are missing from PVFS. Here's a list as of version 1.5.8:

links (both hard and symbolic)
write-sharing through mmap
flock and fcntl locks
fault tolerance (other than using RAID, described later)

That's about it! If a user requires one of these features, perhaps one of the systems described earlier in the chapter will suffice instead.

19.4.2 Managing PVFS File Systems

PVFS allows for many different possible configurations. In this section we'll discuss some of these options.

While PVFS is relatively simple for a parallel file system, it can sometimes be difficult to discover the cause of problems when they occur simply because there are many components that might be the source of trouble. Here we discuss some tools and techniques for finding problem spots.

Monitoring File System Health

The pvfs-ping utility is the most useful tool for discovering the status of a PVFS file system and has turned into something of a "Swiss army knife" for PVFS debugging at this point.

A simple example of its use is as follows:

# pvfs-ping -h localhost -f /pvfs-meta -p 3000
mgr (localhost:3000) is responding.
iod 0 (127.0.0.1:7000) is responding.
pvfs file system /pvfs-meta is fully operational.

In this case the I/O server is dead and needs to be restarted:

# pvfs-ping -h localhost -f /pvfs-meta -p 3000
mgr (localhost:3000) is responding.
pvfs-ping: unable to connect to iod at 127.0.0.1:7000.
iod 0 (127.0.0.1:7000) is down.
pvfs file system /pvfs-meta has issues.

Using Multiple File Systems

Since PVFS includes no fault tolerance, for large systems it can make sense from a fault tolerance point of view to create multiple PVFS volumes. A single metadata server can serve multiple file systems if desired; however, if multiple file systems are chosen for fault tolerance reasons, it is definitely better to use multiple servers for I/O (one per file system). A single I/O server daemon (iod) cannot serve more than one file system. However, more than one daemon may be run on the same server if desired by specifying a different port value in the iod.conf file used to start the server.

Tolerating Disk Failures

Disk failures can be tolerated by using any of the many available RAID solutions under Linux, including both hardware devices and software RAID. There have been very few reported instances of data loss with PVFS because of software failures. Using RAID to tolerate disk failures is an effective mechanism for increasing the reliability of PVFS.

Increasing Usable File Descriptors

While some improvements have been made in PVFS with respect to file descriptor (FD) utilization, the servers in particular still can end up using all of their available FDs. The I/O servers will print a little message when this is about to happen:

NOTICE: exceeded 90 percent of available FDs (1024)!

Luckily this is easy to fix. The limits are set in /etc/security/limits.conf. Lines are of the following format:

<domain> <type> <item> <value>

The domain can be "*" for everyone, a userid, or a group using "@group". The type can be soft (setting the default) or hard (setting the maximum). The item parameter controls what limit this affects and can take many values, including nofile (open files). "Value" is the new value to set.

For example, the following lines would set the maximum number of FDs for root to 8192 and the default to 4096:

root hard nofile 8192
root default nofile 4096

Likewise one can set a new maximum and then use limit or ulimit as appropriate in the startup script for the servers.

Migrating Metadata

When upgrading to a newer PVFS version, occasionally the format of metadata on disk changes. This is due to oversights in the original design of the metadata format. Tools are now provided that can be used to convert metadata to the new format (assuming you haven't gotten too far behind on updates).

For example, if you are moving from version 1.5.6 to version 1.5.8, a utility called migrate-1.5.6-to-1.5.8 is provided (there were no changes from 1.5.7 to 1.5.8 in the metadata format). This tool is used in conjunction with find:

# find /pvfs-meta -type f -not -name .pvfsdir -not \
  -name .iodtab -exec migrate-1.5.6-to-1.5.8 \{\} \;

Warning messages will be printed and the process aborted if the utility detects that the metadata is not the correct version. This process should be performed after stopping the mgr.

19.4.3 Tuning PVFS File Systems

We often get questions about how to tune PVFS file systems for the best performance. Truthfully, system hardware varies widely enough that it is difficult for us to supply any single set of parameters that will work best for everyone. Instead, in this section discuss some specific parameters common to all machines and some general techniques for improving overall PVFS file system performance. Chapters 3 and 5 include many tips for improving the overall performance of Linux nodes; all that information certainly applies to PVFS servers as well.

Of course, in addition to tuning the file system itself, many steps can be taken above the file system that can make a huge difference. Given the discussion of the PVFS design, many of these are obvious: using large requests rather than small ones, using MPI-IO so PVFS List I/O optimizations can be leveraged, and avoiding lots of metadata operations (opens, closes, and stats). Often such optimizations in application code can make more difference than any tuning within PVFS itself. An in-depth discussion of improving the performance of MPI-IO access can be found in [50].

Adjusting Socket Buffers

PVFS relies heavily on the select call and kernel handling of multiple TCP connections for parallelism. For this reason, it is often useful to tune the network-related parameters on the system. Chapter 5 covers this process in some detail; in particular increasing the wmem_max and rmem_max values is often very helpful.

Once these have been increased, the socket_buf option in the I/O server's configuration file (iod.conf) can be used to adjust the socket buffer size up to the new maximum.

Enabling DMA for Hard Drives

Chapter 3 describes the hdparm tool. It can be used to verify that DMA is turned on for the hard drives that are being used for PVFS storage and to turn this on if it is not enabled. Because PVFS pushes both the network and storage hardware, alleviating any load on the CPU is helpful. Note that DMA isn't reliable on some hardware, so you should check the support of your hardware if this isn't turned on by default.

Improving Space Utilization

Originally we thought that users would want to know where their data was striped so that they could distribute processes to match data locations. Hence, we set up default striping so that data always started on the first I/O server. It turns out that for the most part people don't care about this and rarely use this information. Additionally, when users create lots of small files, this unbalances the distribution of data across the I/O servers.

We have subsequently added a "-r" flag that can be passed to the metadata server (mgr). This flag will cause the metadata server to choose a random starting I/O server when no server is specified (this can be done through the MPI-IO interface, for example). This will better distribute files and has a particularly large effect in the small files case.

Here we examine the free space on the I/O servers of a PVFS file system using the additional "-s" option to pvfs-ping:

# pvfs-ping -h localhost -f /pvfs-meta -s
mgr (localhost:3000) is responding.
iod 0 (192.168.67.51:7000) is responding.
iod 0 (192.168.67.51:7000): total space = 292825 Mbytes,
 free space = 92912 Mbytes
iod 1 (192.168.67.52:7000) is responding.
iod 1 (192.168.67.52:7000): total space = 307493 Mbytes,
 free space = 121154 Mbytes
iod 2 (192.168.67.53:7000) is responding.
iod 2 (192.168.67.53:7000): total space = 307485 Mbytes,
 free space = 121155 Mbytes
iod 3 (192.168.67.54:7000) is responding.
iod 3 (192.168.67.54:7000): total space = 307493 Mbytes,
 free space = 121199 Mbytes

We see that the first I/O server has significantly less free space than the others. This will show up in the df output:

Filesystem              Size Used Avail Use% Mounted on
localhost:/pvfs-meta
                        1.2T 824G 363G 69% /pvfs

PVFS calculates the available space returned to the system by the minimum amount available on any single I/O server (in this case 92.9 Gbytes) times the number of I/O servers (in this case 4). Because so much less space is available on the first server, we get a very low reported available space. Using the "-r" manager flag described above will help alleviate this problem.

Testing Aggregate Bandwidth

Since users are mostly interested in PVFS for high performance, obtaining a baseline performance number for a particular configuration is fairly important. The pvfs-test tool supplied with PVFS can be used for this purpose. This is an MPI program that opens a file from a large number of processes and writes or reads that file in parallel with each process accessing a different large block of the file. A "-h" option will cause it to list its options. This program can be used as a simple benchmark for testing the effects of configuration changes.

Here's the output of one of our favorite runs, using 80 nodes of Chiba City (see Chapter 20) as clients for PVFS and 128 separate nodes for I/O servers back in April of 2001:

mpirun -nolocal -np 80 -machinefile mach.all pvfs-test -s 262144 -f
 /sandbox/pvfs/testfile  -b 268435456 -i 1 -u
# Using native pvfs calls.
nr_procs = 80, nr_iter = 1, blk_sz = 268435456, nr_files = 1
# total_size = 21474836480
# Write:  min_t = 3.639028, max_t = 6.166665, mean_t = 4.755538,
 var_t = 0.334923
# Read:  min_t = 6.490499, max_t = 7.171075, mean_t = 6.977580,
 var_t = 0.023353
Write bandwidth = 3482.406857 Mbytes/sec
Read bandwidth = 2994.646755 Mbytes/sec

We did not sync after the writes ("-y" option), so the data was at the servers but not necessarily on disk. Nevertheless we were able to create a 20 Gbyte file in just over 6 seconds and read it back in just over 7 seconds. Not too shabby at the time. Note that we found a strip size of 256 Kbytes to be the best for that particular configuration, where a strip is the amount of data written to a single server (and a stripe is the amount written across all servers in the round-robin fashion).

Adjusting the Default Strip Size

By default the strip size (the size of the regions distributed in round-robin fashion to I/O servers) is set to 64 Kbytes (as of version 1.5.8). For some systems, particularly ones using large RAID volumes at each I/O server, this is simply too small.

The pvfs-test tool can be used to experiment with various strip sizes in order to find a good one for a particular configuration. Using the "-y" option will help ensure more accurate results by forcing data to the disk. Once a good value has been found, an additional "-s ssize" option can be used with the metadata server in order to provide the new default value (ssize is in bytes).

It is also useful to adjust the I/O server write buffer size to be larger than this size. That value is set in the I/O server configuration file with the write_buf option (value is in Kbytes, and the default is 512 Kbytes).

19.4.4 ROMIO and PVFS

MPI-IO implementations provide a number of services over using a local file interface. First and foremost these implementations provide a portable interface to which application programmers can code. The MPI-IO implementation takes MPI-IO operations and translates these into operations that can be performed by the underlying file system. Depending on the underlying file system, the MPI-IO implementation has a number of options with respect to how it translates an MPI-IO read or write operation into file system operations. If the underlying file system supports only POSIX operations, the MPI-IO layer might convert the MPI-IO request into a collection of contiguous operations. For a file system such as PVFS, MPI-IO requests might instead be converted into List I/O operations.

The second service that MPI-IO implementations provide is I/O optimizations. As we have discussed before, the MPI-IO semantics leave some opportunities for performance optimizations that are not available under the POSIX semantics. Further, the information provided by the use of collective I/O calls provides additional opportunities for optimizations. For more information on MPI-IO in general, including examples, see Chapter 9 of this book or [50]. In this section we will touch upon building ROMIO with PVFS support and then discuss in detail the optimizations available within ROMIO that are usable with PVFS.

Building MPICH and ROMIO with PVFS Support

Chapter 8 introduced the MPICH implementation of the MPI standard. ROMIO is included as part of the MPICH package. When configuring MPICH with ROMIO and PVFS support, a few additional parameters are necessary. Particularly we want to tell ROMIO what kinds of file systems to support, link to the PVFS library, and provide the path to PVFS include files.

For example, let us assume that PVFS was previously installed into /soft/pub/packages/pvfs-1.5.8, and we want both PVFS and "regular" (UFS) file system support:

# ./configure --with-romio="-file_system=pvfs+ufs"
 -lib="-L/soft/pub/packages/pvfs-1.5.8/lib/ -lpvfs"
 -cflags="-I/soft/pub/packages/pvfs-1.5.8/include"

The standard MPICH build and installation procedure can be followed from here. Building with LAM is very similar.

If ROMIO is not compiled with PVFS support, it will access files only through the kernel-supported interface (i.e., a mounted PVFS file system). If PVFS support is compiled into ROMIO and you attempt to access a PVFS-mounted volume, the PVFS library will detect that these are PVFS files (if the pvfstab file is correct) and use the library calls to avoid the kernel overhead. If PVFS support is compiled into ROMIO and you attempt to access a PVFS file for which there is no mounted volume, the file name passed to the MPI-IO call must be prefixed with pvfs: to indicate that the file is a PVFS file; otherwise ROMIO will not be able to find the file.

ROMIO Optimizations

ROMIO implements a pair of optimizations to address inefficiencies in existing file system interfaces and to leverage additional information provided through the use of collective operations. These optimizations, as well as PVFS options such as striping parameters, are controlled through the use of the MPI_Info system, commonly known as "hints." Much of the information in this section comes from the ROMIO users guide [117]; this guide provides additional information on these topics as well as covering the use of ROMIO on file systems other than PVFS.

ROMIO implements two I/O optimization techniques that in general result in improved performance for applications. The first of these is data sieving [114]. Data sieving is a technique for efficiently accessing noncontiguous regions of data in files when noncontiguous accesses are not provided as a file system primitive or where the noncontiguous access primitives are inefficient for a certain datatype. In the data sieving technique, a number of noncontiguous regions are accessed by reading a block of data containing all of the regions, including the unwanted data between them (called "holes"). The regions of interest are then extracted from this large block by the client. This technique has the advantage of a single I/O call, but additional data is read from the disk and passed across the network. For file systems with locking the data sieving technique can also be used for writes through the use of a read-modify-write process. Unfortunately, since PVFS does not have file locking of any kind currently, this is not available for PVFS.

Two hints can be used to control the application of data sieving in ROMIO for PVFS:

ind_rd_buffer_size controls the size (in bytes) of the intermediate buffer used by ROMIO when performing data sieving during read operations. Default is 4194304 (4 Mbytes). If data will not all fit into this buffer, multiple reads will be performed.
romio_ds_read determines when ROMIO will choose to perform data sieving. Valid values are enable, disable, or automatic. Default value is automatic. In automatic mode ROMIO may choose to enable or disable data sieving based on heuristics.

The second optimization is two-phase I/O [113]. Two-phase I/O, also called collective buffering, is an optimization that applies only to collective I/O operations. In two-phase I/O, the collection of independent I/O operations that make up the collective operation are analyzed to determine what data regions must be transferred (read or written). These regions are then split up among a set of aggregator processes that will actually interact with the file system. In the case of a read, these aggregators first read their regions from disk and redistribute the data to the final locations; in the case of a write, data is first collected from the processes before being written to disk by the aggregators. Figure 19.8 shows a simple example of the two-phase write using a single aggregator process. In the first phase (step), the two nonaggregator processes pass their data to the aggregator. In the second step the aggregator writes all the data to the storage system. In practice many aggregators are used to help balance the I/O rate of the aggregators to that of the I/O system. Because the MPI semantics specify results of I/O operations only in the context of the processes in the communicator that opened the file, and all these processes are involved in collective operations, two-phase I/O can be applied on PVFS files.

Figure 19.8: Two-Phase Write Steps

Six hints can be used to control the application of two-phase I/O:

cb_buffer_size controls the size (in bytes) of the intermediate buffer used in two-phase collective I/O (both reads and writes). If the amount of data that an aggregator will transfer is larger than this value, then multiple operations are used. The default is 4194304 (4 Mbytes). If the data size exceeds this buffer size, multiple iterations of the two-phase algorithm will be used to accomplish data movement.
cb_nodes controls the maximum number of aggregators to be used. By default this is set to the number of unique hosts in the communicator used when opening the file.
romio_cb_read controls when collective buffering is applied to collective read operations. Valid values are enable, disable, and automatic. Default is automatic. When enabled, all collective reads will use collective buffering. When disabled, all collective reads will be serviced with individual operations by each process. When set to automatic, ROMIO will use heuristics to determine when to enable the optimization.
romio_cb_write controls when collective buffering is applied to collective write operations. Valid values are enable, disable, and automatic. Default is automatic. See the description of romio_cb_read for an explanation of the values.
romio_no_indep_rw indicates that no independent read or write operations will be performed. This can be used to limit the number of processes that open the file.
cb_config_list provides explicit control over aggregators, allowing for particular hosts to be used for I/O. See the ROMIO users guide for more information on the use of this hint.

ROMIO Data Placement Hints

Three hints may also be used to control file data placement. These are valid only at open time:

striping_factor controls the number of I/O servers to stripe across. The default is file system dependent, but for PVFS it is -1, indicating that the file should be striped across all I/O devices.
striping_unit controls the striping unit (in bytes). For PVFS the default will be the PVFS file system default strip size.
start_iodevice determines what I/O device data will first be written to. This is a number in the range of 0 ... striping_factor - 1.

ROMIO and PVFS List I/O

Two hints are available for controlling the use of list I/O in PVFS:

romio_pvfs_listio_read has valid values enable, disable, and automatic. The default is disable. This hint takes precedence over the romio_ds_read hint.
romio_pvfs_listio_write has valid values enable, disable, and automatic. The default is disable.

Clearly, a wide variety of parameters can be used to control the behavior of ROMIO and PVFS when used together. Because no single set of parameters works best for all applications, experimentation is often necessary to attain the best set of parameters. A study examining some of these parameters has been published [26]; this can serve as a starting point for your own tuning.

19.4.5 Bugs

Users sometimes encounter bugs in PVFS. When they do, we generally guide them through a predictable set of steps to help us discover where the problem lies. This section outlines this process. The purpose is not to discourage users from reporting bugs or asking for help, but to streamline the process. If you have already tried these steps, we can skip a number of email exchanges and get right to the root of the problem!

Checking the List Archives

The very first thing to do is to check the PVFS mailing list archives. These are searchable online and available from the PVFS Web site [90]. Many problems have already been reported, so checking here might provide you with an immediate solution.

Reporting Versions and Logged Output

Bugs should always be reported to the PVFS users mailing list. This is an open list for discussion of many PVFS issues, one of them being bugs. By reporting to the mailing list you reach the maximum number of people that might be able to solve your problem, and you guarantee that an archive of the discussion will be saved.

We will always ask what version of the code you are running, especially if the problem that you report looks like something that has already been fixed. The distribution and kernel version you are using are helpful as well. If the problem is related to compiling, we'll ask for configure output and a log of the make process. If the problem is a runtime one, we'll ask for any information in the logs that might help. This includes dmesg output, the pvfsd log, the iod logs, and the mgr log. By default the three types of log files are all placed in /tmp, although this can be changed with configure-time options.

Providing this information in your first message is the easiest way to get the bug reporting and fixing process started.

Client Side or Server Side

The most common runtime bugs seen in PVFS at this time concern the Linux kernel module. One of the first things that we do in the case of a runtime problem is try to determine whether the problem is related to the servers themselves or to a particular client. We usually ask the user to look at the state of other clients in order to determine this. For example, one bug that we have seen prevented new files from showing up on certain clients. One client would see the new file while others did not. By looking at the state of multiple clients, the user was able to report this back and help us narrow down the problem.

Simplifying the Scenario

The simpler the set of conditions necessary to cause the problem on your system, the more likely we are to be able to replicate it on some system we have access to. Hacking out portions of a scientific code so that it performs only I/O or writing a script that uncovers a metadata incoherence problem really helps us see what is going on and replicate the problem on our end.