The MPICH implementation of MPI  is one of the most popular versions of MPI. Recently, MPICH was completely rewritten; the new version is called MPICH2 and includes all of MPI, both MPI-1 and MPI-2. In this section we describe how to obtain, build, and install MPICH2 on a Beowulf cluster. We then describe how to set up an MPICH2 environment in which MPI programs can be compiled, executed, and debugged. We recommend MPICH2 for all Beowulf clusters. Original MPICH is still available but is no longer being developed.
The current version of MPICH2 is available at www.mcs.anl.gov/mpi/mpich. From there one can download a gzipped tar file containing the complete MPICH2 distribution, which contains
all source code for MPICH2;
configure scripts for building MPICH2 on a wide variety of environments, including Linux clusters;
simple example programs like the ones in this chapter;
MPI compliance test programs; and
the MPD parallel process management system.
MPICH2 is architected so that a number of communication infrastructures can be used. These are called "devices." The device that is most relevant for the Beowulf environment is the channel device (also called "ch3" because it is the third version of the channel approach for implementing MPICH); this supports a variety of communication methods and can be built to support the use of both TCP over sockets and shared memory. In addition, MPICH2 uses a portable interface to process management systems, providing access both to external process managers (allowing the process managers direct control over starting and running the MPI processes) and to the MPD scalable process manager that is included with MPICH2. To run your first MPI program, carry out the following steps (assuming a C-shell):
Download mpich2.tar.gz from www.mcs.anl.gov/mpi/mpich or from ftp://ftp.mcs.anl.gov/pub/mpi/mpich2.tar.gz
tar xvfz mpich2.tar.gz ; cd mpich2-1.0
configure <configure options> >& configure.log. Most users should specify a prefix for the installation path when configuring:
configure --prefix=/usr/local/mpich2-1.0 >& configure.log
By default, this creates the channel device for communication with TCP over sockets.
make >& make.log
make install >& install.log
Add the '<prefix>/bin' directory to your path; for example, for tcsh, do
setenv PATH <prefix>/bin:$PATH rehash
Before running your first program, you must start the mpd process manager. To run on a single node, you need only do mpd -d &. See Section 8.7.3 for details on starting mpd on multiple nodes.
mpiexec -n 4 cpi (if '.' is not in your path, you will need to use mpiexec -n 4 ./cpi).
To build MPICH2 to support SMP clusters and to use shared-memory to communicate data between processes on the same node, configure MPICH2 with the additional option --with-device=ch3:ssm, as in
configure --with-device=ch3:ssm --prefix=/usr/local/mpich2-1.0
In a system that contains both SMP nodes and uniprocessor nodes, or if you want an executable that can run on both kinds of nodes, use this version of the ch3 device.
Running MPI programs with the MPD process manager assumes that the mpd daemon is running on each machine in your cluster. In this section we describe how to start and manage these daemons. The mpd and related executables are built when you build and install MPICH2 with the default process manager. The code for the MPD demons are found in '<prefix-directory>/bin', which you should ensure is in your path. A set of MPD daemons can be started with the command
mpichboot <file> <num>
where file is the name of a file containing the host names of your cluster and num is the number of daemons you want to start. The startup script uses ssh to start the daemons, but if it is more convenient, they can be started in other ways. The first one can be started with mpd -t. The first daemon, started in this way, will print out the port it is listening on for new mpds to connect to it. Each subsequent mpd is given a host and port to connect to. The mpichboot script automates this process. At any time you can see what mpds are running by using mpdtrace.
An mpd is identified by its host and a port. A number of commands are used to manage the ring of mpds:
mpdhelp prints a short description of the available mpd commands.
mpdcleanup cleans up mpd if a problem occurred. For example, it can repair the local Unix socket that is used to communicate with the MPD system if the MPD ring crashed.
mpdtrace causes each mpd in the ring to respond with a message identifying itself and its neighbors.
mpdallexit causes all mpds to exit gracefully.
mpdlistjobs lists active jobs for the user managed by mpds in ring. With the command-line option -a or --all, lists the jobs for all user4s.
mpdkilljob job_id kills all of the processes of the specified job.
mpdsigjob sigtype job_id delivers the specified signal to the specified job. Signals are specified using the name of the signal, e.g., SIGSTOP.
Several options control the behavior of the daemons, allowing them to be run either by individual users or by root without conflicts. The most important is
-d background or "daemonize"; this is used to start an mpd daemon that will run without being connected to a terminal session.
MPICH2 jobs are run under the MPD process manager by using the mpiexec command. MPD's mpiexec is consistent with the specification in the MPI standard and also offers a few extensions, such as passing of environment variables to individual MPI processes. An example of the simplest way to run an MPI program is
mpiexec -n 32 cpi
which runs the MPI program cpi with 32 processes and lets the MPD process manager choose which hosts to run the processes on. Specific hosts and separate executables can be specified:
mpiexec -n 1 -host node0 manager : -n 1 -host nodel worker
A configuration file can be used when a command line in the above format would be too long:
mpiexec -configfile multiblast.cfg
where the file 'multiblast.cfg' contains
-n 1 -host node0 blastmanager -n 1 -host nodel blastworker ... -n 1 -host node31 blastworker
One can use
to discover all the possible command-line arguments for mpiexec.
The program mpiexec runs in a separate (non-MPI) process that starts the MPI processes running the specified executable. It serves as a single-process representative of the parallel MPI processes in that signals sent to it, such as ^Z and ^C are conveyed by the MPD system to all the processes. The output streams stdout and stderr from the MPI processes are routed back to the stdout and stderr of mpiexec. As in most MPI implementations, mpirun's stdin is routed to the stdin of the MPI process with rank 0.
Debugging parallel programs is notoriously difficult. Parallel programs are subject not only to the usual kinds of bugs but also to new kinds having to do with timing and synchronization errors. Often, the program "hangs," for example when a process is waiting for a message to arrive that is never sent or is sent with the wrong tag. Parallel bugs often disappear precisely when you add code to try to identify the bug, a particularly frustrating situation. In this section we discuss three approaches to parallel debugging.
Just as in sequential debugging, you often wish to trace interesting events in the program by printing trace messages. Usually you wish to identify a message by the rank of the process emitting it. This can be done explicitly by putting the rank in the trace message. As noted above, using the "line labels" option (-l) with mpirun in the ch_p4mpd device in MPICH adds the rank automatically.
The TotalView? debugger from Etnus, Ltd.  runs on a variety of platforms and interacts with many vendor implementations of MPI, including MPICH on Linux clusters. For the ch_p4 device you invoke TotalView with
mpirun -tv <other arguments>
and with the ch_p4mpd device you use
totalview mpirun <other arguments>
That is, again mpirun represents the parallel job as a whole. TotalView has special commands to display the message queues of an MPI process. It is possible to attach TotalView to a collection of processes that are already running in parallel; it is also possible to attach to just one of those processes.
Check the documentation on how to use Totalview with mpiexec in MPICH2, or with other implementations of MPI.
MPI implementations are usually configured and built by using a particular set of compilers. For example, the configure script in the MPICH implementation determines many of the characteristics of the compiler and the associated runtime libraries. As a result, it can be difficult to use a different C or Fortran compiler with a particular MPI implementation. This can be a problem for Beowulf clusters because several different compilers are commonly used.
The compilation scripts (e.g., mpicc) accept an argument to select a different compiler. For example, if MPICH is configured with gcc but you want to use pgcc to compile and build an MPI program, you can use
mpicc -cc=pgcc -o hellow hellow.c mpif77 -fc=pgf77 -o hellowf hellowf.f
This works as long as both compilers have similar capabilities and properties. For example, they must use the same lengths for the basic datatypes, and their runtime libraries must provide the functions that the MPI implementation requires. If the compilers are similar in nature but require slightly different libraries or compiler options, then a configuration file can be provided with the -config=name option:
mpicc -config=pgcc -o hellow hellow.c
Details on the format of the configuration files can be found in the MPICH installation manual.
The same approach can be used with Fortran as for C. If, however, the Fortran compilers are not compatible (for example, they use different values for Fortran .true. and .false.), then you must build new libraries. MPICH2 provides a way to build just the necessary Fortran support. See the MPICH2 installation manual for details.
As this chapter is being written, the current version of MPICH2 is 0.93, and the current verison of MPICH is 1.2.5.