18.2 Using Scyld Beowulf

18.2 Using Scyld Beowulf

To the typical user of a Scyld Beowulf system, there is little difference between a Scyld and a non-Scyld system. Programs are written with MPI or PVM libraries, and executed interactively or submitted using PBS or some other scheduler. There are small details that may be different, such as how one selects a set of nodes to execute a program or, or how one views a process running on a node. For the most part, these are easier to do. For a system administrator, however, Scyld Beowulf offers a number of features that greatly simplify the configuration and management of a cluster, and in particular, a large cluster.

The ability to boot a machine over a network connection has been available for a long time, and many cluster tool-kits offer mechanisms to simplify cluster installation based on mechanisms such as bootp, dhcp, and similar protocols. Scyld Beowulf also offers a simplified installation procedure for nodes in the form of a boot image that can be installed on a CD or floppy disk. Unlike other systems however, Scyld extends this concept to simplify configuration and management not just at installation time, but every time the cluster boots.

Booting a compute node in a Scyld Beowulf system starts when the node first powers up. A fully installed node will have a boot image installed on the default boot device. This image is responsible for bringing up a minimal running kernel including network services through which the node can broadcast its readiness to boot to the beoserv daemon running on the master node. This boot image is extremely generic, it is used only during this first phase of booting, thus it rarely needs to be changed, even if the clusters compute nodes are completely reconfigured. In fact, this image only need be updated if the Scyld beoboot system requires an update, which normally wouldn't occur even if the Scyld software was updated. Only in fairly extreme circumstances requiring extensive changes to the beoboot system would the boot loader be changed. This is a critical point, because updating an on-disk boot loader can be a cumbersome task, though even in the event this is required, the Scyld beoboot system can usually install on-disk boot loaders automatically onto the disk from the master node, making the process much simpler than even a per-node network install.

Once the phase 1 boot loader has contacted the beoserv daemon on the master node, the beoserv daemon sends boot instructions to the node. If the node has been booted before and has a local disk partition with a kernel installed, the daemon can simply instruct the node to boot from the local image. In many cases the node might not have a local boot image. This might be because a decision has been made not to use the local disk to hold the OS. Or it might be that we want to boot a new configuration that the node does not have installed. This could be a kernel upgrade, a distribution upgrade, or a library upgrade. Whatever the reason, the new boot image is transferred over the network to the booting node, where it can either be loaded into a RAM disk or it can be loaded into an available disk partition on the local disk so that future boots can be performed from local disk. Whatever the source, when the beoserv daemon instructs the node to boot a new image an interesting process takes place. A small bit of Scyld code known as "Two-Kernel Monte" loads the new kernel and tricks the old kernel into giving up control to the new image. This bypasses the normal boot process, but effectly switches kernels to the new image.

The beoserv daemon need not instruct a booting node to boot at all. The daemon maintains a database of nodes, their corresponding MAC addresses, and their current disposition in the system. A node that has never been booted before can be placed in a holding pattern to wait for the system administrator to choose to boot it. Even nodes that have been previously booted can be set to stay in the "down" state until such time as the sysadmin decides to bring it "up." In addition, when selecting a node to boot, the sysadmin can specify whether to boot a local disk image or a server image, the logical node number of the node and other configuration information.

Once a node has its boot instructions it loads and initializes its kernel the same way any other Linux computer does. The primary difference at this point lies in the system startup scripts that are part of a Scyld Beowulf image. Put simply, Scyld Beowulf boot nodes do not start any daemons or services that are not needed to bring the node to a running state other than the bproc slave daemon. The bproc slave daemon contacts the bproc master daemon running on the master and enrolls the compute node as a slave to the master. Once this is completed, the master can initiate processes on the node, and all processes, including any service daemons, are started from the master by the beoboot system.

18.2.1 Programming and Debugging

Writing programs for a Scyld Beowulf is generally very straight-forward. Most users of a distributed memory parallel computer will use a message passing library, such as MPI or PVM to write their program. Both MPICH (a popular implementation of MPI) and PVM have configurations for use with Scyld. For the most part, these packages only have to be configured to use bpsh rather than rsh or its equivalent for starting processes on the nodes. The only other issues comes in naming nodes. Under Scyld, nodes are named .-1 (the master node), .0, .1, .2, and so on, whereas under other Beowulf installation, nodes may be named almost anything. Other than this, the process of creating a socket and establishing a connection are exactly the same as on any other Linux platform. MPICH and PVM are already configured with these issues in mind, and can be used directly for parallel programming.

Running Parallel Programs

When running MPICH, there are a number of features to be aware of. The first of which is that under Scyld, most of the controls for launching a job are controlled with environment variables. Thus, one can launch and MPI job without the traditional mpirun command just by setting the various environment variables. For example, the environment variable NP specifies the number of tasks to start. The variable NOLOCAL specifies not to run any tasks on the master (the default is to run the first task on the master, the rest on the nodes). Table 18.1 lists the environment variables used when starting MPI jobs under Scyld.

Table 18.1: Environment variables used when starting MPI jobs.

Environment Variable

mpirun flag

Description

NP

-np

number of tasks to run

NO_LOCAL

-nolocal

no tasks on master

ALL_LOCAL

-all_local

all tasks on master

ALL_CPUS

-all_cpus

task on every cpu in the system

EXCLUDE

-exclude

exclude nodes

BEOWULF_JOB_MAP

-map

specify node to use for each task

Otherwise, writing MPI programs for Scyld is about the same as for any other Beowulf system. On the other hand, things can be a little different when writing more traditional programs or doing things more out of the ordinary within an MPI program. For example, Scyld does not load a copy of shared libraries on the disk of the nodes unless this is specifically done as part of configuration (see the section below on administration of Scyld systems). When a program is launched, all shared libraries referenced by the program and not already loaded on the target node are loaded. On the other hand, shared libraries used by a program but not referenced directly in the calling code (done via a call to dl_open() and the path to the library) cannot be loaded by Scyld, and will fail unless the library referenced is already installed on the nodes.

Scyld Libraries

In addition to libraries such as MPI, programmers can take advantage of APIs provided specifically for Scyld systems. Table 18.2 outlines the libraries provided by Scyld. Of these, the bproc, perf, and beostat libraries are the most likely to be of interest to some programmers.

Table 18.2: Scyld libraries.

libbeostat

Library for returning compute node status info

libbeomap

Library for finding available (unloaded) nodes

libbpsh

Library for bproc shell-like functions

libbproc

Library for access to Bproc API

libbpslave

Library for compute nodes to receive Bproc requests

perf

Library for access to Pentium Hardware Performance Counters

The bproc libraries provide access to routines for starting new processes under bproc and for moving existing processes. The perf libraries provide access to performance counters in the Pentium hardware. beostat routines are for gathering node status info. These facilities are most likely to useful to systems programmers rather than applications programmers.

Other libraries such a mathematical codes like lapack or IO packages such as HDF are pretty much independent of Scyld, unless they have library loading issues as described above.

Debugging

Debugging parallel programs can be a complex task. Debugging programs with Scyld Beowulf can in many cases be easier than on a standard Beowulf system because of the bproc process management. Typical debugging techniques for MPI programs involve the use of MPE to generate a log file and tools such as Jumpshot [126] to help in analyzing the data. Using these tools on Scyld Beowulf is no different than on any platform. Similarly, the use if print statements in the program code is relatively straight-forward due to the structure of most MPI implementations.

On the other hand, sometimes these tools are not as effective as we may like for debugging. Other tools such as strace, ltrace, and gdb are standard for debugging sequential programs, but are often difficult to use on a parallel program because the processes are not local, but are distributed among many machines. On a standard Beowulf, the approach is to run all of the processes locally, thus allowing these tools to be used, but on Scyld Beowulf, these tools can be used even if the processes are remote. As an example, gdb can be made to attach to a running processes, and once attached can set breakpoints examine and change memory, trace references, and a number of other useful things. On a standard Beowulf, this involves logging in to the remote node the processes is running on, and then running GDB. On Scyld Beowulf, we merely need to identify the process id and attach to it just like any other process.

Overall, debugging is not significantly different under Scyld Beowulf, but in some cases is a little easier and a little more flexible. For more details on these debugging tools see their respective documentation and Section 8.8.

File I/O

A critical issue in an program is I/O. Most programs read at least some data from a file and output results to a file. The I/O may be quite minimal, or it may be hundreds of megabytes. File I/O in any Beowulf system is an issue because there are several distinct ways file I/O can be configured, and these alternatives have very different performance depending on how you use them. There are three major options: local disks on the nodes, Network File System (NFS), or a parallel file system.

Local disks are easy to use, assuming your cluster has local disks. Each task in your program can simply open a file and read and write data from and to that file. The difficulty comes in coordinating the files before and after running your program. If all of your tasks need to read the same data, then you simply copy the file to all of your disks before the program runs. If they all need to ready different parts of a single file, then you must divide the file accordingly and copy the correct part to each node. Similarly, after running the program there are many output files that may need to be reassembled. Sometimes this is a complex enough task as to not be worth the trouble. If your program requires that one task writes a file that is subsequently read by another task, this will not work on local disks, the reading task will only see what was written on that node. If your program runs more than one task on a node, then you must be careful in naming files to prevent a conflict. There are C library routines such as mktemp() that make this fairly easy to do.

Another alternative is the use of NFS [109]. With NFS the disk or disks on one machine can be accessed from all of the nodes. This eliminates the need to distribute and gather files before and after execution, and allows the tasks to read and write portions of the same file simultaneously. The Master node can act as an NFS server, which works well for small clusters, or a dedicated server might be set up. This machine can act as a slave, but usually is configured as a full server just like the master, but without acting as bproc master. This prevents traffic related to NFS from bogging down bproc traffic to and from the master. The down side of NFS is performance. The NFS server becomes a bottleneck. Experience has shown that one a machine gets to be the side of 100 nodes or more, the potential for severe performance problems exists. Even for much smaller clusters, if there is a large amount of data read or written during the execution of a job, NFS can become the limiting factor to performance. In other words the processors have to wait on the I/O. Other issues of concern include caching and other matters related to semantics as implemented by NFS. In general NFS was not designed to act as a parallel file system, thus in many cases it does not behave as one would expect.

The last option for file I/O is the use of a parallel file system. Parallel file systems allow program tasks to interact with a shared file just like NFS, but they do it by distributing the data among many servers, and managing the I/O throughput across the network so as the reduce or eliminate the effect of bottlenecks. An example of a parallel file system is PVFS [22], detailed in Chapter 19. Not only do parallel file systems provide high performance access to shared files, they also tend to offer interfaces better suited to parallel processing. As previously mentioned, there are issues of caching and other semantics that should make a parallel file system better suited to parallel computing. The MPI specification includes a parallel I/O standard called MPI-IO as detailed in Chapter 9. As discussed in chapter 19, one implementation of MPI-IO known as ROMIO [115] works with PVFS and MPICH implementation of MPI.

On the down side, some parallel file systems are so tuned for high performance use by parallel programs that they are not particularly well suited to common every day file system use the NFS is. In the end, it is usually best to provide all three forms of storage and let each application make use of the facilities as best it can. User's home file systems and small config files work well in NFS, large data files work well in parallel file systems, and local disks are still useful for certain types of logging and in other applications.




Part III: Managing Clusters