One of the primary advantages to the single system image approach is simple system administration. Most administrative work, from simple jobs like adding users to more complex tasks like adding network drivers or kernel modules, can be done simply by manipulating the image, rather than manually configuring each node. Experience has shown that administration of traditional Beowulf systems can be as labor intensive as managing a comparable number of workstations, with as much one full time administrator required per 128 nodes. While nothing in the Scyld approach makes maintaining the cluster hardware any less labor intensive, managing the OS is substantially simpler. In earlier chapters of this book, such as Chapters 5, 6, and 13, the complexities of managing traditional clusters have been discussed in detail. This section contrasts with that the Scyld approach to some common administrative tasks to provide some insight into the process of administering Scyld clusters versus the more traditional approach.
Most of the administration tasks that will need to be done can be performed using Scyld's beosetup program, which provides a GUI interface for performing all common configuation tasks. The system can also be configured and administered using command line programs and by modifying relevant configuration files using a text editor. Table 18.3 lists the major configuration files. The sections below describe common administration tasks, including the configuration files and tools related. Sysadmins new to Scyld should probably try to use beosetup rather than using a manual approach until they are familiar with a running system.
/etc/Beowulf |
Directory with Scyld Beowulf configuration files |
/etc/Beowulf/config |
Main configuration file |
/etc/Beowulf/fdisk |
Default disk partitioning for nodes |
/etc/Beowulf/fdisk.1 |
Disk passioning for node 1 |
/etc/Beowulf/fstab |
Default fstab for nodes |
/etc/Beowulf/fstab.1 |
Fstab for node 1 |
/var/Beowulf |
Node boot images |
/var/log/Beowulf |
Node logging |
/usr/lib/Beowulf |
Scripts and programs |
While some functions of administering the cluster use the same configuration systems as normal Linux machines (such as user accounts and groups), the Beowulf specific functions require additions to the "normal" set of Linux configuration files. Scyld encapsulates the additional configuration information into a small set of files consistent with the Linux/UNIX administrative philosophy. The '/etc/beowulf' directory contains information about cluster configuration and node management. Node boot images and related information are kept in '/var/beowulf'. Node logging information is in the directory '/var/log/beowulf', and scripts and programs used in booting are in '/usr/lib/beoboot'.
The sections below discuss how some of the normal tasks of a cluster administrator are performed using the Scyld Beowulf OS. These tasks are broadly grouped into four categories: managing nodes, system maintenance tasks, failure detection and recovery, and finally node allocation and scheduling.
A fairly frequent task for a Beowulf system administrator is the addition, deletion, or customization of the compute nodes. Like most tasks on a Scyld Beowulf, all of these tasks are handled from the head node.
When a new node is added to the cluster, the phase 1 boot image must be booted on this node. This is the same procedure as discussed previously in the install section, and can be done via floppy, CD, or PXE boot. When the node boots, it will make a RARP request to the head node. When the head node sees this request, it examines the MAC address of the requesting node, then examines it's configuration file, '/etc/Beowulf/config' to determine what to do.
If the MAC address of the node does not appear in the configuration file, the request is ignored, and the address is added to the file '/var/Beowulf/unknown_addresses'. If the MAC address of the requesting node does appear in the configuration file, there are two possibilities. If the address is labeled in the configuration file as "ignore", the request will simply be ignored. Otherwise, the head node will respond to the compute node's RARP request, and assign it a node number corresponding to the label or position in the configuration file. Nodes can be removed from the cluster by simply marking the corresponding line in the configuration file as "ignore".
This behavior can be modified if the beosetup GUI is used when nodes are being added. This GUI includes an option to auto-activate new nodes that appear as unknown addresses. This option is particularly useful when adding large numbers of new nodes to the cluster. beosetup also allows you to drag and drop nodes between the unknown and active lists, reorder the node list, and perform many other node setup features as mentioned in the installation section.
One of the components in a Beowulf cluster that is most susceptible to failure is the disk drives. Due to the short product life cycles of commodity hard drives, it is only a matter of time before a Beowulf cluster will be using several different types and sizes of disk drives in the compute nodes. The single system image concept provided by Scyld and the Bproc system make it possible to deal with frequent rebuilds of node file systems. However, it is important that the system image have flexibility in dealing with different disk drives on which the image is to be stored.
The Scyld OS deals with this issue by keeping partition tables for each type of disk in the cluster in the '/etc/Beowulf' directory on the head node. These partition tables are indexed by either the geometry and device number of the disk to which they apply. This allows the head node to automatically determine the appropriate partitioning for a given disk drive at boot time. To add a new type of disk drive to a cluster, the administrator can either manually add the new configuration, or take it from a running node. When a new type of disk is installed on a node, the node can initially be booted using a RAM disk. The node's new disk can be partitioned using the standard fdisk command via bpsh, then this partition table can be read in by the head node in the appropriate format via the beofdisk command. beofdisk can also be used to propagate this partition table to every other node in the cluster with the same hardware, eliminating the need to manually partition each disk.
The default setting in Scyld Beowulf is for each node to use a RAM disk for it's root file system, and to use NFS to mount the '/home' file system from the head. However, it is a simple process for the administrator to customize compute nodes to make use of local disks, or to access any number of network or parallel file systems, either from the head or another accessible server.
In traditional Linux systems, the file systems a node mounts is determined by the '/etc/fstab' file. In the Scyld OS, compute node 'fstab' files are kept in the '/etc/Beowulf' directory. A single file may be used to control all nodes, or, if there are differences in node configurations, fstab files may be created for each node. Ideally, a single fstab would suffice for an entire cluster (and frequently does), but sometimes certain nodes may have additional disks to provide additional swap or temporary space, or to serve as I/O servers for a parallel file system. The list of network file systems available to nodes may change to allow only certain nodes to have access to sensitive data.
The syntax for the 'fstab' files is identical to normal Linux syntax, and allows the use of RAM disks, local disks, NFS file systems, or parallel file systems such as PVFS as described in Chapter 19.
The Scyld OS also provides a number of options for when node file systems should be rebuilt. To maintain a single system image, some users opt to have all local file systems on a node rebuilt each time a node is booted. This option is particularly useful when adding new nodes to the cluster. Some choose to use the local file systems for permanent storage, and never wish to rebuild those images. Still others may choose for performance reasons to only rebuild node file systems when checks on the file system fail, indicating errors. Scyld supports all of these options, and the policy can be changed at any time through the Beosetup GUI or by editing the '/etc/Beowulf/config' file and sending HUP signals to the associated beoboot and bproc daemons.
The Bproc system provided with Scyld allows jobs to be migrated quickly to the nodes by not migrating shared library code with the nodes, but rather remapping these libraries within the process after it is migrated. To achieve high performance with this technique, nodes must keep a cache of the shared libraries. Administrators can easily change the list of libraries kept cached on the nodes to achieve good performance on any application. The '/etc/Beowulf/config' file contains a keyword libraries, after which can be listed individual libraries or whole directories of libraries can be listed. All libraries listed in this line will be moved to the compute nodes when they boot.
Another group of important tasks involves the overall maintenance of the system, such as controlling the state of the nodes, the boot image and kernel run by the nodes, and account management.
Compute nodes can be in any of a number of states, including up, down, unavailable, boot, reboot, error, and pwroff. As a node powers up, it moves from the down state to the boot state, and, if all goes well, eventually to the up state. The state of a node can also be controlled by the administrator via the bpctl command. This command allows the the administrator to set the state (among other things) of all the nodes, individual nodes, or ranges of nodes. Bpctl can be used to reboot nodes, shut them down, tag them as unavailable to users or mark them as back up.
Periodically, as updates become available or new drivers are added, the administrator may want to change either the phase 1 or phase 2 boot images that are given to the slave nodes. Both images can be recreated through the beosetup GUI or via the beofdisk command.
The phase 1 image rarely needs to be changed. It consists simply of a small RAM disk image and a minimal kernel, and is designed to fit on a floppy disk or in a 2 megabyte partition at the start of a hard drive. The RAM disk and kernel can also be generated separately for use with a PXE boot server.
The phase 2 image contains the runtime kernel, and may need to be updated more frequently. This image is created in a format suitable for download by a phase 1 image. When the image is created, the head node must be running the same version of the kernel as to be placed in the phase 2 image.
Periodically, administrators may wish to update kernels on their cluster to take advantage of bug fixes, new features, etc. More frequently, an administrator may wish to add a device driver or new module to the existing kernel, and propagate this change to the slaves. The kernel used in the Scyld system is not quite the standard Linux kernel, so the recommended procedure is to download source for the new kernel from Scyld. If you wish to use a kernel version that is not available from Scyld, you should have some expertise in hacking Linux kernels, and be prepared to add in a number of additional modules for beoboot, bproc, PVFS, etc.
Adding drivers to kernels is a fairly simple task in a Scyld cluster. Most drivers are added via a dynamically loadable module, so recompiling the full kernel is not necessary. In order to add a driver, you will need to compile the module twice, once with options for the kernel on the head node, and a second time with the options for the beoboot kernel on the nodes. The correct options are shown in Table 18.4.
For Uniprocessor kernel on head: |
-D__BOOT_KERNEL_SMP=0 -D__BOOT_KERNEL_UP=1 |
For Uniprocessor kernel for BeoBoot: |
-D__BOOT_KERNEL_SMP=0 -D__BOOT_KERNEL_UP=1 -D__module__beoboot |
Each compiled version will need to be installed in '/lib/modules' in the appropriate directory for each kernel (the kernel for the head has an _Scyld after the version number, the kernel for the compute nodes has an _Scyldbeoboot extension). Once the modules are created, you will need to update your Beoboot images. If the module is a critical one, such as the module for your compute node's primary network interface, you may need to update both the phase 1 and phase 2 kernel images.
If the new module is to be included in the phase 1 image, the '/etc/Beowulf/config' file must be edited to include this module in the module list and to determine how it is loaded. The bootmodule lines in the configuration file list all the modules to be included in the phase 1 image. Addition of new modules may require the deletion of some old ones if the phase 1 image must still fit on the floppy drive. If you wish the module to always be loaded, you must also add a modprobe to the config file. If you wish it to be loaded only when the corresponding hardware is detected, the system's PCI table must be edited. Finally, the new beoboot images can be created including the new kernel modules using the beoboot command.
System administrators may wish to perform site specific customizations of the compute nodes when they boot, such as starting additional daemons or copying extra files to the nodes. At the end of the node boot cycle, each node runs a script called node_up. During its execution, this script looks in the directory '/etc/beowulf/init.d' and executes any scripts it finds there. This is where administrators can add any additional site-specific commands to be run. Any script run from this directory will have the additional environment variable $NODE defined, which will contain the node number of the node on which the script is being executed. This makes it possible to have the script only act on certain nodes, or act differently on each node if this is desired.
Managing user accounts on a Scyld system is just as easy as managing user accounts on a single workstation. All account management is done from the head node, using the normal linux tools, for instance the adduser script or the passwd command, or manual editing of the '/etc/password' file. Compute nodes see exactly the set of user IDs and permissions that are available on the head, and need no passwords.
This removes a number of authentication problems that can exist in traditional Beowulfs. For instance, as seen in Chapter 5, in a traditional Beowulf, user accounts must be added on every node with the same user ID, and passwords must be kept consistent on every node, or some central account management service such as NIS (Network Information Service) must be maintained and accessed via the network by all nodes. Typically, users wish to spawn tasks on compute nodes of the cluster without being prompted for a password. The solution to this problem is usually to maintain a 'hosts.equiv' or '.rhosts' file on every node in the Beowulf, which contains the name or network address of every other node in the Beowulf. This file must be kept to up-to-date each time the cluster's configuration changes.
Managing groups is equally simple. Groups take on an added importance in Scyld clusters. In addition to the traditional use of managing file access, groups can be used to manage access to compute nodes. Groups are defined by the file '/etc/group', and can be changed by directly editing the file, or through the standard usermod, groupadd and groupdel commands.
More sophisticated mechanisms to prevent User and Group ID-space conflicts are being built into the newest version of the Scyld OS to allow for clusters with multiple heads, primarily to provide high availability or failover capabilities.
An important issue whenever working with a large number of nodes is detecting their failure and recovering. This includes systems for monitoring the nodes, and strategies for replacing a failed node.
The Bproc and Beoboot packages provide useful libraries for tracking the status of your cluster from a central location. The Scyld Beowulf OS provides a number of tools that take advantage of these libraries to allow administrators to better control their clusters, as well as the APIs for the creation of more sophisticated tools.
Among the tools provided for for cluster monitoring are the tools beostatus and bpstat, which are designed for direct user interaction, and the beostat tool which is more appropriate for embedding in scripts. beostatus provides a display of common performance metrics for each node, such as CPU, memory, and network utilization. The output display can be graphical or text based. Bpstat provides a summary of the state and permissions for each node, and can also be used in conjunction with the UNIX ps command to list which compute node every bproc process is running on. The beostat tool provides any of the information normally available in the '/proc' file system of a Linux machine for any or all of the compute nodes. The 'libbeostat' library and the bproc kernel module provide a variety of system calls and library functions which make cluster status information easily available to a programmer. These calls can be used for making more sophisticated status reporting tools, or to import status information into load management and other tools. 'libbeostat' has a library call to report each of the same fields as the beostat command line tool, ranging from node status to CPU speed and about twenty other quantities.
Most of the functionality provided by the beosetup configuration tool, the beostatus monitor, and the Beowulf batch queue monitor can also be accessed through the web on a Scyld Beowulf cluster. All of these functions are provided as add-ons to the standard webmin interface for remote system administration.
Inevitably, nodes will eventually fail. This may be do to software failures somewhere along the boot process, such as file system errors or bad scripts added to the boot sequence, or a variety of hardware failures. In the case of software failures, the node is placed in the error state and a complete log of both the phase 1 and phase 2 boot process is stored on a per node basis in the directory '/var/log/Beowulf' in a file named 'node.<nodenumber>'. This makes debugging possible without having to physically access one of the compute nodes.
In the case of a hardware failure, Scyld provides no additional support beyond simply marking the node as being in the down or error state. A system administrator would be well advised to employ one of the cluster management techniques described in Chapter 13 to debug hardware issues.
In either case, the Scyld OS continues to function in the event of a compute node failure. Processes currently running on a node that fails will be lost, and it is up to the application to provide checkpoints if recovery of the job is possible. However, the system as a whole will continue to function, and the OS will not schedule any new tasks on the node that has failed. Unfortunately, some applications and/or users may hard code node numbers into scripts that run jobs. While this practice should generally be discouraged, system administrators can compensate this by simply reordering the node list such that another node takes the place of the one that has failed. For instance, say a cluster has one spare node available for failover. If node 15 on that cluster fails, the administrator can either use the beosetup GUI or edit the configuration file to place the MAC address of the spare node in the 15th position on the node list. If the administrator then boots the spare node, it will come up as node 15. The users will then see the same set of nodes they always see, and service was not interrupted on any other node, though anything on the original node 15 at the time of the failure will be lost.
One of the primary chores of running a large Beowulf is allocating and scheduling nodes to particular jobs or to particular users. The Scyld/bproc system provides an elegant means for providing access to nodes, and a simple set of tools for allocation and scheduling. These mechanisms can in turn be used as a basis for building more sophisticated tools.
The core of the node allocation mechanism is the Bproc permission model. Nodes are given owners, groups, and permission bits, much like the UNIX file permission system. For nodes, the "read" and "write" bits are meaningless, only the execute bit has importance. Nodes are given an owner and group user ID. The permission bits allow the administrator to define whether a node can be used by the owner, by all members of the group, or by all users. Permissions can be changed on the fly manually by the administrator, or can be set by allocation and scheduling software to restrict node access.
Scyld Beowulf includes a simple load management system based on the UNIX at facility known as bbq, the Beowulf Batch Queue. The bbq system queues jobs submitted by users, and runs them on a first-come, first-served basis to processors deemed available by the beomap calls. The number of processors required for a particular job is determined from the users submitted job script. A request for this number of processors is made of beomap, which will return a list of processor numbers which have a load average below 0.8. The job is then issued to this list of processors.
The scheduling policy implemented in BBQ can be changed by replacing 'libbeostat', or by just replacing the call get_beowulf_job_map(). BBQ is a functional scheduler for simple workloads, but lacks the ability to enforce limits on job time, out-of-order execution, and other features expected in a modern scheduler. If a Beowulf has a fairly complicated workload, the PBS system described in Chapter 17 has also been modified to work with Scyld Beowulf, and may provide a better option.
The commands listed in Table 18.5 are used to perform all of the Scyld system administration tasks. New administrators should stick to the GUI systems provided, but in some cases these commands can be very useful. Man pages are provided online with all of the details.
atd |
Beowulf Batch Queue daemon |
atrm |
Remove jobs from batch queue |
batch |
Submit job to queue |
bbq |
Check queue status |
bdate |
Set the time and date on slave nodes |
beoboot |
Generate Beowulf boot images |
beoboot-install |
Install beoboot on compute node drives |
beofdisk |
Partition slave node disks |
beoserv |
Beoboot server daemon |
beostatus |
Interactive status tool |
beostat |
Display raw data from libbeostat |
beowebenable |
Activate web access |
bpcp |
Copy files to compute nodes |
bpctl |
Set node state and ownership |
bpmaster |
The bproc server daemon on the head |
bpsh |
Run programs on compute nodes |
bpslave |
Bproc client daemon on compute nodes |
bpstat |
Show node status information |
linpack |
Run linpack benchmark |
mpprun |
Launch a non-parallel job on compute nodes |
mpirun |
Launch an MPI job on compute nodes |
node_down |
Shutdown compute nodes cleanly |
recvstats |
Daemon to receive multicast status info for libbeostat |
sendstats |
Daemon to send multicast status info for libbeostat |