6.8 When Things go Wrong

6.8 When Things go Wrong

For cluster installation there are literally hundreds of small items that can be show-stoppers in getting an installation to work over the network. In this section, we'll describe some the of common problems that users might encounter. There are many email and web resources to check if you run into an installation problem including toolkit-specific discussion lists and the general Beowulf users list. The key thing that makes clusters different is that one relies on a network to enable installation (whether image or decription).

  • MAC addresses of new nodes are never detected. There a few things to check here. First, make sure on motherboards with dual interfaces that you have plugged into the interface that will be labeled eth0. If you are using PXE, make certain that it is enabled

    on this interface. It is non-standard as to which interface is eth0 and sometimes the fix is as simple is switching the cable. If you are still not seeing DHCPDISCOVER messages on the frontend, attach the frontend to the node with a standard ethernet cross-over cable. If you do see the DHCPDISCOVER message in the logs (make sure dhcpd is running), then you have narrowed things down to the network itself. For today's managed switches, you will need to make certain that broadcast is enabled on the switch itself.

  • During download of image or packages, the node just freezes. There generally are two possibilities. The device driver for your network card is buggy or unreliable (this is actually usual when new NICs are introduced) or your node hardware is simply bad (memory, processor, disk, or more). If the problem affects all nodes, then look for something that is common (like the network driver). It is also possible that either an image or a package is corrupted on the server itself. For RPM-based installations, the installer will often tell you on what package things have failed and using RPM to verify the package on the server is an easy remedy.

  • My network card isn't supported. This problem is much more common than you might think. NIC manufacturers use a number of variants of a standard interface (the Intel e1000 has over 6 hardware variants)—and the Linux driver may not have caught up to the latest versions. You first have to determine exactly what the interface is—if you can hand-install a version of Linux on the node, you can use lspci to find all about the devices on your PCI bus. Ethernet controllers will be listed that way and you can look at the specifics of the PCI ID and the text description in the PCI record. A look at the source code will determine if that variant of a known device is supported. If is is supported, then you have to work to get a custom installation kernel, boot floppy, or PXE image constructed. This is toolkit specific and is quite deep into the specifics of a toolkit.

Part III: Managing Clusters