2.10 Node Choice and Cluster Construction

2.10 Node Choice and Cluster Construction

When building a cluster, a variety of issues must be considered. A choice of hardware suitable to the goal must be chosen. A vendor must be chosen. Environmental issues, such as availability of space, cooling, and power must be considered. Extra services, like hardware and software maintenance can be opted for. See Chapter 6 for a discussion of post-purchase cluster setup. A variety of paths to this goal can be taken, each with pros and cons.

2.10.1 Cluster Vendors

A common approach to building clusters is to find a vendor that provides integrated solutions. Many large system vendors now have products in the cluster space. They are experienced with the problems that customers will have in the initial stages of cluster setup, and know the questions that should be asked initially. These vendors are able to ship integrated solutions. In many cases, the cluster can be powered on when delivered, and be running applications in hours. Experienced cluster vendors optionally offer on-site hardware and software support. This approach is certainly the simplest, but can be more expensive than the following options; all of the extra services provided by the vendors cost money to provide. However, in many cases, the extra cost is well worth it.

2.10.2 White Boxes

Another common approach to building clusters is to find a vendor that builds custom computers, but has no cluster expertise. The vendor builds machines to the customer's specifications. This allows the customer to specify the exact parts the cluster should be assembled from. While on-site hardware maintenance may be available, software maintenance isn't. Experienced cluster builders may choose to take this route, as the difference between white box vendors and cluster vendors largely consists of help with cluster specific issues.

2.10.3 DIY

The final approach taken to building clusters is to do everything yourself. Every detail of system configuration is controllable; from the type of power supply to cables, and fans used for cooling. Hundreds of boxes will be delivered containing each of the parts required for each cluster node. Nodes must be assembled, and software can then be installed. This approach provides the most flexibility, but also has the highest potential for pitfalls.

2.10.4 Pitfalls

Many problems can manifest themselves during the construction and operation of a cluster. Some can be avoided by making proper decisions during the specification process. These problems can make clusters virtually unusable, so they should be taken seriously. Problems mentioned here could be treated as a laundry list of issues to be checked before a cluster is setup.

It should be verified that enough power and cooling exist to properly operate the cluster. Underpowered or overheating clusters rarely perform well, and in many cases exhibit strange problems that can consume days, weeks, or months of administrator time to properly debug.

The use of some sort of console solution should be employed. Many hardware errors are displayed during the BIOS boot sequence. Whether the BIOS supports a serial console or not, the hardware needed to see these errors should be available. The simplest solution for this problem is a crash cart. This consists of a single keyboard, monitor and mouse on a cart that can be connected to machines in case of problems. More elaborate solutions can be constructed using serial concentrators to provide usable consoles on each machine, or KVM switches.

Real profiling of target applications should be performed. Performance on artificial benchmarks is better information than no information at all, however, these results aren't important unless the primary application run on a cluster will be benchmarks.

Finally, remember that everything is harder when it needs to be done multiple times. While it is an easy process to assemble a single new machine, assembling 32, 64, 96, or 128 machines is a much harder process. Remember that time has value. Cutting corners for the sake of small amounts of money almost always causes problems.

Part III: Managing Clusters