1.2 Why Use a Cluster?

1.2 Why Use a Cluster?

Why use a cluster instead of a single computer? There are really two reasons: performance and fault tolerance. The original reason for the development of Beowulf clusters was to provide cost-effective computing power for scientific applications, that is, to address the needs of applications that required greater performance than was available from single (commodity) processors or affordable multiprocessors. An application may desire more computational power for many reasons, but the following three are the most common:

  • Real-time constraints, that is, a requirement that the computation finish within a certain period of time. Weather forecasting is an example. Another is processing data produced by an experiment; the data must be processed (or stored) at least as fast as it is produced.

  • Throughput. A scientific or engineering simulation may require many computations. A cluster can provide the resources to process many related simulations. On the other hand, some single simulations require so much computing power that a single processor would require days or even years to complete the calculation. An example of using a Linux Beowulf cluster for throughput is Google [13], which uses over 15,000 commodity PCs with fault-tolerant software to provide a high-performance Web search service.

  • Memory. Some of the most challenging applications require huge amounts of data as part of the simulation. A cluster provides an effective way to provide even terabytes (1012 bytes) of program memory for an application.

Clusters provide the computational power through the use of parallel programming, a technique for coordinating the use of many processors for a single problem. Part II (Parallel Programming) discusses this approach in detail. What clusters are not good for is accelerating calculations that are neither memory intensive nor processing-power intensive or (in a way that will be made precise below) that require frequent communication between the processors in the cluster.

Another reason for using clusters is to provide fault tolerance, that is, to ensure that computational power is always available. Because clusters are assembled from many copies of the same or similar components, the failure of a single part only reduces the cluster's power. Thus, clusters are particularly good choices for environments that require guarantees of available processing power, such as Web servers and systems used for data collection.

We note that fault tolerance can be interpreted in several ways. For a Web server or data handling, the cluster can be considered up as long as enough processors and network capacity are available to meet the demand. A well-designed cluster can provide a virtual guarantee of availabilty, short of a disaster such as a fire that strikes the whole cluster. Such a cluster will have virtually 100% uptime. For scientific applications, the interpretation of uptime is often different. For clusters used for scientific applications, however, particularly ones used to provide adequate memory, uptime is measured relative to the minimum size of cluster (e.g., number of nodes) that allows the applications to run. In many cases, all or nearly all of the nodes in the cluster must be available to run these applications.

Of course, many uses of clusters are a blend of these two approaches. Part III describes tools for sharing a cluster among users and, in many cases, providing support for both performance-oriented and fault-tolerant computing.

Part III: Managing Clusters