Chapter 1: So You Want to Use a Cluster

Overview

William Gropp

What is a "Beowulf Cluster" and what is it good for? Simply put, a Beowulf Cluster is a supercomputer that anyone can build and use. More specifically, a Beowulf Cluster is a parallel computer built from commodity components. This approach takes advantage of the astounding performance now available in commodity personal computers. By many measures, including computational speed, size of main memory, available disk space and bandwidth, a single PC of today is more powerful than the supercomputers of the past. By harnessing the power of tens to thousands of such low-cost but powerful processing elements, you can create a powerful supercomputer. In fact, the number 5 machine on the "Top500" list of the world's most powerful supercomputers is a Beowulf Cluster.

A Beowulf cluster is a form of parallel computer, which is nothing more than a computer that uses more than one processor. There are many different kinds of parallel computer, distinguished by the kinds of processors they use and the way in which those processors exchange data. A Beowulf cluster takes advantage of two commodity components: fast CPUs designed primarily for the personal computer market and networks designed to connect personal computers together (in what is called a local area network or LAN). Because these are commodity components, their cost is relatively low. As we will see later in this chapter, there are some performance consequences, and Beowulf clusters are not suitable for all problems. However, for the many problems for which they do work well, Beowulf clusters provide an effective and low-cost solution for delivering enormous computational power to applications and are now used virtually everywhere. This raises the following question: If Beowulf clusters are so great, why didn't they appear earlier?

Many early efforts used clusters of smaller machines, typically workstations, as building blocks in creating low-cost parallel computers. In addition, many software projects developed the basic software for programming parallel machines. Some of these made their software available for all users, and emphasized portability of the code, making these tools easily portable to new machines. But the project that truly launched clusters was the Beowulf project at the NASA Goddard Space Flight center. In 1994, Thomas Sterling, Donald Becker, and others took an early version of the Linux operating system, developed Ethernet driver software for Linux, and installed PVM (a software package for programming parallel computers) on 16 100MHz Intel 80486-based PCs. This cluster used dual 10-Mbit Ethernet to provide improved bandwidth in communications between processors, but was otherwise very simple—and very low cost.

Why did the Beowulf project succeed? Part of the answer is that it was the right solution at the right time. PCs were beginning to become competent computational platforms (a 100MHz 80486 has a faster clock than the original Cray 1, a machine considered one of the most important early supercomputers). The explosion in the size of the PC market was reducing the cost of the hardware through economies of scale. Equally important, however, was a commitment by the Beowulf project to deliver a working solution, not just a research testbed. The Beowulf project worked hard to "dot the i's and cross the t's," addressing many of the real issues standing in the way of widespread adoption of cluster technology for commodity components. This was a critical contribution; making a cluster solid and reliable often requires solving new and even harder problems; it isn't just hacking. The contribution of the community to this effort, through contributions of software and general help to others building clusters, made Beowulf clustering exciting.

Since the early Beowulf clusters, the use of commodity-off-the-shelf (COTS) components for building clusters has mushroomed. Clusters are found everywhere, from schools to dorm rooms to the largest machine rooms. Large clusters are an increasing percentage of the Top500 list. You can still build your own cluster by buying individual components, but you can also buy a preassembled and tested cluster from many vendors, including both large and well-established computer companies and companies formed just to sell clusters.

This book will give you an understanding of what Beowulfs are, where they can be used (and where they can't), and how they work. To illustrate the issues, specific operations, such as installation of a software package are described. However, this book is not a cookbook; software and even hardware change too fast for that to be practical. The best use of this book is to read it for understanding; to build a cluster, then go out and find the most up-to-date information on the web about the hardware and software.

Each of the areas discussed in this book could have its own book. In fact, many do, including books in the same MIT Press series. What this book does is give you the basic background so that you can understand Beowulf Clusters. For those areas that are central to your interest in Beowulf computing, we recommend that you read the relevant books. Some of these are described in Appendix B. For the others, this book provides a solid background for understanding how to specify, build, program, and manage a Beowulf cluster.

We begin by defining what a cluster is and why a cluster can be a good computing platform. Since not all applications are appropriate for clusters, Section 1.3 introduces techniques for estimating the performance of an application on a cluster, with an illustration drawn from technical computing. With this background, the next two sections provide two different ways to read this book. Section 1.4 provides a procedural approach, from choosing which components will constitute the cluster to determining how applications can be tuned on the cluster. Section 1.5 provides a topical approach, such as how to program it, run jobs on it, or specify a cluster's components.