J. P. Navarro
In Section I we covered the enabling technologies that make up a cluster's hardware and software components. As we presented node hardware (Chapter 2), the Linux kernel (Chapter 3), cluster networks (Chapter 4), network configuration and tuning (Chapter 5), and cluster setup (Chapter 6) we presented the most significant concepts to consider in selecting cluster hardware and the major operating system installation and configuration activities necessary to deploy a cluster.
After completing basic hardware and operating system installation a cluster administrator will configure cluster wide file systems, install and configure scheduling and resource management software, and install compilers, application libraries, and other software packages needed by cluster users.
With these activities complete a cluster should be ready for productive use. From this point forward cluster management will include activities focused on: 1) detecting, investigating, and recovering from hardware and software failures; and 2) adapting to changing requirements that drive changes to cluster hardware, software, and usage patterns.
This chapter is organized around these two major aspects of cluster management. First we will cover monitoring, logging, backups, configuration management, and the broader set of activities that surround detecting and recovering from failures. Second we will discuss activities like software upgrades and account management that are primarily driven by changing cluster requirements.
We will finally wrap up by discussing the differences between systems management and cluster management which constitute the most significant cluster management challenges.
After making a cluster available to users it will not take long for someone to report a failure. Perhaps a hardware component like a hard disk, node memory, or an interconnect adapter that had passed initial functionality tests during installation will fail under real application load, or perhaps a software library or service that appeared to work initially will fail when used by a real user or application. These are but two of the many possible reasons why a cluster component can fail.
Investigating a failure to determine a root cause can be a challenge. Problems may be clearly hardware related, software related, or in some cases not clearly either. In the following sections we will discuss cluster management activities used by cluster administrators to investigate failures, find the root cause of those failures, and ensure a smooth return to a functional state.