13.3 Hardware Failure and Recovery

One of the most burdensome responsibilities in cluster management is dealing with the consequences of hardware failures. The impact of hardware failures can vary drastically based on how much of the cluster depends on the failing component.

Of highest impact are failures like the loss of the file-servers serving user file-systems or the loss of infrastructure components like management nodes, nodes where scheduling and resource management services run, and the loss of commodity networking or interconnect components like switches and routers. If any of these components fails, the entire cluster may be unusable.

At the opposite end of the impact spectrum are failures that do not affect any other cluster component, for example the loss of a single compute node. When a single compute node fails, only the users active on that node will be affected and other activities on other nodes may proceed unaffected.

Given the broad impact spectrum that a particular failure can have and that the failing component can be as minor as a single disk or as major as an entire cluster network one can't write a single procedure for recovering from hardware failures. In a general sense though, the following outline should be used. Recovery from a hardware failure involves:

Isolating the failed component to make sure no additional cluster activities are impacted.
If the failure has a major impact you may want to find existing hardware that can temporarily be used to fill in for the failing component so you can recover immediately. For example, if you lose a disk, controller, or server serving critical file-systems, and you have some other server with available capacity, you can start immediate recovery to an alternate server.
Getting the hardware serviced.
And finally, fixed hardware must be integrated back into the cluster. If the failed component included data, like a disk containing the operating system or user data the recovery will involve recovering the required contents to the new disk.