13.5 File System Failure and Recovery

If a hardware component or software service fails, the most a user will normally lose is the intermediate results from running applications. The user will normally lose a few hours of work at most and can easily recover by restarting jobs. If, on the other hand, a home file system containing months of work results is lost, the impact on users from data loss could be huge.

For this reason, no cluster component is more critical than the storage and file systems that hold users' applications and data.

Regardless of hardware or software used to provide home file systems the first line of defense consists of regularly scheduled backups. Backups also offer the added advantage that they can be used to recover data lost through human error.

Besides backups the following hardware and software options offer improved protection from hardware and software failures.

Use of RAID 0, 3, or 5 file systems that protect from individual disk failures.
Use of journaled file systems that protect from file system corruption and provide fast recovery in the case of crashes.
Use of parallel file systems that protect from the loss of a file server by providing access to the file system through multiple machines. Commercial file systems in this category include GPFS from IBM, GFS from Systina, and PolyServe.

Adapting to Changing Requirements

In previous sections of this chapter we focused on cluster management activities surrounding investigating and recovering from failures. Sometimes the recovery process will drive a change in the base hardware or software configuration. The most common example is upgrading a software package in order to fix a bug in an older version.

Even when a cluster is fully functional, the world around it is constantly evolving. Application developers enhance their code to use new compiler or library features, new users need to use the cluster, potential security vulnerabilities are revealed that if not fixed could make a cluster susceptible. These are just some examples of the changes that surround a cluster. All of these make it necessary to iterate through a careful change-management process.

Examples of changes driven by changing requirements include:

Adding more disk to expand storage capacity
Upgrading the RAM or processors in nodes to increase throughput
Applying security updates to system services
Upgrading to new and improved compilers or application libraries
New user account requests
Workload management

In the following sections we will discuss cluster management activities driven by changes like these. Many factors can influence a change of requirements, but the most common are the evolving needs of existing users, the needs of new users, hardware changes driven by failures or changing capacity requirements, and the software life cycle. Collectively these changes alter the base state of a cluster and the definition of operational.