13.4 Software Failure

13.4 Software Failure

Although software failures may be similar to hardware failures in their ability to bring an entire machine down, they are also quite different in several respects.

Software failures sometimes do not have a fix. If nobody has detected the failure or bug then a new version or patch may not be available. When this happens the only solution is to avoid the conditions that trigger the fault, report the failure to software supplier, and either wait for the fix or try to fix the problem yourself.

Regardless of what type of software failure you are dealing with, kernel, distribution, scheduling and resource management, or application support library, the best practices for avoiding software failures are:

  • Keep an eye out for new software versions and bug fixes.

  • Perform careful testing and verification prior to upgrading to new software versions.

  • Whenever possible give yourself a way to return to previous software in case an upgrade has major problems.

  • Maintain good records of unresolved failures, such as the ones that disappear after a reboot.




Part III: Managing Clusters