Monitoring involves watching the many performance and operational variables that establish whether a cluster is running correctly and as efficiently as possible. Correct operation involves looking at all hardware and software components and determining that they are available and operating as expected.
For example, to establish that all the expected hardware components are available one would need to ensure that all the CPUs, memory, disks, and network interfaces were detected by the operating system at boot time, and that all the other devices in a cluster that aren't part of a host, such as network devices, power controllers, terminal servers, and network storage devices are detected by the components that use them.
Similarly one can monitor the collection of software services than need to be running correctly for a cluster to be operational. Services such as schedulers, resource managers, and node monitoring daemons themselves need to be up and operational for the various user or operational activities on a cluster to function.
Sometimes, even though hardware and software components are detected and operational they may be operating in a degraded state, affecting efficient operation of the a cluster. Monitoring for degraded operation is often neglected; strictly speaking, applications may work correctly, but not at the expected level of performance. Monitoring for degraded performance can sometimes help predict components that are likely to fail completely in the near future. Some examples are network cables that may be producing packet loss, a disk that is very close to full, or system processes with higher-than-expected memory consumption, indicating a probable memory leak bug.
When you combine all of possible hardware and software monitoring elements and multiply them by the number of components you may find yourself needing to monitor 1000s of operational elements just to answer the basic question of to what degree a cluster is running normally.
Fortunately, monitoring has been an important element of systems management so a plethora of both commercial and open-source products are available to assist with this task. Whether you want to monitor systems, networks, or both, and whether you want to use protocols like SNMP or not, many tools are useful for monitoring clusters. Some of the most common non-commercial cluster monitoring tools are:
Big Brother, http://bb4.com/
Cluemon, http://clumon.ncsa.uiuc.edu/
Ganglia, http://ganglia.sourceforge.net/
Nagios (was NetSaint), http://www.nagios.org/
PARMON, http://www.cs.mu.oz.au/~raj//parmon/
Performance Co-Pilot, http://oss.sgi.com/projects/pcp/
Supermon, http://www.acl.lanl.gov/supermon/
We do not discuss these and other monitoring tools here, since many articles, papers, and discussions on cluster monitoring are available. Our main point is that these tools can be useful for measuring cluster health and summarizing cluster operational status.
Most workload management tools, including the Condor, Maui, and PBS discussed in this book, offer monitoring capability. Cluster managers should be very familiar with the monitoring capabilities in these tools as they summarize the most visible cluster state information: whether the nodes used by applications appear to be functional from a workload perspective, how active or busy is the cluster currently, and what the workload backlog looks like.
From a monitoring perspective, the node state information offered by workload management tools is an excellent indicator of overall cluster state and health since it indicates both that the workload management services are running and reachable on each node, and that basic monitoring implemented by these workload management services do not detect any type of node fault.
Both monitoring tools and logging files may at times detect or record failure situations. If you don't want to constantly have to look at these log files you can use tools designed to detect trigger strings that represent failures and report them via e-mail or other methods.
When cluster resources, such as file system space, machine memory and swap, and network or file I/O bandwidth are exhausted the entire cluster may be affected. One possible effect is the literal failure of a component, for example a machine with exhausted memory and swap is likely to crash or terminate the application exhausting memory. Another and more difficult-to-detect effect is degraded performance.