16.5 Troubleshooting

16.5 Troubleshooting

Maui's diagnostic commands provide a good start for troubleshooting any scheduling issues. The diagnose command together with checknode and checkjob provides detailed state information about the scheduler, including its various facilities, nodes, and jobs. In addition to state information, these commands can also trigger extensive internal sanity checks for the scheduling realm of interest. For example, if the job priorities do not appear to properly reflect site objectives, the diagnose -p command can be used to display the priorities of all jobs and the contributions of the various priority components and subcomponents. This command will also look for invalid priority values and summarize overall priority contributions of each component. At a glance, it will help administrators determine whether parameters need to be adjusted and, if so, by how much. Other diagnostic commands assist in both problem resolution and system tuning in areas such as throttling policies, reservations, fairshare, Grid scheduling, and job management. If any diagnostic command uncovers a potential problem, the issue is reported in the form of WARNING messages appended to the normal command output. Use of these commands typically identifies or resolves the vast majority of all scheduling issues.

If additional information is required, Maui writes out detailed logging information in a logfile specified by the LOGFILE parameter (usually in 'log/maui.log'). The LOGLEVEL and LOGFACILITY parameters enable control over the verbosity and focus of these logs. Maui's high verbosity levels are very verbose, however, so keeping the LOGLEVEL below 4 or so unless actually tracking problems can help prevent excessing file activity.

These logs contain a number of entries, including the following:

  • INFO: provides status information about normal scheduler operations.

  • WARNING: indicates that an unexpected condition was detected and handled.

  • ALERT: indicates that an unexpected condition occurred that could not be fully handled.

  • ERROR: indicates that problem was detected that prevents Maui from fully operating. This may be a problem with the cluster that is outside of Maui's control or may indicate corrupt internal state information.

  • Function header: indicates when a function is called and what parameters are passed.

A simple grep through the log file will usually indicate whether any serious issues have been detected and is of significant value when obtaining support or locally diagnosing problems. If neither commands nor logs point to the source of the problem, the Maui users list (<mauiusers@supercluster.org>) or Supercluster support (<support@supercluster.org>) may be consulted for additional assistance.

Part III: Managing Clusters