Now let us take a look in more detail at each of the major activities performed by a cluster workload management system.
The first of the five aspects of workload management is queuing, or the process of collecting together "work" to be executed on a set of resources. This is also the portion most visible to the user.
The tasks the user wishes to have the computer perform, the work, is submitted to the workload management system in a container called a "batch job." The batch job consists of two primary parts: a set of resource directives (such as the amount of memory or number of CPUs needed) and a description of the task to be executed. This description contains all the information the workload management system needs in order to start a user's job when the time comes. For instance, the job description may contain information such as the name of the file to execute, a list of data files required by the job, and environment variables or command-line arguments to pass to the executable.
Once submitted to the workload management system, the batch jobs are held in a "queue" until the matching resources (e.g., the right kind of computers with the right amount of memory or number of CPUs) become available. Examples of real-life queues are lines at the bank or grocery store. Sometimes you get lucky and there's no wait, but usually you have to stand in line for a few minutes. And on days when the resources (clerks) are in high demand (like payday), the wait is substantially longer.
The same applies to computers and batch jobs. Sometimes the wait is very short, and the jobs run immediately. But more often (and thus the need for the workload management system) resources are oversubscribed, and so the jobs have to wait.
One important aspect of queues is that limits can be set that restrict access to the queue. This allows the cluster manager greater control over the usage policy of the cluster. For example, it may be desirable to have a queue available for short jobs only, analogous to the "ten items or fewer express lane" at the grocery store, providing a shorter wait for "quick tasks."
Each of the different workload management systems discussed later in this volume offers a rich variety of queue limits and attributes.
The second area of workload management is scheduling, which is simply the process of choosing the best job to run. Unlike in our real-life examples of the bank and grocery store (which employ a simple first-come, first-served model of deciding who's next), workload management systems offer a variety of ways by which the best job is identified.
As we have discussed earlier, however, best can be a tricky goal. It depends on the usage policy set by local management, the available workload, the type and availability of cluster resources, and the types of application being run on the cluster. In general, however, scheduling can be broken into two primary activities: policy enforcement and resource optimization.
Policy encapsulates how the cluster resources are to be used, addressing such issues as priorities, traffic control, and capability vs. high throughput. Scheduling is then the act of enforcing the policy in the selection of jobs, ensuring that priorities are met and policy goals are achieved.
While implementing and enforcing the policy, the scheduler has a second set of goals. These are resource optimization goals, such as "pack jobs efficiently" or "exploit underused resources."
The difficult part of scheduling, then, is balancing policy enforcement with resource optimization in order to pick the best job to run.
Logically speaking, one can think of a scheduler as performing the following loop:
Select the best job to run, according to policy and available resources.
Start the job.
Stop the job and/or clean up after a completed job.
Repeat.
The nuts and bolts of scheduling is, of course, choosing and tuning the policy to meet your needs. Although different workload management systems each have their own idiosyncrasies, they typically all provide ways in which their scheduling policy can be customized. Subsequent chapters of this book discuss the various scheduling policy mechanisms available in several popular workload management systems.
Resource monitoring is the third part of any cluster workload management system. It provides necessary information to administrators, users and the scheduling system itself on the status of jobs and resources. Resource monitoring comes into play in three critical times:
When nodes are idle, to verify that they are in working order before starting another job on them.
When nodes are busy running a job. Users and administrators may want to check memory, CPU, network, I/O, and utilization of other system resources. Such checks often are useful in parallel programming when users wish to verify that they have balanced their workload correctly and are effectively using all the nodes they've been allocated.
When a job completes. Here, resource monitoring is used to ensure that no processes remain from the completed job and that the node is still in working order before starting another job on it.
Workload management systems query the compute resources at these times and use the information to make informed decisions about running jobs. Much of the information is cached so that it can be reported quickly in answer to status requests. Some information is saved for historical analysis purposes. Still other information is used in the enforcement of local policy. The method of collection may differ in different workload management systems, but the general purposes are the same.
The fourth area, resource management, is essentially responsible for the starting, stopping, and cleaning up after jobs that are run on cluster nodes. In a batch system resource management involves running a job for a user, under the identity of the user, on the resources the user was allocated, in such a way that the user need not be present at that time.
Many cluster workload management systems provide mechanisms to ensure the successful startup and cleanup of jobs and to maintain node status data internally, so that jobs are started only on nodes that are available and functioning correctly.
In addition, limits may need to be placed on the job and enforced by the workload management system. These limits are yet another aspect of policy enforcement, in addition to the limits on queues and those enacted by the scheduling component.
Resource management also includes removing or adding compute resources to the available pool of systems. Clusters are rarely static; systems go down, or new nodes are added. The "registration" of new nodes and the marking of nodes as unavailable are both additional aspects of resource management.
The fifth aspect of workload management is accounting and reporting. Workload accounting is the process of collecting resource usage data for the batch jobs that run on the cluster. Such data includes the job owner, resources requested by the job, and total amount of resources consumed by the job. Other data about the job may also be available, depending on the specific workload managment system in use.
Cluster workload accounting data can used for a variety of purposes, such as
producing weekly system usage reports,
preparing monthly per user usage reports,
enforcing per project allocations,
tuning the scheduling policy,
calculating future resource allocations,
anticipating future computer component requirements, and
determining areas of improvement within the computer system.
The data for these purposes may be collected as part of the resource monitoring tasks or may be gathered separately. In either case, data is pulled from the available sources in order to meet the objectives of workload accounting.