This section describes how to configure and customize Condor for your site. It discusses the configuration files used by Condor, describes how to configure the policy for starting and stopping jobs in your pool, and recommends settings for using Condor on a cluster.
A number of configuration files facilitate different levels of control over how Condor is configured on each machine in a pool. The top-level or global configuration file is shared by all machines in the pool. For ease of administration, this file should be located on a shared file system. In addition, each machine may have multiple local configuration files allowing the local settings to override the global settings. Hence, each machine may have different daemons running, different policies for when to start and stop Condor jobs, and so on.
All of Condor's configuration files should be owned and writable only by root. It is important to maintain strict control over these files because they contain security-sensitive settings.
The Condor project's website at www.cs.wisc.edu/condor has detailed installation instructions. For some Linux distributions, Condor is available in the native packaging format. For Linux distributions that Condor is not natively packaged for, it is available as a tarfile. A perl script is included to help install Condor and customize the configuration.
Condor has a default set of locations it uses to try to find its top-level configuration file. The locations are checked in the following order:
The file specified in the CONDOR_CONFIG environment variable.
'/etc/condor/condor_config', if it exists.
If user condor exists on your system, the 'condor_config' file in this user's home directory.
If a Condor daemon or tool cannot find its global configuration file when it starts, it will print an error message and immediately exit. Once the global configuration file has been read by Condor, however, any other local configuration files can be specified with the LOCAL_CONFIG_FILE macro.
This macro can contain a single entry if you want only two levels of configuration (global and local). If you need a more complex division of configuration values (for example, if you have machines of different platforms in the same pool and desire separate files for platform-specific settings), LOCAL_CONFIG_FILE can contain a list of files.
Condor provides other macros to help you easily define the location of the local configuration files for each machine in your pool. Most of these are special macros that evaluate to different values depending on which host is reading the global configuration file:
HOSTNAME: The hostname of the local host.
FULL_HOSTNAME: The fully qualified hostname of the local host.
TILDE: The home directory of the user condor on the local host.
OPSYS: The operating system of the local host, such as "LINUX," "WINNT4" (for Windows NT), or "WINNT5" (for Windows 2000). This is primarily useful in heterogeneous clusters with multiple platforms.
RELEASE_DIR: The directory where Condor is installed on each host. This macro is defined in the global configuration file and is set by Condor's installation program.
By default, the local configuration file is defined as
LOCAL_CONFIG_FILE = $(TILDE)/condor_config.local
Ease of administration is an important consideration in a cluster, particularly if you have a large number of nodes. To make Condor easy to configure, we highly recommend that you install all of your Condor configuration files, even the per-node local configuration files, on a shared file system. That way, you can easily make changes in one place.
You should use a subdirectory in your release directory for holding all of the local configuration files. By default, Condor's release directory contains an 'etc' directory for this purpose.
You should create separate files for each node in your cluster, using the hostname as the first half of the filename, and ".local" as the end. For example, if your cluster nodes are named "n01," "n02," and so on, the files should be called 'n01.local', 'n02.local', and so on. These files should all be placed in your 'etc' directory.
In your global configuration file, you should use the following setting to describe the location of your local configuration files:
LOCAL_CONFIG_FILE = $(RELEASE_DIR)/etc/$(HOSTNAME).local
The central manager of your pool needs special settings in its local configuration file. These attributes are set automatically by the Condor installation program. The rest of the local configuration files can be left empty at first.
Having your configuration files laid out in this way will help you more easily customize Condor's behavior on your cluster. We discuss other possible configuration scenarios at the end of this chapter.
Note? |
We recommend that you store all of your Condor configuration files under a version control system, such as CVS. While this is not required, it will help you keep track of the changes you make to your configuration, who made them, when they occurred, and why. In general, it is a good idea to store configuration files under a version control system, since none of the above concerns are specific to Condor. |
Condor has a rich and highly-configurable security implementation. Condor separates security into two parts: Authentication and Authorization. Authentication identifies the client requesting an action, and does not pass judgment on if that client is allowed to perform that action. Condor can use many different methods for authentication, including Kerberos, X.509 Public/Private keys, and TCP/IP hostnames. Authentication levels and methods are automatically negotiated by Condor. Authentication can be Required, Preferred, Optional, or Never. Given the distributed nature of the daemons that implement Condor, access to these daemons is naturally host based, and is currently the default. However, host-based security is fairly easy to defeat. In any sort of untrusted environment, we strongly recommend using a more sophisticated authentication method such as X.509.
Authorization builds on top of authentication by specifying who is allowed to do what. There are four different classes of access levels: Read, Write, Administrator, and Config. Each level may require a different strength of authentication, and have a different set of clients who are allowed to perform that action. For example, it is very common to allow anyone who can authenticate as being from a local subnet to read information about Condor resources and jobs. At the same time, only a few people be might allowed to administer a machine, and these people may be required to identify themselves using Kerberos. The four access levels are described below:
Read: allows a client to obtain information from Condor. Examples of information that may be read are the status of the pool and the contents of the job queue.
Write: allows a client to provide information to Condor, such as submit a job or join the pool. Note that Write access does not imply Read access.
Administrator: allows a client to affect privileged operations such as changing a user's priority level or starting and stopping the Condor system from running.
Config: allows a client to change Condor's configuration settings remotely using the condor_config_val tool's -set and -rset options. This has very serious security implications, so we recommend that you not enable Config access to any hosts.
The defaults during installation give all machines in the pool read and write access. The central manager is also given administrator access. You will probably wish to change these defaults for your site. Read the Condor Administrator's Manual for details on authentication and authorization in Condor and how to customize it for your site.
Certain configuration expressions are used to control Condor's policy for executing, suspending, and evicting jobs. Their interaction can be somewhat complex. Defining an inappropriate policy impacts the throughput of your cluster and the happiness of its users. If you are interested in creating a specialized policy for your pool, we recommend that you read the Condor Administrator's Manual. Only a basic introduction follows.
All policy expressions are ClassAd expressions and are defined in Condor's configuration files. Policies are usually poolwide and are therefore defined in the global configuration file. If individual nodes in your pool require their own policy, however, the appropriate expressions can be placed in local configuration files.
The policy expressions are treated by the condor_startd as part of its machine ClassAd (along with all the attributes you can view with condor_status -long). They are always evaluated against a job ClassAd, either by the condor_negotiator when trying to find a match or by the condor_startd when it is deciding what to do with the job that is currently running. Therefore, all policy expressions can reference attributes of a job, such as the memory usage or owner, in addition to attributes of the machine, such as keyboard idle time or CPU load.
Most policy expressions are ClassAd Boolean expressions, so they evaluate to TRUE, FALSE, or UNDEFINED. UNDEFINED occurs when an expression references a ClassAd attribute that is not found in either the machine's ClassAd or the ClassAd of the job under consideration. For some expressions, this is treated as a fatal error, so you should be sure to use the ClassAd meta-operators, described in Section 15.1.2 when referring to attributes which might not be present in all ClassAds.
An explanation of policy expressions requires an understanding of the different stages that a job can go through from initially executing until the job completes or is evicted from the machine. Each policy expression is then described in terms of the step in the progression that it controls.
When a job is submitted to Condor, the condor_negotiator performs matchmaking to find a suitable resource to use for the computation. This process involves satisfying both the job and the machine's requirements for each other. The machine can define the exact conditions under which it is willing to be considered available for running jobs. The job can define exactly what kind of machine it is willing to use.
Once a job has been matched with a given machine, there are four states the job can be in: running, suspended, graceful shutdown, and quick shutdown. As soon as the match is made, the job sets up its execution environment and begins running.
While it is executing, a job can be suspended (for example, because of other activity on the machine where it is running). Once it has been suspended, the job can resume execution or can move on to preemption or eviction.
All Condor jobs have two methods for preemption: graceful and quick. Standard Universe jobs are given a chance to produce a checkpoint with graceful preemption. For the other universes, graceful implies that the program is told to get off the system, but it is given time to clean up after itself. On all flavors of Unix, a SIGTERM is sent during graceful shutdown by default, although users can override this default when they submit their job. A quick shutdown involves rapidly killing all processes associated with a job, without giving them any time to execute their own cleanup procedures. The Condor system performs checks to ensure that processes are not left behind once a job is evicted from a given node.
Various expressions are used to control the policy for starting, suspending, resuming, and preempting jobs.
START: when the condor_startd is willing to start executing a job.
RANK: how much the condor_startd prefers each type of job running on it. The RANK expression is a floating-point instead of a Boolean value. condor_startd will preempt the job it is currently running if there is another job in the system that yields a higher value for this expression.
WANT_SUSPEND: controls whether the condor_startd should even consider suspending this job or not. In effect, it determines which expression, SUSPEND or PREEMPT, should be evaluated while the job is running. WANT_SUSPEND does not control when the job is actually suspended; for that purpose, you should use the SUSPEND expression.
SUSPEND: when the condor_startd should suspend the currently running job. If WANT_SUSPEND evaluates to TRUE, SUSPEND is periodically evaluated whenever a job is executing on a machine. If SUSPEND becomes TRUE, the job will be suspended.
CONTINUE: if and when the condor_startd should resume a suspended job. The CONTINUE expression is evaluated only while a job is suspended. If it evaluates to TRUE, the job will be resumed, and the condor_startd will go back to the Claimed/Busy state.
PREEMPT: when the condor_startd should preempt the currently running job. This expression is evaluated whenever a job has been suspended. If WANT_SUSPEND evaluates to FALSE, PREEMPT is checked while the job is executing.
WANT_VACATE: whether the job should be evicted gracefully or quickly if Condor is preempting a job (because the PREEMPT expression evaluates to TRUE). If WANT_VACATE is FALSE, the condor_startd will immediately kill the job and all of its child processes whenever it must evict the application. If WANT_VACATE is TRUE, the condor_startd performs a graceful shutdown, instead.
KILL: when the condor_startd should give up on a graceful preemption and move directly to the quick shutdown.
PREEMPTION_REQUIREMENTS: used by the condor_negotiator when it is performing matchmaking, not by the condor_startd. While trying to schedule jobs on resources in your pool, the condor_negotiator considers the priorities of the various users in the system (see Section 15.5.3 for more details). If a user with a better priority has jobs waiting in the queue and no resources are currently idle, the matchmaker will consider preempting another user's jobs and giving those resources to the user with the better priority. This process is known as priority preemption. The PREEMPTION_REQUIREMENTS expression must evaluate to TRUE for such a preemption to take place.
PREEMPTION_RANK: a floating-point value evaluated by the condor_negotiator. If the matchmaker decides it must preempt a job due to user priorities, the macro PREEMPTION_RANK determines which resource to preempt. Among the set of all resources that make the PREEMPTION_REQUIREMENTS expression evaluate to TRUE, the one with the highest value for PREEMPTION_RANK is evicted.
In addition to the policy expressions, you will need to modify other settings to customize Condor for your cluster.
DAEMON_LIST: the comma-separated list of daemons that should be spawned by the condor_master. As described in Section 15.3.1 discussing the architecture of Condor, each host in your pool can play different roles depending on which daemons are started on it. You define these roles using the DAEMON_LIST in the appropriate configuration files to enable or disable the various Condor daemons on each host.
DedicatedScheduler: the name of the dedicated scheduler for your cluster. This setting must have the form
DedicatedScheduler = "DedicatedScheduler@full.host.name.here"