12.4 Distributed Resource Management

If you're running a lot of BLAST jobs, one problem to consider is how to manage them to minimize idle time without overloading your computers. Being organized is the simplest way to schedule jobs. If you're the only user, you can use simple scripts to iterate over the various searches and keep your computer comfortably busy. The problem starts when you add multiple users. In a small group, it's possible for users to cooperate with one another without adding extra software. Sending email saying "hey, stay off blast-server5 until I say so" works surprisingly well. But if you have a large group or irresponsible users, you'll want some kind of distributed resource management (DRM) software.

There are a number of DRM software packages, both free and commercial. But even the free ones will cost you time to install and maintain, and users need training to use the system. Table 12-3 lists some of the most popular packages in the bioinformatics community. Condor is an established DRM that is downloadable for free. It is rare in that it supports Windows and Unix. LSF is a mature product with many bioinformatics users. It is, however, expensive. For large groups, however, the robustness makes the cost justifiable. Parasol is purpose-built for the UCSC kilocluster and throws out some of the generalities for increased performance. PBS and ProPBS are popular DRMs, and if you're an academic user, you can get ProPBS for free. SGE is a relative newcomer but has a strong following, partly due to the fact that it's an open source project.

Table 12-3. DRM software
Product	Description (as advertised)
Condor	Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job-queuing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor; Condor then places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion. http://www.cs.wisc.edu/condor
LSF	Platform LSF 5 is built on a grid-enabled, robust architecture for open, scalable, and modular environments. Platform LSF 5 is engineered for enterprise deployment. It provides unlimited scalability with support for over 100 clusters, more than 200,000 CPUs, and 500,000 active jobs. With more than 250,000 licenses spanning 1,500 customer sites, Platform LSF 5 has industrial-strength reliability to process mission-critical jobs reliably and on time. A web-based interface puts the convenience and simplicity of global access to resources into the hands of your administrators and users. Platform LSF 5, with its open, plug-in architecture, seamlessly integrates with third-party applications and heterogeneous technology platforms. http://www.platform.com
Parasol	Parasol provides a convenient way for multiple users to run large batches of jobs on computer clusters of up to thousands of CPUs. Parasol was developed initially by Jim Kent, and extended by other members of the Genome Bioinformatics Group at the University of California Santa Cruz. Parasol is currently a fairly minimal system, but what it does, it does well. It can start up 500 jobs per second. It restarts jobs in response to the inevitable systems failures that occur on large clusters. If some of your jobs die because of your program bugs, Parasol can also help manage restarting the crashed jobs after you fix your program. http://www.soe.ucsc.edu/~donnak/eng/parasol.htm
PBS	The Portable Batch System (PBS) is a flexible batch queuing and workload management system originally developed by Veridian Systems for NASA. It operates on networked, multiplatform UNIX environments, including heterogeneous clusters of workstations, supercomputers, and massively parallel systems. Development of PBS is provided by the PBS Products Department of Veridian Systems. http://www.openpbs.org
ProPBS	The PBS Pro Version 5.2 workload management solution is the professional version of the Portable Batch System. Built on the success of OpenPBS, PBS Pro goes well beyond it with the features and support you expect in a mission-critical commercial product, such as: Shrink-wrapped, easy-to-install binary distributions Support on every major version of Unix and Linux Enhanced fault tolerance and scalability Enhanced scheduling algorithms Computational grid support Direct support from the team that created PBS New, rewritten documentation Source code availability http://www.propbs.com
SGE	The Grid Engine project is an open source community effort to facilitate the adoption of distributed computing solutions. Sponsored by Sun Microsystems and hosted by CollabNet, the Grid Engine project provides enabling distributed resource management software for wide-ranging requirements from compute farms to grid computing. http://gridengine.sunsource.net