If you're running a lot of BLAST jobs, one problem to consider is how to manage them to minimize idle time without overloading your computers. Being organized is the simplest way to schedule jobs. If you're the only user, you can use simple scripts to iterate over the various searches and keep your computer comfortably busy. The problem starts when you add multiple users. In a small group, it's possible for users to cooperate with one another without adding extra software. Sending email saying "hey, stay off blast-server5 until I say so" works surprisingly well. But if you have a large group or irresponsible users, you'll want some kind of distributed resource management (DRM) software.
There are a number of DRM software packages, both free and commercial. But even the free ones will cost you time to install and maintain, and users need training to use the system. Table 12-3 lists some of the most popular packages in the bioinformatics community. Condor is an established DRM that is downloadable for free. It is rare in that it supports Windows and Unix. LSF is a mature product with many bioinformatics users. It is, however, expensive. For large groups, however, the robustness makes the cost justifiable. Parasol is purpose-built for the UCSC kilocluster and throws out some of the generalities for increased performance. PBS and ProPBS are popular DRMs, and if you're an academic user, you can get ProPBS for free. SGE is a relative newcomer but has a strong following, partly due to the fact that it's an open source project.
Product |
Description (as advertised) |
---|---|
Condor |
Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job-queuing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor; Condor then places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion. http://www.cs.wisc.edu/condor |
LSF |
http://www.platform.com |
Parasol |
Parasol provides a convenient way for multiple users to run large batches of jobs on computer clusters of up to thousands of CPUs. Parasol was developed initially by Jim Kent, and extended by other members of the Genome Bioinformatics Group at the University of California Santa Cruz. Parasol is currently a fairly minimal system, but what it does, it does well. It can start up 500 jobs per second. It restarts jobs in response to the inevitable systems failures that occur on large clusters. If some of your jobs die because of your program bugs, Parasol can also help manage restarting the crashed jobs after you fix your program. http://www.soe.ucsc.edu/~donnak/eng/parasol.htm |
PBS |
The Portable Batch System (PBS) is a flexible batch queuing and workload management system originally developed by Veridian Systems for NASA. It operates on networked, multiplatform UNIX environments, including heterogeneous clusters of workstations, supercomputers, and massively parallel systems. Development of PBS is provided by the PBS Products Department of Veridian Systems. http://www.openpbs.org |
ProPBS |
The PBS Pro Version 5.2 workload management solution is the professional version of the Portable Batch System. Built on the success of OpenPBS, PBS Pro goes well beyond it with the features and support you expect in a mission-critical commercial product, such as:
http://www.propbs.com |
SGE |
The Grid Engine project is an open source community effort to facilitate the adoption of distributed computing solutions. Sponsored by Sun Microsystems and hosted by CollabNet, the Grid Engine project provides enabling distributed resource management software for wide-ranging requirements from compute farms to grid computing. http://gridengine.sunsource.net |