12.5 Software Tricks

In addition to choosing appropriate BLAST parameters and optimizing your hardware set, you can use a few software tricks to increase your BLAST performance. Most of these tricks involve splitting or concatenating sequences into optimal-sized pieces because very large and very small sequences are inefficiently processed by BLAST.

12.5.1 Multiplexing/Query Packing

Input and output (I/O) can become a large fraction of the overall CPU load when the search parameters are insensitive, such as when running BLASTN. If you find yourself running a lot of BLASTN searches, you can pack multiple queries together and reduce the overhead of reading the database repeatedly. For example, let's say you have a collection of 100,000 ESTs from your favorite organism and you want to search them against all other ESTs in the public database. If you search them one at a time, you will perform 100,000 BLAST searches and therefore have to read the database 100,000 times. It should go without saying that caching is essential in such a task.

But what if you glue the sequences together in groups of 100? Well, you've just cut your database I/O down to 1 percent of what it used to be, which can be a significant savings. For ESTs and other sequences of this length, the speed up is typically tenfold. This technique is called multiplexing or query packing. It isn't as simple as it sounds because there must be a way to prevent alignments from bridging the sequences, the coordinates must be remapped, and the statistics need to be recalculated. MegaBLAST, part of the NCBI-BLAST distribution, is a specialized version of BLASTN that multiplexes queries and includes a variety of other optimizations. It's really fast, and anyone doing a lot of BLASTN searches should use this program. You can find more information about MegaBLAST in Chapter 9 and Chapter 13. Query packing can also be accomplished with a single, sophisticated Perl script (see MPBLAST at http://blast.wustl.edu).

12.5.2 Query Chopping

Larger sequences require more memory to search and align. This can blow away your cached database, or worse, cause the computer to start swapping (using the disk for RAM). In addition, for a variety of reasons, larger query sequences are processed less efficiently. One way to solve this problem is to divide the query sequence into several segments, search them independently, and then merge the results back together. This is called query chopping and is effectively the opposite of query packing. The main difficulty with query chopping is dealing with alignments that cross the boundaries between segments.

Both NCBI-BLAST and WU-BLAST let you specify that only a subsequence of a large query sequence is to be searched (see the -L parameter in Chapter 13 and the newstart and nwlen parameters in Chapter 14). Currently, this works a little better for WU-BLAST because alignments seeded in a restricted region can extend outside this region, so there's no need to stitch together the alignments between neighboring segments. The following Perl script searches chromosome-sized sequences in 100-KB segments using WU-BLAST. All coordinates and statistics are identical to a search with an entire chromosome. Note that complexity filters are currently applied to the whole sequence, so apply these filters ahead of time.

#!/usr/bin/perl -w
use strict;
die "usage: $0 <wu-blast command line>\n" unless @ARGV >= 3;
my ($BLAST, $DB, $Q, @P) = @ARGV;
die "ERROR ($0): single FASTA files only\n" if `grep -c ">" $Q` > 1;
my $params = "@P";
die "ERROR ($0): filter ahead of time\n" if $params =~ /filter|wordmask/;
open(FASTA, $Q) or die;
my $def = <FASTA>;
my $count = 0;
while (<FASTA>) {$count += length($_) -1}
my $segment = 100000;
for (my $i = 1; $i <= $count; $i += $segment) {
    system("$BLAST $DB $Q  nwstart=$i nwlen=$segment");
}

12.5.3 Database Splitting

If you have a computer cluster and a lot of individual BLAST jobs to run, you can easily split the jobs among the nodes of your cluster. But what if you have a single, slow BLAST job that you want to spread out over several computers? If your sequence is very large, you can use query chopping as described earlier and assign each computer a separate segment. But what if your sequence isn't so large? A good solution is to have each computer search only part of the database. You'll need to do a little statistical manipulation to set the effective search space to the entire database, as well as some post-processing to merge all the reports together, but overall the process is pretty simple. The hard part is making sure the database is properly segmented on the various computers.

If you're using NCBI-BLAST, you can create database slices using alias databases as described previously. This allows a great deal more flexibility than physically splitting the databases into various parts. But remember that alias databases require that you use GI numbers in the FASTA identifier.

If you're using WU-BLAST, you can split the database dynamically. WU-BLAST has command-line parameters called dbrecmin and dbrecmax that describe the minimum and maximum database records. You can assign each node of the cluster a different subsection of the database by simply assigning dbrecmin and dbrecmax. For example, if your database contains 100 records and you have 10 nodes, node 1 gets records 1 to 10, node 2 gets records 11 to 20, etc. To benefit from caching, each node should be assigned the same database slice.

12.5.4 Serial BLAST Searching

As discussed in Chapter 5, the best way to speed up BLAST searches is by making the seeding more stringent. The only problem is that low-scoring alignments may be lost. High scoring alignments, however, are relatively resistant to changes in seeding parameters. The serial strategy takes advantage of this property; it uses an insensitive search to identify database matches and then a sensitive search to generate the alignments. An intuitive way to think about this with genomic sequence is "if I can hit just one exon, I can get the whole gene." The procedure has three steps and can be carried out with a simple script:

Run BLAST with insensitive parameters.
Build a BLAST database from the matches.
Run BLAST with sensitive parameters on just the matches.

NCBI-BLAST doesn't currently offer a wide range of word sizes, so serial searching is best carried out with WU-BLAST. Example 12-1 shows a script that wraps up the entire procedure.

Example 12-1. A script for serial BLAST searching

#!/usr/bin/perl -w
use strict;
die "usage: $0 <database> <query> <wordsize> <hitdist>\n" unless @ARGV == 4;
my ($DB, $Q, $W, $H) = @ARGV;
$H = $H ? "hitdist=$H" : "";
my $tmpdir = "/tmp/tt-blastx.tmpdir";
END {system("rm -rf $tmpdir") if defined $tmpdir}
system("mkdir $tmpdir") == 0 or die "ERROR ($0): can't create $tmpdir\n";
my $STD = "B=100000 V=100000 wordmask=seg";

# search
system("blastx $DB $Q W=$W T=999 $H $STD > $tmpdir/search") == 0 or die;

# collect names
my @name;
open(NAME, ">$tmpdir/names") or die;
open(SEARCH, "$tmpdir/search") or die;
while (<SEARCH>) {print NAME "$1\n" if /^>(\S+)/}
close SEARCH;
close NAME;

# build second stage database
system("xdget -p -f $DB $tmpdir/names > $tmpdir/database") == 0 or die;
system("xdformat -p $tmpdir/database") == 0 or die;

# align
system("blastx $tmpdir/database $Q $STD") == 0 or die;

To demonstrate the performance of the serial strategy, the script in Example 12-1 performs a search of a Caenorhabditis briggsae genomic fragment (c009500587.Contig4) against all C. elegans proteins (wormpep97). To minimize the effect of chance similarities, only alignments with at least 30 amino acids and 35 percent identity are analyzed. The search parameters, search speed, and number of HSPs found are displayed in Table 12-4. The first two rows correspond to standard, nonserial searches. Using the parameters recommended in Chapter 9 (row 2) BLASTX runs seven times faster than the very sensitive WU-BLAST default parameters (row 1). This speed is paid for by a loss in sensitivity (number of HSPs). The serial searches (rows 3 and above) offer varying levels of speed and sensitivity. Only a few combinations of W and T are presented; there are many useful combinations. Of particular interest is row 4, which has approximately the same sensitivity as row 1, but runs 18 times faster. Not bad for a short script. Because BLAST is under active development, perhaps you'll see serial searching become a standard part of BLAST software.

Table 12-4. Serial BLAST performance
#	First search	Second search	Speed	Elapsed time (sec)	HSPs
1	W=3 T=12	None	1 x	883.3	251
2	W=3 T=14 hitdist=40	None	7 x	121.4	186
3	W=3 T=999 hitdist=40	W=3 T=12	14 x	62.1	230
4	W=4 T=999	W=3 T=12	18 x	49.1	248
5	W=5 T=999	W=3 T=12	50 x	17.6	219
6	W=4 T=999 hitdist=40	W=3 T=12	80 x	11.1	137
7	W=5 T=999 hitdist=40	W=3 T=12	110 x	7.9	116