9.3 BLASTX Protocols

BLASTX is generally used to find protein coding genes in genomic DNA or to identify proteins encoded by transcripts. BLASTX runs relatively slowly, and can be the bottleneck in a sequence annotation pipeline. Most BLASTX searches are of the exploring variety. However, it is sometimes necessary to identify nearly identical sequences quickly. The last protocol gives some advice.

9.3.1 Gene Finding in Genomic DNA

Most proteins are related to other proteins. This makes BLASTX a very powerful gene-finding tool. As protein databases become larger and more diverse, BLASTX becomes even more useful because it can identify more and more genes. Approach

As with any search involving genomic DNA, the sequence must have its repeats masked, and lowercase is preferred to Ns. If possible, low-complexity sequences shouldn't be masked with Ns to avoid terminating extension if a coding region contains a region of low complexity.

Since we want to capture a range of protein similarities, we'll use the default BLOSUM62 scoring matrix, which is a good all-around matrix. We increase the threshold score from its default of 12 to 14, which increases the speed more than twofold and is still quite sensitive.

We use a higher value for E because we don't want to miss low-scoring alignments that may be real genes. We set the output report options very high so that no matches are missed simply by truncation. Some protein families have many members and may fill up a BLASTX report by themselves.

For WU-BLAST, we offer two command lines. The first is similar to the NCBI parameter set, and the second uses a single large word rather than two small words. In our tests, the second set is slightly faster and more sensitive. NCBI-BLAST parameters
blastall -p blastx -d <db> -i <genomic> -F "m S" -U -f 14 -b 10000 -v 10000 -e 100 WU-BLAST parameters
blastx <db> <genomic> wordmask=seg lcmask hitdist=40 T=14 B=10000 V=10000 E=100
blastx <db> <genomic> wordmask=seg lcmask W=4 T=20 B=10000 V=10000 E=100 Expected results

Percent identity is often a good indicator of the reliability of a protein match. Lengthy alignments above 50 percent identity don't occur by chance unless the sequence has some compositional bias. Alignments below 35 percent identity should be met with skepticism, and those below 30 percent aren't very reliable.

The ends of alignments often overrun exon boundaries. If the extensions are too long, they may force an exon into a separate alignment group. If the exon is short, it may not be able to achieve statistical significance when separated. To counteract this phenomenon, you can set the X parameters lower to prevent the extensions from going too far, but this can destroy weak alignments. Another option in WU-BLAST is to use olf and olmax (and their gapped counterparts golf and golmax) to change the overlap rules. These solutions aren't foolproof, so the best approach is to realize that this scenario could happen, and be on the lookout for gaps in coverage.

If your report is very long, you may have some unmasked repeats or low-complexity regions. A graphical report (Appendix D) can help determine where such repeats occur. Look for regions in which the alignment depth is very high. You may have to mask them by hand.

A less obvious problem is associated with GC-rich regions of DNA. These regions tend to have long open reading frames and can match various proteins. If the alignment threshold (Chapter 5) is low, these compositionally biased alignments may be combined to give significant scores. In WU-BLAST, you can set S2 higher, limit alignment groups with topcomboN, or ignore alignment groups with kap. A simpler and more general solution is to raise E.

All gene-finding procedures can be tricked by pseudogenes, and BLASTX is no exception. As discussed in the beginning of the chapter, you can't give stop codons highly negative scores to prevent them from appearing in alignments, that is, unless you use ungapped alignments, which are less sensitive. Rather than trying to remove pseudogenes, try to recognize them quickly. Most pseudogenes have internal stop codons or frame-shifts. While stop codons are easy to find because they appear as a * in an alignment, frame-shifts may not be immediately apparent. Usually, frame-shifts appear as overlapping HSPs in different frames, but they may also appear as HSPs that are very close rather than overlapping. Introns are rarely less than 30 nt, so any HSPs that are closer may result from a frame-shift. The presence of repeats overlapping or abutting an HSP is also a good indicator of a pseudogene. Retro-pseudogenes are derived from mRNAs, so look for poly-A tails or a lack of introns (you may have to do a little research to find out if other versions of the putative gene normally have introns). Also, retro-pseudogenes usually correspond to highly transcribed, conserved genes such as the protein components of the ribosome. Optimizations and variations

Chapter 12 discusses the serial search strategy as a way to vastly improve the speed of translating BLAST searches without losing sensitivity. It's really the best way to run BLASTX. The protocol here and the serial strategy assume that the proteins are reasonably similar. But what if you want to look for remote coding similarities? You can first try lowering T, which will make the search take longer. However, if you're primarily interested in only a short region, lowering T can identify distant relatives. See Section 9.2 section for sensitive searches for appropriate parameters.

9.3.2 Annotating ESTs (and Shotgun Sequence)

Given a collection of ESTs, one of the first analysis tasks is to determine what proteins they encode. The usual procedure is to find the best protein match. Before annotation, the EST may have an identifier such as:


At the end, the EST may be annotated with a FASTA definition line like this:

>my_EST_001 similar to Homo sapiens GATA transcription factor

While this definition is useful for classifying transcripts, it leads to a transitive annotation problem in which the similarity is eventually misapplied. You should use this procedure with discretion, and please don't submit such descriptions to public databases without good reason. Finally, whatever you do, don't concatenate multiple FASTA definition lines for the annotation because it can become confusing later on.

Many genome sequencing projects begin with a survey of random, whole genome shotgun reads. This protocol also works for shotgun genomic sequence. Approach

It's a good idea to check your sequences for vector and other contaminants before you begin this search, and if your sequences are genomic in origin, mask repeats as well. ESTs may also contain repeats, but for genomes that have short 3´UTRs and few repeats, this isn't necessary.

This task is much easier to accomplish with a local BLAST installation because you probably have many sequences to classify. If your FASTA file contains multiple sequences, BLAST conveniently processes each one in turn and its output will contain one report for each sequence.

The following alignment parameters are slightly less sensitive than the default parameters and are a good compromise for speed and sensitivity. Since we're really only interested in the description of the top match above some threshold, we set minimal output report parameters and E to 1e-10 (a somewhat arbitrary, but low value to prevent misclassification). Since soft masking can sometimes allow a region of low-complexity to dominate an alignment score, and since we won't look at each alignment, we err on the side of making no inference rather than making the wrong inference. We therefore use ordinary complexity filtering. In keeping with this philosophy, our choice of alignment parameters favors specificity and speed over sensitivity. For WU-BLAST, we offer two command lines. The second is slightly faster and more sensitive. NCBI-BLAST parameters
blastall -p blastx -d <db> -i <ESTs> -U -f 14 -e 1e-10 -b 0 -v 1 WU-BLAST parameters
blastx <db> <ESTs> filter=seg lcfilter hitdist=40 T=14 E=1e-10 B=0 V=1
blastx <db> <ESTs> filter=seg lcfilter W=4 T=20 E=1e-10 B=0 V=1 Expected results

Depending on the source of the sequence, there may not be many matches. With a low E value, most matches should be real similarities. Optimizations and variations

This is a task that greatly facilitated by the Bioperl tools (see www.bioperl.org). Using them simplifies running the BLAST job, parsing the output, editing sequence descriptions, and rewriting the files in FASTA format. BLAST parsers also let you apply other useful criteria for assignment?for example, requiring 40 percent identity. Lacking familiarity with Bioperl, you can pipe your report through common Unix tools. For example, you can capture the first line of the query and its best match with the following grep:

blastall -p blastx ... | grep "^Query=\|^>"

Unlike the previous protocol, serial searching isn't expected to help much because the first and second searches will find the same alignment.

9.3.3 Super-Fast BLASTX

At times you'll need to quickly map between a genomic sequence and its encoded proteins. This protocol represents the fastest way to find nearly identical matches with BLASTX. Approach

Our general approach is to set the seeding, extension, and evaluation parameters as insensitively as possible, within reason; we still want to be able to find matches to sequences with allelic differences. We don't bother changing the scoring matrix; it doesn't impact the speed of the search.

The NCBI-BLAST version of BLASTX isn't the best program for quickly finding protein matches to DNA because its maximum word size is 3. The parameters that follow use this size and a minimum distance for the two-hit algorithm (4) for an effective word size of 6 (yes, 4 is a strange way of specifying zero distance between two three-letter words). Oddly, the search runs faster when using maximum gap penalties rather than specifying no gaps with -g F. We don't set lower values for X because in practice, it has almost no impact on speed.

The WU-BLAST search uses a large word size without a neighborhood (automatically turned off at word sizes of 5 and above). The WINK parameter greatly reduces the total number of words and is one of the keys to the speed of this search. With W=6 WINK=6, the effective word size is 12. With such stringent seeding parameters, there is no need to change X or the alignment thresholds (see Chapter 5).

We also include a command line for the classic WU-BLAST 1.4, which works quite well for this task. (The 1.4 version is nearly identical to the 1.4 version of NCBI-BLAST that is no longer available). With this program, a word size of 5 performs best. Lowering X to terminate extensions early and increasing the alignment threshold S2 to reduce the computational burden of combined scores are both useful here. The values for X and S2 are calculated based on the default BLOSUM62 matrix. The most negative score is -4, so X=10 allows at least two mismatches. The average match score is about 5, and the average mismatch score is about -1.5 (you can estimate these scores quickly by looking at common amino acids such as alanine and glycine). S2=65 corresponds to a 90 percent identity alignment of 15 amino acids. NCBI-BLAST parameters
blastall -p blastx -d <db> -i <dna> -f 999 -A 4 -G 32767 -E 32767 WU-BLAST parameters
blastx <db> <dna> filter=seg W=6 WINK=6 nogap WU-BLAST 1.4 parameters
blastx <db> <dna> filter=seg W=5 S2=65 X=10 Expected results

Not all exons will be hit using such parameters, but with this mapping experiment, the general coordinates are of greatest interest. If you need an accurate alignment, this search can be followed by a more sensitive search using a serial strategy or bl2seq.

How much faster are these searches? Speed depends on sequence length and content, but as a general observation for sequences in the 50-200 Kb range, if the default NCBI parameter set is considered 1x, the fastest NCBI-BLAST is 6-8x, WU-BLAST is 100-500x, and the classic WU-BLAST 1.4 is 50-150x. Optimizations and variations

Other alignment algorithms such as BLAT (yes, the name is nearly identical), index the database as well as the query. They may prove faster, but they may also have larger resource requirements.