9.5 TBLASTX Protocols

As discussed in Chapter 2, coding sequences evolve slowly compared to surrounding DNA. This makes TBLASTX a powerful gene-prediction tool for genomes that are appropriately diverged. What is the proper evolutionary distance? Because genes and organisms change at varying rates, there is no simple answer like "100 million years." If the distance is too great, the similarities may no longer be visible, but if the distance is too small, sequence similarity loses discriminatory power. For example, there is little sense in performing TBLASTX searches between humans and E. coli or humans and chimpanzees.

Historically, TBLASTX has not been as popular as the other BLAST programs for several reasons. First, TBLASTX is computationally intensive. Second, until recently, there were not many completely sequenced genomes. Third, when you get a match, you will rarely find a useful description for what was found?just an alignment between two potential coding sequences. As more genomes are sequenced and computer performance continues to rise, TBLASTX will become more useful.

9.5.1 Preventing Stop Codons

The scoring matrices distributed with the BLAST programs give positive scores for aligning stop codons to one another. This is unacceptable for discriminating between coding and noncoding regions. Chapter 10 covers installation of BLAST software and describes how to create derivative scoring matrices with highly negative stop codon scores. If you don't have permission to make these changes, you can create the derivatives in your home directory. In this case, you need to specify the explicit path to your matrix rather than use just the name. WU-BLAST operates a little differently, and it is more convenient to specify alternate scores on the command line. Each protocol gives an example of this. As discussed in Chapter 8, gapped alignment can skip over stop codons. For this reason, consider using ungapped alignment for your TBLASTX searches.

9.5.2 Finding Undocumented Genes in Genomic DNA

Gene prediction is difficult. There are no genomes for which all protein coding genes are completely known. One of the most highly investigated genomes is the human genome, but despite what you read in the news, the number of genes can't be stated with much confidence. Counting genes is easier than determining their exact structure, and as a result, there are many proteins for which the true sequence is in doubt. Many genes are still waiting to be discovered (and many documented genes aren't real genes).

9.5.2.1 Approach

TBLASTX is computationally expensive because it translates both strands of the query and database sequences in three frames on each strand. To make matters worse, the sequences and databases searched by TBLASTX tend to be large. To counteract these factors we choose insensitive seeding parameters, which is appropriate, considering that the extension algorithm is gapless and therefore also less sensitive.

Like any other search employing genomic DNA, it is always a good idea to mask repeats first. Here we prefer hard masking instead of soft-masking and normal complexity filtering. Our reasoning is that low-complexity sequence is common in genomic DNA and random word hits near low-complexity sequence may result in lengthy extensions, high alignment scores, and misleading statistical significance.

For WU-BLAST, we offer two command lines. The second, which uses a single, large word rather than two small words, is faster and more sensitive, but requires more memory. It also shows how to change scoring matrix values from the command line with the altscore parameter.

9.5.2.2 NCBI-BLAST

blastall -p tblastx -d <db> -i <genomic> -f 999

9.5.2.3 WU-BLAST

tblastx <db> <genomic> filter=seg W=3 T=999 hitdist=40 nogap
tblastx <db> <genomic> filter=seg W=5 T=25 nogap altscore="* any -999" altscore="any * -999"

9.5.2.4 Expected results

This protocol can be used with either genomic or EST databases. However, searching EST databases with BLASTN is usually better. This discussion focuses on genome-genome searches. For genome-EST results and interpretations, see the appropriate BLASTN protocols.

Interpreting TBLASTX alignments isn't straightforward. It's nearly impossible to look at a report full of alignments and determine gene boundaries or the exact coordinates of coding exons. TBLASTX offers testable hypotheses. Regions with strong coding similarities may or may not correspond to real genes, but they are good candidates for experimental biology. Here are a few reasons why TBLASTX might find something missed by other approaches:

Genes in genes: Most gene-prediction algorithms don't predict genes within genes. However, the fact that large introns contain genes on their opposite strand is a relatively frequent phenomenon. TBLASTX can help identify these genes because the algorithm looks for local alignment similarities and has no bias for overall gene structure.
Alternative splicing: Some genes have several alternatively spliced forms. This is especially common in certain genomes, such as mammalian ones. Gene prediction algorithms usually find a single, optimal gene structure and no alternate forms. Because TBLASTX has no knowledge of splice sites, this doesn't pose a problem. Spliced variants may also have narrow windows of expression, which makes them difficult to find when using cDNA approaches. TBLASTX is less prone to missing these exons unless they are highly diverged.
Low expression: Genes expressed at low levels may have odd codon usage, which makes them less visible to gene prediction algorithms. Because the transcripts are rare, they are also less likely to appear in cDNA libraries and EST databases. TBLASTX isn't affected by codon biases or expression levels

9.5.2.5 Optimizations and variations

This experiment is much more efficiently run as a serial search. In this strategy, a preliminary, insensitive search identifies sequences that are similar, and a second, sensitive search produces the alignments. Chapter 12 discusses this approach. You can try the approach by running bl2seq on each sequence identified in the search.

9.5.3 Transcript-Transcript TBLASTX

When presented with a transcript of unknown function, you should first implement a BLASTX search to determine if such a transcript corresponds to a known protein. But what if it doesn't? One reason why it might not show similarities is because its encoded protein isn't yet in the protein database. It's also possible that the transcript doesn't encode a protein. If it does, though, it might have some undiscovered relatives, and the best place to look for such entities is in EST databases.

9.5.3.1 Approach

We use the same seeding parameters for the same reasons given in the previous section, Section 9.5.2. We employ soft-masking here, however, because the query sequence is short and probably doesn't contain as much of a low-complexity sequence.

9.5.3.2 NCBI-BLAST

blastall -p tblastx -d <est_db> -i <transcript> -f 999 -F "m S"

9.5.3.3 WU-BLAST

tblastx <est_db> <transcript> wordmask=seg W=3 T=999 hitdist=40 nogap
tblastx <est_db> <transcript> wordmask=seg W=5 T=25 nogap altscore="* any -999" altscore="any 
* -999"

9.5.3.4 Expected results

Evolutionary distance between the sequences is the key to determining if an alignment really corresponds to a protein. If the sequences are derived from closely related species, similarity isn't much help. In such cases, try aligning the matches with bl2seq without extreme stop codon penalties. If you get alignments that cross stop codons, the sequences aren't diverged enough.

If the sequences come from distant species, the alignments correspond to coding regions. This interpretation is even more believable if there are matches from multiple species.

9.5.3.5 Optimizations and variations

If the sequence has a long open reading frame, it is more efficient to translate this frame first and then search with TBLASTN. Recognizing the true reading frame, though, isn't always easy (see Chapter 8). If in doubt, use this protocol.