7.3 Where Did My Oligo Go?

First, try to identify the position of the following oligo-nucleotide in the Drosophila melanogaster genome using WU-BLASTN with its default parameters:

TACATCCGGCACTTAGCCGGGCTCG

Example 7-4 shows that the oligo isn't found in the Drosophila melanogaster genome that uses WU-BLASTN with default parameters.

Example 7-4. The oligo isn't found

Reference:  Gish, W. (1996-2000) http://blast.wustl.edu

Notice:  this program and its default parameter settings are optimized to find
nearly identical sequences rapidly.  To identify weak similarities encoded in
nucleic acid, use BLASTX, TBLASTN or TBLASTX.

Query=  oligo
        (25 letters)

Database:  na_whole-genome_genomic_dmel_RELEASE3.FASTA
           7 sequences; 124,181,667 total letters.
Searching....10....20....30....40....50....60....70....80....90....100% done

                                                                     Smallest
                                                                       Sum
                                                              High  Probability
Sequences producing High-scoring Segment Pairs:              Score  P(N)      N

      *** NONE ***

There are, of course, many reasons why you might not be able to identify an oligo in the Drosophila melanogaster genome. First, the oligo might contain repetitive sequence and thus be masked out. However, because WU-BLAST doesn't mask by default, that can't be the reason. Second, the assembled genome may be incomplete. Every sequenced genome to date is incomplete to some degree. In fact, a 99 percent complete 124mb genome is still missing 1.24 mega-bases of a euchromatic (nonrepetitive DNA) sequence, leaving plenty of space for an oligo to go missing in. The incompleteness of the genome is a possible explanation for our WU-BLAST result, but is it the correct one? Before concluding that the oligo falls into a sequencing gap, let's try to run NCBI-BLASTN with its default parameters. Aha! The NCBI-BLASTN results in Example 7-5 show that the oligo is present in the Drosophila melanogaster genome and the HSP is assigned a significant Expect.

Example 7-5. Using NCBI-BLASTN to find the oligo

Sequences producing significant alignments:                      (bits) Value

2R 2R.3 assembled 23-11-2001                                           50   1e-06
X X release:2 length:21666217bp Assembled X chromosome arm seque...    32   0.25 
3R 3R.3                                                                32   0.25 
U GenomicInterval:U                                                    30   0.99 
3L 3L.3 v.3e  23351213bp BCM HGSC guide:3l-mtp-eval.08apr02            28   3.9  
2L 2L release:3 length:22217931bp Assembled 2L chromosome arm se...    28   3.9  

>2R 2R.3 assembled 23-11-2001
          Length = 20302755

 Score = 50.1 bits (25), Expect = 1e-06
 Identities = 25/25 (100%)
 Strand = Plus / Plus

Query: 1        tacatccggcacttagccgggctcg 25
                |||||||||||||||||||||||||
Sbjct: 16190927 tacatccggcacttagccgggctcg 16190951

Results like these frustrate a lot of BLAST users. Why does NCBI-BLAST find the oligo when WU-BLAST doesn't? The results may seem contradictory, but they make perfect sense, and understanding why this is so will help you use Karlin-Altschul statistics to ask questions about your own BLAST results.

7.3.1 Karlin-Altschul Statistics as a Tool for Further Investigation

Parameter choice seems a likely explanation for the results shown in Examples Example 7-4 and Example 7-5. If you assume the failure of WU-BLASTN to report the alignment isn't due to a bug, maybe the hit wasn't significant in the context of the current search. By now you've been exposed to enough Karlin-Altschul statistics to know that BLAST parameters determine the significance of an alignment. The scoring scheme used in a search is the fundamental BLAST parameter, so you should begin your investigation there.

WU-BLASTN and NCBI-BLASTN have very different default scoring schemes. WU-BLASTN uses a +5/-4 scheme to score alignments by default, whereas NCBI-BLASTN uses a +1/-3 scheme. One central theorem of Karlin-Altschul statistics is that every scoring scheme is implicitly a log-odds scoring scheme, and every log-odds scoring scheme implies a target frequency. This insight is significant because it means that a scoring scheme always hunts for alignments having a particular percent identity. One great thing about Karlin-Altschul statistics is they enable you to calculate that percent identity.

Chapter 4 provides a Perl script called Qcalc for calculating various nucleotide scoring scheme target frequencies; Table 4-1 summarizes the l and percent identity implied by some common scoring schemes. The default +5/-4 scoring scheme used by WU-BLAST implies an ungapped target frequency of 65 percent identity, whereas the +1/-3 default scheme used by NCBI-BLASTN looks for alignments with 99 percent identity. But why would a +5/-4 scoring scheme miss a real alignment with 100 percent identity? The answer to this question lies in the value of l and the role played by gap penalties.

Table 7-3 gives 0.104 nats for the value of gapped l associated with the WU-BLASTN default scoring scheme and gap penalties. On the other hand, the l associated with the NCBI-BLASTN default scoring scheme and gap penalties is 1.37 nats (Table 7-4). The difference between the two values is tenfold. To see how the value of l impacts your search results, use Karlin-Altschul statistics to delve more deeply into the relationship between raw scores, normalized (bit and nat) scores, and l.

Table 7-3. Selected WU-BLASTN parameters and values from the search shown in Example 7-5
Parameter	Value
l	0.104 nats (gapped)
k	0.0151 nats (gapped)
H	0.0600 nats/aligned residue
m	25 (length of the query sequence)
n	124,181,667(number of letters in the database)
Number of sequences in database	7

Table 7-4. Selected NCBI-BLASTN parameters and values from the search shown in Example 7-5
Parameter	Value
l	1.37 nats (gapped)
k	0.711 nats (gapped)
H	1.31 nats/aligned residue
m	25 (length of the query sequence)
n	124,181,667 (number of letters in the database searched)
Number of sequences in database	7

First, see how the value of l impacts the expected HSP length. Recall that this value is the length of an alignment that has an Expect=1; more precisely, it's the expected HSP length for E equal to 1 associated with a scoring matrix for a search space of size m´*n´. This value is reported in the footer of NCBI-BLAST reports, where it's called "Effective HSP length." However, it's absent from the WU-BLAST footer, so you should calculate it with the Perl function expectedHSPlength and the information from Tables Table 7-3 and Table 7-4.

Please note that the following function shows the traditional way to calculate the expected HSP length. Recent work by Altschul and colleagues suggests that this function overestimates the effective HSP length for short sequences, and the manner in which it's calculated may change in the future. Therefore, use the value reported in the BLAST report footer, if it's available. For the purposes here, though, this function is fine.

sub expectedHSPlength{
    my $k = shift;
    my $m = shift; # actual length of query
    my $n = shift; # actual length of query
    my $h = shift; # average nats/aligned pair 

    # l = ln(kmn)/H
    return  log($k*$m*$n)/$h;

}

expectedHSPlength returns an expected HSP-length of about 16 nucleotides for the NCBI defaults. The expected length of the WU-BLASTN HSP with Expect = 1, however, is higher?about 294 nucleotides. That's a big difference. Once again, the reason for the difference lies with the scoring matrix. Recall that the implied target frequency for the NCBI-default +1/-3 scoring scheme was 99 percent, but it was 65% for the WU-BLAST defaults (see Table 4-1). This is why the effective HSP length for the WU-BLAST search is so much longer. The hypothetical 294-nucleotide alignment is expected to have a percent identity of less than 65 percent. In other words, taking into account mismatches and gaps, it needs to be 294 bases long to attain a raw score sufficient to generate an Expect of 1. Thus, the WU-BLAST defaults implicitly assume that nucleotide homologies will have low identity (<65%), but be long?294 nucleotides in the context of the Drosophila melanogaster genome. Is this biological assumption valid? Yes and no.

The WU-BLASTN defaults are well suited for detecting long regions of low identity such as poorly conserved exons. On the other hand, the NCBI-BLAST parameters are suitable for finding shorter but nearly identical sequences. Both sets of default parameters will likely fail to detect other kinds of homology, especially short, conserved sequences such as cis-regulatory elements, which tend to be highly conserved and are often less than 10 nucleotides long.

Just how short can an HSP be and still generate a significant hit using WU-BLASTN defaults? Again, Karlin-Altschul statistics provide a basis for answering this question. First you need to know what raw score corresponds to an Expect of 1. The following Perl function calculates this value:

sub rawScoreOfExpectOne {
    my $k = shift;
    my $m = shift; # actual length of query
    my $n = shift; # actual length of database
    my $l = shift; # gapped lambda in nats

    # S_E=1 = ln(kmn)/l
    return  log($k*$m*$n)/$l;
}

For the ~124 mega-base Drosophila melanogaster genome, that raw score is about 15 with the NCBI-BLASTN defaults and about 170 for WU-BLASTN. Recall that the maximum score for aligning an oligo-nucleotide 25 bases long under the +5/-4 scheme is 125 (25*5). Even an alignment with an Expect of 1 has a raw score (170) greater than the maximal attainable score for a 25-mer under the WU-BLAST defaults! This is another reason why WU-BLASTN didn't report a hit.

To determine the length of an ungapped alignment that has 100 percent identity for a given raw score, divide the raw score by the match score. By this calculation, any oligo shorter than about 15 nucleotides (15/1) for NCBI-BLASTN and 34 nucleotides (170/5) for WU-BLASTN will have an Expect > 1. This means that the NCBI-BLASTN defaults are fine for mapping oligo-nucleotides to the Drosophila melanogaster genome. On the other hand, it appears that looking for short?less than 15 base-pair?cis-regulatory elements using either version of BLASTN with the default parameters is unlikely to be successful.

So what was the unreported WU-BLASTN Expect? Let's calculate it. With the data in Table 7-3 and the previously calculated effective HSP length of 294, first calculate m´ and n´ using the Perl functions effectiveLengthSeq and effectiveLengthDB. Plugging m´ and n´ together with the WU-BLASTN l and k and a raw score of 125 into the rawScoreToExpect function gives an Expect of 281. Recall that the NCBI-BLASTN Expect was 1e^-6. That's a 281-million-fold difference. BLAST is clearly parameter-sensitive! Using the default parameters, you instructed NCBI-BLASTN to search for short highly conserved regions, and it found one. WU-BLASTN, on the other hand, is parameterized to look for large regions of relatively low percent identity. This would be fine for cross-species searches of poorly conserved exons but is inappropriate for finding oligos.

Using BLAST intelligently requires using the correct parameters for the task at hand and not placing too much faith in the reported Expect. See the section on BLAST protocols in Chapter 9 for practical suggestions on BLAST parameter choice. Remember, you get what you look for.

7.3.2 What It All Means

You now know how bit scores, sum scores, Expects, and P-values are calculated. You've also seen first-hand that scoring matrices and target frequencies aren't merely theoretical abstractions but realities that determine the outcome of a BLAST search. In some ways, choosing the right scoring scheme for a BLAST search is like choosing the right pair of eyeglasses. If your scoring scheme is too stringent, BLAST becomes nearsighted and will miss distant homologies. If your scheme is too lenient, BLAST becomes farsighted and fails to detect the obvious. Unfortunately, there's no optimal scoring scheme. As in real life, sometimes the best you can do is put on bifocals.

You've also seen that searching the same sequence and database with varied parameters can result in different alignments having very different Expects. Scores and E-values aren't implicit in a sequence or an alignment; they are solely contingent upon parameter values and the methods used to assess significance. There is nothing absolute about a BLAST significance value; it merely denotes the significance of an alignment in the context of a given search. Like everything else in bioinformatics, the biological implications of a (significant) alignment are inferred by the user and should be tested experimentally, if possible.

Hopefully, you've also learned that there is more to Karlin-Altschul statistics than simply calculating an Expect for an alignment. Karlin-Altschul statistics provide a theoretical framework from which to interpret alignment scores in the context of parameter choice. They also give you the means to tune BLAST for specific purposes. Without them, you'd have no way of knowing what a given scoring scheme was looking for, and you'd cast around in the dark for the right set of parameters. Karlin-Altschul statistics remove the mystery from parameter choice. BLAST certainly has its limitations, but thanks to its statistical foundation, at least you know what you're looking for.