4.6 Karlin-Altschul Statistics

In 1990, Samuel Karlin and Stephen Altschul published a theory of local alignment statistics. Karlin-Altschul statistics make five central assumptions:

A positive score must be possible.
The expected score must be negative.
The letters of the sequences are independent and identically distributed (IID).
The sequences are infinitely long.
Alignments don't contain gaps.

The first two assumptions are true for any scoring matrix estimated from real data. The last three assumptions are problematic because biological sequences have context dependencies, aren't infinitely long, and are frequently aligned with gaps. You now know that both alignment and sequence similarity assume independence, and that this is a necessary convenience. You will soon see how sequence length and gaps can be accounted for. For now, though, let's turn to the Karlin-Altschul equation (see Figure 4-12):

Figure 4-12. Equation 4-10

This equation states that the number of alignments expected by chance (E) during a sequence database search is a function of the size of the search space (m*n), the normalized score (lS), and a minor constant (k).

In a database search, the size of the search space is simply the product of the number of letters in the query (m) and the number of letters in the database (n). A small adjustment (k) takes into account the fact that optimal local alignment scores for alignments that start at different places in the two sequences may be highly correlated. For example, a high-scoring alignment starting at residues 1, 1 implies a pretty high alignment score for an alignment starting at residues 2, 2 as well. The value of k is usually around 0.1, and its impact on the statistics of alignment scores is relatively minor, so don't bother with its derivation. According to Figure 4-12 the relationship between the expected number of alignments and the search space (mn) is linear. If the size of the search space is doubled, the expected number of alignments with a particular score also doubles. The relationship between the expected number of alignments and score is exponential. This means that small changes in score can lead to large differences in E.

4.6.1 Gapped Alignments

In practice, gaps reduce the stringency of a scoring scheme. In other words, an alignment score of 30 occurs more often in collection of gapped alignments than it does in a similar collection of ungapped alignments. How much more often depends on the cost of the gaps relative to the scoring matrix values. To determine the statistical significance of gapped alignments with Karlin-Altschul statistics (Figure 4-12), you must find values for lambda, k, and H for a particular scoring matrix and its associated gap initiation and extension costs. Unfortunately, you can't do this analytically, and the values must be estimated empirically. The procedure involves aligning random sequences with a specific scoring scheme and observing the alignment properties (scores, target frequencies, and lengths). The ungapped scoring matrix whose behavior is most similar to the gapped scoring scheme provides values for lambda, k, and H.

In the Karlin-Altschul theory, the distribution of alignment scores follows an extreme value distribution, a distribution that looks similar to a normal (Gaussian) distribution but falls off more quickly on one side and more slowly on the other side. Experiments show that gapped alignment scores fit the extreme value distribution as well. This fit is important because it strongly suggests that applying empirically derived values for lambda, k, and H to gapped alignment is statistically valid. Table 4-2 shows how much the parameters change by allowing gaps given the BLOSUM62 scoring matrix (also see Appendix A and Appendix C).

Table 4-2. Effect of gaps on BLOSUM62
Gap open	Gap extend	l	k	H (nats)
No gaps allowed	No gaps allowed	0.318	0.134	0.40
11	2	0.297	0.082	0.27
10	2	0.291	0.075	0.23
7	2	0.239	0.027	0.10

4.6.2 Length Correction

The Karlin-Altschul equation (Figure 4-12) gives the search space between two sequences as m*n, but not all this space can be effectively explored because some portion of it lies at either end of the sequences. As discussed in Chapter 5, BLAST operates by extending seeds in the alignment space. It can't effectively extend seeds near the ends of the sequences, though, because it runs out of room.

Karlin-Altschul statistics provides a way to calculate just how long a sequence must be before it can produce an alignment with a significant Expect. This minimum length, l, is usually referred to as the expected HSP length (see Figure 4-13)

Figure 4-13. Equation 4-11

Note that the expected HSP (high scoring pair) length is dependent on the search space (m*n) and the relative entropy of the scoring scheme, H, so it varies from search to search.

To take edge effects into account when calculating an Expect, the expected HSP length is subtracted from the actual length of the query, m, and the actual number of residues in the database, n, to give their effective lengths, usually denoted by m´ and n´, respectively (see Figure 4-14 and Figure 4-15).

Figure 4-14. Equation 4-12

Figure 4-15. Equation 4-13

In a large search space, the expected HSP length may be greater than the length of the query, resulting in a negative effective length, m´. In practice, if the effective length is less than 1/k, it is set to 1/k, as doing so cancels the contribution of the short sequence to the Expect; setting m' = 1/k for example, gives , a formulation independent of m'.

Unfortunately, effective lengths of less than 1/k aren't uncommon today. Because lan, the large size on many sequence databases can result in large expected HSP lengths. In fact it's not uncommon to see expected HSP lengthsapproaching 200 when searching some of the larger protein databases. Keep in mind that the average protein is ~300 amino acids long; thus, for many searches, m´ is being set to 1/k routinely. Recent work by S.F. Altschul and colleagues has suggested that part of the problem may be that Figure 4-13 overestimates l. They have proposed another way to calculate this value that results in shorter effective HSP lengths. Thus, the method used to calculate l may change in the not so distant future.