In 1990, Samuel Karlin and Stephen Altschul published a theory of local alignment statistics. Karlin-Altschul statistics make five central assumptions:

A positive score must be possible.

The expected score must be negative.

The letters of the sequences are independent and identically distributed (IID).

The sequences are infinitely long.

Alignments don't contain gaps.

The first two assumptions are true for any scoring matrix estimated from real data. The last three assumptions are problematic because biological sequences have context dependencies, aren't infinitely long, and are frequently aligned with gaps. You now know that both alignment and sequence similarity assume independence, and that this is a necessary convenience. You will soon see how sequence length and gaps can be accounted for. For now, though, let's turn to the Karlin-Altschul equation (see Figure 4-12):

This equation states that the number of alignments expected by chance
(*E*) during a sequence database search is a
function of the size of the search space
(*m***n*), the normalized score
(*lS*), and a minor constant
(*k*).

In a
database search, the size of the search space is simply the product
of the number of letters in the query (*m*) and
the number of letters in the database (*n*). A
small adjustment (*k*) takes into account the fact
that optimal local alignment scores for alignments that start at
different places in the two sequences may be highly correlated. For
example, a high-scoring alignment starting at residues 1, 1 implies a
pretty high alignment score for an alignment starting at residues 2,
2 as well. The value of k is usually around 0.1,
and its impact on the statistics of alignment scores is relatively
minor, so don't bother with its derivation.
According to Figure 4-12 the relationship
between the expected number of alignments and the search space
(*mn*) is linear. If the size of the search space
is doubled, the expected number of alignments with a particular score
also doubles. The relationship between the expected number of
alignments and score is exponential. This means that small changes in
score can lead to large differences in *E*.

In practice, gaps reduce the stringency of a scoring scheme. In other words, an alignment score of 30 occurs more often in collection of gapped alignments than it does in a similar collection of ungapped alignments. How much more often depends on the cost of the gaps relative to the scoring matrix values. To determine the statistical significance of gapped alignments with Karlin-Altschul statistics (Figure 4-12), you must find values for lambda, k, and H for a particular scoring matrix and its associated gap initiation and extension costs. Unfortunately, you can't do this analytically, and the values must be estimated empirically. The procedure involves aligning random sequences with a specific scoring scheme and observing the alignment properties (scores, target frequencies, and lengths). The ungapped scoring matrix whose behavior is most similar to the gapped scoring scheme provides values for lambda, k, and H.

In the Karlin-Altschul theory, the distribution of alignment scores follows an extreme value distribution, a distribution that looks similar to a normal (Gaussian) distribution but falls off more quickly on one side and more slowly on the other side. Experiments show that gapped alignment scores fit the extreme value distribution as well. This fit is important because it strongly suggests that applying empirically derived values for lambda, k, and H to gapped alignment is statistically valid. Table 4-2 shows how much the parameters change by allowing gaps given the BLOSUM62 scoring matrix (also see Appendix A and Appendix C).

Gap open |
Gap extend |
l |
k |
H (nats) |
---|---|---|---|---|

No gaps allowed |
No gaps allowed |
0.318 |
0.134 |
0.40 |

11 |
2 |
0.297 |
0.082 |
0.27 |

10 |
2 |
0.291 |
0.075 |
0.23 |

7 |
2 |
0.239 |
0.027 |
0.10 |

The Karlin-Altschul equation (Figure 4-12) gives the search space between two
sequences as *m*n*, but not all this space can be
effectively explored because some portion of it lies at either end of
the sequences. As discussed in Chapter 5, BLAST
operates by extending seeds in the alignment space. It
can't effectively extend seeds near the ends of the
sequences, though, because it runs out of room.

Karlin-Altschul statistics provides a way to calculate just how long a sequence must be before it can produce an alignment with a significant Expect. This minimum length, l, is usually referred to as the expected HSP length (see Figure 4-13)

Note that the expected HSP (high scoring pair) length is dependent on the search space (m*n) and the relative entropy of the scoring scheme, H, so it varies from search to search.

To take edge effects into account when
calculating an Expect, the expected HSP length is subtracted from the
actual length of the query, m, and the actual
number of residues in the database, n, to give
their *effective* lengths, usually denoted by
m´ and
n´, respectively (see Figure 4-14 and Figure 4-15).

In a large search space, the expected HSP length may be greater than
the length of the query, resulting in a negative effective length,
m´. In practice, if the effective
length is less than 1/*k*, it is set to
1/*k*, as doing so cancels the contribution of the
short sequence to the Expect; setting
m' = 1/k for example, gives
, a formulation independent
of m'.

Unfortunately, effective lengths of less than 1/k aren't uncommon today. Because lan, the large size on many sequence databases can result in large expected HSP lengths. In fact it's not uncommon to see expected HSP lengthsapproaching 200 when searching some of the larger protein databases. Keep in mind that the average protein is ~300 amino acids long; thus, for many searches, m´ is being set to 1/k routinely. Recent work by S.F. Altschul and colleagues has suggested that part of the problem may be that Figure 4-13 overestimates l. They have proposed another way to calculate this value that results in shorter effective HSP lengths. Thus, the method used to calculate l may change in the not so distant future.