8.5 Use the Karlin-Altschul Equation to Design Experiments

The Karlin-Altschul equation is very useful for predicting the outcome of a BLAST experiment, especially in large search spaces. Suppose you want to find exons in the human genome by looking for similarities in the pufferfish genome. These genomes last shared a common ancestor about 450 million years ago. You might assume that any similarities at this distance must be due to evolutionary conservation.

Recall from Chapter 4 that the number of alignments expected by chance (E) is a function of the search space (M, N), the normalized score (lS), and a minor constant (K).


The typical cross-species parameters +1/-1 match/mismatch have a target frequency of 75 percent identity and 0.55 nats per aligned letter on average (H). A 50-bp alignment therefore contains about 27.5 nats. Substituting this normalized score into the Karlin-Altschul equation with K=0.334, M=1.5 GB (assuming half of the human genome contains repeats), and N=450 MB (the size of the repeat-poor pufferfish genome), you expect about 230,000 alignments by chance. That's roughly the same as the number of exons in the human genome. If you want to look for 50-bp exons, you'll have to sift through a lot of false positives.

To change the Karlin-Altschul expectation to something more manageable, either look for larger exons or reduce your search space. A 72-nucleotide alignment is expected only once by chance, and an alignment the size of a typical exon (110 bp) has a probability of about 1 in 1 billion of occurring. An even better approach is to restrict the search to orthologous regions of the size of a typical gene. Here 50-bp alignments have a probability of approximately 1 in 10,000.