The most important property of a scoring matrix is its target frequencies and the expected frequencies of the individual amino acid pairs. Target frequencies represent the underlying evolutionary model. While scoring matrices don't actually contain the target frequencies, they are implicit in the scores.

The Karlin-Altschul statistical theory on which BLAST is based (discussed in the next section) states that all scoring schemes for which a positive score is possible (and the expected score is negative) have implicit target frequencies. Thus they are lod-odds scoring schemes; even a simple "+1 match -1 mismatch" scheme is implicitly a log-odds scoring scheme and has target frequencies. You'll learn how to calculate those frequencies in just a bit, but you first need to understand two additional concepts associated with scoring schemes: lambda and relative entropy.

Raw score
can be a misleading quantity because scaling factors are arbitrary. A
normalized score, corresponding to the original
lod score, is therefore a more useful measure. Converting a raw score
to a normalized score requires a matrix-specific constant called
lambda (or *l*).
Lambda is approximately the inverse of the original scaling factor,
but its value may be slightly different due to integer rounding
errors. Let's now derive lambda.

When calculating target frequencies from multiple alignments, the sum of all target frequencies naturally sums to 1 (see Figure 4-6).

Recall from Figure 4-4 that the score of two amino acids is the log-odds ratio of the observed and expected frequencies. The same equation is presented in Figure 4-7, but the lod score is replaced by the product of lambda and the raw score (in other words, lambda had a value of 1 in Figure 4-4).

Figure 4-8 rearranges Figure 4-7 to solve for pair-wise frequency.

From
Figure 4-8, you can see that a pair-wise frequency
(*q*ij) is implied from individual amino acid
frequencies (*p*i and *p*j) and
a normalized score (*lS*ij). The key to
solving for lambda is to provide the individual amino acid
frequencies (*p*i^{ }and
*p*j) and find a value for lambda where the sum of
the implied target frequencies equals one. The formulation is given
in Figure 4-9 and later in Figure 4-1.

Normally, once lambda is estimated, it is used to calculate the Expect of every HSP in the BLAST report. Unfortunately, the residue frequencies of some proteins deviate widely from the residue frequencies used to construct the original scoring matrix. Recently, some versions of PSI-BLAST and BLASTP have therefore begun to use the query and subject sequence amino acid compositions to calculate a composition-based lambda. These "hit-specific" lambdas have been shown to improve BLAST sensitivity, so this approach may see wider use in the near future.

The expected score of a scoring matrix is the sum of its raw scores weighted by their frequencies of occurrence (see Figure 4-10). The expected score should always be negative.

The
relative entropy of a scoring matrix
(*H*) conveniently summarizes the general behavior
of a scoring matrix. Its formulation is similar to the expected score
(see Figure 4-11)
but is calculated from normalized scores. *H* is
the average number of bits (or nats) per position in an alignment and
is always positive.

*H of* PAM1 is
greater than the *H* PAM120. Recall that the
PAM120 matrix is derived from mutation probabilities for PAM1
extrapolated to 120 PAMs. The PAM120 matrix is therefore less
specific, contains less information, and thus has a lower
H. Similarly, BLOSUM80 has a greater *H
*than BLOSUM62. This makes sense since BLOSUM80 was made
from sequences that were more similar to one another than BLOSUM62.

Which PAM matrix is most similar to
BLOSUM45? To answer this, you only need to determine the PAM matrix
with an *H* closest to that of the BLOSUM45
matrix. By relative entropy, PAM250 is closest to BLOSUM45, PAM120 to
BLOSUM80, and PAM180 to BLOSUM62.

Now
let's determine the target frequencies of a +1/-1
scoring scheme. We will explore this in the case of DNA
alignments where match/mismatch scoring is frequently
employed. For generality, assume that all nucleotide
frequencies are equal to 0.25. This fixes the previous
*p*^{i} and
*p*^{j} terms. Example 4-1 shows a
Perl script that contains an implementation for estimating lambda by
making increasingly refined guesses at its value. Table 4-1 displays the expected score, lambda,
*H*, and the expected percent identity for several
nucleotide scoring schemes. Note that the match/mismatch ratio
determines *H* and percent identity. As the ratio
approaches 0, lambda approaches 2 bits, and the target frequency
approaches 100 percent identity. Intuitively, this makes sense; if
the mismatch score is -, all alignments have 100
percent identity, and observing an A is the same as observing an A-A
pair.

Match |
Mismatch |
Expected score |
l (bits) |
H (bits) |
% ID |
---|---|---|---|---|---|

+10 |
-10 |
-5 |
0.158 |
0.793 |
75 |

+1 |
-1 |
-0.5 |
1.58 |
0.791 |
75 |

+1 |
-2 |
-1.25 |
1.92 |
1.62 |
95 |

+1 |
-3 |
-2 |
1.98 |
1.89 |
99 |

+5 |
-4 |
-1.75 |
0.277 |
0.519 |
65 |

#!/usr/bin/perl -w use strict; use constant Pn => 0.25; # probability of any nucleotide die "usage: $0 <match> <mismatch>\n" unless @ARGV == 2; my ($match, $mismatch) = @ARGV; my $expected_score = $match * 0.25 + $mismatch * 0.75; die "illegal scores\n" if $match <= 0 or $expected_score >= 0; # calculate lambda my ($lambda, $high, $low) = (1, 2, 0); # initial estimates while ($high - $low > 0.001) { # precision # calculate the sum of all normalized scores my $sum = Pn * Pn * exp($lambda * $match) * 4 + Pn * Pn * exp($lambda * $mismatch) * 12; # refine guess at lambda if ($sum > 1) { $high = $lambda; $lambda = ($lambda + $low)/2; } else { $low = $lambda; $lambda = ($lambda + $high)/2; } } # compute target frequency and H my $targetID = Pn * Pn * exp($lambda * $match) * 4; my $H = $lambda * $match * $targetID + $lambda * $mismatch * (1 -$targetID); # output print "expscore: $expected_score\n"; print "lambda: $lambda nats (", $lambda/log(2), " bits)\n"; print "H: $H nats (", $H/log(2), " bits)\n"; print "%ID: ", $targetID * 100, "\n";