4.4 Target Frequencies, lambda, and H

The most important property of a scoring matrix is its target frequencies and the expected frequencies of the individual amino acid pairs. Target frequencies represent the underlying evolutionary model. While scoring matrices don't actually contain the target frequencies, they are implicit in the scores.

The Karlin-Altschul statistical theory on which BLAST is based (discussed in the next section) states that all scoring schemes for which a positive score is possible (and the expected score is negative) have implicit target frequencies. Thus they are lod-odds scoring schemes; even a simple "+1 match -1 mismatch" scheme is implicitly a log-odds scoring scheme and has target frequencies. You'll learn how to calculate those frequencies in just a bit, but you first need to understand two additional concepts associated with scoring schemes: lambda and relative entropy.

4.4.1 Lambda

Raw score can be a misleading quantity because scaling factors are arbitrary. A normalized score, corresponding to the original lod score, is therefore a more useful measure. Converting a raw score to a normalized score requires a matrix-specific constant called lambda (or l). Lambda is approximately the inverse of the original scaling factor, but its value may be slightly different due to integer rounding errors. Let's now derive lambda.

When calculating target frequencies from multiple alignments, the sum of all target frequencies naturally sums to 1 (see Figure 4-6).

Figure 4-6. Equation 4-4

Recall from Figure 4-4 that the score of two amino acids is the log-odds ratio of the observed and expected frequencies. The same equation is presented in Figure 4-7, but the lod score is replaced by the product of lambda and the raw score (in other words, lambda had a value of 1 in Figure 4-4).

Figure 4-7. Equation 4-5

Figure 4-8 rearranges Figure 4-7 to solve for pair-wise frequency.

Figure 4-8. Equation 4-6

From Figure 4-8, you can see that a pair-wise frequency (qij) is implied from individual amino acid frequencies (pi and pj) and a normalized score (lSij). The key to solving for lambda is to provide the individual amino acid frequencies (piand pj) and find a value for lambda where the sum of the implied target frequencies equals one. The formulation is given in Figure 4-9 and later in Figure 4-1.

Figure 4-9. Equation 4-7

Normally, once lambda is estimated, it is used to calculate the Expect of every HSP in the BLAST report. Unfortunately, the residue frequencies of some proteins deviate widely from the residue frequencies used to construct the original scoring matrix. Recently, some versions of PSI-BLAST and BLASTP have therefore begun to use the query and subject sequence amino acid compositions to calculate a composition-based lambda. These "hit-specific" lambdas have been shown to improve BLAST sensitivity, so this approach may see wider use in the near future.

4.4.2 Relative Entropy

The expected score of a scoring matrix is the sum of its raw scores weighted by their frequencies of occurrence (see Figure 4-10). The expected score should always be negative.

Figure 4-10. Equation 4-8

The relative entropy of a scoring matrix (H) conveniently summarizes the general behavior of a scoring matrix. Its formulation is similar to the expected score (see Figure 4-11) but is calculated from normalized scores. H is the average number of bits (or nats) per position in an alignment and is always positive.

Figure 4-11. Equation 4-9

H of PAM1 is greater than the H PAM120. Recall that the PAM120 matrix is derived from mutation probabilities for PAM1 extrapolated to 120 PAMs. The PAM120 matrix is therefore less specific, contains less information, and thus has a lower H. Similarly, BLOSUM80 has a greater H than BLOSUM62. This makes sense since BLOSUM80 was made from sequences that were more similar to one another than BLOSUM62.

Which PAM matrix is most similar to BLOSUM45? To answer this, you only need to determine the PAM matrix with an H closest to that of the BLOSUM45 matrix. By relative entropy, PAM250 is closest to BLOSUM45, PAM120 to BLOSUM80, and PAM180 to BLOSUM62.

4.4.3 Match-Mismatch Scoring

Now let's determine the target frequencies of a +1/-1 scoring scheme. We will explore this in the case of DNA alignments where match/mismatch scoring is frequently employed. For generality, assume that all nucleotide frequencies are equal to 0.25. This fixes the previous pⁱ and p^j terms. Example 4-1 shows a Perl script that contains an implementation for estimating lambda by making increasingly refined guesses at its value. Table 4-1 displays the expected score, lambda, H, and the expected percent identity for several nucleotide scoring schemes. Note that the match/mismatch ratio determines H and percent identity. As the ratio approaches 0, lambda approaches 2 bits, and the target frequency approaches 100 percent identity. Intuitively, this makes sense; if the mismatch score is -, all alignments have 100 percent identity, and observing an A is the same as observing an A-A pair.

Table 4-1. Nucleotide scoring schemes
Match	Mismatch	Expected score	l (bits)	H (bits)	% ID
+10	-10	-5	0.158	0.793	75
+1	-1	-0.5	1.58	0.791	75
+1	-2	-1.25	1.92	1.62	95
+1	-3	-2	1.98	1.89	99
+5	-4	-1.75	0.277	0.519	65

Example 4-1. A Perl script for estimating lambda

#!/usr/bin/perl -w
use strict;
use constant Pn => 0.25; # probability of any nucleotide

die "usage: $0 <match> <mismatch>\n" unless @ARGV == 2;
my ($match, $mismatch) = @ARGV;
my $expected_score = $match * 0.25 + $mismatch * 0.75;
die "illegal scores\n" if $match <= 0 or $expected_score >= 0;

# calculate lambda
my ($lambda, $high, $low) = (1, 2, 0); # initial estimates
while ($high - $low > 0.001) {         # precision

    # calculate the sum of all normalized scores
    my $sum = Pn * Pn * exp($lambda * $match)    * 4
            + Pn * Pn * exp($lambda * $mismatch) * 12;

    # refine guess at lambda
    if ($sum > 1) {
        $high = $lambda;
        $lambda = ($lambda + $low)/2;
    }
    else {
        $low = $lambda;
        $lambda = ($lambda + $high)/2;
    }
}

# compute target frequency and H
my $targetID = Pn * Pn * exp($lambda * $match) * 4;
my $H = $lambda * $match    *     $targetID
      + $lambda * $mismatch * (1 -$targetID);

# output
print "expscore: $expected_score\n";
print "lambda:   $lambda nats (", $lambda/log(2), " bits)\n";
print "H:        $H nats (", $H/log(2), " bits)\n";
print "%ID:      ", $targetID * 100, "\n";