4.7 Sum Statistics and Sum Scores

BLAST uses Figure 4-16 to calculate the normalized score of an individual HSP, but it uses a different function to calculate the normalized score of group of HSPs (see Chapter 7 for more information about sum statistics).

Figure 4-16. Equation 4-14

Before tackling the actual method used by BLAST to calculate a sum score, it's helpful to consider the problem from a general perspective. One simple and intuitive approach for calculating a sum score might be to sum the raw scores, S_r for a set of HSPs, and then convert that sum to a normalized score by multiplying by l, or in mathematical terms (see Figure 4-17).

Figure 4-17. Equation 4-15

The problem with such an approach is that summing the scores for a collection of r HSPs, always results in a higher score, even if none or those HSPs has a significant score on its own. In practice, BLAST controls for this by penalizing the sum score by a factor proportional to the product of the number of HSPs, r, and the search space as shown in Figure 4-18.

Figure 4-18. Equation 4-16

Figure 4-18 is sometimes referred to as an unordered-sum score and is suitable for calculating the sum score for a collection of noncollinear HSPs. Ideally, though, you should use a sum score formulation that rewards a collection of HSPs if they are collinear with regards to their query and subject coordinates because the HSPs comprising real BLAST hits often have this property. BLASTX hits for example often consist of collinear HSPs corresponding to the sequential exons of a gene. Figure 4-19 is a sum score formulation that does just that.

Figure 4-19. Equation 4-17

Figure 4-20 is sometimes referred to as a pair-wise ordered sum score. Note the additional term lnr!, which can be thought of as a bonus added to the sum score when the HSPs under consideration are all consistently ordered.

One shortcoming of Figure 4-18 and Figure 4-19 is that they invoke a sizable penalty for adding an additional HSP raw score to the sum score. To improve the sensitivity of its sum statistics, NCBI-BLASTX employs a modified version of the pair-wise ordered sum score (Figure 4-19) that is influenced less by the search space and contains a term for the size of the gaps between the HSPs (Figure 4-20). The advantage of this formulation is that the gap size, g, rather than the search space, mn, is multiplied by r. For short gaps and big search spaces, this formulation results in larger sum scores.

Figure 4-20. Equation 4-18

4.7.1 Converting a Sum Score to a Sum Probability

The aggregate pair-wise P-value for a sum score can be approximated using Figure 4-21.

Figure 4-21. Equation 4-19

Thus, when sum statistics are being employed, BLAST not only uses a different score, it also uses a different formula to convert that score into a probability?the standard Karlin-Altschul equation (Figure 4-12) isn't used to convert a sum score to an Expect.

BLAST groups a set of HSPs only if their aggregate P-value is less than the P-value of any individual member, and that group is an optimal partition such that no other grouping might result in a lower P-value. Obviously, finding these optimal groupings of HSPs requires many significance tests. It is common practice in the statistical world to multiply a P-value associated with a significant discovery by some number proportional to the number of tests performed in the course of its discovery to give a test corrected P-value. The correction function used by BLAST for this purpose is given in Figure 4-22. The resulting value, P'_r is a pair-wise test-corrected sum-P.

Figure 4-22. Equation 4-20

In this equation, b is the gap decay constant (its value can be found in the footer of a standard BLAST report).

The final step in assigning an E-value to a group of HSPs to adjust the pair-wise test-corrected sum-P for the size of the database The formula used by NCBI-BLAST (Figure 4-23) divides the effective length of the database by the actual length of the particular database sequence in the alignment and then multiplies the pair-wise test-corrected sum-P by the result.

Figure 4-23. Equation 4-21

NCBI-BLAST and WU-BLAST compute combined statistical significance a little differently. The previous descriptions apply to NCBI-BLAST only. The two programs probably have many similarities, but the specific formulations for WU-BLAST are unpublished.

4.7.2 Probability Versus Expectation

While NCBI-BLAST reports an Expect, WU-BLAST reports both the E-value and a P-value. An E-value tells you how many alignments with a given score are expected by chance. A P-value tells you how often you can expect to see such an alignment. These measures are interchangeable using Figure 4-24 and Figure 4-25.