# 4.1 Introduction to Information Theory

In common usage, the word information conveys many things. Forget everything you know about this word because you're going to learn the most precise definition. Information is a decrease in uncertainty. You can also think of information as a degree of surprise.

##### Figure 4-1. Equation 4-1 The information, H, associated with some probability p, is by convention the base 2 logarithm of the inverse of p. Values converted to base 2 logarithms are given the unit bits, which is a contraction of the words binary and digit (it is also common to use base e, and the corresponding unit is nats). For example, if the probability that a child doesn't like ice cream is 0.25, this answer has 2 bits of information (liking ice cream has 0.41 bits).

It is typical to describe information as a message of symbols emitted from a source. For example, tossing a coin is a source of head and tail symbols, and a message of such symbols might be:

`tththttt`

Similarly, the numbers 1, 2, 3, 4, 5, and 6 are symbols emitted from a six-sided die source, and the letters A, C, G, and T are emitted from a DNA source. The symbols emitted by a source have a frequency distribution. If there are n symbols and the frequency distribution is flat, as it is for a fair coin or die, the probability of any particular symbol is simply 1/n. It follows that the information of any symbol is log2(n), and this value is also the average. The formal name for the average information per symbol is entropy.

But what if all symbols aren't equally probable? To compute the entropy, you need to weigh the information of each symbol by its probability of occurring. This formulation, known as Shannon's Entropy (named after Claude Shannon), is shown in Figure 4-2.

##### Figure 4-2. Equation 4-2 Entropy (H) is the negative sum over all the symbols (n) of the probability of a symbol (pi) multiplied by the log base 2 of the probability of a symbol (log2pi). Let's work through a couple of examples to make this clear. Start with the flip of a coin and assume that h and t each have a probability 0.5 and therefore a log2 probability of -1. The entropy of a coin is therefore:

` - ( (0.5)(-1) + (0.5)(-1) ) = 1 bit`

Suppose you have a trick coin that comes up heads 3/4 of the time. Since you're a little more certain of the outcome, you expect the entropy to decrease. The entropy of your trick coin is:

` - ( (0.75)(-0.415) + (0.25)(-2) ) = 0.81 bits`

A random DNA source has an entropy of:

`- ( (0.25)(-2) + (0.25)(-2) + (0.25)(-2) + (0.25)(-2) ) = 2 bits`

However, a DNA source that emits 90 percent A or T and 10 percent G or C has an entropy of:

`- ( 2(0.45)(-1.15) + 2(0.05)(-4.32) ) = 1.47 bits`

In these examples, you've been given the frequency distribution as some kind of truth. But it's rare to know such things a priori, and the parameters must be estimated from actual data. You may find the following Perl program informative and entertaining. It calculates the entropy of any file.

```# Shannon Entropy Calculator
my %Count;      # stores the counts of each symbol
my \$total = 0;  # total symbols counted

while (<>) {                           # read lines of input
foreach my \$char (split(//, \$_)) { # split the line into characters
\$Count{\$char}++;               # add one to this character count
\$total++;                      # add one to total counts
}
}

my \$H = 0;                          # H is the entropy
foreach my \$char (keys %Count) {    # iterate through characters
my \$p = \$Count{\$char}/\$total;   # probability of character
\$H += \$p * log(\$p);             # p * log(p)
}
\$H = -\$H/log(2);                    # negate sum, convert base e to base 2
print "H = \$H bits\n";              # output```  Forward  Preface  Part I: Introduction  Part II: Theory  Chapter 2. Biological Sequences  Chapter 3. Sequence Alignment  Chapter 4. Sequence Similarity  4.1 Introduction to Information Theory  4.2 Amino Acid Similarity  4.3 Scoring Matrices  4.4 Target Frequencies, lambda, and H  4.5 Sequence Similarity  4.6 Karlin-Altschul Statistics  4.7 Sum Statistics and Sum Scores  4.8 Further Reading  Part III: Practice  Part IV: Industrial-Strength BLAST  Part V: BLAST Reference  Part VI: Appendixes  Glossary  Colophon