4.1 Introduction to Information Theory

In common usage, the word information conveys many things. Forget everything you know about this word because you're going to learn the most precise definition. Information is a decrease in uncertainty. You can also think of information as a degree of surprise.

Suppose you're taking care of a child and the response to every question you ask is "no." The child is very predictable, and you are pretty certain of the answer the next time you ask a question. There's no surprise, no information, and no communication. If another child answers "yes" or "no" to some questions, you can communicate a little, but you can communicate more if her vocabulary was greater. If you ask "do you like ice cream," which most children do, you would be informed by either answer, but more surprised if the answer was "no." Qualitatively, you expect more information to be conveyed by a greater vocabulary and from surprising answers. Thus, the information or surprise of an answer is inversely proportional to its probability. Quantitatively, information is represented by either one of the following equivalent formulations shown in Figure 4-1.

Figure 4-1. Equation 4-1

The information, H, associated with some probability p, is by convention the base 2 logarithm of the inverse of p. Values converted to base 2 logarithms are given the unit bits, which is a contraction of the words binary and digit (it is also common to use base e, and the corresponding unit is nats). For example, if the probability that a child doesn't like ice cream is 0.25, this answer has 2 bits of information (liking ice cream has 0.41 bits).

It is typical to describe information as a message of symbols emitted from a source. For example, tossing a coin is a source of head and tail symbols, and a message of such symbols might be:


Similarly, the numbers 1, 2, 3, 4, 5, and 6 are symbols emitted from a six-sided die source, and the letters A, C, G, and T are emitted from a DNA source. The symbols emitted by a source have a frequency distribution. If there are n symbols and the frequency distribution is flat, as it is for a fair coin or die, the probability of any particular symbol is simply 1/n. It follows that the information of any symbol is log2(n), and this value is also the average. The formal name for the average information per symbol is entropy.

But what if all symbols aren't equally probable? To compute the entropy, you need to weigh the information of each symbol by its probability of occurring. This formulation, known as Shannon's Entropy (named after Claude Shannon), is shown in Figure 4-2.

Figure 4-2. Equation 4-2

Entropy (H) is the negative sum over all the symbols (n) of the probability of a symbol (pi) multiplied by the log base 2 of the probability of a symbol (log2pi). Let's work through a couple of examples to make this clear. Start with the flip of a coin and assume that h and t each have a probability 0.5 and therefore a log2 probability of -1. The entropy of a coin is therefore:

 - ( (0.5)(-1) + (0.5)(-1) ) = 1 bit

Suppose you have a trick coin that comes up heads 3/4 of the time. Since you're a little more certain of the outcome, you expect the entropy to decrease. The entropy of your trick coin is:

 - ( (0.75)(-0.415) + (0.25)(-2) ) = 0.81 bits

A random DNA source has an entropy of:

- ( (0.25)(-2) + (0.25)(-2) + (0.25)(-2) + (0.25)(-2) ) = 2 bits

However, a DNA source that emits 90 percent A or T and 10 percent G or C has an entropy of:

- ( 2(0.45)(-1.15) + 2(0.05)(-4.32) ) = 1.47 bits

In these examples, you've been given the frequency distribution as some kind of truth. But it's rare to know such things a priori, and the parameters must be estimated from actual data. You may find the following Perl program informative and entertaining. It calculates the entropy of any file.

# Shannon Entropy Calculator
my %Count;      # stores the counts of each symbol
my $total = 0;  # total symbols counted

while (<>) {                           # read lines of input
    foreach my $char (split(//, $_)) { # split the line into characters
        $Count{$char}++;               # add one to this character count
        $total++;                      # add one to total counts

my $H = 0;                          # H is the entropy
foreach my $char (keys %Count) {    # iterate through characters
    my $p = $Count{$char}/$total;   # probability of character
    $H += $p * log($p);             # p * log(p)
$H = -$H/log(2);                    # negate sum, convert base e to base 2
print "H = $H bits\n";              # output