2.6 Data Structures in Action

The previous sections introduced a fair amount of new Perl syntax and capabilities. Now, let's see some of these new capabilities in action.

2.6.1 The Problem of String Matching

It is frequently important in biology to find the best possible match for a short sequence in a longer sequence; for example, between an oligonucleotide and the sequence of DNA that has been cloned in a YAC or BAC. This match need not always be perfect; frequently, what is important is to find the closest match available. This problem is known in computer science as approximate string matching, and dynamic programming is a popular technique used to compute the solution.

The problem of string matching is to find a pattern, such as a nucleotide or peptide fragment, in a longer text such as a chromosome or protein.

The problem of approximate string matching is to find a pattern in a text in which the match might not be perfect. Perhaps a few of the characters are different or missing; the problem is to find the best match possible.

2.6.2 Genetic Variability and String Matching

Biologically, approximate matches are of commanding importance. Evolutionary changes between species can make genes with essentially the same function collect a fair number of individual base changes; they may even have acquired differences in exon structure. Even within a species, individual base changes among groups in the population (single nucleotide polymorphisms) are important causes of disease and important clues in the discovery of disease-causing genes.

Mutations tend to accumulate over time in noncoding regions of DNA; mutations in coding regions tend to avoid altering critical regions essential for the functioning of the gene (where mutations may be fatal to the organism). Even a noncoding region may be critical to the regulation of a gene and thus tend to resist mutations. As a result, studying where mutations are not accumulating is often an important clue to discerning the function and control of a gene and its associated protein.

Due to the redundancy of the genetic code, mutations in DNA may not affect the coding of the associated protein. Other mutations may make a change in a coded amino acid, but it may be an amino acid with similar properties to the original amino acid; for instance, both amino acids may be hydrophilic. Tracing these kinds of mutations is another source of important information about the process of mutation. It can give vital clues to the conservation of critical coding regions and to the folding of proteins.

Biologists will have no difficulty expanding upon the preceding brief motivation for studying approximate string matching and other techniques that find similarities between biological molecules. Many standard laboratory techniques rely on the annealing of a molecule to another molecule with similar, but not necessarily exact, structure.

In the discussion that follows, you'll use your new knowledge of complex data structures in Perl to implement an algorithm that finds approximate matches of patterns in text, such as DNA fragments in chromosomes or peptide fragments in proteins.

The algorithm I present is an example of dynamic programming; it relies on computing the entries to a matrix.