In sequence analysis, homologous means derived from a common ancestor. Sequences are either homologous or they aren't. It is incorrect to say that sequences are 80 percent homologous unless you mean that there is an 80 percent chance of common ancestry. Use percent identity to describe the similarity of alignments.


Literally, "likes water." Water is a polar molecule that mixes well with other polar molecules. The charged amino acids K, R, D, and E, are examples of hydrophilic amino acids.


Literally, "fears water." Nonpolar molecules (like those in oils) don't mix well with water. The amino acids L, I, V, and F are particularly hydrophobic.


The standard local alignment theory is often called Karlin-Altschul statistics after its founding authors.

lambda, l

The Karlin-Altschul statistical parameter that converts a raw score to a normalized score.

local alignment

An alignment algorithm that finds the optimal subsequence alignment. The alignment may include all letters of each sequence, but it isn't required to do so.

low-complexity sequence

Regions of sequences that are highly predictable?for example, a region that is 90 percent A or T.


One of the 20 common amino acids. Methionine is abbreviated as M or Met, and is especially important because all proteins begin with a methionine. There is only one codon for this amino acid: ATG.


Any change in sequence to a DNA molecule.


The start of a protein. In text form, a protein's N-terminus is always at the left.


Contraction for natural log digits. The base e logarithm of a number is in units of nats.

natural selection

A theory founded by Charles Darwin that explains how organisms change over time to better fit their environment. It is based on the principles of variation, heritability, and differential reproduction.


The abbreviation for noncoding RNA. Some RNAs, like tRNAs or rRNAs, don't contain information for protein sequences.


Global alignment is often called Needleman-Wunsch after the authors who first described the algorithm.


The basic building block of nucleic acid sequences (DNA and RNA). DNA is made from A, C, G, or T, while RNA contains A, C, G, or U.


The abbreviation for nucleotide.


The computational complexity of an algorithm is often described by its asymptotic behavior. O(n) problems grow linearly with the size of the input. O(log2n) grow much more slowly, and O(n2) grow much more quickly.


Abbreviation for open reading frame. Each strand of DNA has three frames. Any subsequence that doesn't contain stop codons in a particular frame is an open reading frame.


Genes that are separated by speciation (i.e., the same gene in different species). This is often approximated as the best reciprocal match between two complete genomes or proteomes.


A palindrome in DNA is a sequence that is read the same on the plus and minus strands. For example, the sequence GAATTC is a palindrome. Palindromes and near-palindromes are often sites for DNA-protein interaction. Proteins scanning along DNA "see" a palindrome as the same sequence regardless of which direction they are moving.


An acronym for Percent or Point Accepted Mutation. PAM scoring matrix names are usually followed by a number (e.g., PAM200), which indicates how many iterations of multiplication were used starting with the PAM1 matrix. The higher number indicates a more distant similarity.


Genes that are duplicated within a single genome. Duplication sometimes allows one of the genes to take on a specialized function.


The study of evolutionary relationships among organisms.


Organisms that don't contain intracellular organelles. All bacteria are prokaryotes.


The complete set of all proteins produced by a particular organism. Many proteins undergo post-translational modifications that add or subtract features from a protein. Therefore, a particular mRNA might have many different protein isoforms.


A sequence that looks like a gene but isn't. Most pseudogenes are derived from mRNAs that have been reverse-transcribed back to DNA and inserted into the genome. They have the hallmarks of RNA processing?notably a poly-A tail and no introns.

relative entropy

The average number of bits (or nats) per aligned letter for a given scoring scheme.


Any class of a sequence that appears multiple times in a genome. Usually, gene families aren't called repeats and the term is used for junk DNA. Some of the most common repeats in the human genome include the ALU and LINE families.

reverse transcriptase

A protein that creates DNA from an RNA template.


Ribonucleic acid. RNA is chemically similar to DNA but not used strictly for storage. Many RNA molecules have important functions in the cell and may even have enzymatic properties. Some of the most common functional RNA molecules include rRNAs and tRNAs.

RNA polymerase

A protein or multiprotein complex that creates RNA from a DNA template.


A complex macromolecule made up of proteins and rRNAs. Ribosomes are responsible for translating mRNAs into proteins.


Ribosomal RNA. The ribosome is composed of many specific RNA molecules, and these components are called rRNAs. rRNAs are some of the most abundant RNAs in a cell.


Local alignment is often referred to as Smith-Waterman, after the authors who first described the algorithm.

start codon

ATG. Codes for the amino acid methionine. Many proteins have N-terminal post-translational modifications, and the first amino acid of the mature protein may therefore not be methionine.

stop codon

TAA, TGA, and TAG are the three codons that terminate translation.

sum statistics

A method that determines the aggregate statistical significance of multiple local alignments.

target frequency

The expected frequencies of individual letter pairings. For nucleotide scoring matrices, the target frequency is often summarized by the expected percent identity in sequences with unbiased composition.


The complete set of transcripts for a particular genome. This term is often used to mean the mRNAs of protein coding genes and their alternatively spliced variants.


The abbreviation for transfer RNA. tRNAs transfer individual amino acids to the ribosome. Each tRNA molecule has an anti-codon the matches the reverse-complement of the amino acid it carries.


The abbreviation for an untranslated region. The 5´ and 3´ ends of an mRNA have untranslated regions. These regions sometimes play regulatory roles that change the mRNA's stability, translatability, or localization.