The abbreviation for an amino acid that is often used when describing the length of a protein (e.g., the average protein is about 300 aa long).


A form of a gene. Typically, the most common form is called wild-type, and each allele is given a specific (and often obscure) name.

amino acid

The basic building block for all proteins. There are 20 common amino acids.

Arabidopsis thaliana

Known by its common name, thale cress, this mustard weed is a favorite organism for plant genetics and molecular biology. It was the first plant with a complete genomic sequence. For more information, see http://www.arabidosis.org.


The contraction for binary digit. The base-2 logarithm of a number is in units of bits.


The abbreviation for a blocks substitution matrix. Matrix names are followed by a number (e.g., BLOSUM62) that indicate the minimum percent identity between any two aligned sequences.


The abbreviation for base pair. The length of DNA is usually given in bp or nt, Common measures include Kb, Mb, and Gb for thousands, millions, and billions of bp, respectively.


The end of a protein. In text form, the C-terminus of the protein is always at the right.

Caenorhabditis elegans

A nematode (also called a roundworm) that is about 1 mm long and has about 1,000 cells as an adult. C. elegans was the first animal to have its complete genome sequenced. See http://www.wormbase.org.


The abbreviation for a coding sequence. CDS isn't synonymous with exon, since exons may contain noncoding sequence.


Three contiguous letters of DNA or RNA. Each of the 64 codons specifies either an amino acid or a translation stop.


The complement of a DNA sequence is the sequence on the other strand. For example, the complement of ACCCGT is TGGGCA. To complement a sequence in Perl, use either of the following:

# 4-letter alphabet
$dna =~ tr/ACGT/TGCA/;
# 15-letter alphabet

Drosophila melanogaster

The common fruit fly. This is one of the most famous organisms for genetic research and was one of the first animals whose complete genomic sequence was determined. See http://www.fruitfly.org.

dynamic programming

A common technique that reduces the computational complexity of a problem by finding and extending a partial optimization.

E. coli

Eschericia coli. A common bacteria normally found in your gut and a favorite organism for molecular biology research. Some variants cause food poisoning.

effective length

Karlin-Altschul statistics assume sequences of infinite length. To adjust for edge effects in real sequences, the search space is reduced by adjusting the true lengths of the sequences to effective lengths.


Randomness; disorder; unpredictability.


Organisms with intracellular membranous organelles such as the nucleus and mitochondria are called eukaryotes.

frame-shift mutation

A mutation that causes an insertion or deletion of nucleotides that isn't a multiple of three, and therefore causes the reading frame to change.


A functional unit of the genome. When not specifically stated, "gene" is usually considered a "protein-coding" gene, but many genes don't contain the instructions for proteins (e.g., various RNA genes).

genetic code

The mapping of codons to amino acids. See Table 2-3.

genetic drift

The tendency of sequences to change over time by accumulating random mutations.


The complete genetic material for an organism. For eukaryotes, the genome refers to the nuclear genome and doesn't include organelles.

global alignment

An alignment algorithm that requires every letter of each sequence to appear in the alignment. Globally aligning sequences of different lengths may lead to very strange alignments.