The sequences in BLAST databases come from sequence databases. But what are sequence databases and where do you get them? The answers to these simple questions are surprisingly complex. Sequence databases come in many shapes and sizes. Some are just collections of raw sequence data from genome sequencing projects, while others contain comprehensive information about the origin and function of the sequences. Unfortunately, there isn't a one-stop shopping place to get all the information you may want, but there is one particular service worth mentioning above all others: the International Nucleotide Sequence Database.
Probably the most important molecular biology resource is the public sequence database maintained by the International Nucleotide Sequence Database (INSD). It is composed of three parties: the DNA Data Bank of Japan (DDBJ, http://www.ddbj.nig.ac.jp), the European Molecular Biology Laboratory, (EMBL, http://www.embl.org), and GenBank from the National Center for Biotechnology Information (NCBI, http://ncbi.nlm.nih.gov/GenBank). This consortium collaborates to form the largest public repository for DNA and protein sequences in the world. Because it is such an important resource, this chapter spends some time exploring it.
The amount of publicly available sequence has been growing geometrically, doubling approximately every 14 months (see Figure 11-2). Fortunately, computer technology has also kept pace. While it seems scary that GenBank is currently approaching 100 GB and will be half a terabyte in a few years, it's nice to know that this isn't going to be a problem. Not every database grows so fast, though. Organism-specific databases such as the Saccharomyces Genome Database, WormBase, and FlyBase are growing at a more moderate pace, principally because the sequence of their genomes is complete. But many new genome projects are just getting started, and they will probably grow very quickly.
Sequence databases usually offer their data in several different formats. The FASTA format is universally accepted for operating on sequences, but many sequence databases record a lot more data than just the sequence. Such extra information is commonly presented in a human-readable format called a flat file. The INSD uses two kinds of flat files. The DDBJ and GenBank flat file formats are identical, while the EMBL format is slightly different. The following DDBJ/GenBank record corresponds to a fragment of the Hoxa-11 gene from the coelacanth (the ancient fish on the cover of the book):
LOCUS AF287139 606 bp DNA linear VRT 10-DEC-2000 DEFINITION Latimeria chalumnae Hoxa-11 gene, partial cds. ACCESSION AF287139 VERSION AF287139.1 GI:11611818 KEYWORDS . SOURCE Latimeria chalumnae. ORGANISM Latimeria chalumnae Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Coelacanthiformes; Coelacanthidae; Latimeria. REFERENCE 1 (bases 1 to 606) AUTHORS Chiu,C.H., Nonaka,D., Xue,L., Amemiya,C.T. and Wagner,G.P. TITLE Evolution of Hoxa-11 in lineages phylogenetically positioned along the fin-limb transition JOURNAL Mol. Phylogenet. Evol. 17 (2), 305-316 (2000) MEDLINE 20538275 PUBMED 11083943 REFERENCE 2 (bases 1 to 606) AUTHORS Chiu,C.-H. and Wagner,G.P. TITLE Direct Submission JOURNAL Submitted (14-JUL-2000) Ecology and Evolutionary Biology, Yale University, 165 Prospect St., New Haven, CT 06520-8106, USA FEATURES Location/Qualifiers source 1..606 /organism="Latimeria chalumnae" /db_xref="taxon:7897" CDS <1..>606 /codon_start=1 /product="Hoxa-11" /protein_id="AAG39070.1" /db_xref="GI:11611819" /translation="YLPSCTYYVSGPDFSSLPSFLPQTPSSRPMTYSYSSNLPQVQPV REVTFRDYAIDTSNKWHPRSNLPHCYSTEEILHRDCLATTTASSIGEIFGKGNANVYH PGSSTSSNFYNTVGRNGVLPQAFDQFFETAYGTTENHSSDYSADKNSDKIPSAATSRS ETCRETDEKERREESSSPESSSGNNEEKSSSSSGQRTRKKRC" BASE COUNT 173 a 169 c 129 g 135 t ORIGIN 1 tacttgccaa gttgcaccta ctacgtttcg ggtcccgatt tctccagcct cccttctttt 61 ttgccccaga ccccgtcttc tcgccccatg acatactcct attcgtctaa tctaccccaa 121 gttcaacctg tgagagaagt taccttcagg gactatgcca ttgatacatc caataaatgg 181 catcccagaa gcaatttacc ccattgctac tcaacagagg agattctgca cagggactgc 241 ctagcaacca ccaccgcttc aagcatagga gaaatctttg ggaaaggcaa cgctaacgtc 301 taccatcctg gctccagcac ctcttctaat ttctataaca cagtgggtag aaacggggtc 361 ctaccgcaag cctttgacca gtttttcgag acggcttatg gcacaacaga aaaccactct 421 tctgactact ctgcagacaa gaattccgac aaaatacctt cggcagcaac ttcaaggtcg 481 gagacttgca gggagacaga cgagaaggag agacgggaag aaagcagtag cccagagtct 541 tcttccggca acaatgagga gaaatcaagc agttccagtg gtcaacgtac aaggaagaag 601 aggtgc //
The next example is the same record in the slightly different EMBL format. Most of the data is identical between the two formats, but there are a few important differences. The VERSION field of the DDBJ/GenBank record includes a GI number (discussed below) that isn't in the EMBL record. The EMBL record contains both a creation date and a modification date, while the DDBJ/GenBank record contains only a modification date.
ID AF287139 standard; DNA; VRT; 606 BP. XX AC AF287139; XX SV AF287139.1 XX DT 11-DEC-2000 (Rel. 66, Created) DT 11-DEC-2000 (Rel. 66, Last updated, Version 1) XX DE Latimeria chalumnae Hoxa-11 gene, partial cds. XX KW . XX OS Latimeria chalumnae (coelacanth) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Coelacanthiformes; Coelacanthidae; Latimeria. XX RN [1] RP 1-606 RX PUBMED; 11083943. RA Chiu, Ch, Nonaka D., Xue L., Amemiya C.T., Wagner G.P.; RT "Evolution of Hoxa-11 in Lineages Phylogenetically Positioned along the RT Fin-Limb Transition"; RL Mol. Phylogenet. Evol. 17(2):305-316(2000). XX RN [2] RP 1-606 RA Chiu C.-H., Wagner G.P.; RT ; RL Submitted (14-JUL-2000) to the RL Ecology and Evolutionary Biology, Yale University, 165 Prospect St., New RL Haven, CT 06520-8106, USA XX DR SPTREMBL; Q9DDT9; Q9DDT9. XX FH Key Location/Qualifiers FH FT source 1..606 FT /db_xref="taxon:7897" FT /organism="Latimeria chalumnae" FT CDS <1..>606 FT /codon_start=1 FT /db_xref="SPTREMBL:Q9DDT9" FT /product="Hoxa-11" FT /protein_id="AAG39070.1" FT /translation="YLPSCTYYVSGPDFSSLPSFLPQTPSSRPMTYSYSSNLPQVQPVR FT EVTFRDYAIDTSNKWHPRSNLPHCYSTEEILHRDCLATTTASSIGEIFGKGNANVYHPG FT SSTSSNFYNTVGRNGVLPQAFDQFFETAYGTTENHSSDYSADKNSDKIPSAATSRSETC FT RETDEKERREESSSPESSSGNNEEKSSSSSGQRTRKKRC" XX SQ Sequence 606 BP; 173 A; 169 C; 129 G; 135 T; 0 other; tacttgccaa gttgcaccta ctacgtttcg ggtcccgatt tctccagcct cccttctttt 60 ttgccccaga ccccgtcttc tcgccccatg acatactcct attcgtctaa tctaccccaa 120 gttcaacctg tgagagaagt taccttcagg gactatgcca ttgatacatc caataaatgg 180 catcccagaa gcaatttacc ccattgctac tcaacagagg agattctgca cagggactgc 240 ctagcaacca ccaccgcttc aagcatagga gaaatctttg ggaaaggcaa cgctaacgtc 300 taccatcctg gctccagcac ctcttctaat ttctataaca cagtgggtag aaacggggtc 360 ctaccgcaag cctttgacca gtttttcgag acggcttatg gcacaacaga aaaccactct 420 tctgactact ctgcagacaa gaattccgac aaaatacctt cggcagcaac ttcaaggtcg 480 gagacttgca gggagacaga cgagaaggag agacgggaag aaagcagtag cccagagtct 540 tcttccggca acaatgagga gaaatcaagc agttccagtg gtcaacgtac aaggaagaag 600 aggtgc 606 //
Note that the sequence data is only one part of the record; there's a lot of other useful information in here including the organism, the taxonomic classification, the authors, a reference to the scientific literature, and a feature table indicating the translation of the DNA. This is great stuff, and INSD is full of these kinds of records. But there is a downside to using the public databases. They're a bit like public parks: huge, beautiful, inexpensive to use, and valuable, but there's always someone who doesn't pick up their trash. Some sequences are erroneous, and the ancillary information is sometimes wrong and misleading. But overall, the databases are high-quality resources, and you should take a moment to applaud the scientists who contribute their sequences to the INSD, as well as the administrators and curators at DDBJ/EMBL/GenBank who do an outstanding job. Now let's take a closer look at some parts of the sequence record.
One of the most important parts of any sequence record is its database identifier, which is often called its accession number. (Although it's called a number, it may be a mixture of letters, numbers, and other symbols, but not spaces.) This tag uniquely identifies the sequence in a database. There isn't necessarily a one-to-one correspondence between sequences and tags because sequences are sometimes known by multiple unique names. The DDBJ/GenBank ACCESSION (or AC in EMBL) is the primary name for a sequence record. Another unique name is the LOCUS (or ID in EMBL). The locus is supposed to be a "short mnemonic name for the entry, chosen to suggest the sequence's definition." For example, "HSMG01" is the locus name for the database entry containing Homo sapiens myoglobin exon 1. Over time, like the names of celestial objects, locus names have become less descriptive and are often just duplicates of the accession numbers.
Sequence records can also change over time. This often happens when the record is edited to correct a sequence error. The accession number and locus don't change, but the version number is increased (VERSION in DDBJ/GenBank and SV in EBML). In this way, an ACCESSION.VERSION points to a particular record at a particular time. It's a good idea to always refer to sequences in this way and not by ACCESSION alone or by LOCUS or ID.
DDBJ/GenBank records include an additional token called the GI number, which is a numeric identifier that points to a particular ACCESSION.VERSION. The GI number is especially important because NCBI-BLAST relies on it as an additional mechanism for indexing BLAST databases. This topic was covered in Section 11.2.3.
The DEFINITION is a concise description of the origin and function of a sequence, and is typically what you find a FASTA description. The text is structured, meaning that there are rules that define how it is produced. However, it doesn't use a controlled vocabulary, which means you can't be sure which words will or won't appear.
KEYWORDS are a historical relic like the locus name and aren't used in modern sequence records. Avoid the temptation to believe that keywords are meaningful.
The common name for an organism is often found in the SOURCE, or in parentheses after the OS in EMBL format. The scientific name is on the ORGANISM line (OS in EMBL) and the complete taxonomic classification is given on the following lines (OC in EMBL). The complete taxonomy may be abbreviated if it's especially long.
The FEATURES (FT in EMBL) list specific regions of importance on the sequence such as genes or repetitive elements. The general syntax of features is fairly simple; each has a key and location, and optional qualifiers. The key tells what kind of feature it is (e.g., a gene), the location (e.g., from nucleotide 100 to nucleotide 200), and the qualifiers include additional information, such as specific names, database cross references, and experimental notes. A detailed discussion of the feature table is beyond the scope of this book. See http://www.ncbi.nih.gov/projects/collab/FT for more information.
INSD is just one of many important databases. Some other favorites are listed in Table 11-2.
Database |
Description | ||
---|---|---|---|
RefSeq |
RefSeq provides reference sequences that represent the highest quality information about a particular sequence. Each record may be constructed from several INSD records, which makes the database nonredundant. All RefSeq accession numbers are preceded by two letters and an underscore, for example XP_102310. Some types of RefSeq records have been inspected manually by curators, and they are the highest quality records (indicated below). | ||
Prefix |
Molecule |
Description | |
NC_ |
Genomic |
Curated complete genomic molecules including genomes, chromosomes, organelles, and plasmids. | |
NG_ |
Genomic |
Curated incomplete genomic region; primarily supplied for Homo sapiens and Mus musculus to support the NCBI Genome Annotation pipeline. | |
NM_ |
mRNA |
Curated mRNAs. | |
NR_ |
RNA |
Curated noncoding transcripts including structural RNAs, transcribed pseudogenes, and others. | |
NP_ |
Protein |
Curated proteins. | |
NT_ |
Genomic |
Intermediate genomic assemblies of BAC sequence data. | |
NW_ |
Genomic |
Intermediate genomic assemblies of Whole Genome Shotgun sequence data. | |
XM_ |
mRNA |
Homo sapiens model mRNA provided by the Genome Annotation process; sequence corresponds to the genomic contig. | |
XR_ |
RNA |
Homo sapiens model noncoding transcripts provided by the Genome Annotation process; sequence corresponds to the genomic contig. | |
http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html | |||
Pfam |
Pfam is a collection of multiple sequence alignments and hidden Markov models (HMMs) for many common protein domains and families. If you are interested in a particular family, such as globins, or a particular domain, such as WD-40, this is a great resource. HMMs are probabilistic models that describe how whole domains evolve, which is quite different from a scoring matrix employed by BLAST that treats each amino acid of a protein independently. http://pfam.wustl.edu,http://www.sanger.ac.uk/pfam | ||
SWISS-PROT |
The SWISS-PROT Protein Knowledgebase is a curated protein sequence database that provides a high level of annotation (such as the description of protein function, domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy, and high level of integration with other databases. http://www.ebi.ac.uk/swissprot | ||
TrEMBL |
The TrEMBL database contains translations of all coding sequences (CDS) present in the INSD, which aren't yet integrated into SWISS-PROT. TrEMBL is split into two main sections: SP-TrEMBL contains entries expected to be included in SWISS-PROT, and REM-TrEMBL contains those that aren't expected to be included. http://www.ebi.ac.uk/tremble | ||
UniGene |
UniGene is an experimental system for automatically partitioning GenBank sequences into a nonredundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and the map location. UniGene sets are available for most genomes with a lot of EST sequences. http://www.ncbi.nlm.nih.gov/UniGene | ||
MGC |
The goal of the Mammalian Gene Collection (MGC) is to provide a complete set of full-length (open reading frame) sequences and cDNA clones of expressed mammalian genes. The current focus is limited to human and mouse. http://mgc.nci.nih.gov | ||
SGD |
SGD is a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker's or budding yeast. S. cerevisiae was the first eukaryotic genome sequenced. http://genome-www.stanford.edu/Saccharomyces | ||
WormBase |
WormBase is a comprehensive database dedicated to the biology and genome of the nematode Caenorhabditis elegans. C. elegans was the first multicellular organism to have its genome sequenced. http://www.wormbase.org | ||
FlyBase |
FlyBase is a comprehensive database for information on the genetics and molecular biology of Drosophila. It includes data from the Drosophila Genome Projects and data curated from the literature. FlyBase is a joint project with the Berkeley Drosophila Genome Project. http://www.flybase.org | ||
TAIR |
The Arabidopsis Information Resource (TAIR) provides a comprehensive resource for the scientific community working with Arabidopsis thaliana, a widely used model plant. http://www.arabidopsis.org |