11.1 FASTA Files

Regardless of where you get your sequences, you will eventually want them in FASTA format because it is the standard currency for sequence data. The FASTA format has a very simple specification consisting of two parts: the definition line and the sequence lines.

The definition line is a single line that begins with the mandatory > symbol immediately followed by an identifier and then a description. There are no spaces between the > and the identifier. The identifier itself must not contain any whitespace because it is the delimiter between the identifier and the description. The description is free-form text that may contain any characters except an end-of-line character. Figure 11-1 shows a simple definition line in which the identifier is "EcoRI" and the description reads "is a restriction enzyme."

Figure 11-1. The FASTA definition line
figs/blst_1101.gif

The sequence lines follow a very simple format: they may be any length and there may be any number of them. Usually, you'll see 50, 60, or 80 characters per line, but the choice is arbitrary. Some software relies on sequence lines not being too long, so it's generally a good idea to follow the convention of 50 to 80 characters per line.

The first and most important guideline for FASTA definition lines is that the identifier uniquely specifies the sequence in some database. The identifier and description are actually optional. The following is a valid, though confusing, FASTA file because there is no identifier (dumb is the description).

> dumb
GAATTC

The following definition lines are confusing because the identifier isn't unique:

>chromosome 1 sequence 1
>chromosome 1 sequence 2

This is easily remedied by replacing the whitespace with some other character:

>chromosome_1-sequence.1
>chromosome_1-sequence.2

On the surface, these look like good identifiers, but another researcher may have the same identifiers for completely different sequences from another organism. How can you prevent this? You can't, but you can minimize potential conflicts by including a unique tag, based on your name or institution. If your data will be made public, the best solution is to submit your sequences to the public databases and use the accession numbers they provide. If not, choose identifiers you think will be unique (and make sure you read about creating fake GI numbers in Section 11.2.3).

The sequencing world is usually very cooperative, and standards have been developed to minimize name conflicts. In particular, there is a tight collaboration between DDBJ, EMBL, and GenBank so that accession numbers among these databases are guaranteed to be unique. But this isn't true of all databases. Although the identifier "AAG39070" points to a specific DDBJ/EMBL/GenBank record, it may also point to a wholly different sequence in another database. A good way to avoid name conflicts is to make sure the identifier specifies a database in addition to some unique tag for the sequence. Let's look at how the NCBI solves this problem.

11.1.1 NCBI Identifier Format

The NCBI identifier format[1] indicates the name of the database in addition to an accession number. These tokens are separated by the "|" symbol, often called a bar or a pipe. This symbol can be confusing in some fonts, as it may look like a lowercase L or the number one. Try using a constant-width serif font such as Courier if you're having trouble seeing them.

[1] The NCBI identifier format is used by both NCBI-BLAST and WU-BLAST software and is necessary for proper indexing of BLAST databases. You should use it as your standard format.

In general terms, an NCBI definition line has the following specification:

>database|identifier

A little knowledge can be a dangerous thing, so don't stop reading now and assume that if you follow this syntax you will have valid identifiers. The names of databases are restricted and have particular syntaxes. If you don't follow the proper syntax, you will end up confusing formatdb or xdformat, which will prevent you performing certain operations such as retrieving sequences by accession number. And you may end up confusing people if you share your data. Table 11-1 shows the current database tokens and their syntax (for an up-to-date list, see the documentation distributed with NCBI-BLAST or WU-BLAST software). For example, if you use the "pat" database token, which corresponds to the patent database, you must supply the country and the patent number as well.

Table 11-1. NCBI identifier syntax

Database name

Syntax

DDBJ

dbj|accession|locus

EMBL

emb|accession|ID

NCBI GenBank

gb|accession|locus

NCBI GenInfo

gi|integer

NCBI Reference Sequence

ref|accession|locus

NBRF Protein Information Resource

pir||entry

Protein Research Foundation

prf||name

SWISS-PROT

sp|accesion|entry

Brookhaven Protein Data Bank

pdb|entry|chain

Patents

pat|country|number

GenInfo Backbone ID

bbs|number

Local

lcl|identifier

General

gnl|database|identifier

Here are some real examples of NCBI identifiers:

>gi|21305377
>gb|AAM45611.1|AF384285_1
>ref|NP_104634.1|

In the final example, there is no locus, even though this is expected in the syntax. This isn't an error; the locus is just blank.

If you have a collection of your own sequences, with your own names, your best choice is to use the Local or General databases, which are designed specifically for that purpose. The advantage to using General is that you can specify your own sub-namespace in the database field. The following identifier strings are all different from one another (note that identifiers are case-sensitive):

>lcl|foo
>lcl|FOO
>gnl|mydatabase|foo
>gnl|yourdatabase|foo

If, for some reason, you don't want to use the Local or General databases, you can omit the database name and just use your own identifiers with the guidelines discussed earlier. If you're using NCBI-BLAST, your sequences are actually stored in the Local database, and the following identifiers are therefore identical:

>dna.001
>lcl|dna.001

If you retrieve sequences from the BLAST database with fastacmd, they will have the lcl| prepended (even if you didn't specify this), and they will have no definition line found if the definition line doesn't include a description.

If you use WU-BLAST, the previous two identifiers aren't considered identical because an additional unnamed database is separate from the Local database. Definition lines without descriptions are also reported unmodified.

11.1.1.1 Compound identifiers

It is typical for the same sequence to be known by various names. The NCBI identifier format supports this by using compound identifiers where individual identifiers are concatenated with a pipe symbol. The following identifiers are examples of such compound identifiers. In databases distributed by the NCBI, the GI number is the first identifier.

>gi|11611818|gb|AF287139.1|AF287139
>gi|1708198|sp|P80487|HHP_THICU
>gi|9910844|sp|Q9UWG2|RL3_METVA
>gi|7228451|dbj|BAA92411.1|
>gi|11277201|pir||T44712
11.1.1.2 Concatenated definition lines

If you download the nonredundant protein database from the NCBI or use one of the programs distributed with WU-BLAST that creates nonredundant databases, you will see concatenated definition lines. Each definition is separated with the Control-A character, which is a whitespace character that in text editors or word processors looks like a normal space. When it is forced to be visible, Control-A is often written as ^A (a white character on a black background is also common). You may wonder why you can't just create a larger compound identifier rather than a concatenated definition. The reason is that identical sequences may originate from different organisms or different loci and are therefore not identical in the biological sense; they may have different descriptions, which you may want to see. The following single definition line contains concatenated definitions as well as compound identifiers.

>gi|9845511|ref|NP_008839.2| ras-related C3 botulinum toxin substrate 1 isoform 
Rac1; rho family, small GTP binding protein Rac1 [Homo sapiens]^Agi|131807|sp|P1
5154|RAC1_HUMAN Ras-related C3 botulinum toxin substrate 1 (p21-Rac1) (Ras-like 
protein TC25)^Agi|68958|pir||TVHUC1 GTP-binding protein rac1 - human^Agi|108115|
pir||G36364 GTP-binding protein rac2 - dog^Agi|280956|pir||A60347 GTP-binding pr
otein rac1 - mouse^Agi|14277763|pdb|1I4D|D Chain D, Crystal Structure Analysis O
f Rac1-Gdp Complexed With Arfaptin (P21)^Agi|14277766|pdb|1I4L|D Chain D, Crysta
l Structure Analysis Of Rac1-Gdp In Complex With Arfaptin (P41)^Agi|922|emb|CAA3
9801.1| rac2 [Canis familiaris]^Agi|53886|emb|CAA40545.1| ras-related C3 botulin
ium toxin substrate [Mus musculus]^Agi|190824|gb|AAA36537.1| ras-related C3 botu
linum toxin substrate^Agi|249582|gb|AAB22206.1| rac1 p21=small GTP-binding prote
in [human, HL60, Peptide, 192 aa]^Agi|3184510|gb|AAC18960.1| GTPase cRac1A [Gall
us gallus]^Agi|6007014|gb|AAF00714.1|AF175262_1 GTPase [Bos taurus]^Agi|8574038|
emb|CAB53579.5| Rac1 protein [Homo sapiens]^Agi|12843555|dbj|BAB26027.1| RAS-rel
ated C3 botulinum substrate 1~data source:MGD, source key:MGI:97845, evidence:IS
S~putative [Mus musculus]^Agi|13277918|gb|AAH03828.1| ras-related C3 botulinum t
oxin substrate 1 (rho family, small GTP binding protein Rac1) [Mus musculus]^Agi
|15919905|dbj|BAB69451.1| RAS-related C3 botulinum substrate 1~data source:MGD, 
source key:MGI:97845, evidence:ISS~putative [Mus musculus]^Agi|20379102|gb|AAM21
111.1|AF498964_1 small GTP binding protein RAC1 [Homo sapiens]

Most concatenated definitions aren't this long. This particular protein is highly conserved and is identical from human to chicken (Gallus gallus). You might take a moment to appreciate that in the eons during which continents have split apart and converged, this protein has remained completely unchanged.

11.1.2 Descriptions

While you should use the NCBI identifier format, there isn't a publicly recognized standard for descriptions. Some people choose to omit descriptions entirely, while others load up the definition line with the entire contents of a GenBank file. The best descriptions are both brief and informative. Descriptions from the NCBI include a short description and the species names in square brackets at the end of the line. This is a reasonably good practice, but you should be wary of trying to reliably parse descriptions that don't come from a controlled vocabulary. The following identifier is a real example of a difficult-to-parse description:

>gi|20820984|ref|XP_140836.1| similar to DiGeorge syndrome critical region gene DGSI 
protein [Homo sapiens] [Mus musculus]

It's hard to tell if this is a human or mouse sequence. In reality, it's a mouse sequence similar to a human protein that originates from a region involved with the genetic disease called DiGeorge Syndrome. If you use a regular expression to find all Homo sapiens sequences and you don't bind the pattern match to the end of the line, this description can fool you. This kind of problem isn't limited to FASTA files; you'll also find fields in GenBank records that have embedded GenBank tags. It's both confusing and annoying. Unfortunately, automatically generating descriptions from transitive associations is a common practice. One way to cope with this problem is to rigorously construct your own definition lines from a controlled vocabulary. Another way is to trust only the identifiers, and when you need biological information, such as the species, retrieve it directly from the parent biological database.