13.4 formatdb Parameters

formatdb turns FASTA files into BLAST databases (ASN.1 format is also acceptable, but because it isn't commonly used, it isn't covered in this book. You can find more information about ASN.1 at http://www.ncbi.nlm.nih.gov/Sitemap/Summary/asn1.html/). Chapter 11 discusses the typical methods for building BLAST databases and examines the NCBI identifier syntax required for some aspects of formatdb and blastall. Here are a few sample command lines:

formatdb -i protein_db
formatdb -p F -i nucleotide_db
zcat est*.gz | formatdb -p F -i stdin -o -n est -v 2000000000

The following reference lists the default value for each formatdb parameter.

-B [file]

Default: Optional

Specifies a binary GI output file. The advantage of using a binary GI file is that it's smaller than a corresponding text file and can be read directly into memory without being parsed. See the -F option.

To convert a text GI file to binary, use the following command:

formatdb -F text_gi_list -B binary_gi_list
-F [file]

Default: Optional

Specifies a GI file, either text or binary. This is used for creating an alias database that doesn't contain sequences, but pointers to sequences stored in another database (which may be an alias database as well). See the -L parameter. The databases must use the NCBI FASTA identifier syntax, include GI numbers, and be indexed with -o.

-i [file]

Default: Required

Sets the input FASTA file. You may specify that input come from stdin with -i stdin, but you must also set the -n parameter to give it a name. If you wish to make a single BLAST database from multiple FASTA files, pipe them to formatdb as follows:

cat file1 file2 file3 | formatdb -i stdin -n my_db
-l [file]

Default: formatdb.log

Specifies an output log file. Log messages are appended to this file.

-L [file]

Default: Optional

Creates an alias database, which has several uses. It can be a simple synonym for another database, a selection of specific records from a database (see the -F option), or a static virtual database. Alias databases have the .pal or .nal extension, depending on whether they are proteins or nucleotides.

To create an alias database with a selected set of GI numbers:

formatdb -i db -F gi_list -L alias_name -p [T/F]

To merge databases, first create a synonymous alias and then edit it to include additional database names. Chapter 11 covers this process in more detail.

-n [string]

Default: Optional, required with -i stdin

Sets the base name for the BLAST database. If not specified, the name of the FASTA file will be used. If the input is from stdin, this parameter must be set.

-o [T/F]

Default: Optional

Creates indexes. Indexing the databases isn't required but is recommended. Alias databases that use GI lists (see -F and -L options) and the blastall -l option require indexed databases. Additionally, some blastall output options specified with the -m parameter require indexing. Indexing adds four files with extensions .nnd, .nni, .nsd, and .nsi for nucleotides and .pnd, .pni, .psd, and .psi for proteins. If you know you don't need indexes, you can save space by omitting -o.

If GI numbers are included and more than one sequence has the same GI number, formatdb terminates with an error. If accession numbers aren't unique, an error won't be issued (see -V).

-p [T/F]

Default: T

Specifies the type of type of file being formatted. By default, formatdb assumes the file is protein, so you must set -p F whenever you format nucleotide databases.

-s [T/F]

Default: Optional

Creates indexes for accessions but not locus names. Must be used in conjunction with the -o parameter. For many sequences from DDBJ/GenBank/EMBL, the locus name and accession number are identical and some disk space can be saved by not including redundant information. In general, locus names are historical relics, so always include -s.

-t [string]

Default: Optional

The title for a database file. If this parameter isn't set, the title of the database will be the name of the FASTA file or the argument of -n, if it was set. -t lets you use more descriptive names that you might not want as filenames. For example:

formatdb -i proteins -t "my favorite human proteins"

In the BLAST report, this is reported in the header as:

Database: my favorite human proteins

Using this parameter can be confusing, because backtracking from reports to databases might be difficult.

-v [integer]

Default: Optional

The maximum number of sequence bases to be created in a volume. Values range from 1 to 2147483647 (2 billion in powers of two). This parameter is useful if the filesystem doesn't support large files. Volumes with greater than [integer] letters are automatically split, and an alias is created. See Chapter 9 for more information.

-V [T/F]

Default: F

Reports warning messages if sequence identifiers aren't unique. Requires the -o option.