formatdb turns FASTA files into BLAST
databases (ASN.1 format is also acceptable, but because it
isn't commonly used, it isn't
covered in this book. You can find more information about ASN.1 at
http://www.ncbi.nlm.nih.gov/Sitemap/Summary/asn1.html/).
Chapter 11 discusses the typical methods for
building BLAST databases and examines the NCBI identifier syntax
required for some aspects of formatdb and
blastall. Here are a few sample command lines:
formatdb -i protein_db
formatdb -p F -i nucleotide_db
zcat est*.gz | formatdb -p F -i stdin -o -n est -v 2000000000
The following reference lists the default value for each
formatdb parameter.
Specifies a binary GI output file. The advantage of using a binary GI
file is that it's smaller than a corresponding text
file and can be read directly into memory without being parsed. See
the -F option.
To convert a text GI file to binary, use the following command:
formatdb -F text_gi_list -B binary_gi_list
Specifies a GI file, either text or
binary. This is used for creating an alias database that
doesn't contain sequences, but pointers to sequences
stored in another database (which may be an alias database as well).
See the -L parameter. The databases must use the
NCBI FASTA identifier syntax, include GI numbers, and be indexed with
-o.
Sets the
input FASTA file. You may specify that input come from
stdin with -i
stdin, but you must also set the
-n parameter to give it a name. If you wish to
make a single BLAST database from multiple FASTA files, pipe them to
formatdb as follows:
cat file1 file2 file3 | formatdb -i stdin -n my_db
Specifies an output log file. Log messages are appended to this file.
Creates an
alias database, which has several uses. It can be a simple synonym
for another database, a selection of specific records from a database
(see the -F option), or a static virtual database.
Alias databases have the .pal or
.nal extension, depending on whether they are
proteins or nucleotides.
To create an alias database with a selected set of GI numbers:
formatdb -i db -F gi_list -L alias_name -p [T/F]
To merge databases, first create a synonymous alias and then edit it
to include additional database names. Chapter 11
covers this process in more detail.
Default: Optional, required with -i stdin | |
Sets the base name for the BLAST database. If not specified, the name
of the FASTA file will be used. If the input is from
stdin, this parameter must be set.
Creates indexes. Indexing the
databases isn't required but is recommended. Alias
databases that use GI lists (see -F and
-L options) and the blastall
-l option require indexed databases. Additionally,
some blastall output options specified with the
-m parameter require indexing. Indexing adds four
files with extensions .nnd,
.nni, .nsd, and
.nsi for nucleotides and
.pnd, .pni,
.psd, and .psi for
proteins. If you know you don't need indexes, you
can save space by omitting -o.
If GI numbers are included and more
than one sequence has the same GI number,
formatdb terminates with an error. If accession
numbers aren't unique, an error
won't be issued (see -V).
Specifies the type of type of file
being formatted. By default, formatdb assumes
the file is protein, so you must set -p F whenever
you format nucleotide databases.
Creates indexes for accessions but
not locus names. Must be used in conjunction with the
-o parameter. For many sequences from
DDBJ/GenBank/EMBL, the locus name and accession number are identical
and some disk space can be saved by not including redundant
information. In general, locus names are historical relics, so always
include -s.
The title
for a database file. If this parameter isn't set,
the title of the database will be the name of the FASTA file or the
argument of -n, if it was set.
-t lets you use more descriptive names that you
might not want as filenames. For example:
formatdb -i proteins -t "my favorite human proteins"
In the BLAST report, this is reported in the header as:
Database: my favorite human proteins
Using this parameter can be confusing, because backtracking from
reports to databases might be difficult.
The maximum number of sequence bases to be created in a volume.
Values range from 1 to 2147483647 (2 billion in powers of two). This
parameter is useful if the filesystem doesn't
support large files. Volumes with greater than
[integer] letters are automatically split, and an
alias is created. See Chapter 9 for more
information.
Reports warning messages if
sequence identifiers aren't unique. Requires the
-o option.