xdformat
formats BLAST databases from FASTA files. It also reports descriptive
information about the database and dumps the entire content to FASTA
format.
Here are some examples:
xdformat -n files
xdformat -p files
zcat fasta.*.gz | xdformat -o my_db -n -- -
xdformat -n -i database
xdformat -n -r datatbase > fasta_file
When indexing accession.version
identifiers, you have three indexing options:
- 0
-
Accession only; version isn't stored
- 1
-
Stored as accession.version
- 2
-
Stored as both accession only and accession.version
Appends sequences to the named database. If the database is indexed,
the appended sequences will also be indexed.
If an invalid letter is encountered, xdformat
terminates and reports an error message. If this occurs, check the
sequence file for errors. After checking, you may either skip illegal
characters with -k or change them to a legal
character with -c. The typical operation for
nucleotides is to set -c N, and
for proteins -c X.
See also
-k
Sets the maximum length for definition lines.
Sets a user-defined release date for the database. The date may have
63 characters at most.
See also
-v
Appends information and errors to the named file.
Prefaces each sequence with the database record number in the format
of gnl|xdf|#.
Reports descriptive information about a BLAST database. This is
useful for determining when a database was created, how many
sequences it contains, and if it is indexed.
Sets the maximum number of
identifiers with Control-A separators. This is useful for trimming
highly redundant sequences created with nrdb or
another redundancy purifier that uses Control-A separators.
If an invalid letter is encountered, xdformat
terminates. If this occurs, you can either skip illegal characters
with -k or change them to a legal letter with
-c. Check the errors to ensure the input file is
formatted properly.
See also
-c
Default: 100000000 (100 million letters) | |
Sets the maximum sequence length. For optimal performance, break up
large sequences into smaller fragments no larger than 1 million
letters.
Sets the minimum sequence length.
Sets the cache size for indexing. For
faster indexing, the size may be increased (for example, -M
512m).
Sets the number of
bytes of precision. The default value allows databases of up to 4
billion amino acids or 16 billion nucleotides. If you expect a
database to contain more than this limit, increasing precision by one
level multiplies the limit by 256. Setting -O is
necessary only if you append to the database because the precision
automatically increases appropriately when databases are created.
This option applies only when dumping the entire content of a
database with -r. -P controls
the length of the sequence lines; -P 0 puts the
whole sequence on one line.
See also
-r
Certain files may contain numerous nonfatal errors in their
identifier format. -q quiets these errors.
- 0
-
No silencing
- 1
-
Silences field1 errors
- 2
-
Silences field 2 errors
- 3
-
Silences all fields
Reports
(dumps) the entire database content to stdout in
FASTA format.
This option lets you restrict indexing of identifiers to a particular
database name or tag. The [string] has two parts: part 1 is the name
of the database (e.g., gb for GenBank or
emb for EMBL?see Chapter 10), and part 2 is either blank or a number.
- blank
-
Index all identifiers.
- 0
-
Don't index.
- 1
-
Index only field 1.
- 2
-
Index only field 2.
Here are some examples:
- -T emb0 doesn't index EMBL records.
- -T gb1 indexes GenBank accession but not locus.
- -T gb2 indexes GenBank locus but not accession.
- -T gb index both accession and locus of GenBank records.
Sets a user-defined version string for the database (a maximum of 63
characters).
See also
-d
Databases that are formatted but
not indexed may be indexed or re-indexed (e.g., with a different
indexing scheme) with -X. In the following
examples, the two commands on Line 1 are equivalent to the one on
Line 2.
xdformat -n nt_db ; xdformat -n -X nt_db
xdformat -n -I nt_db