13.3 blastall Parameters

blastall is controlled by several parameters. Many of the parameters have default settings and don't need to be explicitly assigned. Consider this simple command:

blastall -p blastp

Behind the scenes, this command is converted to:

blastall -p blastp -d nr -i stdin -e 10 -m 0 -o stdout -F T -G 11 -E 2 -X 15 -v 500 
-b 250 -f 11 -g T -a 1 -M BLOSUM62 -W 3 -z 0 -K 0 -Y 0 -T F -U F -y 0.0 -Z 0 -A 40

You can see that many parameters are set without your express knowledge. These parameters affect the results of your experiment and, as reinforced many times throughout the book, you should try to understand these parameters and set them to fit each experiment.

The following reference section explains all the parameters available for blastall and lists the default values that are used if not explicitly set. The table was compiled according to the default values for the five basic programs. Although megablast can be run from within blastall (-n T), you should use the standalone program. The parameters for it are presented later in the chapter.

-a [integer]

Default: 1

Programs: All

Sets the number of processors to use on of processors. If you have multiple queries, you will get better throughput by executing multiple BLAST searches. For insensitive searches such as default BLASTN, setting -a to a higher value may not appreciably improve speed if disk I/O is the bottleneck.

-A [integer]

Default: blastn 0, others 40

Programs: All

Sets the multiple-hit window size. When BLAST is set to two-hit mode, this option requires two word hits on the same diagonal to be within [integer] letters of each other in order to extend from either one. The larger the [integer], the more sensitive BLAST will be. Setting [integer] to 0 sets the default behavior of 40, except for blastn, whose default is single word hit. To specify one-hit behavior, set -P 1.

-b [integer]

Default: 250

Programs: All

Truncates the report to [integer] number of alignments. There is no warning when you exceed this limit, so it's generally a good idea to set [integer] very high unless you're interested only in the top hits.

-B [integer]

Default: Optional

Programs: blastn, tblastn

Sets the number of queries to concatenate in a single search. Concatenating queries accelerates the search because the database is scanned just one time. This is the principle underlying megablast, but the implementation is different in blastall.

This option is new in Version 2.2.6 and still experimental. The specified [integer] must be the number of sequences in the query file. If it's less, only the first set of [integer] sequences is used. Also, the output is very different than you would expect. All the query names are listed, and then all the one-line summaries are given, followed by the alignments, and finally, one footer is produced for the whole report. Given this format, it's very difficult to discern which alignments belong to which query. This option should not be used in its current implementation.

-d [database]

Default: nr

Programs: All

Identifies the database to search. [database] must already be formatted by formatdb. BLAST looks for [database] in the following order: the local directory, the BLASTDB environment variable (Unix only), and finally, the location specified in the .ncbirc file.

You can merge multiple databases into a single virtual database by putting the individual databases in quotes. For example, to merge the nt and est databases, use: -d "nt est". You can't mix nucleotide and amino acid databases. The statistics reported are based on the sizes of the combined databases. Virtual databases may exceed file size limits imposed by the operating system.

-D [1..23]

Default: 1

Programs: tblastn, tblastx

The genetic code to use for translation of the database nucleotide sequence. See http://www.ncbi.nlm.nih.gov/htbin-post/Taxonomy for updates.

Options

1: Standard Nuclear Genetic Code
2: Vertebrate Mitochondrial
3: Yeast Mitochondrial
4: Mold, Protozoan, and Coelocoel Mitochondrial
5: Invertebrate Mitochondrial
6: Ciliate Nuclear
9: Echinoderm Nuclear
10: Euplotid Nuclear
11: Bacterial and Plant Plastid
12: Alternative yeast nuclear
13: Ascidian Mitochondrial
14: Flatworm Mitochondrial
15: Blepharisma Nuclear
16: Chlorophycean Mitochondrial
21: Trematode Mitochondrial
22: Scenedesmus Obliquus Mitochondrial
23: Thraustochytrium Mitochondrial

-e [real number]

Default: 10

Programs: All

Sets the threshold expectation value for keeping alignments. This is the E from the Karlin-Altschul equation that describes how often an alignment with a given score is expected to occur at random.

-E [integer]

Default: blastn 2, others 1

Programs: All

The penalty for each gap character. The -G parameter controls the initial cost of opening a gap. Note that -E 0 is synonymous with the default behavior and, it's impossible to set -E to zero unless -g F is set, which turns gapping off. The default gap cost, for programs other than blastn, depends on the scoring matrix. The value shown here is for the default BLOSUM62 matrix. See Appendix C for a complete list of default and legal gap penalties.

-f [integer]

Defaults: blastp 11, blastx 12, tblastn 13, tblastx 13

Programs: blastp, blastx, tblastn, tblastx

Neighborhood word threshold score. Only those words scoring equal to or greater than [integer] will seed alignments.

-F [T/F], -F [string]

Default: T, but see below

Programs: All

Filters the query sequence for low-complexity subsequences. The default setting is T. Complexity filtering is generally a good idea, but it may break long HSPs into several smaller HSPs due to low-complexity segments. This can cause some alignments to fall below the significance threshold and be lost. To prevent this, either turn off filtering (not recommended) or use soft masking, in which the filter is used only in the word seeding phase, but not the extension phase.

The parameter argument's [string] form follows a nonintuitive syntax. If the string begins with an m, soft masking is turned on. Filtering programs are specified by a single capital letter: D for DUST, R for human repeats, V for vector sequences, S for SEG, and C for coiled-coil. D, R, and V are used only for blastn searches, and S and C are used for all other programs. More than one filter may be specified, and additional parameters may be passed to the programs. See the following tables and the -U parameter used for filtering lowercase letters in the query sequence.

To use R or V, the correct database files must be downloaded and installed in the BLASTDB directory. For human repeats, three databases are needed: humlines.lib, humsines.lib, and retrovir.lib. For vector filtering, use the UniVec_Core database (ftp://ftp.ncbi.nih.gov/pub/UniVec/).

String options for blastn

Behavior	Parameter format
No complexity filter	-F ""
Default (DUST)	-F "D"
Soft masking	-F "m D"
Lowercase soft masking	-F "m" -U
Soft masking of DUST and lowercase letters	-F "m D" -U
Mask human repeats	-F "R"
Mask vector sequences	-F "V"
Soft-masking of human repeats and vector	-F "m R;V"

String options for blastp, blastx, tblastn, and tblastx

Behavior	Parameter format
No complexity filter	-F ""
Default (SEG)	-F "S"
Soft masking	-F "m S"
Lowercase soft masking	-F "m" -U
Coiled-coil	-F "C"
SEG plus coiled-coil	-F "S;C"
SEG with settings for windowsize, locut, and hicut	-F "S 10 1.0 1.5"
As above, plus coiled coil and soft masking (including lowercase)	-F "m S 10 1.0 1.5; C" -U

-g [T/F]

Default: T

Programs: blastn, blastp, blastx, tblastn

Performs gapped alignment. Setting this to F invokes the older, ungapped style of alignment. You can't perform gapped alignments with tblastx, regardless of this setting.

-G [integer]

Defaults: blastn 5, others 11

Programs: All

Initial penalty for opening a gap of length 0. Penalties for extending the gap is controlled by parameter -E. -G 0 invokes the default behavior, and setting -G to zero is impossible, unless -g F is set, which turns gapping off. The default gap costs for programs other than blastn depend on the scoring matrix; the value here is for the default BLOSUM62 matrix. See Appendix C for a complete list of default and legal gap penalties.

-i [input file]

Default: stdin

Programs: All

If -i isn't included on the command line, BLAST expects input from stdin (i.e., it will wait indefinitely for you to type in a FASTA file from the keyboard). The following commands are therefore equivalent:

blastall -p blastn -d nt -i query
blastall -p blastn -d nt < query
cat query | blastall -p blastn -d nt
cat query | blastall -p blastn -d nt -i stdin

If the input file contains multiple sequences, BLAST will be run on each sequence in order, and the resulting output will contain concatenated BLAST reports.

-I [T/F]

Default: F

Programs: All

Shows GenInfo Identifier (GI) numbers in definition lines. A GI is a unique numeric identifier assigned for a sequence in GenBank. A GI corresponds to an accession version pair.

-J [T/F]

Default: F

Programs: All

Believe the query defline.

-K [integer]

Default: 0 - Off

Programs: All

The number of best hits from a region to keep. This option is useful when you want to limit the number of alignments that might pile up in one section of the query. This is most useful if the settings of -b or -v are low, and the abundant alignments push lower scoring alignments off the end of the report. If set, a value of 100 is recommended.

-l [file]

Default: Optional

Programs: All

Restricts database search to a list of GIs found in [file]. The database sequences must have NCBI-compliant identifiers, including GI numbers, and the database must be indexed (by running formatdb with the -o option). The [file] must be in the same directory as the database or in the directory from which blastall is called. [file] may be in text format with one GI per line or in binary format (see the -B parameter for formatdb).

-L [string]

Default: Optional

Programs: All

The location on query sequence. This lets you limit the search to a subsequence of the query sequence. For example, to search just the letters from 21 to 50, add the following parameter:

-L "21,50"

The alignments won't extend outside the specified region. In older versions of BLAST, -L set the size of the region under control of the -K parameter.

-m [0..11]

Default: 0

Programs: All

Sets the alignment viewing options. Appendix C gives examples of these display options.

Options

0: Pairwise
1: Query-anchored, showing identities, no gaps in query (gaps are shown as a tree-like thing in subjects), identities shown as ".", positives uppercase, negatives lowercase
2: Query-anchored, no identities, no gaps in query, negatives lowercase
3: Flat query-anchored, show identities, padding through all sequences
4: Flat query-anchored, no identities, padding through all sequences
5: Query-anchored, no identities and blunt ends, (dashes [-]are used to blunt the ends)
6: Flat query-anchored, no identities and blunt ends, ([-] to ends)
7: XML output
8: Tabular
9: Tabular with comment lines
10: ASN.1 in text format ([-] must be set for this option to work)
11: ASN.1 in binary format ([-J] must be set for this option to work)

-M [matrix file]

Default: BLOSUM62

Programs: All except blastn

Designates a protein similarity matrix. This is used in all BLAST programs except blastn. Matrices are sought in the following order: in the local directory, in the location specified in the .ncbirc file, in a local data directory, and finally, in the BLASTMAT environment variable (only on Unix systems). Other matrices included in the standard distribution include BLOSUM45, BLOSUM80, PAM30, and PAM70.

You can use custom matrix files, but it requires modifying the source code and defining the new matrix with all of its associated statistics for different affine gap combinations and recompiling the binary. Using these custom files isn't recommended because it requires the arduous task of calculating gapped values for lambda and maintaining a derivative branch of the source code.

-n [T/F]

Default: F

Programs: megablast

Sets the blastn program to the megablast mode, which is optimized to find near identities very quickly. The following lines are equivalent:

blastall -p blastn -n T -d est -i my_file
megablast -d est -i my_file -D 2

More program options are available if you run the megablast executable (see Section 13.6).

-o [output file]

Default: Optional

Programs: All

Designates an output file for the search results. If not used, output is printed to stdout. The following commands are equivalent:

blastall -p blastn -d nr -i query -o output
blastall -p blastn -d nr -i query > output

-p [program name]

Default: None, required parameter

Choices: blastn, blastp, blastx, tblastn, tblastx, psitblastn

When choosing psitblastn, the -R [checkpoint file] must also be specified. This special use of blastall uses the output PSSM checkpoint file of PSI-BLAST (see blastpgp -C option), combined with the protein query sequence, to implement a tblastn search against a nucleotide database.

-P [0/1]

Default: blastn 1, others 0

Programs: All

Specifies the two-hit or single-hit algorithm. The two-hit option requires two word hits on the same diagonal to extend from either one. When set to two-hit mode, the -A parameter specifies how close the two hits have to be to trigger extension.

Options

0: Two hit
1: Single hit

-q [negative integer]

Default: -3

Programs: blastn only

Sets the penalty for a nucleotide mismatch. Also see -r. The choice of [integer] for -q and -r are very important because they determine your target frequencies. The default values -r 1 -q -3 are most effective for aligning sequences that are 99 percent identical. See Appendix B for more information on nucleotide scoring schemes.

-Q [1..23]

Default: 1

Programs: blastx, tblastx

Genetic code to use for translation of the query nucleotide sequence. See the -D parameter for list of genetic codes.

-r [integer]

Default: 1

Programs: blastn only

Sets the score of a nucleotide match. See the -q parameter and Appendix B.

-R [checkpoint file]

Default: Optional

Programs: psitblastn

Designates the PSI-BLAST checkpoint file to be used in the psitblastn search. -p must be set to psitblastn. The input must be a protein sequence and be the same one used with blastpgp -C to generate the [checkpoint file].

-S [1..3]

Default: 3

Programs: blastn, blastx, tblastx

Chooses which strand of DNA-based queries is searched.

Options

1: Top strand
2: Bottom strand
3: Both strands

For example, the following command searches only the query's top strand.

blastall -p blastn -d nr -i query -S 1

-t [integer]

Default: 0

Length of the largest intron allowed in tblastn for linking HSPs. A default of 0 means that linking is turned off.

-T [T/F]

Default: F

Programs: All

Produces HTML output with <anchor> links from the summary at the top of the report to the alignments farther below. This option should be used only with the standard report format (-m 0).

-v [integer]

Default: 500

Programs: All

Sets the number of database sequences for which to show the one-line summary descriptions at the top of a BLAST report. You won't be warned if you exceed [integer]. Also see the -b parameter.

-w [integer]

Default: 0

Programs: blastx only

Sets the frame shift penalty for the Out Of Frame (OOF) algorithm of blastx. When -w is set, it invokes the OOF mode of BLAST, which lets alignments proceed across reading frames. The expect values calculated from OOF blastx are only approximate, and BLAST issues the following warning when OOF is invoked:

[NULL_Caption] WARNING: test500: Out-of-frame option
selected, Expect values are only approximate and 
calculated not assuming out-of-frame alignments

The out-of-frame alignments are signified by slashes that indicate the +1(/),+2(//), -1(\), and -2(\\) frameshifts. The following is a sample OOF alignment:

Query: 23  PLIRNSL/YCINC\\A//QSIIRAHVKGPYLTRWVVNC/E\TCSKGYAKTPGASTDLLLL 160
           PLIRNSL YCINC     QSIIRAHVKGPYLTRWVVNC   TCSKGYAKTPGASTDLLLL
Sbjct: 1   PLIRNSL YCINC  X  QSIIRAHVKGPYLTRWVVNC X TCSKGYAKTPGASTDLLLL 53

Query: 161 YKTRNSLTSASSLSPVRSQRMI/N\SFPRFQGHLVVSG/S\SAHNR/FS\FNRDSPRGSG 322
           YKTRNSLTSASSLSPVRSQRMI   SFPRFQGHLVVSG   SAHNR F  FNRDSPRGSG
Sbjct: 54  YKTRNSLTSASSLSPVRSQRMI X SFPRFQGHLVVSG X SAHNR FX FNRDSPRGSG 107

Query: 323 SYCSREPMGQIKIRRTHTDDKLFR/ND\SRHTRAGDGLNI//TLA\\RDPSFLSRVYNAN 484
           SYCSREPMGQIKIRRTHTDDKLFR    SRHTRAGDGLNI   L   RDPSFLSRVYNAN
Sbjct: 108 SYCSREPMGQIKIRRTHTDDKLFR XX SRHTRAGDGLNI  XLX  RDPSFLSRVYNAN 161

Query: 485 SYLHI 499
           SYLHI
Sbjct: 162 SYLHI 166

-W [integer]

Defaults: blastn 11, others 3

Programs: All

Sets the word size for the initial word search. The minimum word size for blastn is 7. Word sizes for blastp, blastx, tblastn, and tblastx are 2 or 3.

-X [integer]

Default: blastn 30, others 15

Programs: All, except tblastx

Sets the X2 dropoff value for gapped alignments. The value is measured in bits. Smaller values of X2 result in earlier termination of extensions. Adjusting this parameter is generally unnecessary.

-y [integer]

Default: blastn 20; other 7

Programs: All

Sets the X1 dropoff value (in bits) for extensions. The lower X1 is set, the shorter the extension will be. It's rarely necessary to adjust this parameter.

-Y [real number]

Default: 0

Programs: All

The effective length of the search space. This is the size of the database multiplied by the size of the query or MN from the Karlin-Altschul equation.

If -Y is unset or set to 0, the actual size of the database and query is used.

-z [real number]

Default: 0

Programs: All

The effective length of the database. This option is useful for maintaining consistent statistics over time as databases grow.

If -z is unset or set to 0, the actual effective length of the database is used.

-Z [integer]

Default: 25

Programs: All

Sets the X3 dropoff value (in bits) for extensions but is bounded by the value for X2. It's generally not necessary to adjust this parameter.