megablast
is similar to blastn but optimized to find near
identities very quickly. It's much faster than the
standard blastn, partly because it uses query
packing. The extension algorithm differs from the standard
blastn and isn't designed for
cross-species searches. Many parameters are identical between
megablast and blastall, but
some are unique to one program or the other, and some parameters with
the same symbol do different things.
Here are a few example command lines:
megablast -d my_db -i my_query -F "m D"
megablast -d my_db -i my_query -D 2 -t 18 -W 11
The number of processors; same as blastall.
The two-hit algorithm window size;
same as blastall.
The number of database sequences to
show; same as blastall, if -D
2 is set.
The database; same as blastall.
The type of output. The -m option applies only if
-D 2 is set here.
Options
- 0
-
One-line output for each alignment in the form of:
'subject-id'=='[+-]query-id' (s_beg q_beg s_end q_end)
Score
For example:
'AF071362'=='+AF071357' (1 715 200 920) 8
Score for non-affine gapping
parameters (the default) is the total number of differences
(mismatches + gaps); it's the actual raw score when
using affine gapping.
- 1
-
Same as the output of -D 0, but
additionally shows the endpoints and percent identity for each
ungapped segment in the alignment.
#'>AF071362'=='+AF071357' (1 715 200 920) 8
a {
s 8
b 1 715
e 200 920
l 1 715 26 740 (96)
l 27 742 27 742 (100)
l 28 744 47 763 (100)
l 48 765 50 767 (100)
l 51 769 60 778 (100)
l 61 780 133 852 (100)
l 134 854 200 920 (99)
}
- s
-
Score.
- b
-
Begin coordinates for the subject and query, respectively.
- e
-
End coordinates for subject and query, respectively.
- l
-
Coordinates for each ungapped segment with the percent identity in
parentheses at the end.
- 2
-
A traditional BLAST output.
- 3
-
A tab-delimited, one-line format. The 12 reported tab-delimited
fields are as follows:
- Query
- Subject
- Percent identity
- Alignment length
- Mismatches
- Gap openings
- Query start
- Query end
- Subject start
- Subject end
- E value
- Bit score
The
expectation value; same as blastall. However,
it's set to a very large number, so there is
effectively no cutoff.
Setting -E and
-G turns on affine gapping (same as standard
blastall). This causes
megablast to use more memory and
isn't necessary when the sequences are expected to
be nearly identical. When -E and
-G aren't set, the gap extension
penalty is calculated from the match (-r) and
mismatch (-q) so that E = r/2 -q. E is rounded
down to the nearest integer. So, for the default +1/-3 matrix, the
gap extension penalty equals 3.
Shows full IDs of the database
sequences in the output. The default is only the accession, or just
the GI if no accession is given. Applies to -D
0, -D 1, and
-D 3.
Filters the query sequence; same as
blastall.
Setting -E and -G turns on
affine gapping (same as standard blastall). This
causes megablast to use more memory and
isn't necessary when the sequences are expected to
be nearly identical.
The maximum number of HSPs to save
per database sequence. The default of 0 means
"unlimited."
The query file; same as blastall.
Shows GI numbers in database
deflines; same as blastall.
Can be used only with -D 2.
Restricts search to a list of GI numbers; same as
blastall.
The location on query sequence; same
as blastall.
Alignment view options. Must set -D
2, then it's the same as
blastall.
Default: 20000000 (20 million) | |
The maximum total length of queries for a single search. Reducing
this number reduces the amount of memory required by
megablast.
Uses dynamic programming extension
for affine gap scores. The default is to use a greedy algorithm for
an extension.
The
type of discontiguous template. To use discontiguous seeding,
-t must be set to 16, 18, or 21, and
-W must be 11 or 12.
Discontiguous templates don't require the usual
exact word match employed by the other BLAST programs, but use a
template pattern that must be matched to seed an alignment. If a
template is specified by 1s and 0s, for example, with 1 representing
required matches and 0 representing residues that need not match,
then you can represent a template size 16 with a word size of 11 as:
1,110,010,110,110,111
Options
- 0
-
Coding template. This discontiguous
template uses a pattern of 110 to match coding
sequence where the third codon position is variable (and therefore
set to 0 and not required to match). Here are all coding template
combinations:
110,110,110,110,110,1 [11 of 16]
111,110,110,110,110,1 [12 of 16]
10,110,110,010,110,110,1 [11 of 18]
10,110,110,110,110,110,1 [12 of 18]
10,010,110,010,110,010,110,1 [11 of 21]
10,010,110,110,110,010,110,1 [12 of 21]
- 1
-
Optimal. This template pattern tries to minimize the correlation
between successive words. Here are all optimal template combinations:
1,110,010,110,110,111 [11 of 16]
1,110,110,110,110,111 [12 of 16]
111,010,010,110,010,111 [11 of 18]
111,010,110,010,110,111 [12 of 18]
111,010,010,100,010,010,111 [11 of 21]
111,010,010,110,010,010,111 [12 of 21]
- 2
-
Simultaneous optimal and coding. This option increases sensitivity by
allowing seeding from a match to either template at a given position.
Output file; same as blastall.
Percent identity cutoff. Alignments
less than [real number]
aren't reported. If using -D
0, all alignments are kept regardless of percent
identity (no trace-back is performed, so percent identity
can't be calculated).
The maximum number of positions for a hash value. If set to nonzero,
redundant subsequences will be masked in the word seeding phase. This
allows a simple type of filtering by masking out subsequences that
occur in the query sequences more than [integer]
times. When the word size (-W) is set to 16 or
higher, -P applies to subsequences of length 12;
it applies to subsequences of length 8 when -W is
set less than 16.
Mismatch penalty; same as blastall.
Masked query output. Each query sequence is reported to
[file], but with any region hit turned to Ns. This
works only in conjunction with -D 2.
Match score; same as blastall.
Reports a short log message at the end of the run.
The minimum hit score to report. All
alignments scoring less than [integer]
aren't reported. By default, this is set to the word
size, which results in all hits being reported.
The strands to search; same as blastall.
Sets discontiguous template size.
This, combined with the word size (-W) of either
11 or 12 and the template type (-N), sets
discontiguous megablast.
The HTML output; same as blastall, but is active
only if -D 2 is set.
Lowercase filtering; same as
blastall.
The number of one-line descriptions. Same as
blastall if -D
2 is set.
Word size. The default word size is
very high because sequences aligned by megablast
are expected to be nearly identical. For discontiguous searches
(-t), word size can be only 11 or 12.
megablast generates words every four bases
(similar to the WU-BLAST wink parameter), so using
a word size divisible by four assures that all words of that size
will be found.
The X dropoff value for a gapped
alignment; same as blastall.
The X dropoff value for an ungapped extension; same as
blastall.
The effective length of a database; same as
blastall.
The X dropoff value for a dynamic
programming gapped extension.