6.2 Alignments

The alignments and alignment statistics reported by BLAST differ slightly from program to program. The rest of this chapter describes the details of BLASTP, BLASTN, BLASTX, TBLASTN, and TBLASTX alignments and shows how to recognize alignment groups.

6.2.1 BLASTP

BLASTP alignments are the simplest to understand. Figure 6-2 shows the anatomy of a typical BLASTP alignment.

Figure 6-2. A BLASTP alignment

Here are the parts you should pay attention to:

Score: This value is computed from the scoring matrix and gap penalties. A higher score indicates greater similarity. The raw score is shown without units, and the normalized score is followed by "bits."
Database sequence: The complete FASTA definition line is reported here along with the length of the sequence. All the alignments between the query and a specific database sequence are collectively called a hit. The database in Figure 6-2 has one alignment.
Expect: The number of alignments expected at random given the size of the search space, the scoring matrix, and the gap penalties. The lower the E-value, the less likely this is a random similarity.
Statistics lines: Score, E-value, and percent identity always appear here. Depending on the program, percent positive scoring, P-value, group, gaps, strand, and reading frame may also be reported.
Coordinates: The coordinates of each sequence are indicated at the beginning and ending of each line. The single alignment in Figure 6-2 is long enough that it is reported on three separate lines.
Alignment line: Letters that are identical between two sequences are reported here. Those that have positive scores in the scoring matrix are displayed with a plus sign. Gaps and nonpositive scores are blank.
Query and Sbjct: The query sequence is always listed first. The database sequence is abbreviated as Sbjct (subject).

The database sequence may be several lines long if the BLAST database is a nonredundant database with concatenated definition lines. For more on this topic, see Chapter 11. The WU-BLAST format differs slightly from the NCBI format: gaps aren't reported on the statistics line, and the P-value (displayed as P or Sum P) is always reported in addition to the Expect.

6.2.2 BLASTN

DNA is a double-stranded molecule, and genes may occur on either strand. This fact makes BLASTN alignments a little more difficult to interpret than BLASTP alignments. When a query sequence is searched against a database, both strands of the query are examined. The plus strand is the sequence in the FASTA file. The minus strand is the reverse complement of this sequence. If the similarity between the query and subject sequences is on the same strand, both sequences are labeled as being on the plus strand and the coordinates increase from left to right (Figure 6-3a). Since BLAST just aligns letters and has no model of genes or other features, it is impossible to determine on which strand a gene lies from a BLASTN alignment. Even if an alignment is labeled as "Plus/Plus," the encoded gene may be on the minus strand.

When the minus strand of the query sequence is similar to a database sequence, the alignment is reported with either the subject or query sequence in reversed coordinates. In NCBI-BLAST, the database sequences are flipped (Figure 6-3b), but in WU-BLAST, the query coordinates are flipped (Figure 6-3c).

Figure 6-3. BLASTN alignments: (a) NCBI-BLAST, same strand; (b) NCBI-BLAST, different strand; (c) WU-BLAST, different strand

Table 6-1 shows how strand is displayed in the five standard BLAST programs.

Table 6-1. Strandedness
Program	Plus / Plus	Plus / Minus	Minus / Plus	Minus / Minus
BLASTP	Always	Never, proteins don't have strand
BLASTN	Same strand	NCBI-BLAST flips the subject sequence	WU-BLAST flips the query sequence	Never
BLASTX	The query sequence is labeled as Frame +1, +2, +3	Never	The query sequence is labeled as Frame -1, -2, -3	Never
TBLASTN	The subject sequence is labeled as Frame +1, +2, +3	The subject sequence is labeled as Frame -1, -2, -3	Never	Never
TBLASTX	Any combination of positive or negative frames for either the query or subject sequence.

Here are a few minor notes:

Both NCBI-BLAST and WU-BLAST change the alignment format for BLASTN to represent matches as vertical bars. Because match/mismatch scoring is used, positive scoring mismatches are not displayed.
NCBI-BLAST displays nucleotide sequences in lowercase, whereas WU-BLAST displays them in uppercase.

6.2.3 BLASTX

Alignments from BLASTX are complicated by both strand and reading frame. The query sequence is translated in three frames on both the plus and minus strands. Chapter 2 discusses the reading frame in more detail. With three nucleotides per codon, the coordinates of the query sequence increase by threes (Figure 6-4a). On the plus strand, the reading frame is computed relative to the start of the plus strand; reading frame 1 starts at position 1 and reading frame 2 starts at position 2. On the minus strand, the reading frame is calculated relative to the reverse complement of the plus strand; the last letter of the FASTA file starts frame -1 and the second-to-last letter starts frame -2. Minus strand matches invert the query coordinates (Figure 6-4b).

Figure 6-4. BLASTX alignments (ovals indicate that nucleotide coordinates increase by threes (a) and are reversed for minus strand matches (b))

6.2.4 TBLASTN

TBLASTN alignments are very similar to BLASTX alignments, except that the database and query are exchanged. Therefore, the database sequence increases in threes, and the database sequence has flipped coordinates on the minus strand.

6.2.5 TBLASTX

TBLASTX has more complicated alignments because both the query and the database have strand and frame. Figure 6-5 shows examples of all strand combinations. One of the most confusing aspects of TBLASTX alignments is that a number of different frames may represent the same region from both the query and subject. A TBLASTX alignment between two genomic sequences often highlights shared coding sequences. However, the correct frame of the encoded proteins can't be determined from a TBLASTX report. Chapter 8 and Chapter 9 discuss techniques that make TBLASTX more discriminate.

Figure 6-5. TBLASTX alignments (coordinates increase by threes and may have any combination of frames)

6.2.6 Alignment Groups

Alignment groups are one of the most confusing aspects of the BLAST report. Chapter 4 and Chapter 5 discuss how and why alignments are sometimes grouped to increase their statistical significance. However, the standard BLAST format doesn't make this structure easy to see. Figure 6-6 shows the scores reported for various alignments in a single database hit. The groups can be inferred from the Expect values. If several alignments have the same E-value, it is more difficult to determine which alignments belong to which groups.

Figure 6-6. Alignment groups (groups can be inferred from Expect values)

By default, WU-BLAST alignment groups are just as difficult to recognize as NCBI-BLAST groups. WU-BLAST has a very useful command-line option called topcomboN that organizes and limits the number of groups. Chapter 8 discusses topcomboN in more detail. Figure 6-7 shows how groups are organized by strand and then by Sum P-value for a single database hit. Groups are labeled and need not be inferred. Notice that some groups contain only one alignment.