9.2 BLASTP Protocols

Most BLASTP searches fall under the exploring category, which means you're trying to learn about your query sequence by comparing it to other proteins. You might also want to determine if particular regions are highly or not so highly conserved. Or you may want to gather proteins to build a phylogenetic tree. In any case, your main concern is how deeply you want to explore. The following protocols offer three levels of sensitivity.

9.2.1 The Standard BLASTP Search

Probably the most common BLAST search is BLASTP with default parameters. It is used in various settings because it balances speed and sensitivity. For example, if you want to compare all the proteins between two organisms, this is a good place to start. If the proteomes are very distant, the default parameters may not be ideal because alignments containing less than 35 percent identity aren't as easily detected. If the proteomes are very close, the standard search is still a good strategy because not all proteins evolve at the same rate, and some may diverge rather quickly. Approach

We'll make only one adjustment to the default NCBI parameters. We use soft masking instead of normal complexity filtering so the entire alignment is scored. The WU-BLAST parameters are approximately the same as those of NCBI-BLAST. NCBI-BLAST parameters
blastall -p blastp -d <db> -i <query> -F "m S" WU-BLAST parameters
blastp <db> <query> hitdist=40 wordmask=seg postsw Expected results

If you don't find any database hits, your query sequence may correspond to a novel protein. On the other hand, it may be that the parameters of the search are obscuring the similarity. If your query is very short, it may be difficult for it to achieve statistical significance. In this case, first try raising E. However, this step alone may not be enough, and you may have to change to a scoring matrix with a higher value of H (bits per aligned letter), such as BLOSUM80.

If you want to find remote homologies with short query sequences, be prepared for many false-positive alignments. If your sequence has a long, low-complexity region, be sure to have soft masking turned on. It's difficult to find collagens, for example, if complexity filters are destroying most of the alignment. Finally, try the slower, more sensitive search described later.

If you find that you have hundreds of database hits, you may be overrunning the output reporting parameters (-b and -v in NCBI-BLAST and V and B in WU-BLAST). If this is a concern, simply increase these values. However, if you're interested in only the top hits, you can either set E higher or use a search strategy designed for more similar sequences (below). Optimizations and variations

The two protocols below offer speed-sensitivity tradeoffs. For more subtle changes, try altering T. If you use WU-BLAST, set W=4 and scale up T appropriately.

9.2.2 Fast, Insensitive Search

Increased speed is one reason to use an insensitive search. This is particularly true when performing multiple searches. Another reason is to increase the information content in the alignments, which is helpful for short query sequences whose alignments might otherwise fall below the significance threshold. As a rule, the insensitive search shouldn't be used for sequences that are expected to have less than 50 percent identity. Approach

A simple way to make BLASTP faster is to ignore neighborhood words and require that seeds be formed from identical words. Because the sequences are expected to be very similar, we choose the BLOSUM80 scoring matrix and set a low value for E. The proper value for E depends on the query length and database size, so treat the value given next as a starting point. NCBI-BLAST parameters
blastall -p blastp -d <db> -i <query> -F "m S" -f 999 -M BLOSUM80 -G 9 -E 2 -e 1e-5 WU-BLAST parameters
blastp <db> <query> wordmask=seg W=3 T=999 matrix=BLOSUM80 Q=11 R=2 postsw E=1e-5 Expected results

Most of your alignments should have high percent identities. You will find some that dip to 30 percent, but this doesn't ensure that you can find such alignments in general. With such insensitive parameters, it is unlikely that you will overrun the output cutoffs, but it's worth checking anyway. Set -v and -b higher (V and B in WU-BLAST), or decrease E as you see fit. Optimizations and variations

You can't make the NCBI-BLAST search much faster than it is because the parameters are already near optimum. Setting the two-hit distance lower gives a minute increase in speed that isn't worth the loss in sensitivity. You can play around with the WU-BLAST seeding parameters by changing W, T, and hitdist.

9.2.3 Slow, Sensitive Search

If you're having a hard time finding sequences similar to your query or if you're looking for distant relatives, you may have more success with sensitive parameters. Approach

We recommend lowering T and choosing a scoring matrix designed for greater divergence. The NCBI-BLAST and WU-BLAST scoring parameters are slightly different because they don't have the same built-in estimates for lambda. As usual, we suggest soft masking to align the low-complexity sequence properly; here it's particularly important because we want to make sure that all positive scores are counted. When searching for remote similarities, some real signals can have very low scores. For this reason, even though it will make the report longer, we set E higher. For the same reason, we increase the output reporting parameters. NCBI-BLAST parameters
blastall -p blastp -d <db> -i <query> -f 9 -F "m S" -M BLOSUM45 -e 100 -b 10000 -v 
10000 WU-BLAST parameters
blastp <db> <query> T=9 wordmask=seg hitdist=60 matrix=BLOSUM50 Q=13 R=1 E=100 
B=10000 V=10000 Expected results

Whenever you increase sensitivity, expect a decrease in specificity. These parameters are very sensitive, so many of the alignments may be chance similarities and of no biological significance. On the other hand, some biological signals aren't modeled well by BLAST statistics and what may appear as a very low score may be of real interest. Reading a BLAST report containing thousands of alignments isn't always entertaining, so if you're looking for something specific, such as an alignment to a particular region, you may be able to automate the reading with a BLAST parser. Optimizations and variations

The probability model of BLAST assumes that amino acid pairings are independent of their neighbors. But some domains have characteristic signatures. So if your protein belongs to a family of related proteins, you may be able to find more distant relatives by choosing an algorithm with a position-specific scoring matrix, such as PSI-BLAST or HMMER. However, if your query is a novel protein, the best you can do is make your search parameters more sensitive.

To increase sensitivity even more, turn off the two-hit algorithm. In NCBI-BLAST, set -P 1 and in WU-BLAST, remove hitdist=60.