As we said earlier, most searches can be categorized as either mapping or exploring searches. When sequences are expected to be nearly identical, you should use the +1/-3 match-mismatch parameters, which have a target frequency of 99 percent identity. Cross-species exploration requires a change in the scoring parameters and word size. We like +1/-1 for both its simplicity and its 75 percent identity target frequency. The choice of word size depends on balancing sensitivity and specificity. The default word size of 11 is too risky; use 9, which corresponds to a little more stringency than three identical amino acids because there's no allowance for degenerate codons. The choice of gap costs depends on the size of the expected gap. For simulating sequencing errors, the gap costs should be uniform and relatively high, but for modeling amino acid gaps or nucleotide hybridization bubbles, the cost of extension should be lower.
Many kinds of experiments, both molecular and computational, employ short nucleotide sequences called oligonucleotides, or just oligos (oligo is Greek for few). For example, the polymerase chain reaction (PCR) is a routine laboratory procedure for amplifying a specific nucleotide sequence from DNA or indirectly from RNA. In PCR, oligos are used as templates for DNA replication, and the subsequence between the oligos is amplified. The most important feature of oligos is they may be short enough to give rise to many false-positive matches. In a test tube, we would say the oligo hybridizes nonspecifically, and in a BLAST experiment, we would say the alignments have high expectations.
Our goal here is to simulate the interaction between an oligo and a genome in a test tube. The thermodynamics of annealing are complex and depending on the conditions of the experiment (temperature, salt concentration, length, and composition of oligo), some mismatches between the sequences and even gaps may be possible. Still, the sequences are expected to be nearly identical, so we use corresponding match-mismatch parameters. The default word size is fine here; we don't increase it because a fortuitous mismatch can prevent seeding for a short oligo. Complexity filtering is turned off because we want the entire oligo to match, and low complexity isn't expected to be a problem with such a short query sequence. Because there is quite a bit of variation from one oligo to the next, we can't set a specific E value. Instead, we use the default and visually inspect the report after the search.
blastall -p blastn -d <genome> -i <oligo> -G 2 -E 1 -F F megablast -d <genome> -i <oligo> -W 11 -F F -D 2
blastn <genome> <oligo> M=1 N=-3 Q=3 R=1
There may be several alignments between the oligo and genome, and not all of them may align end to end. If you are simulating PCR, mismatches at the 3´ end of the oligo are of particular interest because they may prevent priming.
If you don't find any hits, the oligo may be too short for its alignments to achieve statistical significance. For short oligos, even a 100 percent matching alignment may have a score that is expected at random in a large search space. Try raising E. Also, make sure that the scoring scheme favors near identity. Otherwise, lambda may transform the score to a very low amount of information, and you may not be able to set E high enough to recover the alignment. Other possibilities include too large a value for W or the use of complexity filters.
If you find too many hits, increase the stringency of the search by decreasing E. The suggested scoring scheme is already pretty strict, but you may want to set the gap penalties higher or turn off gapping entirely if you find too many gapped alignments. It may be that the query is just found in many places. If you don't care about the details of the alignments, tabular format is convenient to parse and takes up much less space. See Appendix A and Appendix E to learn how to report in tabular format.
If you have many oligos to map, a technique called query packing (see Chapter 10) can greatly improve your speed. If you're interested only in exact matches, you can set the word size to the same size as the oligo. This is probably not a good simulation for what happens in a test tube, but it will make the search faster. Here, you might consider using MegaBLAST rather than BLASTN because it automatically packs queries and uses large word sizes but make sure the word size isn't larger than the oligo. If you want to search for oligos cross-species, be prepared to sift through many alignments because the expectation for low-scoring alignments may be very high.
Many BLASTN searches fall into the same general category in which a moderately sized DNA sequence (usually around 500 bp) is used to query a genome. There are separate protocols for spliced query sequences, searching EST databases, and exploring distantly related sequences.
Our alignment parameters favor near identity and use a large word size to make the search faster. It is probably not necessary to set a value for E because the word size alone provides specificity. But if you lower the word size, you will want to set E to a small value that depends on your search space. The value presented here is only a suggestion. As with any search involving genomic DNA, you should mask repeats before you begin. See Chapter 7 for more details.
blastall -p blastn -d <genome> -i <dna> -G 1 -E 3 -W 30 -F "m D" -U -e 1e-20 megablast -d <genome> -i <dna> -F "m D" -U -D 2
blastn <genome> <dna> M=1 N=-3 Q=3 R=3 W=30 wordmask=seg lcmask E=1e-20
True matches between the query and the genome ought to align from end to end with near identity. Because BLAST is a local alignment program, you can't require the alignment to cover the entire length of the sequence, so you just have to look for this property in the output. If the alignment doesn't go end to end, the sequence quality of the query might drop at the ends (which happens with raw sequencing reads).
Even if you mask repeats, they can still cause trouble. Some repeats aren't very abundant, or are limited to a particular region/chromosome and therefore may not be part of your repeat library. If your report is particularly long, look for regions that are over-represented in alignments. Graphical reports are very useful (see Appendix D). You may have to mask or omit troublesome regions by hand if they continue to give you problems.
Low-complexity sequences can also be problematic because not all instances are caught by the default parameters of complexity filters. You can further reduce nonspecific hits by filtering both the query and the database, but since the database is case-insensitive, soft-masking isn't an option. If you find that repeats or low-complexity matches dominate your report, you will probably have to run the report through a parser and select the hits that are nearly full length.
Genomes sometimes have regions of large duplications. While you may expect a single near-identity alignment, you can find multiple matches if your query has paralogs. Depending on how much time has passed since the duplication event, paralogs may be very distant or identical to one another.
If you don't care much about the alignments, tabular output will help you read the report (see Appendix A and Appendix E). If you have a number of query sequences, this is a really good place to use MegaBLAST; it was designed for tasks such as this. For sequences that aren't expected to be identical, see Section 9.1.4.
Determining the correct structure of eukaryotic protein-coding genes isn't an easy task because genes are broken up into exons. One of the most accurate methods for determining exon-intron structure involves mapping transcripts back to their origin in a genome. This procedure sounds simple, but it is actually a bit complicated, and its difficulties shouldn't be underestimated. A related, but more difficult, problem is mapping transcripts between species (see the following protocol). Also see Section 9.1.5.
Most exons are 100-200 bp long, but there is a large range from only a few nucleotides to several kilobases. Generally, 99 percent of exons are larger than 50 bp, so large word sizes work fine. We suggest that you use typical near-identity parameters, but choose a word size that isn't quite as large as the previous protocol because it may be difficult to seed short exons with low quality sequence. As with any DNA search, the query should be repeat-masked prior to the search, and lowercase masking is preferred. Including a low value for E reduces many nonspecific hits. The proper value for E depends on the length of the query and the size of the database. The value given here is only a suggestion.
blastall -p blastn -d <genome> -i <cDNA> -G 1 -E 3 -W 15 -F "m D" -U -e 1e-20 megablast -d <genome> -i <cDNA> -W 12 -F "m D" -U -t 21 -N 0 -D 2
blastn <genome> <cDNA> M=1 N=-3 Q=3 R=3 W=15 wordmask=seg lcmask E=1e-20
BLAST performs local alignments and doesn't explicitly model exon-intron structure or splice sites. For this reason, HSP endpoints aren't expected to correspond to exon boundaries, though they sometimes do. It is common for the alignment to be a few bps longer than the exon boundary on each side but in a low quality sequence, the alignment may be shorter than the exon. To determine whether you missed a short exon, look for a region of the query that isn't represented in any alignment. A graphical report is useful here (see Appendix D). If you find such a region, you may wish to search just this piece against the intron between neighboring exons with the bl2seq program using a shorter word size.
The same issues involving unidentified repeats, low-complexity sequence, and paralogs you encounter when mapping nonspliced sequences also apply here. Pseudo genes may also pose a problem. They are fairly easy to detect because they look like mRNAs embedded in the genome rather than real genes. See Section 9.1.5.
If you have several sequences, MegaBLAST is a better choice than BLASTN. If your sequences come from different species, also see Section 9.1.4. Several programs model exon-intron structure, and they often give accurate results. But don't expect them to work every time because small exons, low quality sequence, repeats, gene duplications, etc., also affect these tools. Some of the most popular programs include SIM4, SPIDEY, and EST2GENOME. If you want to align ESTs and genomes from distant species, EST2GENOME is the best choice because it doesn't seed alignments with words.
Comparative sequence analysis is a powerful approach for finding biologically important sequences. You may search for protein-coding genes, regulatory elements, RNA genes, or other regions of interest. In most cases, you expect the sequences to be similar but probably not identical. Most changes will probably be nucleotide substitutions, insertions, or deletions, but some may be more extreme. For example, genes may gain/lose exons or introns, repetitive elements may be inserted/deleted, and large-scale duplications, inversions, and deletions and other rearrangements may occur. Be cautious. This book doesn't include multiple cross-species protocols, so use this one to modify the other BLASTN protocols.
Because we don't expect the sequences to be identical, we use relaxed parameters for both seeding and alignment. Therefore, we use typical exploration parameters (+1/-1 and word size of 9) with soft masking. These parameters are similar to the following repetitive element identification parameters, but we choose higher gap penalties here because functional sequences usually have few gaps. The choice of E is left to you because there are many appropriate values, depending on your level of stringency. Should you set it high, you may also want to increase the output reporting options (-b and -v in NCBI-BLAST; B and V in WU-BLAST).
blastall -p blastn -d <genome> -i <dna> -r 1 -q -1 -G 1 -E 2 -W 9 -F "m D" -U
blastn <genome> <dna> M=1 N=-1 Q=3 R=2 W=9 wordmask=seg lcmask
Be on the lookout for repetitive and low-complexity sequence and pseudogenes. Cross-species alignments are difficult to interpret because many factors impact DNA evolution. Not all sequences evolve at the same rate, and it is very easy to confuse signal and noise. It's a good idea to approach your findings with skepticism. Sequences that are nearly identical may indicate a very important biological signal, or they may represent sequencing contamination.
Low-scoring alignments may be coincidental similarities of no biological significance. If your search space is large, even high-scoring alignments are expected by chance. Work out the Karlin-Altschul expectation and search fabricated sequences to appreciate how frequently false positive alignments occur. That said, some biological signals are short and may be buried in the stochastic noise. The best way to deal with them is to reduce your search space. For example, if you are interested in determining if there is a short region of interest within the intron of a gene, try aligning the intron with the orthologous intron from another genome rather than the entire genome.
If you want to identify orthologs between genomes, the most common approach is to label the best reciprocal match to the ortholog. This approach can be confounded by paralogs, so take synteny into account if possible and look for homology that extends to neighboring genes.
Changing word size and scoring parameters are some of the most obvious alterations you can make to the protocol. Adjusting word size by a single point can alter the speed by a factor of 3 (this is a rough estimate and applies only to relatively small word sizes). How seeding affects sensitivity depends on what you're searching for and the expected divergence. Other useful scoring schemes are given in Appendix B.
If you're interested in coding sequence similarities, TBLASTX is a better choice for more distant relationships. But since this program runs relatively slowly, you are better off with BLASTN for closer relatives. As a rule, if the expected identity is less than 70 percent, switch to TBLASTX.
MegaBLAST isn't recommended for cross-species searches. The new discontiguous version is designed for this task, but the effective word size, 14, is too high.
Identifying genes in genomic sequence is a difficult and important task. None of the many experimental and computational approaches is foolproof. One useful technique is to identify related transcripts. The most common form of transcript information comes from ESTs.
ESTs are sequencing reads derived from the ends of cDNAs, and they therefore conceptually correspond to the transcripts of protein-coding genes. But not all ESTs encode proteins: mRNAs have untranslated regions at both ends, and many ESTs don't actually correspond to genes. Various techniques are employed to increase the proportion of less abundant transcripts, and while these techniques are useful for discovering genes with low levels of expression, they tend to increase the fraction of nontranscript sequences (otherwise know as "junk"). As a result, some EST collections contain a lot of sequence that doesn't correspond to any protein. Unfortunately, it is no simple task to determine which EST sequences in a database correspond to transcripts and which are junk.
Even though we expect many ESTs to align to the genomic DNA with near-identity, exploration parameters are often more appropriate than mapping parameters. There may be genes for which matching transcripts haven't yet been isolated but for which similar transcripts are available. These genes may come from the same or different species.
As usual, you should repeat-mask the sequence prior to the search and (preferred, but not necessary) use lowercase masking rather than Ns. Set a low E value to cut down on false positive alignments, and set the output options high because some regions are highly expressed and may prevent the display of real, low-scoring alignments.
After this search is performed, you will probably want to use specialized alignment algorithms to determine the exon-intron boundaries. See Section 9.1.1.5 section in the protocol in Section 9.1.3.
blastall -p blastn -d <est_db> -i <genomic> -r 1 -q -1 -G 1 -E 2 -W 9 -F "m D" -U -e 1e-20 -b 100000 -v 100000
blastn <est_db> <genomic> M=1 N=-1 Q=3 R=2 W=9 wordmask=seg lcmask E=1e-20 B=100000 V=100000
The first thing to remember is that not all transcript matches correspond to a gene. There is quite a bit of variability from one region of a genome to the next; some regions have very few nontranscript matches while others are completely covered in junk alignments. Several features can separate transcripts from junk. Here are a few rules to remember:
Eukaryotic genes usually have introns, so if a database match has only one HSP, it may be junk. However, some genes have only one exon, and some exons are longer than sequencing reads, so you can't rely on this rule. If the exon contains coding sequence, there ought to be a large ORF. It is possible for what should be a single HSP to look like multiple HSPs if the extension terminates (low sequence quality and hard-masking cause this). True splicing events are easily identified from their coordinates; there ought to be a large coordinate gap in the genome but not the EST.
Exons almost never overlap a repetitive element and are usually at least 20 bp away. If you mask repeats with Ns, you won't find repeat overlaps, so look for HSPs that abut repeats. Low complexity is a completely different issue, and transcripts often overlap short, low-complexity regions.
Most real genes are evolutionarily conserved. Therefore, ESTs from multiple species ought to align to a gene if the organisms aren't too diverged. However, just because ESTs pile up on a particular region doesn't mean that a gene is there. Many pseudogenes have this property, as do unmasked repeats and low entropy regions.
cDNA libraries constructed with subtractive/selective hybridization, micro-dissected tissues, or PCR amplification usually have a lot of nontranscript sequences. You may wish to track down the literature references for suspect ESTs to determine how their cDNA libraries were constructed.
Genes are regulated in time and space, so not all of them may be present in a particular cDNA library. For example, for ethical reasons, it is more difficult to find genes expressed in the human egg than the chicken egg. Regulation also occurs at the level of splicing, so some exons may be absent at one time or another.
cDNA libraries are usually constructed using poly-T primers to bind to the poly-A tail of mRNAs. If the genomic sequence has a long run of As, the resulting ESTs may all appear to end there; the real transcript may be much longer.
You can often discriminate exons from junk by simply viewing how the alignments stack up on the genome. Exonic regions generally have short HSPs with numerous alignments (except if the exon is very long). Junk regions usually have long alignments with little overlap. Visually, exons look like towers, and junk looks like stepping stones. But internal priming can make junk look like it has a defined endpoint. A graphical report is very handy here (see Appendix D).
If your database is very large, you may consider increasing the word size. This will, of course, reduce sensitivity, but if you're only interested in nearly identical ESTs, you can change to typical mapping parameters, and you may want to change to MegaBLAST as well. To increase sensitivity, rather than decreasing the word size, you may consider TBLASTX if you are most interested in the coding sequences.
cDNA libraries are often redundant, with a handful of highly expressed genes making up most transcripts. Clustering transcripts to create a representative set with less redundancy is therefore a common task. A variant of clustering is extension, in which ESTs are assembled into larger, more complete entities. Clustering and extension are difficult even for seasoned bioinformatics professionals. Treat this protocol as a starting or learning point. BLASTN isn't the best program for this specific task. Several software packages for clustering and extension already exist, and this protocol can help you understand their features.
This is an "all versus all BLASTN" procedure, so your computational time may be immense if you have a lot of sequences. It's one of the few cases when hard-masking is preferable because repeats and low complexity can confuse clustering or extension if the wrong associations are made. To err on the side of safety, we recommend masking your sequences before creating your database.
We expect the alignments to be nearly identical, except for sequencing errors and allelic differences (polymorphisms), so we use typical mapping parameters and a very large word size (WU-BLAST parameters use a slightly smaller word size; you can include WINK to reduce the number of seeds because this combination is more efficient). It may seem risky to use large words with data that is expected to contain sequencing errors, but because the dataset is potentially very large and we're primarily interested in long, highly specific alignments, the risk is worth taking. The word size has enough specificity that it is probably not necessary to set E, but we do so just in case. Finally, we set the output options to "high" in case some clusters are particularly deep.
blastall -p blastn -d <db> -i <EST> -G 1 -E 3 -W 30 -U -v 10000 -b 10000 -e 1e-10
blastn <db> <EST> M=1 N=-3 Q=3 R=3 W=15 WINK=15 filter=seg lcmask V=10000 B=10000 E=1e-10
In the simplest case, EST overlaps can be followed in either direction to create longer, virtual transcripts. For clustering, the representative EST is usually the one with the most matches (the longest). These straightforward expectations have many potential problems. Here are some of the common ones:
Alignments may not be able to cross long repetitive regions. It is therefore possible for multiple HSPs to be present for sequences that are 100 percent identical. A second alignment with unmasked sequences can solve this problem.
Some genes have multiple promoters or undergo alternative splicing and therefore produce multiple forms of transcripts. As a result, transcripts that are identical for much of their length may have discrepancies that correspond to unique or variant exons.
Cloning artifacts and lane tracking errors may join two sequences artificially. It is difficult to differentiate between chimeric sequences and isoform variants with just transcript alignments. Mapping the ESTs in their source genome is the best way to sort this issue out.
Some genes exist in multiple copies in a genome. These may be completely identical to one another or quite diverged. Determining if two nearly identical ESTs come from the same gene isn't as simple because it depends on the sequencing error rate and the level of polymorphism. Mapping transcripts to their genomic source can help solve this problem.
The presence of a poly-A tail is often taken as meaning the end of a transcript, but it may just be a run of A's in the middle of an exon. Real poly-A tails often have an AATAAA consensus sequence upstream, but a more reliable measure examines the genomic source to determine if the A's come from the genome or were added to a transcript.
Given a database of DNA sequences, it is often necessary to rapidly group related sequences for further analysis or simply identify redundancy at some level. One approach is to use BLASTN with rapid, insensitive search parameters, and then parse the output for the desired properties (e.g., 97 percent identity over at least 90 percent of the sequence length), and finally group all reads that are directly or indirectly (transitively) associated. Bioperl tools can automate such a procedure, but it takes a little work. The NCBI-BLAST distribution includes a standalone program called blastclust that is designed for just this task.
Two protocols are given below?one for clustering ESTs that are expected to be nearly identical across the length of the read (99 percent identity, 90 percent coverage), and another for shotgun sequences that have high identity over a smaller region of the read (97 percent identity, 10 percent coverage). The alignment parameters are preset for near identity, but some differences that may be the result of sequencing errors or polymorphism are allowed. Unlike other BLAST programs, blastclust doesn't allow soft masking.
blastclust -i <fasta file> -o <output file> -p F -L 0.9 -S 99 -b F
blastclust -i <fasta file> -o <output file> -p F -L 0.1 -S 97 -b F
The output from blastclust is one line for each cluster. Each line contains the identifiers for sequences in the cluster and may therefore be very long.
Repetitive elements are a problem because they may lead to false associations. This is especially true in the shotgun sequence approach where you're looking for high identity over short stretches. In contrast, the EST approach requires a high-identity match over a large portion of the sequence, making it less prone to small repeat or domain problems.
Vectors are DNA sequences used to clone (copy) fragments of DNA. They are commonly used in DNA sequencing. For various reasons, the vector DNA may inadvertently be present in a sequencing read. Therefore, a common practice in sequencing labs is to identify and remove vector sequences. This protocol describes how to identify vectors but not actually clip them. This additional step can be accomplished in many ways and is easily automated using the Bioperl tools.
Our goal is to take a batch of sequencing reads and search them against a database of vector sequences (a comprehensive database is distributed in GenBank). We expect vector sequences to align with near identity, so our parameters reflect this. The parameters here are almost the same as for oligo mapping because vector contamination may be relatively short. However, we add complexity filters because raw sequencing reads sometimes have an abundance of low complexity sequence, and we change the gap parameters to better simulate sequence error.
blastall -p blastn -d <vector_db> -i <read> -G 1 -E 3 -W 10
blastn <vector_db> <read> M=1 N=-3 Q=3 R=3 W=10 filter=seg
Vector similarity usually occurs on one end of the query sequence, but it may not extend all the way to the end of the read if the sequence quality drops, and the alignment deteriorates. It's difficult to tell the difference between a short piece of vector contamination and a fortuitous similarity. If the alignment is at the end of the read and longer than 15 nucleotides, it's a good bet that the alignment is to vector.
Overcalling and undercalling are two potential problems in vector clipping. Poor sequence quality may lead to undercalling, so quality clipping usually precedes vector-clipping (this isn't a BLAST-based procedure). Undercalling can also occur if the value of E is set too high. Overcalling can result from believing that all alignments reported are vector similarities when they are really only expected at one end of the sequence.
Query packing speeds up this BLAST search by a factor of 10. The standard vector database has many more vectors than may be used by the sequencing lab, so a good way to increase your efficiency is to minimize the vector database. You can use MegaBLAST here, though the default large word size poses a small risk for short regions of vector contamination.
Eukaryotic genomes often contain an abundance of repetitive elements. There are many kinds of repetitive elements, and these sequences may comprise most of a genome. Libraries of repetitive sequences are available from GenBank and elsewhere (http://www.girinst.org/Repbase_Update.html), but for newly sequenced organisms, you may have to build your own library.
Finding repetitive elements requires relaxed search parameters because their sequences are free to drift, and there is usually quite a bit of divergence within a particular family. We recommend soft masking rather than ordinary complexity filtering to ensure that the elements containing a low complexity sequence are aligned over their entire length. When choosing a value for W, we try to balance speed and sensitivity. While it may make sense to choose a very small value to ensure that all repeats are found, doing so isn't practical if you have to process many sequences. For WU-BLAST, we include the kap parameter, which omits calculating scores for combinations of alignments.
blastall -p blastn -d <repeat_db> -i <dna> -r 1 -q -1 -G 2 -E 2 -W 9 -F "m D"
blastn <repeat_db> <dna> M=1 N=-1 Q=2 R=2 W=9 wordmask=seg kap
Alignments between repeats are expected to range from perfect identity to complete obscurity. The common classification scheme applies a score cutoff to discriminate between repeats and nonrepeats. Identifying the proper score threshold takes some experimentation because each repeat family has its own length and expected divergence. Overall, the score threshold determines the balance between undercalling and overcalling.
Some repetitive elements are mobile and may therefore insert themselves into other elements. If an insertion occurs near the end of an element, the alignment on the shorter side may fall below the score threshold.
RepeatMasker (http://repeatmasker.genome.washington.edu) is the standard program used to identify and mask repeats. It uses a range of word sizes, scoring matrices, and cutoffs to optimize the sensitivity for each repeat family. One of its special features is that it clips full-length elements from sequences and performs a second round of searches with a "compressed" sequence. This enables it to find nested repeats. If your favorite genome is supported by RepeatMasker, it is probably better to use this software than write your own. However, if you want/need to do your own repeat masking, you will find that Bioperl tools are an enormous help.
This protocol departs from the usual format because it is especially difficult and requires more than a single BLAST search. Contaminants come in many forms. Some, such as mitochondrial DNA mixed with nuclear DNA, are easily detected with near-identity parameters. But cross-species contamination is very difficult to detect. If you find an exact match between two genomes, is it contamination or a highly conserved region? There's no simple answer. Some genomes, however have specific signatures. For example, the human genome has many primate-specific Alu repetitive elements. If you find many Alu elements in a database of corn sequence, it's probably a contaminant.
The most critical part of contaminant detection is having representative databases. You can't find contaminants for which you have no sequences. On the other hand, if your sequence database is too large, you may spend an inordinate amount of time looking for contaminants. Repetitive element databases are good representative databases, and a reasonable approach to contaminant detection is to look for repeats that match other genomes better than your genome of interest. This won't catch everything, but it will tell you how much of a contaminant problem you may have.