8.12 Use Caution When Searching Raw Sequencing Reads

The largest source of raw sequencing reads comes from the early stages of genome projects and from EST sequencing. Most sequencing reads have an error rate of about 1 percent. This rate isn't uniform; there is a spike near the beginning and a gradual increase towards the end of the read. In addition, some regions have intrinsically high error rates due to compositional properties such as high GC content. DNA sequencing involves several steps, and there are abundant opportunities for mechanical and human error. Thus, you will need to be careful when using large word sizes. For redundant sequence collections, such as 3x shotgun coverage of a genome, large word sizes are fine, but if the absence of a single alignment is troublesome, scale down the word size to keep sequencing errors from preventing seeding.

Raw sequencing reads may be contaminated from a variety of sources. Cloning vectors are one expected source. Depending on the sequencing center, the vectors may or may not have been clipped from the sequence. Other kinds of contamination are also possible. Nuclear DNA is sometimes contaminated with mitochondrial or viral DNA, and any collection of sequence can be contaminated from another organism (genome centers usually sequence more than one entity at a time, and sometimes there's a mix up of who did what and when). ESTs sometimes have their poly-A tail intact, and whether or not this is contamination is a matter of perspective. Taken together, there are many opportunities for contamination, and it's a good idea to be cautious when using raw sequencing reads.