8.10 Be Skeptical of Hypothetical Proteins

Amino acid sequencing is more difficult than nucleic acid sequencing, and therefore, sequences of most proteins are inferred from DNA translations. Some inferences come from gene predictions and others come from transcript translations. Finding the correct structure of genes in genomic DNA is very difficult; algorithms are incomplete approximations, and people make mistakes. Some research groups are conservative and only report proteins when there is good evidence. Others submit hypothetical proteins and hope that they will be useful (and they often are). As a result, many proteins in the public database are slightly incorrect or even fictitious. Unfortunately, hypothetical gene structures aren't always clearly labeled.

The most accurate protein sequences come from translating full-length cDNAs. But determining the protein encoded by a transcript isn't as simple as it sounds. While there is usually only one long open reading frame (ORF), the longest ORF won't necessarily correspond to a real protein. Be suspicious of all short proteins. Even in a full-length cDNA with a very large ORF, determining the start of translation isn't straightforward. The first methionine in the longest ORF is usually picked as the start of translation, but as a rule of convenience, not a biological truth. Many protein sequences have erroneous N-terminal extensions.