Amino acid sequencing is more difficult
than nucleic acid sequencing, and therefore, sequences of most
proteins are inferred from DNA translations. Some inferences come
from gene predictions and others come from transcript translations.
Finding the correct structure of genes in genomic DNA is very
difficult; algorithms are incomplete approximations, and people make
mistakes. Some research groups are conservative and only report
proteins when there is good evidence. Others submit hypothetical
proteins and hope that they will be useful (and they often are). As a
result, many proteins in the public database are slightly incorrect
or even fictitious. Unfortunately, hypothetical gene structures
aren't always clearly labeled.
most accurate protein sequences come from translating full-length
cDNAs. But determining the protein encoded by a transcript
isn't as simple as it sounds. While there is usually
only one long open reading frame (ORF), the longest ORF
won't necessarily correspond to a real protein. Be
suspicious of all short proteins. Even in a full-length cDNA with a
very large ORF, determining the start of translation
isn't straightforward. The first methionine in the
longest ORF is usually picked as the start of translation, but as a
rule of convenience, not a biological truth. Many protein sequences
have erroneous N-terminal extensions.