10.4 A Genetic Information Model

  Previous section   Next section

Let's envision a genetic-based information model. One method of basing our model is to start with a DNA sequence. This base is used by NCBI because GenBank is commissioned as a sequence database as opposed to a protein database such as Swiss-Prot. It is instructive to follow this starting point and see where it leads in terms of the variety of information that can be attached to a sequence that contains a gene. The information model and some of the data within the model will be somewhat contrived to bring out the key points. A detailed and accurate biological model is beyond the scope of this chapter.

Listing 10.1 is an XML document derived from an NCBI DNA sequence entry. For each sequence, a name or definition is provided along with a unique accession number to identify the sequence. This information is contained within the header element; also included in the header element are keywords. Notice how each keyword is contained within its own <keyword> tag. This makes database searching based on keywords much more efficient.

Listing 10.1 DNA Sequence Entry
[View full width]
<dna_sequence_entry>
 <header>
    <definition>
       Human Cu/Zn superoxide dismutase (SOD1) gene
    </definition>
    <accession_no>
       <base_no>L44135</base_no>
       <version>L44135.1</version>
       <GI>1237400</GI>
    </accession_no>
    <keyword>Cu/Zn superoxide dismutase</keyword>
    <keyword>Human SOD1 gene</keyword>
</header>
<source>
    <name>Human Cu/Zn superoxide dismutase (SOD1) gene, exon
    1.</name>
    <organism>Homo sapiens</organism>
    <taxonomy>
       <cell_type>Eukaryota</cell_type>
       <kingdom>Metazoa</kingdom>
       <phylum>Chordata</phylum>
       <subphylum>Vertebrata</subphylum
       <class>Mammalia</class>
       <infraclass>Eutheria</infraclass>
       <order>Primates</order>
       <family>Hominidae</family>
       <genus>Homo</genus>
       <species>sapiens</sapiens>
    </taxonomy>
</source>
<reference>
   <author> Levanon,D. </author>
   <author> Lieman-Hurwitz,J </author>
   <title>
      Architecture and anatomy of the chromosomal locus in human chromosome 21 encoding 
graphics/ccc.gifthe Cu/Zn superoxide dismutase
   </title>
   <journal> EMBO J. 4 (1), 77-84 (1985)</journal>
  </reference>
  <dna_sequence>
gtaccctgtttacatcattttgccattttcgcgtactgcaaccggcgggccacgccgtgaaaagaag
gttgttttctccacagtttcggggttctggacgtttcccggctgcggggcggggggagtctccggcg
cacgcggccccttggcccgccccagtcattcccggccactcgcgacccgaggctgccgcagggggcg
ggctgagcgcgtgcgaggccattggtttggggcc . . .
 </dna_sequence>
 <features>
    <protein>
       <type>CDS</type>
       <location>1..799</location>
       <gene>SOD1</gene>
       <codon_start>1</codon_start>
       <chromosome>21</chromosome>
       <map>21q22.1</map>
       <product>Cu/Zn-superoxide dismutase</product>
       <protein_id>AAB05661.1</protein_id>
       <db_xref>GI:1237407</db_xref>
       <amino_acid_sequence>
MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDL . . .
       </amino_acid_sequence>
    </protein>
 </features>
</dna_sequence_entry>

The next element in Listing 10.1 is source information. This information lets us know where and from what organism the sequence came. For most biologists the hierarchy in Listing 10.2 will be quite familiar.

Listing 10.2 Linnaean Classification of Humans
Kingdom:  Animalia
 Phylum:  Chordata
     Subphylum:  Vertebrata
        Class:  Mammalia
           Subclass:  Theria
              Infraclass:  Eutheria
                   Order:  Primates
                      Suborder:  Anthropoidea
                           Superfamily:  Hominoidea
                                  Family:  Hominidae
                                         Genus:  Homo
                                               Species:  sapiens

Listing 10.2 was taken from one of many Web sites that detail Linnaean classifications (http://anthro.palomar.edu/animal/humans.htm). While the degree of detail expressed in Listing 10.2 (outlining 12 classification categories from kingdom to species) may change depending on the source, it serves to illustrate the natural and efficient mechanism used to classify organisms. We have added acell type category to cover the Eukaryota/Prokaryota breakdown of a cell. Many databases put the primary taxonomic information within a single <taxonomy> tag (Eukaryota to Homo). The inclusion of most of the classification categories within one data item fails to respect the natural hierarchy of Linnaean classification and fails to unleash the strengths of XML. By mapping the natural hierarchical structure into XML, database searches for DNA sequences based on organism classification become more natural and efficient. It is surprising how often this point is missed.

The <reference> element in Listing 10.1 contains journal article information related to the DNA sequence. Often the researcher(s) who sequenced the section of DNA submits a journal article describing the sequence and information related to the sequence. The actual DNA sequence is contained within the <dna_sequence> element.

Finally, Listing 10.1 contains a features section, which contains information about features that are contained within the DNA sequence?in Listing 10.1, a section of the DNA (nucleotides 447 through 1934) code for a protein sequence. Some of the basic information on the protein is contained within the <protein> element.

If life were simple, we could stop here. We may want to add or remove some of the descriptive information about a sequence, but we might be able to lock down a data structure for capturing sequence information. Of course, life is not this simple, and we must consider how we might attach additional information to our base model and what forms this information might come in.

As noted previously, one of the sections in the XML in Listing 10.1 is a features section. In the brief biology introduction (Section 10.2), it was stated that DNA is composed of genes, and genes code for proteins. The feature shown in Listing 10.1 highlights a section of the DNA sequence, nucleotides 447?1934, which is a gene segment that codes for a protein. The protein ID is given along with the protein amino acid sequence. The more one understands and discovers about biology, the more information one would like to attach to the given DNA sequence. Here is a list of a few items that come to mind:

  • DNA sequences are composed of genes. The actual gene section (or gene sections if there are multiple genes) of the sequence needs to be identified and noted in the information model (the XML). This would be noted by opening up a <gene> . . . </gene> element within the features section. The gene name, reference number, location, plus other information would be included within the element.

  • Genes have promoter sections that control when the given gene is expressed. A gene is expressed if it is actively being transcribed into mRNA. Transcription factors (which are specialized proteins) bind to sections of the DNA in the promoter region to control gene expression and expression rates. The promoter sections need to be identified along with the specific transcription factors that control the gene at hand. Different genes use different transcription factors and different control mechanisms. This implies that, within a gene element, we would have a <promoter> . . . </promoter> element. Within the promoter element, we would have <transcription_factor> . . . </transcription_factor> elements. Within the transcription factor elements, there might be information as to where the promoter section binds with the transcription factor, and under what conditions.

  • Genes have regions that code for proteins called exons, and interspersed within the gene may be regions that are not part of the coding for a protein; these regions are called introns. The information model must be able to handle the identification of these sections.

  • The same gene may be used to code for multiple different proteins. It is somewhat of a mix and match use of the exon sections of the gene; this is called alternative splicing of the mRNA. Within each gene section, the various proteins that the gene codes for would be inserted. This would move the protein element in the example XML document in Listing 10.1 to be within a gene element. As we learn what control factors determine which proteins are made at which times, these control factors will be added to our model.

  • Proteins control almost every biological process in a living organism and are used in the basic structure of organisms. In order to identify what a given gene is controlling or affects, one must know the function of the protein that the gene codes for. A protein function element could be added to the protein element to encode this information.

  • A gene may be involved directly in a given organism trait such as eye color, hair color, baldness, number of fingers, and so on. The phenotype of a gene is this outward characteristic expression of the gene. For each gene we may want to have a <phenotype> . . . </phenotype> element. The subelements of the phenotype would depend heavily on the particular gene.

  • A gene that goes awry can be the root cause of any number of diseases including sickle cell anemia, Huntington disease, cystic fibrosis, blindness, all kinds of cancers, and so on. A single nucleotide polymorphism (SNP) is a single change in one base-pair on a DNA sequence. Sickle cell anemia is caused by an error in the gene that tells the body how to make hemoglobin. The defective gene tells the body to make the abnormal hemoglobin that results in deformed red blood cells. There is one amino acid substitution, a valine for glutamic acid in the beta sixth position that forms sickle beta chains, which is caused by a SNP on chromosome 11 where the beta chain of hemoglobin is coded. Sections for SNPs and related diseases may be added to capture this information.

Listing 10.3 shows the updated DNA sequence entry with some of this new information added.

Listing 10.3 Updated DNA Sequence Entry
[View full width]
<dna_sequence_entry>
 <header>
    <definition>
       Human Cu/Zn superoxide dismutase (SOD1) gene
    </definition>
    <accession_no>
       <base_no>L44135</base_no>
       <version>L44135.1</version>
       <GI>1237400</GI>
    </accession_no>
    <keyword>Cu/Zn superoxide dismutase</keyword>
    <keyword>Human SOD1 gene</keyword>
 </header>
 <source>
    <name> Human Cu/Zn superoxide dismutase (SOD1) gene, exon
    1.</name>
    <organism>Homo sapiens</organism>
    <taxonomy>
       <cell_type>Eukaryota</cell_type>
       <kingdom>Metazoa</kingdom>
       <phylum>Chordata</phylum>
       <subphylum>Vertebrata</subphylum
       <class>Mammalia</class>
       <infraclass>Eutheria</infraclass>
       <order>Primates</order>
       <family>Hominidae</family>
       <genus>Homo</genus>
       <species>sapiens</sapiens>
    </taxonomy>
 </source>
 <reference>
    <author> Levanon,D. </author>
    <author> Lieman-Hurwitz,J </author>
    <title>
Architecture and anatomy of the chromosomal locus in human chromosome 21 encoding the Cu/
graphics/ccc.gifZn superoxide dismutase
    </title>
    <journal> EMBO J. 4 (1), 77-84 (1985)</journal>
 </reference>
 <dna_sequence>
gtaccctgtttacatcattttgccattttcgcgtactgcaaccggcgggccacgccgtgaaaagaag
gttgttttctccacagtttcggggttctggacgtttcccggctgcggggcggggggagtctccggcg
cacgcggccccttggcccgccccagtcattcccggccactcgcgacccgaggctgccgcagggggcg
ggctgagcgcgtgcgaggccattggtttggggcc . . .
 </dna_sequence>
 <features>
    <gene>
       <location>1..799</location>
       <phenotype>
          <eye_color>blue</eye_color>
       </phenotype>
       <promoter_section>
          <location>100..447</location>
          <transcription_factor>
             Various transcription factor info
          </transcription_factor>
          <transcription_factor>
             Various transcription factor info
          </transcription_factor>
       </promoter_section>
      <SNP>
mutation changing codon 102 from Asp->Gly and causing amyotrophic lateral sclerosis
       </SNP>
       <exon>5</exon>
       <protein>
          <type>CDS</type>
          <location>1..799</location>
          <gene>SOD1</gene>
          <codon_start>1</codon_start>
          <chromosome>21</chromosome>
          <map>21q22.1</map>
          <product>Cu/Zn-superoxide dismutase</product>
          <protein_id>AAB05661.1</protein_id>
          <db_xref>GI:1237407</db_xref>
          <amino_acid_sequence>
MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSRKHGG . . .
          </amino_acid_sequence>
       </protein>
       <related_diseases>
Disease Information, AMYOTROPHIC LATERAL SCLEROSIS Lou Gehrig's disease
      </related_diseases>
    </gene>
 </features>
</dna_sequence_entry>

It would be easy to imagine many more examples of all the possible information that can be associated with one DNA sequence. From this illustration, one quickly begins to realize why biological information is complex to model, and this example touched on only some of the issues at hand. Microbiology is still very much in the discovery phase. It is impossible to predict all of the information, context plus data, that researchers will want to capture against a given sequence. Therefore, it is imperative that the information model be flexible and easily scalable.

A few points about the DNA sequence model presented in Listing 10.3 are worth noting. First, all of the data items are contained within elements and not within attributes. Many of the XML schemas being developed for bioinformatics, and other fields as well, place a variety of their data items within attributes. While this is syntactically correct XML, it is poor information modeling. This practice blurs the distinction between an attribute and a data element. An attribute should tell us how to process or interpret data enclosed within the tag element and apply to all items within the tag element; it should not be data.

The tag structure in the DNA sequence model gives the entire context of related dat The tag structure in the DNA sequence model gives the entire context of related data items. We know based on tag structure, for instance, that "CDS" is a type of a protein coded by a gene that is a feature within a DNA sequence.

Hierarchy is used to show relationships within the model. For instance, a protein is contained within a gene. If the same gene codes for multiple proteins, multiple protein elements will be within the gene element. If the DNA sequence contains several genes, there will be several gene elements, each containing their own protein elements.

Finally, the model represents heterogeneous information, which is not obvious looking at one DNA sequence entry. If we looked across many DNA sequence entries, we would see that each DNA sequence contained different types of information. Some of the basic information types would remain the same; all would have a header element and a <dna_sequence> element, but not all would contain gene elements. This is due to the fact that much of the DNA in humans does not code for proteins and therefore does not contain genes. It is still a mystery what 90 percent of human DNA does! Based on this fact alone, the model must be flexible. Since genes that are in the DNA code for such a wide variety of proteins that are involved in a vast array of functions, it should be clear that the types of information captured for each gene will also vary widely. Therefore, our information model captures heterogeneous information.


Top

Part IV: Applications of XML