10.3 Life Sciences Are Turning to XML to Model Their Information

It has been said that the difficulties in dealing with bioinformatics data come more from its idiosyncrasies than its quantity, and there certainly is no lack of quantity of information (Achard et al. 2001). Biological data are complex to model, and because of this complexity, our models must have the flexibility to grow. The complexity comes in part from the large variety of data types and their many interrelationships. In addition, new data types emerge regularly, and these new types modify our perception of the old types.

François Rechenmann confirms this synopsis by stating, "It is not so much the volume of data which characterizes biology, as it is the data's diversity and heterogeneity" (Rechenmann 2000). He wraps up his editorial with the statement, "The crucial role of computer science is now unanimously recognized for the management and the analysis of biological data; through knowledge modeling, it could be brought to play a role in life sciences similar to the role mathematics plays in physical sciences."

Since the earliest origins of scientific thought, there have been efforts to collect, organize (i.e., categorize), and analyze data. A major alteration in scientific thought occurred with the publication in 1735 of Systema Naturae by Carolus Linnaeus. While the desire to organize data for eventual analysis predated this publication, the Linnaean classification system transformed the world of biology. The establishment of a common system for the identification of living organisms opened countless avenues to analyze the information gathered. This allowed vast increases in the overall knowledge base for biological science.

Interesting to note is that the Linnaean system, like XML, uses a hierarchical structure. The reason the Linnaean system was and XML is revolutionary is that they are flexible structures that support new information as it is discovered. Just as the Linnaean system of nomenclature allowed researchers to communicate in exact terms regarding specific organisms, XML, with its similar hierarchical nature, will allow researchers from widely diverse backgrounds to communicate information that leads to knowledge development.

Information is the combination of data and context. If someone gave you the sequence of letters "seat" without any context, you have no way of knowing to what he or she was referring. He or she could be referring to an uncomfortable seat or to a short amino acid sequence contained within the aspartokinase I protein in E. coli. The point is that data without context is incomplete. It is only when we combine data with its context that we have information. A system that locks context and only allows data to be modified, which is typical of many traditional data models, is unsuitable for a knowledge management system. Ideally, a knowledge management system will handle context (metadata) as freely and fluidly as it handles data (Direen et al. 2001).

XML is well suited for knowledge management systems in that it pairs data with its context via the hierarchical tag structure. XML is the first widely adopted standard for expressing information as opposed to just data. XML is also rapidly becoming an accepted information-modeling language for bioinformatics. A number of groups are developing standard XML markup languages in this area, including

BIOML (BIOpolymer Markup Language): BIOML is used to describe experimental information about proteins, genes, and other biopolymers. A BIOML document will describe a physical object (e.g., a particular protein) in such a way that all known experimental information about that object can be associated with the object in a logical and meaningful way. The information is nested at different levels of complexity and fits with the tree-leaf structure inherent in XML. BIOML is a product of core working groups at Proteometrics, LLC and Proteometrics Canada, Ltd. More information is available at http://www.bioml.com/BIOML/.
BSML (Bioinformatic Sequence Markup Language): The National Human Genome Research Institute (NHGRI) funded the development of BSML in 1997 as a public domain standard for the bioinformatics community. Among the early goals for BSML was to create a data representation model for sequences and their annotations. This model would enable linking the behavior of the display object to the sequences, annotations, and links it represents. The linking capability of BSML has paralleled the evolution of storage, analysis, and linking of biological data on computer networks. Sequence-related phenomena from the biomolecular level to the complete genome can be described using BSML. This flexibility provides a needed medium for genomics research. LabBook, Inc. (http://www.labbook.com/) is the author and owner of the copyrights for BSML.
PSDML: The Protein Information Resource (PIR) database is a partnership between the National Biomedical Research Foundation at Georgetown University Medical Center, the Munich Information Center for Protein Sequences, and the Japan International Protein Information Database. The PIR is an annotated, public-domain sequence database that allows sequence similarity and text searching. The Protein Sequence Database Markup Language (PSDML) is used to store protein information in the PIR database. More information is available at http://pir.georgetown.edu/.
GAME: Genome Annotation Markup Elements can be utilized to represent features or annotations about specific regions of a sequence. These annotations may be differentiated using GAME. Examples of this are features generated by a sequence analysis program and those generated by a professional lab worker. Facilitation of the exchange of genomic annotations between researchers, genome centers, and model organism databases will allow each to specify the conclusions they have drawn from their analysis and then share these descriptions in XML with each other. The first widely used version of GAME was created at the BDGP (Berkeley Drosophila Genome Project) by Suzanna Lewis and Erwin Frise. More information is available at http://www.bioxml.org/Projects/game/game0.1.html.
SBML (Systems Biology Markup Language): The Systems Biology Work-bench (SBW) project at the California Institute of Technology seeks to provide for the sharing of models and resources between simulation and analysis tools for systems biology. One of the two approaches that are being pursued to attain the goal has been the incremental development of the Systems Biology Markup Language (SBML). SBML is an XML-based representation of biochemical network models. More information is available at http://www.cds.caltech.edu/erato/index.html.
CellML: CellML is an XML-based markup language whose purpose is to store and exchange computer-based biological models. CellML has primarily been developed by Physiome Sciences, Inc., in Princeton, New Jersey, and the Bioengineering Institute at the University of Auckland. CellML allows the sharing of models, even if they are using different model-building software and the reuse of components between models. This capability accelerates the model-building process. More information is available at http://www.cellml.org/.
MAGE-ML: Microarray Gene Expression Markup Language has been automatically derived from Microarray Gene Expression Object Model (MAGE OM). MAGE-ML is based on XML and is designed to describe and communicate information about microarray-based experiments. The information can describe microarray designs, manufacturing information, experiment setup and execution information, gene expression data, and data analysis results. The Object Management Group (OMG) was primarily involved in the creation and distribution of MAGE-OM standards through the Gene Expression Request For Proposal (RFP). MAGE-ML replaced the MAML (Microarray Markup Language) as of February 2002. More information is available at http://www.mged.org/Workgroups/MAGE/introduction.html.

These are a few of the XML markup languages being defined for the bioinformatics arena. Hank Simon (Simon 2001) provides a more complete list, and Paul Gordon has a Web site devoted to XML for molecular biology (http://www.visualgenomics.ca/gordonp/xml/).

The unprecedented degree of flexibility and extensibility of XML in terms of its ability to capture information is what makes it ideal for knowledge management and for use in bioinformatics.

Top