10.1 Introduction

With the publication of the seminal paper by Watson and Crick that proposed the DNA double helix in 1953, a course was set for the eventual explosion of scientific data related to DNA research. As we entered the twenty-first century, the Human Genome Project revolutionized the field of biology and created a need for efficient ways to manage enormous quantities of information stemming from gene research. New high-throughput sequencing technology now allows researchers to read genetic code at prodigious rates, generating vast amounts of new data. The requirement to manage the abundance of information from the Human Genome Project spawned the new field of bioinformatics. The following are a few of the key events in biology since the Watson and Crick discovery, up to the first drafts of the human genome:

April 2, 1953: James Watson and Francis Crick publish in Nature: "A Structure for Deoxyribose Nucleic Acid."
1955: Frederick Sanger sequences the first protein; he later develops methods for sequencing DNA.
1966: Genetic code is cracked, showing how proteins are made from DNA instructions.
1980: A human protein project is proposed but not funded.
1990: The Human Genome Project (HGP) is launched by the public sector as an international effort to map all human genes. The U.S. HGP begins officially in 1990 as a $3 billion, 15-year program to find the estimated 80,000?100,000 human genes and determine the sequence of the 3 billion nucleotides that make up the human genome.
1998: A private-sector rival, Celera Genomics (headed by Craig Venter), joins the race.
June 2000: Celera and the Human Genome Project (headed by Francis Collins) celebrate separate drafts of the human genome.
February 2001: The draft human genome sequence is published in Nature and Science.

The information generated by the Human Genome Project is only the tip of the iceberg. This project is just the beginning of all the new information coming from the field of biology. Knowing the sequence of the over 3 billion base pairs that comprise the DNA in all 23 pairs of chromosomes in a human is comparable to having a byte dump of the Linux operating system without having any of the source code or code documentation, or for that matter, knowing the target for the machine code. The reverse-engineering problem from this starting point is astronomical!

On November 4, 1988, the late Senator Claude Pepper of Florida sponsored legislation that established the National Center for Biotechnology Information (NCBI at http://www.ncbi.nlm.nih.gov/). NCBI was established as a division of the National Library of Medicine (NLM) at the National Institutes of Health (NIH) to act as a national resource for molecular biology information. Other nations and regions have established their own centers and databases for biotechnology information, including Europe (http://www.ebi.ac.uk/), Japan (http://www.ddbj.nig.ac.jp), and Swiss-Prot (http://us.expasy.org/sprot/). Many other public and private databases have been established or are being established to capture the huge volumes of new information coming out of biological research today.

The intent of amassing information in these various public and private databases goes beyond creating repositories of data. A primary intent of these repositories is the building, management, and access of knowledge. A knowledge management system is a system that allows one to capture, store, query, and disseminate information that is gained through experience or study.

A key problem with many current data repositories is that the schema, which includes the data structure and the type of data stored, must be predefined. This implies that the creator of the repository (or the database) must have a priori knowledge of all information being stored in the database, which precludes handling "information gained through experience or study." XML provides a flexible, extensible information model that will grow with the discovery process. To support a growing knowledge base and the discovery process, the database technology must be as flexible, extensible, and schema independent as the XML information model.

Top