10.5 NeoCore XMS*

  Previous section   Next section

This section material is taken in part from (Direen et al. 2001). ©2001 IEEE Reprinted with permission, from Proceedings?23rd Annual Conference?IEEE/EMBS.

Traditional databases were designed to manage data by creating a static framework to contain dynamic data?meaning data elements can be managed as long as all metadata (data's context) has been established in advance. This methodology falls short, by half, of true information or knowledge management. The solution to this dilemma is to manage metadata in exactly the same dynamic way as the data component. This offers two significant (if not profound) advantages. First, constraints on the use of dynamic data types are removed. New data types may simply be defined and added to the database. Second, the processes of database design and configuration (usually the most labor- and time-consuming aspect of system design) are reduced to practically nothing. NeoCore XMS (XML Management System) has been designed with precisely this feature: Metadata and data are handled in the same dynamic way. NeoCore XMS was designed to achieve the following goals:

  • Dynamic management of metadata. All information is represented in an internal format called "information couplets"?pairings of data and complete metadata. Information couplets are treated as patterns, and those patterns can be arbitrary. Consequently, there are no rows or columns, and there is no need to predefine indices. Patterns can be as common or as unique as desired. A piece of information can be associated with a single fragment of information without having to be predefined, preallocated, or added to another information fragment. This means that an entirely new data type can be added to an application at any time without having to do anything to the database.

  • Immediate availability of information. Information is "indexed" as soon as it is posted. In fact, there is no separate indexing process because there is no need to specify what should be indexed. NeoCore XMS automatically generates data patterns that allow information to be retrieved based on fully or partially qualified queries. Complex queries are accommodated through a hierarchical vector convergence algorithm, which quickly converges sets of individual pattern matches to locate information fragments based on multiple criteria. The vector convergence process operates on any information that has been posted, requiring no predefinition. The structure of the indices created within the XMS gives flat access time to all nodes within the system. There are no access time penalties based on the structure of the data stored in the XMS.

  • Schema independence. NeoCore XMS was designed to be oblivious to schemas or DTDs, which is a more important feature than it may initially seem. Most XML data management systems require that all XML information be described by a schema or DTD for data-mapping reasons. The problem schema dependence imposes is that it destroys the ability to have heterogeneous data within similar document types. For example, suppose you want to add a new field to some, but not all, documents of the same type. How will the database know which schema applies unless a new schema is supplied? How will it know whether all the other documents of the same type need to be changed, or whether they should be treated as different document types? How will other applications know about this new document type? Ultimately, schema dependence makes it virtually impossible to use XML's most attractive feature?its extensibility. Schema independence implies that no database design needs to be done, and no penalty is imposed for change.

  • Scalability. NeoCore XMS was designed to manage huge repositories of XML documents of all types. This was achieved by treating XML documents as aggregations of information. NeoCore XMS is aware of, but functionally oblivious to, the document-centric structure of XML information. XML data management systems are notorious for not scaling well?both as document size increases and as the number of documents increases. NeoCore XMS was designed to seamlessly scale to very large information management requirements, exhibiting remarkably flat performance as system size increases. Individual document size has no bearing on performance.

  • Efficient use of storage. Information couplets represent a fundamental means of storing and managing information. Breaking the information into couplets, along with efficient indexing using NeoCore's patented Digital Pattern Processing (DPP) technology, creates a very efficient storage format. The upshot is the amount of storage used, for everything combined; adds up to between one time and two times the size of the XML documents alone. This includes the documents, all indices, access control information?everything. This compares very favorably with all other methods of managing information, requiring less than half the storage of any database management system, and a tiny fraction of the space used by DOMs (Document Object Models).

By achieving these goals, NeoCore XMS is a perfect fit for managing complex biological information. Through NeoCore's membership and involvement with the Center for Computational Biology, NeoCore XMS is being tested in a research environment with biological information from a variety of sources. The Center for Computational Biology (CCB?http://www.cudenver.edu/ccb/) was created by Colorado University at Denver in association with the Colorado University Health Science Center for the purpose of bringing together computer science, mathematics/statistics, and biology (including relevant elements of chemistry and physics) in order to tackle many of the difficult problems coming out of the exploding field of bioinformatics. NeoCore XMS is being tested in the Computational Pharmacology Group located at the Health Science Center and directed by Dr. Lawrence Hunter.

Ron Taylor, a member of the Computational Pharmacology Group, has been running extensive tests on NeoCore XMS, using various biological information. Ron has also been developing specific interfaces to NeoCore XMS for their work. In one test, the entire Swiss-Prot protein database was loaded into NeoCore XMS. In addition, several bacterial genomes from NCBI were loaded into the database along with various ligand reaction, pathway, enzyme, and compound information, which is very diverse, heterogeneous information from disparate sources. Some of the findings of this test are:

  • To load information into NeoCore XMS, a simple load command is issued with the file containing well-formed XML. NeoCore XMS determines the structure of the information based on the XML it receives, stores the information, and then fully indexes the information. A schema for the information is not required and there is no database design effort whatsoever to configure the database for the type of information being entered. The various types of information are simply loaded.

  • The Swiss-Prot protein information consisted of 101,602 protein documents. The entire XML file was over 350MB, which loaded in approximately 30 minutes on a single processor, a Windows 2000 machine that had 512MB of RAM. This time includes storing all of the information and fully indexing it. The 350MB XML file consumed approximately 400MB of NeoCore XMS database resources, including all of the indexing. This means the footprint of the Swiss-Prot protein information was only 1.14 times the original XML. NeoCore XMS has the option of indexing data for data-only queries. This allows, for instance, finding all occurrences of "blue" in the database regardless of the context. This option added another 85MB to the footprint. The 30 minutes of load time included this indexing also. The same database had an additional 73MB or 13,977 records of NCBI gene data, plus another 26MB of other ligand data.

  • Query, retrieval, and access of the data stored was substantially faster in most cases than other database technologies the pharmacology group was using.

  • Adding new structural information to a given protein record is as easy as targeting the location within the desired XML document using a simple XPath statement inside an insert command and providing the XML segment to be inserted. Only the targeted record is changed. The way Neo-Core XMS is designed, adding new information structure to one document does not add unused fields to all similar documents in the database. This means that the space inside NeoCore XMS required to add unique information to a given record is limited to the specific information. There is no penalty for heterogeneous information. Information can be added as it is discovered.

  • A Perl API was created by Ron Taylor to access NeoCore XMS. Specific storage and retrieval modules allow the laboratory easy access to the database for handling Affymetrix gene expression data.

The key point is that no database design is required in order to work with NeoCore XMS. Once information has been described in well-formed XML, the information may be stored, retrieved, modified, or deleted from the database via a simple HTTP interface. The information stored may be very heterogeneous, and structure can be added without changing the database in any way. These properties are essential for working with biological data. In addition, the storage footprint is efficient, and access is fast.


Part IV: Applications of XML