2.3 XML Storage

XML documents can have some schematic information (e.g., in the form of a DTD or W3C XML Schema), but they are not required to. Even if a schema exists, comments and processing instructions may occur at any place without previous declaration in the schema. Thus, the classical database approach of handling objects of a predefined type cannot be applied to the storage of XML. It is mandatory that schemas remain optional. Schemas may also be partial (i.e., describe only parts of the data, as discussed later in this chapter), and they can be easily modified even for existing data.

The descriptive power of DTDs is not sufficient for database purposes. For example, DTDs lack information such as data type, which is needed for the proper indexing of information. As a consequence, DTDs are not a sufficient basis for an XML database schema. In 2001, W3C published the recommendation for W3C XML Schema, a schema definition language that covers most of the expressive power of DTDs but also extends this power with a number of new concepts. In particular, an elaborated type system has been added, which makes XML Schema a suitable basis for a database schema description. In addition, XML Schema offers extensibility features that can be used to enhance standard schematic descriptions by database-specific information without compromising the interpretability of the schema by nonproprietary standard tools. Tamino XML Server uses this concept and supports the schematic description of documents via W3C XML Schema.

2.3.1 Collections and Doctypes

A Tamino database consists of multiple so-called collections (see Figure 2.3). These collections are just containers to group documents together. Each document stored in Tamino's data store resides in exactly one collection. A collection has an associated set of W3C XML Schema descriptions. In each schema description, doctypes can be defined using a Tamino-specific notation in the extensibility area of W3C XML Schema (appinfo element). A doctype identifies one of the global elements declared in a W3C XML Schema as the root element. Within a collection, each document is stored as a member of exactly one doctype.

Figure 2.3. Organization of Data in a Tamino Database

graphics/02fig03.gif

The root element of the document identifies the doctype. As a consequence, within a collection, there is a 1:1 relationship between the doctype and root element type. If a document is to be stored in a collection, and no doctype corresponds to the document's root element, such a doctype is created dynamically. In this case, there is no associated user-defined schema. In cases where an associated user-defined schema exists, Tamino validates incoming documents against this schema.

Tamino's configuration data (e.g., character-handling information about available server extensions, etc.) are also stored in XML documents in (system) doctypes in (system) collections. Consequently, configuration can be done by storing or modifying XML documents via the normal Tamino interface.

Tamino XML Server can also store arbitrary objects (non-XML objects)?for example, images, sound files, MS Word documents, HTML pages, and so on. These are organized in a dedicated doctype called nonXML. When these objects are read from Tamino XML Server, Tamino sets the appropriate MIME type.

Tamino assigns an identifier (called ino:id) to each document or non-XML object. In addition, the user can specify a name. This name must be unique within a doctype and can be used for directly addressing the document or object via a URL.

2.3.2 Schemas

As already mentioned, schemas for XML documents play a different role than in relational databases?they are much more loosely coupled to documents and might well describe only parts of a document. With the wildcard mechanism in XML Schema, it is possible to allow subtrees (using the any element) or attributes (using the anyAttribute element) to occur in specified places of a document, without a detailed description of these subtrees or attributes. Three processContents options control the behavior of the validation:

strict: Requires that all elements or attributes that occur at the corresponding location are declared as global items and match the declaration.
lax: Requires that those elements and/or attributes declared as global items match the declaration.
skip: Does not require any checks against declarations.

In addition, the namespace of such elements or attributes can be restricted. For example, a declaration that allows for completely unrestricted subtrees of XML elements below an element myelement looks like that shown in Listing 2.1:

Listing 2.1 Subtree of XML Elements

<element name="myelement">
 <complexType>
    <sequence>
       any maxOccurs="unbounded" processContents="skip"/>
    </sequence>
 </complexType>
</element>

These capabilities of W3C XML Schema already provide some flexibility for the documents associated with this schema and are fully supported by Tamino. However, some scenarios require even higher flexibility. Consider the case of electronic data interchange, where a standard schema that all participants can understand is required. There might be some need for unilateral extension of the standard schema, be it due to a new version of the standard schema, or due to the need for certain participants to enhance the commonly understood information by proprietary bits. Such extensions are not preplanned; hence they cannot be represented in the schema, and W3C XML Schema has no means to support such extensions. For such cases, Tamino XML Server has introduced the open content option. If this option is specified, the document is validated against the schema. If information items are found that are not described in the schema, they are accepted nevertheless. Consider the very simple XML schema for Tamino shown in Listing 2.2.

Listing 2.2 Simple XML Schema

<?xml version = "1.0" encoding = "UTF-8"?>
<xs:schema xmlns:xs = "HTTP://www.w3.org/2001/XMLSchema"
xmlns:tsd = "HTTP://namespaces.softwareag.com/tamino/TaminoSchemaDefinition">
  <xs:annotation>
    <xs:appinfo>
      <tsd:schemaInfo name = "City">
        <tsd:collection name = "mycollection"></tsd:collection>
        <tsd:doctype name = "City">
          <tsd:logical>
            <tsd:content>open</tsd:content>
          </tsd:logical>
        </tsd:doctype>
      </tsd:schemaInfo>
    </xs:appinfo>
  </xs:annotation>
  <xs:element name = "City">
    <xs:complexType>
      <xs:sequence>
        <xs:element name = "Monument" minOccurs = "0" maxOccurs = "unbounded">
          <xs:complexType>
            <xs:sequence>
              <xs:element name = "Name" type = "xs:string"/>
              <xs:element name = "Description" type = "xs:string"/>
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
      <xs:attribute name = "Name" type = "xs:string"/>
    </xs:complexType>
  </xs:element>
</xs:schema>

The schema describes documents that contain information about monuments in a city. A city has a name and may have zero or more monuments. You will notice the xs:annotation element as a child of the xs:schema element. This element type has been introduced in the W3C XML Schema recommendation to allow applications to add their annotations to an XML schema without compromising the interpretability of the schema by other applications. Tamino XML Server uses this feature, adding its information below the xs:appinfo child. This information is Tamino specific. For this reason, the names used are from a Tamino namespace rather than from the XML Schema namespace. The Tamino information comprises the name of the schema when it is stored in Tamino, the name of the collection it applies to, and the name of the doctype(s) defined in this schema. For each doctype, open or closed content can be specified. In this example, open content has been specified. Hence, Tamino accepts the document shown in Listing 2.3 without complaining about validation errors.

Listing 2.3 Undeclared Attribute and Element

<?xml version = "1.0" encoding = "UTF-8"?>
<City Name="Darmstadt">
 <Monument built="1897-1899">
    <Name>Russian Chapel</Name>
    <Location>Mathildenhîhe</Location>
    <Description>Built for Nikolai II, czar of Russia.</Description>
 </Monument>
</City>

An undeclared attribute is built in the Monument element, and an undeclared Location element child is below Monument. If <tsd:content>closed</tsd:content> had been specified, it would have caused a validation error.

XML schemas can evolve in many aspects: Attributes and elements can be added or removed, and types can change (e.g., by modifying or adding restricting facets, etc.). If a schema is modified for a doctype for which documents are stored in Tamino XML Server, Tamino guarantees the validity of these documents with respect to the new schema. For some modifications, validation can be guaranteed without accessing the documents (e.g., when adding an optional attribute in the case of closed content). For other modifications, Tamino revalidates existing documents in the course of the schema modification.

2.3.3 Access to Other Databases?Tamino X-Node

As already mentioned, data stored in relational databases or in Adabas can be integrated into documents stored in Tamino via the X-Node component. For the user, the fact that parts of the data reside in another data source is transparent. These data behave just like regular parts of a document. They are also declared as part of a document in a Tamino schema. As an example, suppose a relational database contains statistical data about cities. We want to enhance the city information stored in Tamino by the number of inhabitants. Any update of the statistics should be immediately reflected in the documents delivered by Tamino. Thus, we do not replicate the information into Tamino, but we access the information every time it is needed. When accessing a document of the doctype City, Tamino looks up the external database for the inhabitants' information and integrates it into the resulting document as if the information were stored in Tamino. Modification of data stored in another database would be possible in the same way: When you store a City document that contains information about the number of inhabitants, the external database is updated. In many scenarios, an update of the external database is not desired. In this case, you can tell Tamino XML Server not to propagate changes.

Listing 2.4 is a Tamino schema snippet that includes a corresponding X-Node definition.

The correspondence between data stored in Tamino and data stored in the other database system must be explicitly described. In addition, user and password to access the other database can be specified, and some other database information as well (e.g., the encoding used in the external database).

Listing 2.4 Tamino Schema Example

<xs:element name = "City">
  <xs:annotation>
    <xs:appinfo>
      <tsd:elementInfo>
        <tsd:physical>
          <tsd:map>
            <tsd:subTreeSQL table = "Cities" datasource = "mydb">
              <tsd:primarykeyColumn>name</tsd:primarykeyColumn>
              <tsd:accessPredicate>
              name=<tsd:nodeParameter>/City/@Name
</tsd:nodeParameter>
              </tsd:accessPredicate>
            </tsd:subTreeSQL>
            <tsd:ignoreUpdate></tsd:ignoreUpdate>
          </tsd:map>
        </tsd:physical>
      </tsd:elementInfo>
    </xs:appinfo>
  </xs:annotation>
  <xs:complexType>
    <xs:sequence>
      <xs:element name = "Monument" minOccurs = "0" maxOccurs = "unbounded">
        <xs:complexType>
          <xs:sequence>
            <xs:element name = "Name" type = "xs:string">
</xs:element>
            <xs:element name = "Description" type = "xs:string">
               </xs:element>
          </xs:sequence>
        </xs:complexType>
      </xs:element>
    </xs:sequence>
    <xs:attribute name = "Name" type = "xs:string" use = "required">
      <xs:annotation>
        <xs:appinfo>
          <tsd:attributeInfo>
            <tsd:physical>
              <tsd:map>
                <tsd:nodeSQL column = "name"></tsd:nodeSQL>
              </tsd:map>
            </tsd:physical>
          </tsd:attributeInfo>
        </xs:appinfo>
      </xs:annotation>
    </xs:attribute>
    <xs:attribute name = "Inhabitants" type = "xs:string">
      <xs:annotation>
        <xs:appinfo>
          <tsd:attributeInfo>
            <tsd:physical>
              <tsd:map>
                <tsd:nodeSQL column = "POPULATION"></tsd:nodeSQL>
              </tsd:map>
            </tsd:physical>
          </tsd:attributeInfo>
        </xs:appinfo>
      </xs:annotation>
    </xs:attribute>
  </xs:complexType>
</xs:element>

Attached to the City element, a connection between this element and the table Cities in database mydb is defined. The accessPredicate defines how rows of this table and elements in corresponding documents relate: Here, equality of the Name attribute of the XML element City to the name column of Cities is required. Consequently, a mapping of this attribute to the column is specified. From rows matching this criterion, the POPULATION column is included as XML attribute Inhabitants into the City element. The mapping on table level defines ignoreUpdate. As a consequence, should an XML document of this type be stored in Tamino with the Inhabitants attribute contained, the new value would not be propagated to the database mydb.

2.3.4 Mapping Data to Functions?Tamino X-Tension

For access to data stored in other databases, the correspondence between parts of an XML document and data in the database can easily be described in a declarative manner. For other data sources (e.g., ERP systems), a more procedural mapping is needed. For this purpose, Tamino's X-Tension component supports mapping functions. A Tamino X-Tension package can be made up of map-in and map-out functions, event-handling functions, and query functions (discussed shortly). These functions can be written in Java or as COM objects (on Windows platforms). The administrator can specify whether they run in the same address space as the Tamino Server (which is faster) or in a separate address space (which is safer). X-Tension functions are loaded dynamically when they are referenced. They can be added to an online Tamino Server without interruption to normal operations.

Map-in functions accept whole documents or parts of documents as parameters. The functions are responsible for storing the XML passed to them. This includes the option to pass documents to middleware systems such as Software AG's EntireX Communicator for further processing. Analogously, map-out functions output XML documents or parts thereof. With this mechanism, the logic to store parts of a document somewhere can be described programmatically. The X-Tension mechanism has full access to Tamino functionality via callbacks. As a consequence, it is also possible to store the XML passed to an X-Tension function in Tamino. While this may seem strange at first glance, it can make sense in scenarios where the same information is received in multiple different formats. In this case, the map-in function can transform the data into a standard format (e.g., using XSLT style sheets) and then store it in Tamino. Then, data can be retrieved in the standard format or?via the map-out function?in the format sent to Tamino. X-Tension mapping is specified in the Tamino schema as shown in Listing 2.5.

Listing 2.5 X-Tension Mapping Example

[View full width]

<?xml version = "1.0" encoding = "UTF-8"?>
<xs:schema xmlns:xs = "HTTP://www.w3.org/2001/XMLSchema" xmlns:tsd = "HTTP://namespaces.
softwareag.com/tamino/TaminoSchemaDefinition">
  <xs:annotation>
    <xs:appinfo>
      <tsd:schemaInfo name = "City">
        <tsd:collection name = "mycollection1"></tsd:collection>
        <tsd:doctype name = "City">
          <tsd:logical>
            <tsd:content>open</tsd:content>
          </tsd:logical>
        </tsd:doctype>
      </tsd:schemaInfo>
    </xs:appinfo>
  </xs:annotation>
  <xs:element name = "City">
    <xs:annotation>
      <xs:appinfo>
        <tsd:elementInfo>
          <tsd:physical>
            <tsd:map>
              <tsd:xTension>
                <tsd:onProcess>transform.transformIn</tsd:onProcess>
                <tsd:onCompose>transform.transformOut</tsd:onCompose>
              </tsd:xTension>
            </tsd:map>
          </tsd:physical>
        </tsd:elementInfo>
      </xs:appinfo>
    </xs:annotation>
    <xs:complexType>
      <xs:sequence>
        <xs:element name = "Monument" minOccurs = "0" maxOccurs = "unbounded">
          <xs:complexType>
            <xs:sequence>
              <xs:element name = "Description" type = "xs:string">
                   </xs:element>
            </xs:sequence>
            <xs:attribute name = "Name" type = "xs:string">
</xs:attribute>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
      <xs:attribute name = "Cityname" type = "xs:string">
</xs:attribute>
    </xs:complexType>
  </xs:element>
</xs:schema>

When a document is stored in the City doctype in mycollection1, the function transform.transformIn is called. This user-provided function applies some transformation to the document in order to match the schema for City in collection my collection (rename attribute Cityname to Name, transform Monument attribute Name to child element Name) and then store the result in doctype City in collection my collection. Analogously, the function transform.transformOut retrieves the document from my collection and applies reverse transformations.

Again, the fact that mapping is used is transparent to the user. In particular, it does not affect the available functionality on the data. For example, there are no restrictions on the queries allowed on such data.

For an X-Tension function, the information about transactional events (commit, rollback) is usually important. If the X-Tension function is used to store data in a transactional external system, the corresponding transaction on the foreign system has to be rolled back if the Tamino transaction is rolled back. If the external system does not support transactions, data stored by the X-Tension function in the course of a Tamino transaction has to be explicitly removed on rollback. To enable such actions, X-Tension packages can register callback functions that are invoked on events such as commit or rollback of a transaction, end of session, and so on.

As a special case, mapping does not need to be symmetric. For example, one can include random values into documents delivered by Tamino, if the map-out function is a random number generator and no map-in function is specified. Analogously, one may specify a map-in function that sends its input via e-mail to a certain recipient or passes it to a workflow system. The map-out function might then deliver status information (e.g., where the data currently are in the work flow) rather than the information passed in.

2.3.5 Internationalization Issues

XML is based on Unicode (Unicode Consortium 2000). Consequently, Tamino internally works with Unicode only. However, not all systems interacting with Tamino are based on Unicode. Hence, Tamino has to care for encoding differences and encoding conversions. The first obvious place where Tamino can encounter non-Unicode encoding is in the XML declaration in a document:

<?xml version="1.0" encoding="iso-8859-1"?>

The XML 1.0 specification requires XML processors to understand the Unicode encodings UTF-8 and UTF-16, and it leaves open whether an XML processor accepts other encodings. Tamino XML Server supports a plethora of non-Unicode encodings, including the ISO-Latin family, and many others. When such documents are sent to Tamino XML Server, Tamino converts them into Unicode before processing them and also resolves character references by replacing them with the corresponding Unicode character. Analogously, users can specify an encoding when retrieving data. In this case, Tamino converts the query results to the desired encoding before sending them to the user.

This is, however, not the only place where encoding issues occur. On the HTTP level, messages can also carry encoding information. This encoding information can even differ from the information included in the XML documents. Here, Tamino XML Server has to do conversions as well.

Among the facets for string types defined by W3C XML Schema, there is no collation facet. In the context of a database, where sorting is an important operation, it is highly desirable to be able to influence the order of strings according to language-specific rules. For example, according to Spanish language rules, the word llamar is sorted after the word luz (i.e., "luz" > "llamar"). Because W3C XML Schema does not support the concept of user-defined facets, Tamino adds collation information in the appinfo element as shown in Listing 2.6:

Listing 2.6 Collation Information

<xs:attribute name = "Name" type = "xs:string" use = "required">
  <xs:annotation>
    <xs:appinfo>
      <tsd:attributeInfo>
        <tsd:logical>
          <tsd:collation>
            <tsd:language value = "es"></tsd:language>
          </tsd:collation>
        </tsd:logical>
      </tsd:attributeInfo>
    </xs:appinfo>
  </xs:annotation>
</xs:attribute>

2.3.6 Indexing

Indexes are indispensable in database systems because otherwise large quantities of data could not be queried in a satisfactory way. Tamino XML Server supports three types of indexes that are maintained whenever documents are stored, modified, or deleted.

The standard index is a value-based index, as it is well known from relational databases. It serves for a fast lookup when searching for elements or attributes having certain values, or for relational expressions on such values (find all books with a price less than 50). In the City example presented earlier, a standard index on the Name attribute of City would accelerate searches for a dedicated city. Indexes are also defined in the Tamino schema, again using the appinfo mechanism of W3C XML Schema.

Standard-indexes are type-aware: Indexes on numerical values support numerical order (i.e., 5 < 10); indexes on values of a textual data type are ordered lexicographically ("5" > "10").

Global elements may be referenced in multiple contexts. If an index is to be established only for a subset of these contexts, a which element can be used to specify the paths for which an index is to be created.

Text indexes are the prerequisite for efficient text retrieval functionality. In text indexing, the words contained in an element or attribute are indexed, such that the search for words within the content of an element or attribute is accelerated. Note that text indexes can not only be defined on leaf elements, but also on elements that contain other elements. Thus, it is possible to text-index whole subtrees or even a whole document. In any case, the index is based on the result of applying the text() function known from XPath (Clark and DeRose 1999) to the element or attribute. This result is tokenized, and each token is included in the index. Note that this tokenization is a nontrivial task. Even in English text, where words are separated by whitespace and therefore are easily recognizable, the role of punctuation characters such as colon and dash has to be defined. Do they separate tokens, or are they part of a token? For other languages, the same characters may have to be treated separately. Based on decades of experience with text retrieval at Software AG, Tamino defines default handling of such characters that fits the needs of most character-based languages. However, Tamino XML Server offers a configuration mechanism to override its default handling for dedicated characters.

The words in some languages are not separated by whitespace, and tokenization has to work differently (e.g., it must be dictionary-based). This holds for Japanese, Chinese, Korean, and so on. Tamino XML Server supports tokenization for these languages.

Non-XML objects representing text are automatically text-indexed to provide basic search facilities on them.

A special index for text search is the so-called word fragment index. It is used to speed up wildcard search in cases where the search term uses a wildcard for both the prefix and the postfix of the words to be searched.

There is also an XML-specific type of index, called structure index. It comes in two flavors. The condensed structure index keeps the information about all paths that occur in any instance of a specific doctype. This can accelerate query execution in many cases (e.g., for documents without associated schema, doctypes with open content, or when the xs:anyAttribute is used). If such an index does not exist, misspelled names might lead to a sequential scan of all documents of a doctype. This index also helps when a schema is modified: The validity of some changes can be assessed without looking at the instances of a doctype. The full structure index records not only the existence of paths in a doctype, but also the documents in which the path occurs. This can be used for optimization whenever a query asks for optional parts of a doctype (e.g., elements or attributes with minOccurs=0 or children of them).

2.3.7 Organization on Disk

Tamino databases are made up of two persistent parts (called spaces):

The data space contains all the documents and objects stored in Tamino. Doctypes are organized in clusters in order to accelerate sequential access. Based on the document size, Tamino XML Server chooses appropriate compression techniques. Tamino's choice can, however, be overruled by the specification of a dedicated compression method in the Tamino schema of the corresponding doctype.
The index space contains the index data for the documents stored in Tamino.

Both spaces can be distributed over many different volumes of external storage, thus allowing Tamino to store terabytes of data. Tamino can be configured to automatically extend spaces when they run full.

For transactional logging, Tamino has a journal space, which is a fixed-size container. It is used as a circular buffer containing all logs necessary to roll back transactions, or to redo transactions after a system crash. Long-time logging, which can be used to bring the database to a current state after the restoration of a previously done backup, is stored in sequential files.

Top

Part III: XML and Relational Databases

Part IV: Applications of XML

Part V: Performance and Benchmarks

References

Contributors