In this section we discuss some of the key features of eXist?namely, schema-less data storage, collections, index-based query processing, and XPath extensions for performing full-text searches.
eXist provides schema-less storage of XML documents. Documents are not required to have an associated schema or document type definition. In other words, they are allowed to be well formed only. This has some major benefits: Besides the fact that one usually finds a lot of documents without a valid document type definition, many XML authors typically tend to create the DTD or schema after writing the document to which it applies. In practice, DTDs may also evolve over a longer period of time, so documents follow slightly different versions of the same DTD or schema. Therefore, an XML database should support queries on similar documents, which may not have the same structure.
Inside the database, documents are managed in hierarchical collections. From a user's point of view, this is comparable to storing files in a file system. Collections may be arbitrarily nested. They are not bound to a predefined schema, so the number of document types used by documents in one collection is not constrained. Arbitrary documents may be mixed inside the same collection.
Users may query a distinct part of the collection hierarchy or even all the documents contained in the database using XPath syntax with extensions.
Evaluating structured queries against possibly large collections of unconstrained documents poses a major challenge to storage organization and query processing. To speed up query processing, some kind of index structure is needed. eXist uses a numerical indexing scheme to identify XML nodes in the index. The indexing scheme not only links index entries to the actual DOM nodes in the XML store, but also provides quick identification of possible relationships between nodes in the document node tree, such as parent-child or ancestor-descendant relationships. Based on these features, eXist's query engine uses fast path join algorithms to evaluate XPath expressions, while conventional approaches are typically based on top-down or bottom-up traversals of the document tree. It has been shown that path join algorithms outperform tree-traversal based implementations by an order of magnitude (Li and Moon 2001; Srivastava et al. 2002). We will provide details on the technical background at the end of this chapter.
Indexing is applied to all nodes in the document, including elements, attributes, text, and comments. Contrary to other approaches, it is not necessary to explicitly create indexes. All indexes are managed by the database engine. However, it is possible to restrict the automatic full-text indexing to defined parts of a document.
During development, the main focus has been to support document-centric as opposed to data-centric documents. Document-centric documents usually address human users. They typically contain a lot of mixed content, longer sections of text, and less machine-readable data. However, querying these types of documents is not very well supported by standard XPath. eXist thus provides a number of extensions to efficiently process full-text queries. An additional index structure keeps track of word occurrences and assists the user in querying textual content. Special full-text operators and functions are available to query text nodes as well as attribute values.