P.2 XML Concepts

This section provides an overview of basic XML concepts: DTDs, XML schemas, DOM, and SAX.

P.2.1 DTDs and XML Schemas

Both DTDs and XML schemas are mechanisms used to define the structure of XML documents. They determine what elements can be contained within the XML document, how they are to be used, what default values their attributes can have, and so on. Given a DTD or XML schema and its corresponding XML document, a parser can validate whether the document conforms to the desired structure and constraints. This is particularly useful in data exchange scenarios as DTDs and XML schemas provide and enforce a common vocabulary for the data to be exchanged.

XML DTDs are subsets of SGML (Standard Generalized Markup Language) DTDs. An XML DTD lists the various elements and attributes in a document and the context in which they are to be used. It can also list any elements a document cannot contain. However, it does not define constraints such as the number of instances of a particular element within a document, the type of data within each element, and so on. Consequently, DTDs are inherently suitable for document-centric XML as compared to data-centric XML because data-typing and instantiation constraints are less critical in the former case. However, they can be and are being used for both types of documents.

Listing P.5 shows a DTD for the simple XML document in Listing P.2. It describes which primitive elements form valid components for the three composite ones: person, name, and address. The keyword #PCDATA signifies that the element does not contain any tags or child elements and only parsed character data.

Listing P.5 A DTD for the Simple XML Document in Listing P.2

<!ELEMENT person (name, address)>
<!ELEMENT name (surname, firstname)>
<!ELEMENT surname (#PCDATA)>
<!ELEMENT firstname (#PCDATA)>
<!ELEMENT address (housenumber, street, town, postcode, country)>
<!ELEMENT housenumber (#PCDATA)>
<!ELEMENT street (#PCDATA)>
<!ELEMENT town (#PCDATA)>
<!ELEMENT postcode (#PCDATA)>
<!ELEMENT country (#PCDATA)>

XML schemas differ from DTDs in that the XML schema definition language is based on XML itself. As a result, unlike DTDs, the set of constructs available for defining an XML document is extensible. XML schemas also support namespaces and richer and more complex structures than DTDs. In addition, stronger typing constraints on the data enclosed by a tag can be described because a range of primitive data types such as string, decimal, and integer are supported. This makes XML schemas highly suitable for defining data-centric documents. Another significant advantage is that XML schema definitions can exploit the same data management mechanisms as designed for XML; an XML schema is an XML document itself. This is in direct contrast with DTDs, which require specific support to be built into an XML data management system.

Listing P.6 shows an XML schema for the simple XML document in Listing P.2. The sequence tag is a compositor indicating an ordered sequence of subelements. There are other compositors for choice and all. Also, note that, as shown for the address element, it is possible to constrain the minimum and maximum instances of an element within a document. Although not shown in the example, it is possible to define custom complex and simple types. For instance, a complex type Address could have been defined for the address element.

Listing P.6 An XML Schema for the Simple XML Document in Listing P.2

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   <xs:element name="person">
      <xs:complexType>
         <xs:sequence>
     <xs:element name="name">
        <xs:complexType>
           <xs:sequence>
         <xs:element name="surname" type="xs:string"/>
         <xs:element name="firstname" type="xs:string"/>
      </xs:sequence>
        </xs:complexType>
     </xs:element>
     <xs:element name="address" minOccurs="0" maxOccurs="1">
        <xs:complexType>
           <xs:sequence>
         <xs:element name="housenumber" type="xs:integer"/>
         <xs:element name="street" type="xs:string"/>
         <xs:element name="town" type="xs:string"/>
         <xs:element name="postcode" type="xs:string"/>
         <xs:element name="country" type="xs:string"/>
      </xs:sequence>
        </xs:complexType>
     </xs:element>
  </xs:sequence>
      </xs:complexType>
   </xs:element>
</xs:schema>

P.2.2 DOM and SAX

DOM and SAX are the two main APIs for manipulating XML documents in an application. They are now part of the Java API for XML Processing (JAXP version 1.1). DOM is the W3C standard Document Object Model, an operating system? and programming language?independent model for storing and manipulating hierarchical documents in memory. A DOM parser parses an XML document and builds a DOM tree, which can then be used to traverse the various nodes. However, the tree has to be constructed before traversal can commence. As a result, memory management is an issue when manipulating large XML documents. This is highly resource intensive especially in cases where only a small section of the document is to be manipulated.

SAX, the Simple API for XML, is a de facto standard. It differs from DOM in that it uses an event-driven model. Each time a starting or closing tag, or processing instruction is encountered, the program is notified. As a result, the whole document does not need to be parsed before it is manipulated. In fact, sections of the document can be manipulated as they are parsed. Therefore, SAX is better suited to manipulating large documents as compared to DOM.

Top