Hack 68 Validate an XML Document with a DTD

figs/moderate.gif figs/hack68.gif

XML inherited the Document Type Definition (DTD) from SGML. It is the native language for validating XML?though it is not itself in XML syntax?and is interwoven into the XML 1.0 specification (http://www.w3.org/TR/2004/REC-xml-20040204/). Using non-XML syntax, a DTD defines the structure or content model of a valid XML instance. A DTD can define elements, attributes, entities, and notations, and can contain comments (just like XML comments), conditional sections, and a structure unique to DTDs called parameter entities. DTDs can be internal or external to an XML document, or both. This hack shows you how to implement all the basic structures of a DTD.

Example 5-1 shows external.xml, and Example 5-2 shows a DTD against which external.xml is valid. The external DTD is called order.dtd. This is also known as an external subset. This DTD is a local file in this example, but it could also exist across a network.

Example 5-1. external.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>



<!DOCTYPE order SYSTEM "order.dtd">



<order id="TDI-983857">

 <store>Prineville</store>

 <product>feed-grade whole oats</product>

 <package>sack</package>

 <weight std="lbs.">50</weight>

 <quantity>23</quantity>

 <price cur="USD">

  <high>5.99</high>

  <regular>4.99</regular>

  <discount>3.99</discount>

 </price>

 <ship>the back of Tom's pickup</ship>

</order>

5.2.1 External Subset

The XML declaration on line 1 declares that this document does not stand alone. That's because on line 2, external.xml references the DTD order.dtd. The file order.dtd is considered an external entity and is called an external subset. The SYSTEM keyword on line 2 indicates that the DTD will be identified by a system identifier, which for all practical purposes is a URL for a local or remote file.

In this DTD, all the valid structures found in external.xml are declared. The document element is order (line 4), which has child elements that describe the pieces of a purchase order, including information on the store, product, product packaging, product weight, quantity, price, and shipping method. Validate external.xml against its associated DTD order.dtd by using RXP, xmlvalid, or xmllint on the command line [Hack #9], or use RXP online, or the Brown University STG online validator [Hack #9].

Example 5-2. order.dtd
<?xml encoding="UTF-8"?>



<!-- Order DTD -->



<!ELEMENT order (store+,product,package?,weight?,quantity,price,ship*)>



<!-- id = part number -->



<!ATTLIST order id ID #REQUIRED



                xmlns CDATA #FIXED "http://www.wyeast.net/order"



                date CDATA #IMPLIED>



<!ELEMENT store (#PCDATA)>



<!ELEMENT product (#PCDATA)>



<!ELEMENT package (#PCDATA)>



<!ELEMENT weight (#PCDATA)>



<!ATTLIST weight std NMTOKEN #REQUIRED>



<!ELEMENT quantity (#PCDATA)>



<!ELEMENT price (high?,regular,discount?,total?)>



<!ATTLIST price cur (USD|CAD|AUD|EUR) "USD">



<!ELEMENT high (#PCDATA)>



<!ELEMENT regular (#PCDATA)>



<!ELEMENT discount (#PCDATA)>



<!ELEMENT ship (#PCDATA)>

5.2.1.1 The text declaration

A text declaration (http://www.w3.org/TR/2004/REC-xml-20040204/#sec-TextDecl) is similar to an XML declaration (see "The XML Declaration" in Chapter 1), except that version information (e.g., version="1.0") is optional; encoding declarations, such as encoding="UTF-8", are required; and there are no standalone declarations (e.g., standalone="no").

5.2.1.2 Element type declarations and content models

Most of the lines in this DTD contain element type declarations (http://www.w3.org/TR/2004/REC-xml-20040204/#elemdecls). This is one of several kinds of markup declarations (http://www.w3.org/TR/2004/REC-xml-20040204/#dt-markupdecl) that may appear in a DTD. The simplest, on lines 8 through 11 and lines 16 through 19, have content models for parsed character data (#PCDATA), which means that these elements must contain only text?no element children. The elements declared on lines 3 and 14 (order and price) have content models that include only child elements. The +, ?, and * symbols denote occurrence constraints, meaning that the child elements may occur only a given number of times: + means that the element may occur one or more times; ? means the element may occur zero or one time (that is, it's optional); and * means the element may occur zero or more times. When an element name in a content model is followed by a comma (,), that means that exactly one of those elements may occur.

5.2.1.3 Attribute-list declarations

The DTD order.dtd has three attribute-list declarations on lines 5, 12, and 15. You can declare one or more attributes at a time, hence the phrase attribute list. The first declares three attributes, id, xmlns, and date. XML attributes declared in DTDs must have one of 10 possible types: CDATA, ID, IDREF, IDREFS, ENTITY, ENTITIES, NMTOKEN, NMTOKENS, NOTATION, and enumeration (see http://www.w3.org/TR/2004/REC-xml-20040204/#sec-attribute-types for an explanation of all the types).

The attribute id on line 5 is of type ID, which must be an XML name (http://www.w3.org/TR/2004/REC-xml-20040204/#NT-Name) and must be unique (http://www.w3.org/TR/2004/REC-xml-20040204/#id). It is also required (#REQUIRED); that is, it must appear in any valid instance of the DTD.

Emulating Namespace Support in DTDs

DTDs do not directly support XML namespaces (http://www.w3.org/TR/xml-names11), but you can use a few tricks to imitate namespace support. Here is how to do it: the attribute xmlns (line 6) has a fixed value of http://www.wyeast.net/order. The #FIXED keyword means that the attribute must always have the provided default value. When an instance of this DTD is processed, for example by the command rxp -aV or xmllint --valid, it will contain the namespace declaration xmlns="http://www.wyeast.net/order". If you want to use prefixed elements, for example, change line 6 to read: xmlns:order CDATA #FIXED "http://www.wyeast.net/order". Then add the prefix order: to all the element declarations in the DTD?for example, <!ELEMENT store (#PCDATA)> becomes <!ELEMENT order:store (#PCDATA)> and so forth. Be cautious: you will want to use defaulted attributes as namespace declarations only when you are certain that your instance will use the namespace.


On line 7, the attribute date is declared. The #IMPLIED keyword means that the attribute may or may not appear in a legal instance. CDATA means that the value of date will be a string.

The std attribute for the weight element is declared on line 12. It is required (#REQUIRED) and is of type NMTOKEN. A name token is a single, atomic unit?a string with no whitespace. The attribute-list declaration on line 15 declares the cur (currency) attribute for the price element. The default value in quotes is USD (United States dollar), with possible values USD, CAD (Canadian dollar), AUD (Australian dollar), and EUR (Euro).

5.2.2 Internal Subset

You can also have a DTD that is internal to an XML document. This is called the internal subset. internal.xml is an example of an XML document that contains an internal subset (Example 5-3). The DTD is stored in the DOCTYPE declaration, which encloses markup declarations in square brackets ([ ]); see lines 2 and 21.

Example 5-3. internal.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>



<!DOCTYPE order [

<!-- Order DTD -->

<!ELEMENT order (store+,product,package?,weight?,quantity,price,ship*)>

<!-- id = part number -->

<!ATTLIST order id ID #REQUIRED

                xmlns CDATA #FIXED "http://www.wyeast.net/order"

                date CDATA #IMPLIED>

<!ELEMENT store (#PCDATA)>

<!ELEMENT product (#PCDATA)>

<!ELEMENT package (#PCDATA)>

<!ELEMENT weight (#PCDATA)>

<!ATTLIST weight std NMTOKEN #REQUIRED>

<!ELEMENT quantity (#PCDATA)>

<!ELEMENT price (high?,regular,discount?,total?)>

<!ATTLIST price cur (USD|CAD|AUD|EUR) "USD">

<!ELEMENT high (#PCDATA)>

<!ELEMENT regular (#PCDATA)>

<!ELEMENT discount (#PCDATA)>

<!ELEMENT ship (#PCDATA)>

]>



<order id="TDI-983857">

 <store>Prineville</store>

 <product>feed-grade whole oats</product>

 <package>sack</package>

 <weight std="lbs.">50</weight>

 <quantity>23</quantity>

 <price cur="USD">

  <high>5.99</high>

  <regular>4.99</regular>

  <discount>3.99</discount>

 </price>

 <ship>the back of Tom's pickup</ship>



</order>

One line 1, the document internal.xml is declared to be standalone; i.e., it does not depend on markup declarations in an external entity. Notice that there is no SYSTEM keyword or system identifier (URL). This is because the markup declarations are enclosed in the document type declaration, rather than in an external entity. The document type declaration (lines 2 through 21) contains the same declarations as order.dtd, and the document itself (lines 23 through 35) is the same as external.xml, except for the DOCTYPE.

5.2.2.1 Using an internal subset and an external subset together

The document both.xml, shown in Example 5-4, uses both an internal subset and an external subset (both.dtd in Example 5-5). Notice how the document type declaration uses both the SYSTEM keyword, a system identifier (both.dtd), and also encloses markup declarations in square brackets ([ ]). The advantage of this syntax is that DTDs can be developed and used in a modular fashion, and documents can be validated with these modules even if they exist locally or in disparate locations (across the Internet).

Example 5-4. both.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<!DOCTYPE order SYSTEM "both.dtd" [

<!-- Order DTD -->

<!ELEMENT order (store+,product,package?,weight?,quantity,price,ship*)>

<!-- id = part number -->

<!ATTLIST order id ID #REQUIRED

                xmlns CDATA #FIXED "http://www.wyeast.net/order"

                date CDATA #IMPLIED>

<!ELEMENT store (#PCDATA)>

<!ELEMENT product (#PCDATA)>

<!ELEMENT package (#PCDATA)>

<!ELEMENT weight (#PCDATA)>

<!ATTLIST weight std NMTOKEN #REQUIRED>

<!ELEMENT quantity (#PCDATA)>

<!ELEMENT ship (#PCDATA)>

]>

   

<order id="TDI-983857">

 <store>Prineville</store>

 <product>feed-grade whole oats</product>

 <package>sack</package>

 <weight std="lbs.">50</weight>

 <quantity>23</quantity>

 <price cur="USD">

  <high>5.99</high>

  <regular>4.99</regular>

  <discount>3.99</discount>

 </price>

 <ship>the back of Tom's pickup</ship>

</order>

Example 5-5. both.dtd
<!ELEMENT price (high?,regular,discount?,total?)>

<!ATTLIST price cur (USD|CAD|AUD|EUR) "USD">

<!ELEMENT high (#PCDATA)>

<!ELEMENT regular (#PCDATA)>

<!ELEMENT discount (#PCDATA)>

5.2.3 Parameter Entities

A parameter entity (PE) is a special entity that can be used only in a DTD. They are not allowed in XML documents. A PE provides a way to store information and then reuse that information elsewhere, multiple times. A good example of this can be found in the way the XHTML 1.0 strict DTD (http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd) defines a set of core attributes. Here is a fragment from the DTD:

<!-- core attributes common to most elements

  id       document-wide unique id

  class    space separated list of classes

  style    associated style info

  title    advisory title/amplification

-->

<!ENTITY % coreattrs

 "id          ID             #IMPLIED

  class       CDATA          #IMPLIED

  style       %StyleSheet;   #IMPLIED

  title       %Text;         #IMPLIED"

  >

Lines 1 through 6 of this fragment contain a comment explaining the purpose of four attributes, id, class, style, and title. Starting on line 7, an entity is declared. The percent sign (%) is a flag to the XML processor saying that this is a parameter entity. The information in double quotes makes up part of an attribute-list declaration that is reused three times in the DTD.

Where normal entity references [Hack #4] begin with an ampersand (&), parameter entity references begin with a percent sign (%). Lines 10 and 11 show the parameter entity references %Stylesheet; and %Text;, which are defined elsewhere in the DTD as:

<!ENTITY % StyleSheet "CDATA">

    <!-- style sheet data -->

   

<!ENTITY % Text "CDATA">

    <!-- used for titles etc. -->

%Stylesheet; and %Text; expand to CDATA. As you can see, a parameter entity can contain a reference to another parameter entity. In fact, the attrs parameter entity in xhtml1-strict.dtd references coreattrs and two other parameter entities:

<!ENTITY % attrs "%coreattrs; %i18n; %events;">

attrs, in turn, is used over 60 times in the DTD, so you can see that parameter entities are a handy way to reuse information in a DTD.

5.2.4 Other Things That Can Go in a DTD

This section briefly covers several other things you can include in DTDs: comments, conditional sections, unparsed entities, and notations.

5.2.4.1 Comments

DTDs can contain XML-style comments [Hack #1]. For example, the pair of comments used on lines 2 and 4 in Example 5-2 are formed just as they would be in an XML document.

5.2.4.2 Conditional sections

Conditional sections allow you to include or exclude declarations in a DTD conditionally. This feature can help you develop a DTD while you are still trying out different content models. Look at this fragment from conditional.dtd:

<![INCLUDE[

<!ATTLIST price cur (USD|CAD|AUD|EUR) "USD">

]]>

<![IGNORE[

<!ATTLIST price cur (USD|EUR) "USD">

]]>

The structure that starts with the word INCLUDE indicates that the following declaration (which must be complete) is to be included in the DTD at validation time. The section marked IGNORE, however, is ignored. The following fragment, also in conditional.xml, shows how you can turn these sections on or off with parameter entities.

<!ENTITY % on 'INCLUDE' >

<!ENTITY % off 'IGNORE' >

...

<![%on;[

<!ELEMENT price (high?,regular,discount?,total?)>

]]>

<![%off;[

<!ELEMENT price (regular,discount,total)>

]]>

Conditional sections are an interesting hack in themselves, but they are frequently considered more complicating than helpful.

5.2.4.3 Unparsed entities and notations

An unparsed entity is a resource upon which XML places no constraints. It can consist of a chunk of XML, non-XML text, a graphical file, a binary file, or any other electronic resource. An unparsed entity has a name that is associated with a system identifier or a public identifier.

For example, in DocBook [Hack #62], a module of the DTD (dbnotnx.mod, under the subdirectory docbook-4.3CR in this book's file archive) is dedicated to notations. Here is a notation from that module that associates the name GIF89a with a public identifier -//CompuServe//NOTATION Graphics Interchange Format 89a//EN:

<!NOTATION GIF89a               PUBLIC

"-//CompuServe//NOTATION Graphics Interchange Format 89a//EN">

Here is another example from the same module that uses a system identifier for the name PNG:

<!NOTATION PNG          SYSTEM "http://www.w3.org/TR/REC-png">

Elsewhere in a DTD that includes this module, you could declare several entities like this:

<!ENTITY dbnotnx SYSTEM "dbnotnx.mod">

&dbnotnx;

...

<!ENTITY g001 SYSTEM "g001.gif" NDATA GIF89a>

<!ENTITY g002 SYSTEM "g002.png" NDATA PNG>

...

<!ELEMENT graphic EMPTY>

<!ATTLIST graphic img ENTITY #REQUIRED>

The entity declarations associate names with files with the names of notations. The presence of the NDATA keyword indicates an unparsed entity. Then, in an instance, you could refer to the entity in an attribute, like this:

<graphic img="g001"/>

...

<graphic img="g002"/>

The syntax for unparsed entities is the most awkward and forbidding of any syntax in XML. The use of unparsed entities is rare, and the applications that support them are even rarer. If people want to display graphics, they usually transform their XML into HTML or XHTML and use the ubiquitously supported img tag.