Section 15.3. Understanding XML DTDs

To use a markup language defined with XML, you should be able to read and understand the elements and entities found in its XML DTD. But don't be put off: while XML DTDs are verbose, filled with obscure punctuation, and designed primarily for computer consumption, they are actually easy to understand once you get past all the syntactic sugar. Remember, your brain is better at languages than any computer is.

As we said previously, an XML DTD is a collection of XML entity and element declarations and comments. Entities are name/value pairs that make the DTD easier to read and understand, while elements are the actual markup tags defined by the DTD, like HTML's <p> or <h1> tags. The DTD also describes the content and grammar for each tag in the language. Along with the element declarations, you'll also find attribute declarations that define the attributes authors may use with the tags defined by the element declarations.

There is no required order, although the careful DTD author arranges declarations in such a way that humans can easily find and understand them, computers notwithstanding. The beloved DTD author includes lots of comments, too, that explain the declarations and how they can be used to create a document. Throughout this chapter, we use examples taken from the XHTML 1.0 DTD, which can be found in its entirety at the W3C web site. Although lengthy, you'll find this DTD to be well-written, complete, and, with a little practice, easy to understand.

XML also provides for conditional sections within a DTD, allowing groups of declarations to be optionally included or excluded by the DTD parser. This is useful when a DTD actually defines several versions of a markup language; the desired version can be derived by including or excluding appropriate sections. The XHTML 1.0 DTD, for example, defines both the "regular" version of HTML and a version that supports frames. By allowing the parser to include only the appropriate sections of the DTD, the rules for the <html> tag can change to support either a <body> tag or a <frameset> tag, as needed.

15.3.1 Comments

The syntax for comments within an XML DTD is exactly like that for HTML comments: comments begin with . Everything between these two elements is ignored by the XML processor. Comments may not be nested.

15.3.2 Entities

An entity is a fancy term for a constant. Entities are crucial to creating modular, easily understood DTDs. Although they may differ in many ways, all entities associate a name with a string of characters. When you use the entity name elsewhere within a DTD, or in an XML document, language parsers replace the name with the corresponding characters. Drawing an example from HTML, the < entity is replaced by the < character wherever it appears in an HTML document.

Entities come in two flavors: parsed and unparsed. Parsed entities are processed by an XML processor; unparsed ones are ignored. The vast majority of entities are parsed. An unparsed entity is reserved for use within attribute lists of certain tags; it is nothing more than a replacement string used as a value for a tag attribute.

You can further divide the group of parsed entities into general entities and parameter entities. General entities are used in the XML document, while parameter entities are used in the XML DTD.

You may not realize that you've been using general entities within your HTML documents all along. For example, the entity for the copyright (©) symbol (©) is a general entity defined in the HTML DTD. Like all general entities, it is referenced by preceding its name with the ampersand character. All of the other general entities you know and love are listed in Appendix F.

To make life easier, XML predefines the five most common general entities, which can be used in any XML document. While it is still preferred that they be explicitly defined in any DTD that uses them, these five entities are always available to any XML author:

&amp;			&

&apos;			'

&gt;			>

&lt;			<

&quot;			"

You'll find parameter entities littered throughout any well-written DTD, including the HTML DTD. Parameter entities have a percent sign (%) preceding their names. The percent sign tells the XML processor to look up the entity name in the DTD's list of parameter entities, insert the value of the entity into the DTD in place of the entity reference, and process the value of the entity as part of the DTD.

That last bit is important. By processing the contents of the parameter entity as part of the DTD, the XML processor allows you to place any valid XML content in a parameter entity. Many parameter entities contain lengthy XML definitions and may even contain other entity definitions. Parameter entities are the workhorses of the XML DTD; creating DTDs without them would be extremely difficult.^[5]

^[5] C and C++ programmers may recognize that the entity mechanism in XML is similar to the #define macro mechanism in C and C++. The XML entities provide only simple character-string substitution and do not employ C's more elaborate macro parameter mechanism.

15.3.3 Entity Declarations

Let's define an entity with the <!ENTITY> tag in an XML DTD. Inside the tag, first supply the entity name and value, and then indicate whether it is a general or parameter entity:

<!ENTITY name value>

<!ENTITY % name value>

The first version creates a general entity; the second, because of the percent sign, creates a parameter entity.

For both entity types, the name is simply a sequence of characters beginning with a letter, colon, or underscore and followed by any combination of letters, numbers, periods, hyphens, underscores, or colons. The only restriction is that names may not begin with the sequence "xml" (either upper- or lowercase).

The entity value is either a character string within quotes (unlike HTML markup, you must use quotes even if it is a string of contiguous letters) or a reference to another document containing the value of the entity. For these external entity values, you'll find either the keyword SYSTEM, followed by the URL of the document containing the entity value, or the keyword PUBLIC, followed by the formal name of the document and its URL.

A few examples will make this clear. Here is a simple general entity declaration:

<!ENTITY fruit "kumquat or other similar citrus fruit">

In this declaration, the entity "&fruit;" within the document is replaced with the phrase "kumquat or other similar citrus fruit" wherever it appears.

Similarly, here is a parameter entity declaration:

<!ENTITY % ContentType "CDATA">

Anywhere the reference %ContentType; appears in your DTD, it is replaced with the word "CDATA". This is the typical way to use parameter entities: to create a more descriptive term for a generic parameter that will be used many times in a DTD.

Here is an external general entity declaration:

<!ENTITY boilerplate SYSTEM "http://server.com/boilerplate.txt">

It tells the XML processor to retrieve the contents of the file boilerplate.txt from server.com and use it as the value of the boilerplate entity. Anywhere you use &boilerplate; in your document, the contents of the file are inserted as part of your document content.

Here is an external parameter entity declaration, lifted from the HTML DTD, that references a public external document:

<!ENTITY % HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" 

    "xhtml-lat1.ent">

It defines an entity named HTMLlat1 whose contents are to be taken from the public document identified as -//W3C//ENTITIES Latin 1 for XHTML//EN. If the processor does not have a copy of this document available, it can use the URL xhtml-lat1.ent to find it. This particular public document is actually quite lengthy, containing all of the general entity declarations for the Latin 1 character encodings for HTML.^[6] Accordingly, simply writing this in the HTML DTD:

^[6] You can enjoy this document for yourself at http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent.

%HTMLlat1;

causes all of those general entities to be defined as part of the language.

A DTD author can use the PUBLIC and SYSTEM external values with general and parameter entity declarations. You should structure your external definitions to make your DTDs and documents easy to read and understand.

You'll recall that we began the section on entities with a mention of unparsed entities whose only purpose is to be used as values to certain attributes. You declare an unparsed entity by appending the keyword NDATA to an external general entity declaration, followed by the name of the unparsed entity. If we wanted to convert our general boilerplate entity to an unparsed general entity for use as an attribute value, we could say:

<!ENTITY boilerplate SYSTEM "http://server.com/boilerplate.txt" NDATA text>

With this declaration, attributes defined as type ENTITY (as described in Section 15.5.1) could use boilerplate as one of their values.

15.3.4 Elements

Elements are definitions of the tags that can be used in documents based on your XML markup language. In some ways, element declarations are easier than entity declarations, since all you need to do is specify the name of the tag and what sort of content that tag may contain:

<!ELEMENT name contents>

The name follows the same rules as names for entity definitions. The contents section may be one of four types described here:

The keyword EMPTY defines a tag with no content, like <hr> or <br> in HTML. Empty elements in XML get a bit of special handling, as described in Section 15.4.5.
The keyword ANY indicates that the tag can have any content, without restriction or further processing by the XML processor.
The content may be a set of grammar rules that defines the order and nesting of tags within the defined element. This content type is used when the tag being defined contains only other tags, without conventional content allowed directly within the tag. In HTML, the <ul> tag is such a tag, as it can contain only <li> tags.
Mixed content, denoted by a comma-separated list of element names and the keyword #PCDATA, is enclosed in parentheses. This content type allows tags to have user-defined content, along with other markup elements. The <li> tag, for example, may contain user-defined content as well as other tags.