The grammar of human language is rich with a variety of sentence structures, verb tenses, and all sorts of irregular constructs and exceptions to the rules. Nonetheless, you mastered most of it by the age of three. Computer language grammars typically are simple, regular, and have few exceptions. In fact, computer grammars use only four rules to define how elements of a language may be arranged: sequence, choice, grouping, and repetition.
Sequence rules define the exact order in which elements appear in a language. For instance, if a sequence grammar rule states that element A is followed by B and then by C, your document must provide elements A, B, and C in that exact order. A missing element (A and C, but no B, for example), an extra element (A, B, E, then C), or an element out of place (C, A, then B) violates the rule and does not match the grammar.
In many grammars, XML included, sequences are defined by simply listing the appropriate elements, in order and separated by commas. Accordingly, our example sequence in the DTD would appear simply as A, B, C.
Choice grammar rules provide flexibility by letting the DTD author choose one element from among a group of valid elements. For example, a choice rule might state that you may choose elements D, E, or F; any one of these three elements would satisfy the grammar. Like many other grammars, XML denotes choice rules by listing the appropriate choices separated by a vertical bar (|). Thus, our simple choice would be written in the DTD as D | E | F. If you read the vertical bar as the word or, choice rules become easy to understand.
Grouping rules collect two or more rules into a single rule, building richer, more usable languages. For example, a grouping rule might allow a sequence of elements, followed by a choice, followed by a sequence. You can indicate groups within a rule by enclosing them in parentheses in the DTD. For example:
Document ::= A, B, C, (D | E | F), G
requires that a document begin with elements A, B, and C, followed by a choice of one element out of D, E, or F, followed by element G.
Repetition rules let you repeat one or more elements some number of times. With XML, as with many other languages, repetition is denoted by appending a special character suffix to an element or group within a rule. Without the special character, that element or group must appear exactly once in the rule. Special characters include the plus sign (+), meaning that the element may appear one or more times in the document; the asterisk (*), meaning that the element may appear zero or more times; and the question mark (?), meaning that the element may appear either zero or one time.
For example, the rule:
Document ::= A, B?, C*, (D | E | F)+, G*
creates an unlimited number of correct documents with the elements A through F. According to the rule, each document must begin with A, optionally followed by B, followed by zero or more occurrences of C, followed by at least one, but perhaps more, of either D, E, or F, followed by zero or more Gs. All of these documents (and many others!) match this rule:
ABCDG ACCCFFGGG ACDFDFGG
You might want to work through these examples to prove to yourself that they are, in fact, correct with respect to the repetition rule.
By now you can probably imagine that specifying an entire language grammar in a single rule is difficult, although possible. Unfortunately, the result would be an almost unreadable sequence of nearly unintelligible rules. To remedy this situation, the items in a rule may themselves be rules containing other elements and rules. In these cases, the items in a grammar that are themselves rules are known as nonterminals, while the items that are elements in the language are known as terminals. Eventually, all the nonterminals must reference rules that create sequences of terminals, or the grammar would never produce a valid document.
For example, we can express our sample grammar in two rules:
Document ::= A, B?, C*, Choices+, G* Choices ::= D | E | F
In this example, Document and Choices are nonterminals, while A, B, C, D, E, F, and G are terminals.
There is no requirement in XML (or most other grammars) that dictates or limits the number of nonterminals in your grammar. Most grammars use nonterminals wherever it makes sense for clarity and ease of use.
The rules for defining the contents of an element match the grammar rules we just discussed. You may use sequences, choices, groups, and repetition to define the allowable contents of an element. The nonterminals in rules must be names of other elements defined in your DTD.
A few examples show how this works. Consider the declaration of the <html> tag, taken from the HTML DTD:
<!ELEMENT html (head, body)>
This defines the element named html whose content is a head element followed by a body element. Notice that you do not enclose the element names in angle brackets within the DTD; that notation is used only when the elements are actually used in a document.
Within the HTML DTD, you can find the declaration of the <head> tag:
<!ELEMENT head (%head.misc;, ((title, %head.misc;, (base, %head.misc;)?) | (base, %head.misc;, (title, %head.misc;))))>
Gulp. What on earth does this mean? First, notice that a parameter entity named head.misc is used several times in this declaration. Let's go get it:
<!ENTITY % head.misc "(script|style|meta|link|object)*">
Now things are starting to make sense: head.misc defines a group of elements, from which you may choose one. However, the trailing asterisk indicates that you may include zero or more of these elements. The net result is that anywhere %head.misc; appears, you can include zero or more script, style, meta, link, or object elements, in any order. Sound familiar?
Returning to the head declaration, we see that we are allowed to begin with any number of the head miscellaneous elements. We must then make a choice: either a group consisting of a title element, optional miscellaneous items, and an optional base element followed by miscellaneous items; or a group consisting of a base element, miscellaneous items, a title element, and some more miscellaneous items.
Why such a convoluted rule for the <head> tag? Why not just write:
<!ELEMENT head (script|style|meta|link|object|base|title)*>
which allows any number of head elements to appear, or none at all? The HTML standard requires that every <head> tag contain exactly one <title> tag. It also allows for only one <base> tag, if any. Otherwise, the standard does allow any number of the other head elements, in any order.
Put simply, the head element declaration, while initially confusing, forces the XML processor to ensure that exactly one title element appears in the head element and that, if specified, just one base element appears as well. It then allows for any of the other head elements, in any order.
This one example demonstrates a lot of the power of XML: the ability to define commonly used elements using parameter entities and the use of grammar rules to dictate document syntax. If you can work through the head element declaration and understand it, you are well on your way to reading any XML DTD.
Mixed element content extends the element grammar rules to include the special #PCDATA keyword. PCDATA stands for "parsed character data" and signifies that the content of the element will be parsed by the XML processor for general entity references. After the entities are replaced, the character data is passed to the XML application for further processing.
What this boils down to is that parsed character data is the actual content of your XML document. Elements that accept parsed character data may contain plain ol' text, plus whatever other tags you allow, as defined in the DTD.
<!ELEMENT title (#PCDATA)>
means that the title element may contain only text with entities. No other tags are allowed, just as in the HTML standard.
A more complex example is the <p> tag, whose element declaration is:
<!ELEMENT p %Inline;>
Another parameter entity! The %Inline; entity is defined in the HTML DTD as:
<!ENTITY % Inline "(#PCDATA | %inline; | %misc;)*">
which expands to these entities when you replace the parameters:
<!ENTITY % special "br | span | bdo | object | img | map"> <!ENTITY % fontstyle "tt | i | b | big | small"> <!ENTITY % phrase "em | strong | dfn | code | q | sub | sup | samp | kbd | var | cite | abbr | acronym"> <!ENTITY % inline.forms "input | select | textarea | label | button"> <!ENTITY % misc "ins | del | script | noscript"> <!ENTITY % inline "a | %special; | %fontstyle; | %phrase; | %inline.forms;">
What do we make of all this? The %Inline; entity defines the contents of the p element as parsed character data, plus any of the elements defined by %inline; and any defined by %misc;. Notice that case does matter: %Inline; is different from %inline;.
The %inline; entity includes lots of stuff: special elements, font-style elements, phrase elements, and inline form elements. %misc includes the ins, del, script, and noscript elements. You can read the HTML DTD for the other entity declarations to see which elements are also allowed as the contents of a p element.
Why did the HTML DTD authors break up all these elements into separate groups? If they were simply defining elements to be included in the p element, they could have built a single long list. However, HTML has rules that govern where inline elements may appear in a document. The authors grouped elements that are treated similarly into separate entities that could be referenced several times in the DTD. This makes the DTD easier to read and understand, as well as easier to maintain when a change is needed.
Elements whose content is defined to be empty deserve a special mention. XML introduced notational rules for empty elements, different from the traditional HTML rules that govern them.
HTML authors are used to specifying an empty element as a single tag, like <br> or <img>. XML requires that every element have an opening and a closing tag, so an image tag would be written as <img></img>, with no embedded content. Other empty elements would be written in a similar manner.
Since this format works well for non-empty tags but is a bit of overkill for empty ones, you can use a special shorthand notation for empty tags. To write an empty tag in XML, just place a slash (/) immediately before the closing angle bracket of the tag. Thus, a line break may be written as <br/> and an image tag might be specified as <img src="myimage.gif"/>. Notice that the attributes of the empty element, if any, appear before the closing slash and bracket.