2.1 Tags

If XML markup is a structural skeleton for a document, then tags are the bones. They mark the boundaries of elements, allow insertion of comments and special instructions, and declare settings for the parsing environment. A parser, the front line of any program that processes XML, relies on tags to help it break down documents into discrete XML objects. There are a handful of different XML object types, listed in Table 2-1.

Table 2-1. Types of tags in XML

Object

Purpose

Example

empty element

Represent information at a specific point in the document.

<xref linkend="abc"/>

container element

Group together elements and character data.

<p>This is a paragraph.</p>

declaration

Add a new parameter, entity, or grammar definition to the parsing environment.

<!ENTITY author "Erik Ray">

processing instruction

Feed a special instruction to a particular type of software.

<?print-formatter force-linebreak?>

comment

Insert an annotation that will be ignored by the XML processor.

<! here's where I left off >

CDATA section

Create a section of character data that should not be parsed, preserving any special characters inside it.

<![CDATA[Ampersands galore! &&&&&&]]>

entity reference

Command the parser to insert some text stored elsewhere.

&company-name;

Elements are the most common XML object type. They break up the document into smaller and smaller cells, nesting inside one another like boxes. Figure 2-1 shows the document in Chapter 1 partitioned into separate elements. Each of these pieces has its own properties and role in a document, so we want to divide them up for separate processing.

Figure 2-1. Telegram with element boundaries visible
figs/lx2_0201.gif

Inside element start tags, you sometimes will see some extra characters next to the element name in the form of name="value". These are attributes. They associate information with an element that may be inappropriate to include as character data. In the telegram example earlier, look for an attribute in the start tag of the telegram element.

Declarations are never seen inside elements, but may appear at the top of the document or in an external document type definition file. They are important in setting parameters for the parsing session. They define rules for validation or declare special entities to stand in for text.

The next three objects are used to alter parser behavior while it's going over the document. Processing instructions are software-specific directives embedded in the markup for convenience (e.g., storing page numbers for a particular formatter). Comments are regions of text that the parser should strip out before processing, as they only have meaning to the author. CDATA sections are special regions in which the parser should temporarily suspend its tag recognition.

Rounding out the list are entity references, commands that tell the parser to insert predefined pieces of text in the markup. These objects don't follow the pattern of other tags in their appearance. Instead of angle brackets for delimiters, they use the ampersand and semicolon.

In upcoming sections, I'll explain each of these objects in more detail.