Hack 25 Include Text and Documents with Entities

figs/moderate.gif figs/hack25.gif

You can insert external text and even documents into XML documents by using external entities.

XML comes with a native mechanism for including text from both internal and external sources. The mechanism is called entities (http://www.w3.org/TR/REC-xml.html/#sec-physical-struct). This feature allows you to make XML documents modular. Entities can be declared and stored internally in a document, in an external file, and even across a network. Entities are declared in DTDs and can contain just small bits of non-XML text, XML markup, or even large amounts of text.

XML has a concept of a document entity, which is a starting point for an XML processor. A document entity, from one standpoint, may exist in a file with an associated name. However, from the standpoint of the XML spec, a document entity does not have a name and might be an input stream that has no means of identification at all.


The rather minimal XML document entity.xml declares one internal entity (line 3) and two external entities (lines 4 and 5), as shown in Example 2-17.

Example 2-17. entity.xml
<?xml version="1.0" encoding="UTF-8"?>



<!DOCTYPE time [



<!ENTITY tm "59">



<!ENTITY tme SYSTEM "tm.ent">



<!ENTITY rmt SYSTEM "http://www.wyeast.net/rmt.ent">



]>







<!-- a time instant -->



<time timezone="PST">



 <hour>11</hour>



 <minute>&tm;</minute>



 <second>&tme;</second>



 <meridiem>p.m.</meridiem>



 &rmt;



</time>

This kind of DTD is called the internal subset because it is internal to the XML document itself. You can have an internal subset, an external subset, or both at the same time (see [Hack #68] ).

The XML 1.0 spec allows for validating and non-validating processors. Validating processors care about DTDs, but non-validating processors do not. A non-validating processor is not required to resolve external entities. See http://www.w3.org/TR/2004/REC-xml-20040204/#proc-types.


Line 3 contains a declaration for an internal, parsed entity. tm is the entity name and the text in quotes (59) is replacement text. A reference to this entity [Hack #4] is on line 11, &tm;. Entity references begin with an ampersand (&) and end with a semicolon (;), with the entity name sandwiched in between (tm). When processed with entity replacement "turned on," the reference on line 11 will be replaced by the replacement text 59.

The entity declared on line 4 is an external, parsed entity. Its replacement text is found in the external local file tme.ent. (The suffix .ent is certainly not required?it's just a convention that some folks use for naming entity files.) When processed, the reference &tme; on line 12 will be replaced by the little fragment of text found in the file tme.ent:

<?xml encoding="UTF-8"?>59

Right before the text 59 is a text declaration (http://www.w3.org/TR/REC-xml.html/#sec-TextDecl). It looks like an XML declaration [Hack #1] minus the version information. The version information is allowed here, but unlike the XML declaration, it is not required. Text declarations allow you to explicitly assign an encoding to an external entity file. If a text declaration is present, it must have an encoding declaration and must appear at the beginning of the entity.

The final entity, declared on line 5, is also an external, parsed entity, like the one defined on the line before it. The difference is that this entity's replacement text comes from an external file out on the Web, http://www.wyeast.net/rmt.ent. The contents of rmt.ent contains markup and looks like this:

<?xml encoding="UTF-8"?><atomic signal="true"/>

A reference to rmt.ent turns up on line 14 of entity.xml (&rmt;). When processed, the reference is replaced with the missing markup in entity.xml.

You can process this document at the command line to expand the entities using a tool like rxp or xmllint. Here's an example for xmllint using the --noent switch, which turns entity processing on:

xmllint --noent entity.xml

xmllint will yield the output shown in Example 2-18, provided that you have a connection to the Internet at runtime.

Example 2-18. xmllint output from entity.xml
<?xml version="1.0" encoding="UTF-8"?>



<!DOCTYPE time [



<!ENTITY tm "59">



<!ENTITY tme SYSTEM "tm.ent">



<!ENTITY rmt SYSTEM "http://www.wyeast.net/rmt.ent">



]>



<!-- a time instant -->



<time timezone="PST">



 <hour>11</hour>



 <minute>59</minute>



 <second>59



</second>



 <meridiem>p.m.</meridiem>



 <atomic signal="true"/>



</time>

The entity references are gone, replaced by the declared replacement text, including the empty element tag on line 14. Microsoft Internet Explorer (IE) can process these entities as well. Figure 2-25 shows entity.xml as displayed in IE.

Figure 2-25. entity.xml in IE
figs/xmlh_0225.gif


2.16.1 Unparsed Entities and Notations

XML also supports unparsed entities. The XML specification states that:

"An unparsed entity is a resource whose contents may or may not be text, and if text, may be other than XML. Each unparsed entity has an associated notation, identified by name. Beyond a requirement that an XML processor make the identifiers for the entity and notation available to the application, XML places no constraints on the contents of unparsed entities." (See http://www.w3.org/TR/2004/REC-xml-20040204/#dt-unparsed.)

Unparsed entities are declared in DTDs together with notations. A notation identifies the name of an unparsed entity. To see how unparsed entities and notations work together, see [Hack #68] .