Let me tackle that question by sorting the kinds of problems for which you would use XML.
Just about every software application needs to store some data. There are look-up tables, work files, preference settings, and so on. XML makes it very easy to do this. Say, for example, you've created a calendar program and you need a way to store holidays. You could hardcode them, of course, but that's kind of a hassle since you'd have to recompile the program if you need to add to the list. So you decide to save this data in a separate file using XML. Example 1-4 shows how it might look.
<caldata> <holiday type="international"> <name>New Year's Day</name> <date><month>January</month><day>1</day></date> </holiday> <holiday type="personal"> <name>Erik's birthday</name> <date><month>April</month><day>23</day></date> </holiday> <holiday type="national"> <name>Independence Day</name> <date><month>July</month><day>4</day></date> </holiday> <holiday type="religious"> <name>Christmas</name> <date><month>December</month><day>25</day></date> </holiday> </caldata>
Now all your program needs to do is read in the XML file and convert the markup into some convenient data structure using an XML parser. This software component reads and digests XML into a more usable form. There are lots of libraries that will do this, as well as standalone programs. Outputting XML is just as easy as reading it. Again, there are modules and libraries people have written that you can incorporate in any program.
XML is a very good choice for storing data in many cases. It's easy to parse and write, and it's open for users to edit themselves. Parsers have mechanisms to verify syntax and completeness, so you can protect your program from corrupted data. XML works best for small data files or for data that is not meant to be searched randomly. A novel is a good example of a document that is not randomly accessed (unless you are one of those people who peek at the ending of a novel before finishing), whereas a telephone directory is randomly accessed and therefore may not be the best choice to put in a single, enormous XML document.
If you want to store huge amounts of data and need to retrieve it quickly, you probably don't want to use XML. It's a sequential storage medium, meaning that any search would have to go through most of the document. A database program like Oracle or MySQL would scale much better, caching frequently used data and using a hash table to zero in on records with lightning speed.
I mentioned before that a large class of XML documents are narrative, meaning they are for human consumption. But we don't expect people to actually read text with XML markup. Rather, the XML must be processed to put the data in a presentable form. XML has a number of strategies and tools for turning the unappealing mishmash of marked-up plain text into eye-pleasing views suitable for web pages, magazines, or whatever you like.
Most XML markup languages focus on the task of how to organize information semantically. That is, they describe the data for what it is, not in terms of how it should look. Example 1-2 encodes a mathematical equation, but it does not look like something you'd write on a blackboard or see in a textbook. How you get from the raw data to the finished product is called formatting.
There are a number of different strategies for formatting. The simplest is to apply a Cascading Style Sheet (CSS) to it. This is a separate document (not itself XML) that contains mappings from element names to presentation details (font style, color, margins, and so on). A formatting XML processor such as a web browser, reads the XML data file and the stylesheet, then produces a formatted page by applying the stylesheet's instructions to each element. Example 1-5 shows a typical example of a CSS stylesheet.
telegram { display: block; background-color: tan; color: black; font-family: monospace; padding: 1em; } message { display: block; margin: .5em; padding: .5em; border: thin solid brown; background-color: wheat; whitespace: normal; } to:before { display: block; color: black; content: "To: "; } from:before { display: block; color: black; content: "From: "; } subject:before { color: black; content: "Subject: "; } to, from, subject { display: block; color: blue; font-size: large; } emphasis { font-style: italic; } name { font-weight: bold; } villain { color: red; font-weight: bold; }
To apply this stylesheet, you need to add a special instruction to the source document. It looks like this:
<?xml-stylesheet type="text/css" href="ex2_memo.css"?>
This is a processing instruction, not an element. It will be ignored by any XML processing software that doesn't handle CSS stylesheets.
To see the result, you can open the document in a web browser that accepts XML and can format with CSS. Figure 1-1 shows a screenshot of how it looks in Safari version 1.0 for Mac OS X.
CSS is limited to cases where the output text will be in the same order as the input data. It would not be so useful if you wanted to show only an excerpt of the data, or if you wanted it to appear in a different order from the data. For example, suppose you collected a lot of phone numbers in an XML file and then wanted to generate a telephone directory from that. With CSS, there is no way to sort the listings in alphabetical order, so you'd have to do the sorting in the XML file first.
A more powerful technique is to transform the XML. Transformation is a process that breaks apart an XML document and builds a new one. The new document may or may not use the same markup language (in fact, XML is only one option; you can transform XML into any kind of text). With transformation, you can sort elements, throw out parts you don't want, and even generate new data such as headers and footers for pages. Transformation in XML is typically done with the language XSLT, essentially a programming language optimized for transforming XML. It requires a transformation instruction which happens to be called a stylesheet (not to be confused with a CSS stylesheet). The process looks like the diagram in Figure 1-2.
A popular use of transformations is to change a non-presentation XML data file into a format that combines data with presentational information. Typically, this format will throw away semantic information in favor of device-specific and highly presentational descriptions. For example, elements that distinguish between filenames and emphasized text would be replaced with tags that turn on italic formatting. Once you lose the semantic information, it is much harder to transform the document back to the original data-specific format. That is okay, because what we get from presentational formats is the ability to render a pleasing view on screen or printed page.
There are many presentational formats. Public domain varieties include the venerable troff, which dates back to the first Unix system, and TEX, which is still popular in universities. Adobe's PostScript and PDF and Microsoft's Rich Text Format (RTF) are also good candidates for presentational formats. There are even some XML formats that can be included in this domain. XHTML is rather generic and presentational for narrative documents. SVG, a graphics description language, is another format you could transform to from a more semantic language.
Example 1-6 shows an XSLT stylesheet that changes any telegram document into HTML. Notice that XSLT is itself an XML application, using namespaces (an XML syntax for grouping elements by adding a name prefix) to distinguish between XSLT commands and the markup to be output. For every element type in the source document's markup language, there is a corresponding rule in the stylesheet describing how to handle it. I don't expect you to understand this code right now. There is a whole chapter on XSLT (Chapter 7) after which it will make more sense to you.
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:template match="telegram"> <html> <head><title>telegram</title></head> <body> <div style="background-color: wheat; padding=1em; "> <h1>telegram</h1> <xsl:apply-templates/> </div> </body> </html> </xsl:template> <xsl:template match="from"> <h2><xsl:text>from: </xsl:text><xsl:apply-templates/></h2> </xsl:template> <xsl:template match="to"> <h2><xsl:text>to: </xsl:text><xsl:apply-templates/></h2> </xsl:template> <xsl:template match="subject"> <h2><xsl:text>subj: </xsl:text><xsl:apply-templates/></h2> </xsl:template> <xsl:template match="message"> <blockquote> <font style="font-family: monospace"> <xsl:apply-templates/> </font> </blockquote> </xsl:template> <xsl:template match="emphasis"> <i><xsl:apply-templates/></i> </xsl:template> <xsl:template match="name"> <font color="blue"><xsl:apply-templates/></font> </xsl:template> <xsl:template match="villain"> <font color="red"><xsl:apply-templates/></font> </xsl:template> <xsl:template match="graphic"> <img width="100"> <xsl:attribute name="src"> <xsl:value-of select="@fileref"/> </xsl:attribute> </img> </xsl:template> </xsl:transform>
When applied against the document in Example 1-1, this script produces the following HTML. Figure 1-3 shows how it looks in a browser.
<html> <head> <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> <title>telegram</title> </head> <body><div style="background-color: wheat; padding=1em; "> <h1>telegram</h1> <h2>to: Sarah Bellum</h2> <h2>from: Colonel Timeslip</h2> <h2>subj: Robot-sitting instructions</h2> <blockquote><font style="font-family: monospace">Thanks for watching my robot pal <font color="blue">Zonky</font> while I'm away. He needs to be recharged <i>twice a day</i> and if he starts to get cranky, give him a quart of oil. I'll be back soon, after I've tracked down that evil mastermind <font color="red">Dr. Indigo Riceway</font>. </font></blockquote> </div></body> </html>
Transforming XML into HTML is fine for online viewing. It is not so good for print media, however. HTML was never designed to handle the complex formatting of printed documents, with headers and footers, multiple columns, and page breaks. For that, you would want to transform into a richer format such as PDF. A direct transformation into PDF is not so easy to do, however. It requires extensive knowledge of the PDF specification which is huge and difficult, and much of the content is compressed.
A better solution is to transform your XML into an intermediate format, one that is generic and easy for humans to understand. This is XSL-FO, the style language for formatting objects. A formatting object is an abstract representation for a portion of a formatted page. You use XSLT to map elements to formatting objects, and an XSL formatter turns the formatting objects into pages, paragraphs, graphics, and other presentational components. The process is illustrated in Figure 1-4.
The source document on the left is first transformed, using an XSLT stylesheet and XSLT processor, into a formatting object tree using XSLT. This intermediate file is then fed into the XSL formatter which processes it into a presentational format, such as PDF. The beauty of this system is that it is modular. You can use any compliant XSLT processor and XSL formatter. You don't need to know anything about the presentational format because XSL is so generic, describing layout and style attributes in the most declarative form. I will describe XSL in more detail in Chapter 8.
Finally, if stylesheets do not fit the bill, which may be the case if your source data is just too raw for direct transformation, then you may find a programming solution to be to your liking. Although XSLT has much to offer in transformation, it tends to be rather weak in some areas, such as processing character data. I often find that, despite my best efforts to stay inside the XSLT paradigm, I sometimes have to resort to writing a program that preprocesses my XML data before a transformation. Or I may have to write a program that does the whole processing from source to presentational format. That option is always available, and we will see it in detail in Chapter 10.
Trust is important for datatrust that it hasn't been corrupted, truncated, mistyped or left incomplete. Broken documents can confuse software, format as gibberish, and result in erroneous calculations. Documents submitted for publishing need to be complete and use only the markup that you specify. Transmitting and converting documents always entails risk that some information may be lost.
XML gives you the ability to guarantee a minimal level of trust in data. There are several mechanisms. First, there is well-formedness. Every XML parser is required to report syntax errors in markup. Missing tags, malformed tags, illegal characters, and other problems should be immediately reported to you. Consider this simple document with a few errors in it:
<announcement< <TEXT>Hello, world! I'm using XML & it's a lot of fun.</Text> </anouncement>
When I run an XML well-formedness checker on it, here is what I get:
> xwf t.xml t.xml:2: error: xmlParseEntityRef: no name <TEXT>Hello, world! I'm using XML & it's a lot of fun.</Text> ^ t.xml:2: error: Opening and ending tag mismatch: TEXT and Text <TEXT>Hello, world! I'm using XML & it's a lot of fun.</Text> ^ t.xml:3: error: Opening and ending tag mismatch: announcement and anouncement </anouncement> ^
It caught two mismatched tags and an illegal character. And not only did it tell me what was wrong, it showed me where the errors were, so I can go back and correct them more easily. Checking if a document is well-formed can pick up a lot of problems:
Mismatched tags, a common occurrence if you are typing in the XML by hand. The start and end tags have to match exactly in case and spelling.
Truncated documents, which would be missing at least part of the outermost document (both start and end tags must be present).
Illegal characters, including reserved markup delimiters like <, >, and &. There is a special syntax for complex or reserved characters which looks like < for <. If any part of that is missing, the parser will get suspicious. Parsers should also warn you if characters in a particular encoding are not correctly formed, which may indicate that the document was altered in a recent transmission. For example, transferring a file through FTP as ASCII text can sometimes strip out the high bit characters.
The well-formedness check has its limits. The parser doesn't know if you are using the right elements in the right places. For example, you might have an XHTML document with a p element inside the head, which is illegal. To catch this kind of problem, you need to test if the document is a valid instance of XHTML. The tool for this is a validating parser.
A validating parser works by comparing a document against a set of rules called a document model. One kind of document model is a document type definition (DTD). It declares all the elements that are allowed in a document and describes in detail what kind of elements they can contain. Example 1-7 is a small DTD for telegrams.
<!ELEMENT telegram (from,to,subject,graphic?,message)> <!ATTLIST telegram pri CDATA #IMPLIED> <!ELEMENT from (#PCDATA)> <!ELEMENT to (#PCDATA)> <!ELEMENT subject (#PCDATA)> <!ELEMENT graphic EMPTY> <!ATTLIST graphic fileref CDATA #REQUIRED> <!ELEMENT message (#PCDATA|emphasis|name|villain)*> <!ELEMENT emphasis (#PCDATA)> <!ELEMENT name (#PCDATA)>
Before submitting the telegram document to a parser, I need to add this line to the top:
<!DOCTYPE telegram SYSTEM "/location/of/dtd">
Where "/location..." is the path to the DTD file on my system. Now I can run a validating parser on the telegram document. Here's the output I get:
> xval ex1_memo.xml ex1_memo.xml:13: validity error: No declaration for element villain mastermind <villain>Dr. Indigo Riceway</villain>. ^ ex1_memo.xml:15: validity error: Element telegram content doesn't follow the DTD </telegram> ^
Oops! I forgot to declare the villain element, so I'm not allowed to use it in a telegram. No problem; it's easy to add new elements. This shows how you can detect problems with structure and grammar in a document.
The most important benefit to using a DTD is that it allows you to enforce and formalize a markup language. You can make your DTD public by posting it on the web, which is what organizations like the W3C do. For instance, you can look at the DTD for "strict" XHTML version 1.0 at http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd. It's a compact and portable specification, though a little dense to read.
One limitation of DTDs is that they don't do much checking of text content. You can declare an element to contain text (called PCDATA in XML), or not, and that's as far as you can go. You could not check whether an element that should be filled out is empty, or if it follows the wrong pattern. Say, for example, I wanted to make sure that the to element in the telegram isn't empty, so I have at least someone to give it to. With a DTD, there is no way to test that.
An alternative document modeling scheme provides the solution. XML Schemas provide much more detailed control over a document, including the ability to compare text with a pattern you define. Example 1-8 shows a schema that will test a telegram for completely filled-out elements.
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="telegram" type="telegramtype" /> <xs:complexType name="telegramtype"> <xs:sequence> <xs:element name="to" type="texttype" /> <xs:element name="from" type="texttype" /> <xs:element name="subject" type="texttype" /> <xs:element name="graphic" type="graphictype" /> <xs:element name="message" type="messagetype" /> </xs:sequence> <xs:attribute name="pri" type="xs:token" /> </xs:complexType> <xs:simpleType name="texttype"> <xs:restriction base="xs:string"> <xs:minLength value="1" /> </xs:restriction> </xs:simpleType> <xs:complexType name="graphictype"> <xs:attribute name="fileref" type="xs:anyURI" use="required" /> </xs:complexType> <xs:complexType name="messagetype" mixed="true"> <xs:choice minOccurs="0" maxOccurs="unbounded"> <xs:element name="emphasis" type="xs:string" /> <xs:element name="name" type="xs:string" /> <xs:element name="villain" type="xs:string" /> </xs:choice> </xs:complexType> </xs:schema>
So there are several levels of quality assurance available in XML. You can rest assured that your data is in a good state if you've validated it.
XML wants to be useful to the widest possible community. Things that have limited other markup languages from worldwide acceptance have been reworked. The character set, for starters, is Unicode, which supports hundreds of scripts: Latin, Nordic, Arabic, Cyrillic, Hebrew, Chinese, Mongolian, and many more. It also has ample supplies of literary and scientific symbols. You'd be hard-pressed to think of something you can't express in XML. To be flexible, XML also supports many character encodings.
The difference between a character set and a character encoding can be a little confusing. A character set is a collection of symbols, or glyphs. For example, ASCII is a set of 127 simple Roman letters, numerals, symbols, and a few device codes. A character encoding is a scheme for representing the characters numerically. All text is just a string of numbers that tell a program what symbols to render on screen. An encoding may be as simple as mapping each byte to a unique glyph. Sometimes the number of characters is so large that a different scheme is required.
For example, UTF-8 is an encoding for the Unicode character set. It uses an ingenious algorithm to represent the most common characters in one byte, some less common ones in two bytes, rarer ones in three bytes, and so on. This makes the vast majority of files in existence already compatible with UTF-8, and it makes most UTF-8 documents compatible with most older, 1-byte character processing software.
There are many other encodings, such as UTF-16 and ISO-8859-1. You can specify the character encoding you want to use in the XML prologue like this:
<?xml version="1.0" encoding="iso-8859-1"?>
This goes at the very top of an XML document so it can prepare the XML parser for the text to follow. The encoding parameter and, in fact, the whole prologue, is optional. Without an explicit encoding parameter, the XML processor will assume you want UTF-8 or UTF-16, depending on the first few bytes of the file.
It is inconvenient to insert exotic characters from a common terminal. XML provides a shorthand, called character entity references. If you want a letter "c" with a cedilla (ç), you can express it numerically like this: à (decimal) or ç (hexadecimal), both of which use the position of the character in Unicode as an identifier.
Often, there may be one or more translations of a document. You can keep them all together using XML's built-in support for language qualifiers. In this piece of XML, two versions of the same text are kept together for convenience, differentiated by labels:
<para xml:lang="en">There is an answer.</para> <para xml:lang="de">Es gibt ein Antwort.</para>
This same system can even be used with dialects within a language. In this case, both are English, but from different locales:
<para xml:lang="en-US">Consult the program.</para> <para xml:lang="en-GB">Consult the programme.</para>