A.1 Comparing HTML and XML

HTML and XML are siblings: they are both children of Standard Generalized Markup Language (SGML). Therefore, an XML document looks somewhat like an HTML page. Example A-1 is an XML document that represents a basic address book.

Example A-1. Simple address book

    <person id="1">

        <!--Rasmus Lerdorf-->







    <person id="2">

        <!--Zeev Suraski-->



        <city>Tel Aviv</city>





A.1.1 Similarities

Even though Example A-1 is XML, it looks similar to HTML. There are opening and closing elements, and these elements can contain text, other elements, or both. Elements can also have attributes, such as id="1" for the person element.

XML also uses the same syntax as HTML for comments (<!--Rasmus Lerdorf-->) and entities (&lt;). Like HTML, XML has similar restrictions on ampersands, greater- and less-than signs, and quotation marks.

A.1.2 Differences

There are a few differences between HTML and XML. First, an XML element must have both an opening and closing tag. To represent an element without text (also known as an empty element), place a closing slash at the end of the tag: <img src="php.png" />. You can also have completely blank elements. For instance, since Zeev Suraski lives in Israel, he doesn't have a U.S. state, so that element is completely empty: <state/>.

Attributes must have either double or single quotes around their values. You cannot have <person id=1>.

Elements are also case-sensitive. <email> and <EMAIL> are not identical. To circumvent this restriction, some XML processors have a case-folding setting. When case folding is enabled, all elements and attributes are converted to the same case before processing.

XML also has a few features that HTML doesn't, including XML declarations, processing instructions, and CDATA sections.

A.1.2.1 XML declarations

An XML declaration specifies the XML version and document encoding settings for the file. For instance:

<?xml version="1.0" encoding="UTF-8" ?>

This example tells the XML processor that the file is an XML Version 1.0 file and its contents are encoded using UTF-8, a form of Unicode.

An XML declaration appears only at the start of an XML file and is optional. The default values are those used in the example: Version 1.0 of XML and UTF-8 Unicode encoding.

Internally, libxml2 stores all content as UTF-8. While libxml2 has native support for only a few encodings, it can optionally hook into your system's iconv library to expand the number of encodings it can use. Since encoding support varies widely from system to system, see http://www.xmlsoft.org/encoding.html for more information on how libxml2 handles this issue.

A.1.2.2 Processing instructions

The XML declaration is a specific example of a more general feature known as processing instructions (PIs for short). When you want to pass information to your XML processor, use a PI. For example:

<title><?pdf font="Gill Sans" ?>Upgrading to PHP 5</title>

This informs a pdf formatting script to set the font to Gill Sans.

PIs have this standard syntax:

<?target data ... ?>

The XML declaration shown earlier has a target of xml, and its data is version="1.0" encoding="UTF-8".

It is uncommon to find an XML document that uses a PI whose target doesn't begin with xml.

A.1.2.3 Character Data sections

One difficulty with using HTML and XML is that you're always forced to call htmlentities( ) on your data. XML has a way to indicate that your text should be treated as literal text: a block called a CDATA, short for Character Data. It's written like this:

<![CDATA[<img src="logo.png" alt="A & P">]]>

You don't need to encode < and & inside a CDATA block, and it cannot contain the sequence ]]>. The awkward syntax comes from XML's ancestor, SGML.