Hack 69 Validate an XML Document with XML Schema

figs/expert.gif figs/hack69.gif

XML Schema is the W3C evolution of the DTD. It is complex but powerful, in wide use but not always popular. This hack will help you start writing schema in this format.

XML Schema is a recommendation of the W3C, written in three parts. Part 0 is a nice little primer (http://www.w3.org/TR/xmlschema-0/) that gets you started with the language. Part 1 describes the structures of XML Schema (http://www.w3.org/TR/xmlschema-1/); it is a long spec?about 200 pages long when printed?and is rather complex. Part 2 defines datatypes (http://www.w3.org/TR/xmlschema-2/) and has been more gladly received than Part 1, though it is considered by some to be ad hoc and not without anomalies.

XML mensch James Clark (http://www.jclark.com) has said of Part 1 that "it is without doubt the hardest to understand specification that I have ever read" (http://www.imc.org/ietf-xml-use/mail-archive/msg00217.html). Many others who have read the spec, or have attempted to read it, heartily agree with James. This is unfortunate, as it has placed many schema writers and companies in the uncomfortable position of using and supporting a difficult spec from the W3C, a widely accepted (though not always highly regarded) source. Happily, there are alternatives, such as RELAX NG [Hack #72] and tools such as Trang (http://www.thaiopensource.com/relaxng/trang.html), that can conveniently convert RELAX NG to XML Schema [Hack #76] .

5.3.1 A Quick Introduction to XML Schema

We'll start out by taking a look at the schema time.xsd, which was introduced but not explored in depth in [Hack #14] . It is displayed in Example 5-6.

Example 5-6. time.xsd
<?xml version="1.0" encoding="UTF-8"?>

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="time">



   <xs:element name="hour" type="xs:string"/>

   <xs:element name="minute" type="xs:string"/>

   <xs:element name="second" type="xs:string"/>

   <xs:element name="meridiem" type="xs:string"/>

   <xs:element name="atomic">


      <xs:attribute name="signal" type="xs:string" use="required"/>




  <xs:attribute name="timezone" type="xs:string" use="required"/>




The document element of an instance of XML Schema is always schema (line 2). The namespace name is http://www.w3.org/2001/XMLSchema and the common prefix for the namespace is xs: (also line 2). Starting on line 4, the element time is declared. This is called a global element declaration (http://www.w3.org/TR/xmlschema-0/#Globals); because it is the only global declaration in the schema, the schema will anticipate that time will be the top-level or document element in an instance.

The complexType element (line 5) indicates that its children may have complex content; that is, they can have attributes and element child content (http://www.w3.org/TR/xmlschema-0/#DefnDeclars). Contrariwise, elements with simple types cannot have attributes or element children. I think this terminology makes things harder to grasp than is necessary, but that's the way it is in XML Schema.

On lines 6 through 16, the sequence element specifies the order in which elements must appear in an instance. So the element declarations (lines 7 through 15) for the elements hour, minute, second, meridiem, and atomic, must appear in that order. The element names are given in the name attributes of the element, and all but the atomic element will have a string datatype (http://www.w3.org/TR/xmlschema-2/#string), as indicated by the type attribute.

Starting on line 11 is the declaration for the atomic element, which is different from the others. It is considered an anonymous type definition (http://www.w3.org/TR/xmlschema-0/#InlineTypDefn) because it is a complex type declaration without a name (that is, there is no name attribute on the complexType element start tag). The definition for time (starting on line 4) also is an anonymous type definition. atomic has a signal attribute (declared in the attribute element on line 13) whose type is string, and is required (hence the use attribute with a value of required).

Finally, on line 17, the required timezone attribute is declared. This declaration, way down near the bottom of the schema, applies to the time element. Its type is string, and it is also required.

Next, you need to become acquainted with the named complex type structure in XML Schema, as well as simple types. These structures can be named and reused. Example 5-7 shows a new version of our previous schema, complex.xsd, using these complex types and two derived simple types.

Example 5-7. complex.xsd
<?xml version="1.0" encoding="UTF-8"?>

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="time" type="Time"/>

<xs:complexType name="Time">


   <xs:element ref="hour"/>

   <xs:element ref="minute"/>

   <xs:element ref="second"/>

   <xs:element ref="meridiem"/>

   <xs:element name="atomic" type="Atomic"/>


   <xs:attribute name="timezone" type="xs:string" use="required"/>


<xs:element name="hour" type="Digits"/>

<xs:element name="minute" type="Digits"/>

<xs:element name="second" type="Digits"/>

<xs:element name="meridiem" type="AmPm"/>

<xs:complexType name="Atomic">

  <xs:attribute name="signal" type="xs:string" use="required"/>


<xs:simpleType name="Digits">

 <xs:restriction base="xs:string">

  <xs:pattern value="\d\d"/>



<xs:simpleType name="AmPm">

 <xs:restriction base="xs:string">

  <xs:enumeration value="a.m."/>

  <xs:enumeration value="p.m."/>




When the time element is declared on line 4, rather than using a built-in type, its type is set to be the complex type named Time, which starts on line 6 (you could use time instead of Time as the name and it would not conflict with the name time used in an element declaration). Note the ref attributes on lines 8 through 11, which refer to element declarations on lines 17 through 20 (this is superfluous, but serves to illustrate how ref works). On line 12, the element atomic is of type Atomic, a complex type that contains only an attribute declaration (line 22).

The element declarations on lines 17, 18, and 19 are of type Digits, a simple type (line 26) that is a restriction of a string. The pattern facet element (line 28) restricts the content to two digits with the regular expression \d\d (http://www.w3.org/TR/xmlschema-0/#regexAppendix). The meridiem element is of type AmPm, an enumeration (http://www.w3.org/TR/xmlschema-0/#CreatDt) that can contain either of the values a.m. or p.m. (see line 33).

5.3.2 Validation with XML Schema Tools

Now let's validate time.xml against time.xsd or complex.xsd. There are a number of tools readily available to do this. We'll use three here: an online XSD Schema Validator, available from Got Dot Net (http://www.gotdotnet.com), and the command-line validators xmllint (http://www.xmlsoft.org) and xsv (http://www.ltg.ed.ac.uk/~ht/xsv-status.html). XSD Schema Validator

In a web browser, go to http://apps.gotdotnet.com/xmltools/xsdvalidator/ (Figure 5-1). Click the Browse button next to the first text box, and the File Upload dialog box appears. Select time.xsd or complex.xsd from the working directory where the file archive was extracted, then click Open. Again, click the Browse button next to the third text box. Select time.xml in the File Upload dialog, and then click Open. Having selected both files, click the Submit button. Upon success, the browser will display the message "Validated OK!" and display the validated file. By selecting one or the other file alone, you can also use this service to check only an instance of an XML Schema for validity or only an XML document for well-formedness.

Figure 5-1. Got Dot Net's XSD Schema Validator in Firefox

5.3.3 xmllint

The command-line tool xmllint was discussed and demonstrated in [Hack #9]. To use this tool to validate against XML Schema, all you need to do is use the --schema option. With xmllint installed and in the path, enter the command:

xmllint --schema time.xsd time.xml


xmllint --schema complex.xsd time.xml

When successful, the validated instance is displayed, without reporting any errors. You can submit one or more XML instances at the end of the command line for validation

5.3.4 xsv

xsv is an XML Schema validator that is available both online and as a command-line tool (http://www.ltg.ed.ac.uk/~ht/xsv-status.html). It was developed by Henry S. Thompson and Richard Tobin of the University of Edinburgh. It is available for the Windows platform (ftp://ftp.cogsci.ed.ac.uk/pub/XSV/XSV26.EXE), in Python as an RPM (RPM Package Manager) package (ftp://ftp.cogsci.ed.ac.uk/pub/XSV/XSV-2.6-2.noarch.rpm), or as a tar ball (ftp://ftp.cogsci.ed.ac.uk/pub/XSV/XSV-2.6.tar.gz).

We will use only the command-line version of this tool. To use the online version of this validator, go to http://www.w3.org/2001/03/webdata/xsv.

Once xsv is installed and in your path, you can use it to validate time.xml with time.xsd by typing:

xsv time.xml time.xsd


xsv time.xml complex.xsd

By default, xsv reports its validation results with an XML document, as shown here (for time.xsd):

<?xml version='1.0'?>

<xsv xmlns="http://www.w3.org/2000/05/xsv" docElt="{None}time"

     instanceAssessed="true" instanceErrors="0" rootType="[Anonymous]"

     schemaDocs="time.xsd" schemaErrors="0"

     target="file:///C:/Hacks/examples/time.xml" validation="strict"

     version="XSV 2.6-2 of 2004/02/04 11:33:42">

  <schemaDocAttempt URI="file:///C:/Hacks/examples/time.xsd" outcome="success" source=

"command line"/>


In the file archive there is a stylesheet that transforms this result into HTML; it's called xsv.xsl. To put it to work, use xsv with the -o switch for the output file and the -s switch for the XSLT stylesheet:

xsv -o xsvresult.xml -s xsv.xsl time.xml time.xsd

The -s switch places an XML stylesheet PI in the resulting file. You can then display the file in a browser that supports client-side XSLT, and it will be transformed as shown in Figure 5-2.

Figure 5-2. The transformed result of xsv validation in Firefox

5.3.5 Other XML Schema Features

Here are some additional interesting features from XML Schema.

choice, group, all

These three elements help you to construct content models. choice allows one of its children to appear in an instance, literally a choice of two or more options. group collects declarations into a single unit. all allows all children elements to appear once or not at all, in any order (http://www.w3.org/TR/xmlschema-0/#groups).


The annotation element, with its children appInfo and documentation, can annotate and document a schema or provide information about an application (http://www.w3.org/TR/xmlschema-0/#CommVers).

include, import

With the include element, you can include other schemas as part of a schema definition. The import element allows you to borrow definitions from other namespaces (see http://www.w3.org/TR/xmlschema-0/#IPO and http://www.w3.org/TR/xmlschema-0/#import).


With the restriction and extension elements, it is possible to create new types by deriving from existing types. You can, for example, add additional elements to an existing complex type or restrict or change facets in a simple type (see http://www.w3.org/TR/xmlschema-0/#DerivExt and http://www.w3.org/TR/xmlschema-0/#DerivByRestrict).

Matching any name

You can match any element name with any and any attribute name with anyAttribute wildcards (http://www.w3.org/TR/xmlschema-0/#any).

Fixed and default values for elements

XML 1.0 gave us fixed and default values for attributes, and XML Schema extends that capability to elements by using the fixed and default attributes on the element declaration (http://www.w3.org/TR/xmlschema-0/#OccurenceConstraints).

List and union types

A list type in XML Schema allows you to define a whitespace-separated list of values in an attribute or element. Union types allow values that are of any one of a number of simple types, such as integer and string (http://www.w3.org/TR/xmlschema-0/#ListDt and http://www.w3.org/TR/xmlschema-0/#UnionDt).

Substitution groups

You can substitute one element for another. You can also create abstract elements. An abstract element can be the head of a substitution group, so that other elements can take its place, but it cannot be used within an XML document itself (http://www.w3.org/TR/xmlschema-0/#SubsGroups).


With the redefine element, you can redefine simple types, complex types, attribute groups, etc., from external schema files (http://www.w3.org/TR/xmlschema-0/#Redefine).

Identity constraints

You can use ID, IDREF, and IDREFS to constrain the identity of elements and attributes in XML Schema. You can also constrain values within a scope so that they are unique (unique), unique and present (key), or refer to a unique or key constraint (keyref) (http://www.w3.org/TR/xmlschema-0/#specifyingUniqueness and http://www.w3.org/TR/xmlschema-0/#specifyingKeys&theirRefs).

Nil values

This feature (xsi:nil in an instance and nillable="true" on an element declaration) allows you to give an element meaning to a nil value (http://www.w3.org/tr/xmlschema-0/#Nils).

5.3.6 See Also

  • DecisionSoft's online schema validator: http://tools.decisionsoft.com/schemaValidate.html

  • XML Schema, by Eric van der Vlist (O'Reilly)

  • Definitive XML Schema, by Priscilla Walmsley (Prentice Hall PTR)

  • xframe schema-based programming project: http://xframe.sourceforge.net/xframe.html

  • xframe xsddoc documentation toolkit for XML Schema: http://xframe.sourceforge.net/xsddoc.html