4.5 Schematron

Schematron takes a different approach from the schema languages we've seen so far. Instead of being prescriptive, as in "this element has the following content model," it relies instead on a series of Boolean tests. Depending on the result of a test, the schema will output some predetermined message.

The tests are based on XPath, which is a very granular and exhaustive set of node examination tools. Relying on XPath is clever, taking much of the complexity out of the schema language. XPath, which is used in places such as XSLT and some implementations of DOM, can scratch an itch that more blunt tools like DTDs can't reach. As the creator of Schematron, Rick Jelliffe, says it's like "a feather duster for the furthest corners of a room where the vacuum cleaner (DTD) cannot reach."

4.5.1 Overview

The basic structure of a Schematron schema is this:

<schema xmlns="http://www.ascc.net/xml/schematron">
  <pattern>
    <rule context="XPath Expression">
      <assert test="XPath Expression">
        message
      </assert>
      <report test="XPath Expression">
        message
      </report>
      ...more tests...
    </rule>
    ...more rules...
  </pattern>
  ...more patterns...
</schema>

A pattern in Schematron does not carry the same meaning as patterns in RELAX NG. Here, it's just a logical grouping of rules. If your schema is testing books, one pattern may hold rules for chapters while another groups rules for appendixes. So think of this as more of a higher-level, conceptual testing pattern, rather than as a specific node-matching pattern.

The context for each test is determined by a rule. Its context attribute contains an XSLT pattern that matches nodes. Each node found becomes the context node, on which all tests inside the rule are applied.

The children of a rule, report and assert, each apply a test to the context node. The test is another XPath expression, stored in a test attribute. report's contents will be output if its XPath expression evaluates to "true." assert is just the opposite, outputting its contents if its test evaluates to "false."

XPath expressions are very good at describing XML nodes and reasonably good at matching text patterns. Here's how you might test an email address:

<rule context="email">
  <p>Found an email address...</p>
  <assert test="contains(.,'@')">Error: no @ in email</assert>
  <assert test="contains(.,'.')">Error: no dot in email</assert>
  <report test="length(.)>20">Warning: email is unusually long</report>
</rule>

To summarize, running a Schematron validator on a document works like this. First, parse the document to build a document tree in memory. Then, for each rule, obtain a context node using its XPath locator expression. For each assert or report in the rule, evaluate the XPath expression for a Boolean value, and conditionally output text. The idea is that whenever something is found that is not right with the document, the Schematron processor should output a message to that effect. You can think of Schematron as a language for generating validation reports.

One interesting feature of Schematron is that its documentation is a part of the language itself. Rather than rely on comments or the namespace hack from RELAX NG, this language explicitly defines elements and attributes to hold commentary. The root element, schema has an optional child title to name the schema, and pattern elements have a name attribute for identifying rule groups. A Schematron validator will use that attribute to label each pattern of testing in output. There is also a set of tags for formatting text, borrowed from HTML, such as p and span.

Let's look at an example. Below is a schema to test a report document. There are two kinds of reports we allow: one with a body and another with a set of at least three sections.

<schema xmlns="http://www.ascc.net/xml/schematron">
  <title>Test: Report Document Validity</title>

  <pattern name="Type 1">
    <p>Type 1 reports should have a title and a body.</p>
    <rule context="/">
      <assert test="report">Wrong root element. This isn't a report.</assert>
    </rule>
    <rule context="report">
      <assert test="title">Darn! It's missing a title.</assert>
      <report test="title">Yup, found a title.</assert>
      <assert test="body">Yikes! It's missing a body.</assert>
      <report test="body">Yup, found a body.</assert>
    </rule>
  </pattern>

  <pattern name="Type 2">
    <p>Type 2 reports should have a title and <em>at least
      three</em> sections.</p>
    <rule context="/">
      <assert test="report">Wrong root element. This isn't a report.</assert>
    </rule>
    <rule context="report">
      <assert test="title">Darn! It's missing a title.</assert>
      <report test="title">Yup, found a title.</assert>
      <assert test="count(section)&gt;2">There are not enough section
        elements in this report.</assert>
      <report test="count(section)&gt;2">Plenty of sections, so I'm 
        happy.</assert>
    </rule>
  </pattern>
</schema>

Now, let's run the Schematron validator on this document:

<report>
  <title>A ridiculous report</title>
  <body>
    <para>Here's a paragraph.</para>
    <para>Here's a paragraph.</para>
  </body>
</report>

I used a version of Schematron that outputs its report in HTML form. Figure 4-1 shows how it looks in my browser.

Figure 4-1. A Schematron report

4.5.2 Abstract Rules

An abstract rule allows you to reuse rules when they are likely to appear often in the schema. The syntax is the same, with the additional attribute abstract set to yes and an id with some unique value. Another rule will reference the id with a rule attribute in an extends child element. See the following example.

<rule id="inline" abstract="yes">
  <report test="*">Error! Element inside inline.</report>
  <assert test="text">Strange, there's no text inside this inline.</assert>
</rule>
<rule context="bold">
  <extends rule="inline"/>
</rule>
<rule context="emphasis">
  <extends rule="inline"/>
</rule>
<rule context="quote">
  <extends rule="inline"/>
</rule>