Hack 78 Use RELAX NG to Generate DTD Customizations

figs/expert.gif figs/hack78.gif

RELAX NG enables you to create a customized subset or extension of a DTD much more easily than doing it the old-fashioned way.

You may find at some point in your XML work?especially if you're working with a large, complex XML vocabulary such as DocBook (http://www.docbook.org)?that you don't use or want the full set of elements that the XML vocabulary provides, and would much rather work with a smaller, custom subset that excludes the elements you don't need.

If you're working with a RELAX NG schema, you'll be happy to find out that making a subset of it is a relatively trivial task, especially if you're using a RELAX NG compact syntax (RNC) schema instead of a RELAX NG XML-syntax (RNG) schema. You basically just need to create a file in which you include (by reference) the schema you want to subset, and then list out all the elements you want to exclude, giving the notAllowed pattern for the content of the element.

On the other hand, if you're working with a DTD, you'll find that trying to create a subset of it the old-fashioned way (by making a DTD customization layer or by editing the DTD directly) can be a much more complicated and time-consuming task.

The good news is that you don't need to do it the old-fashioned way because you can use some existing free software to quickly and easily generate a custom DTD: Norm Walsh's Perl script flatten.pl (http://cvs.sourceforge.net/viewcvs.py/*checkout*/docbook/cvstools/flatten), James Clark's Trang and Jing (http://www.thaiopensource.com/relaxng/), and David Tolpin's XSLT stylesheet incelim.xsl (http://ftp.davidashen.net/incelim/).

The problem with trying to subset a DTD the old-fashioned way is that it requires you to:

  • Search through the documentation for the DTD (or through the DTD itself) to identify all the parameter entities and/or parent-element content models that contain the elements you want to exclude.

  • Redefine all content models and parameter entities that contain any of the elements you want to exclude.

That can be much more of a hassle than it sounds, and those are just the steps that you need to take to make a subset. If you want to add any elements, it can end up being a lot more complicated.

The solution to all that customization hassle is to instead use a RELAX NG compact-syntax schema along with a couple of very clever tools. This enables you to generate a subset of a DTD using a process that:

  • Does not require you to know or care what parent elements contain the elements you want to exclude.

  • Does not require you to redefine the content of any of the parent elements that contain the elements you want to exclude.

The first part of the process, getting some tools installed and doing a couple of preliminary steps, is the only part that takes any real amount of time, and even then it doesn't really take long. The remaining steps, creating the customization file to describe your subset and generating your custom DTD, take very little time at all.

5.12.1 Generating an RNC Schema

To make creating your DTD subset as easy as possible, you need to have a flattened RELAX NG compact-syntax (RNC) version of the DTD you want to subset. This section describes a simple process you can use to do that. (If you already have an RNC version of your DTD, you can just skip this section and go to the next.) The process requires a few tools that you'll need to download, but they're all relatively easy to install and use. Flattening your DTD

To ensure that everything works as expected, you need to generate a flattened version of your DTD source; i.e., a standalone version of the DTD in which all entity references to external files have been replaced with the contents of the referenced files. (If you're not working with a modular DTD?if your DTD does not use entity references to include other files?you can skip this section and go on to the next.)

To flatten your DTD source:

  1. Locate Norm Walsh's Perl script flatten.pl in the working directory of examples, or check for a later version (download it and place it in the working directory).

  2. Run the following command:

    perl flatten.pl docbook.dtd > docbook-flat.dtd

    flatten.pl generates the file docbook-flat.dtd in the same directory as your docbook.dtd file. Generating an RNC schema from your flattened DTD

To create an RNC schema from a flattened DTD:

  1. Make sure Trang and Jing are available (http://www.thaiopensource.com/download/).

  2. With Trang in the classpath, run the following command:

    java -jar trang.jar docbook-flat.dtd docbook.rng

    Trang generates docbook.rng in the same directory as your docbook.dtd file.

  3. Run the following:

    java -jar trang.jar docbook.rng docbook.rnc

    Trang generates docbook.rnc in the same directory as your docbook.rng file.

The reason for the two-step process of first creating an RNG version and then an RNC version is that you'll need to have both versions around to get things to work smoothly. You need the RNG version in order to be able to convert back to DTD syntax, and you could actually do everything you need to do using only the RNG version. But having a version in the RNC syntax?which is designed for ease of authoring and readability?makes the customization process easier.

5.12.2 Creating an RNC Schema Customization File

The elegance and simplicity of the design of RELAX NG and its compact syntax makes the process of creating your actual RNC customization file the quickest and easiest part of this whole solution.

To make your RNC schema customization file, simply create a custom.rnc file similar to Example 5-16, replacing docbook.rnc with your own RNC file and replacing the element names on the left of the equals signs with the names of the elements you want to exclude.

Example 5-16. RNC customization file custom.rnc
include "docbook.rnc" {

  confdates = notAllowed

  confgroup = notAllowed

  confnum = notAllowed

  confsponsor = notAllowed

  conftitle = notAllowed

  contractnum = notAllowed

  contractsponsor = notAllowed

  msg = notAllowed

  msgaud = notAllowed

  msgentry = notAllowed

  msgexplan = notAllowed

  msginfo = notAllowed

  msglevel = notAllowed

  msgmain = notAllowed

  msgorig = notAllowed

  msgrel = notAllowed

  msgset = notAllowed

  msgsub = notAllowed

  msgtext = notAllowed

  simplemsgentry = notAllowed


That's it. Really.

5.12.3 Compiling Your Customization File

James Clark's Trang is an extremely powerful tool for converting among various schema types: from RNG, RNC, DTD, or XML to RNG, RNC, DTD, or XSD. One of Trang's current limitations is that it can't convert RELAX NG schemas to DTDs if those schemas include other schemas by reference and also override element definitions from those referenced schemas. What that means is that Trang, on its own, can't convert your RNC customization file to a DTD.

This is where the next tool in the chain comes in: David Tolpin's incelim.xsl stylesheet. The name of the stylesheet describes what it does: it resolves any includes in RELAX NG schemas by literally inserting the contents of them into its output, and then it eliminates from its output all definitions of elements that are overridden in a customization file.

It is basically doing something very similar to what Norm Walsh's flatten.pl tool does for DTDs, but Tolpin prefers to describe the process as creating a "compiled" version of the schema. In his own words:

incelim is a RELAX NG splicer. It takes a RELAX NG grammar in XML syntax, expands all includes and externalRefs, and optionally replaces references to text, empty, or notAllowed with the patterns. The result is a "compiled" schema convenient for distribution, as well as for consumption by tools which do not yet support include and externalRef.

Note the part that says it "takes a RELAX NG grammar in XML syntax." That means you'll first need to transform your RNC customization file into RNG syntax before using incelim.xsl.

5.12.4 Converting Your RNC Customization File to RNG XML Syntax

Like all XSLT stylesheets, incelim.xsl needs well-formed XML as input. So, because RELAX NG compact syntax (RNC) is non-XML syntax, you'll need to convert your RNC customization file into RELAX NG XML syntax (RNG) before using incelim.xsl. This is another task for Trang.

To transform your custom.rnc RNC customization file to RNG, just run the following command:

java -jar trang.jar custom.rnc custom.rng

Trang generates custom.rng in the same directory as your custom.rnc file. It'll look something like Example 5-17 .

Example 5-17. RNG customization file custom.rng
<grammar xmlns="http://relaxng.org/ns/structure/1.0">


<include href="docbook.rng">

 <define name="beginpage">



 <define name="confdates">



 <define name="confgroup">



 <define name="confnum">



 <define name="confsponsor">



 <define name="conftitle">





Compare Example 5-17 to Example 5-16 and you'll see why most people prefer to create and edit RELAX NG using the RNC syntax, and just convert that to RNG syntax when they need to.

5.12.5 Using incelim.xsl to Compile Your RNG Customization File

To compile your RNG customization file with incelim.xsl, you'll need to use an XSLT engine?either Michael Kay's Saxon or Daniel Veillard's xsltproc. You can download Saxon from http://saxon.sourceforge.net; if you are running Cygwin or a Linux distribution that includes libxml2, you already have xsltproc.

Make sure to use the latest version of xsltproc, because xsltproc compiled against libxml v20604, libxslt v10102, and libexslt v802 and earlier versions cannot be used with incelim.xsl. This is due to a bug in the implementation of exsl:node-set() in those earlier versions.

To use incelim.xsl:

  1. incelim.xsl and other stylesheets are in a subdirectory (incelim) of the file archive and should already be in your working directory. If you want the latest version, download incelim.xsl again and place it in the incelim subdirectory. It's actually a set of stylesheets with the incelim.xsl stylesheet just acting as a core file, so make sure to keep all the files in the same directory.

  2. Run the following command, using whichever XSLT engine (Saxon or xsltproc) you prefer, and replacing the pathname with an appropriate path. In Saxon:

    java -jar saxon7.jar custom.rng incelim/incelim.xsl > 

    or in xsltproc:

    xsltproc incelim/incelim.xsl custom.rng > custom-compiled.rng

    Either command generates a custom-compiled.rng file in the same directory as your custom.rng file. If you look at the contents of that file, you'll see that it represents the complete contents of your original DTD/schema, minus all the elements you excluded in your customization file.

5.12.6 Generating Your DTD Subset

The final step?getting your customization file back into DTD syntax?is another easy one. Just run the following command:

java -jar trang.jar custom-compiled.rng custom.dtd

Trang generates a custom.dtd in the same directory as your custom-compiled.rng file.

It's possible that Trang may fail to convert your customization, but instead just emit one or more error messages similar to the following:

custom-compiled.rng:1329:error: sorry, cannot handle this kind of 


If you get an error message like that, don't panic: it probably just indicates that your customization has left behind an element that no longer has any real content because you've removed all of its possible child elements. That is, it's probably an element you wanted to remove but just overlooked.

The fix is to:

  1. Go to the part of your custom-compiled.rng file where Trang says it's having a problem (the number in the error message is a line number in the file).

  2. Identify the name of the problem element.

  3. Go back to the earlier section "Creating an RNC Schema Customization File" and add the problem element to the list of excluded elements in your customization file.

  4. Repeat the previous steps to regenerate the custom.rng, custom-compiled.rng, and custom.dtd files.

Once you begin using custom.dtd, you'll see that it omits all elements you excluded in your RNC customization file.

?Michael Smith