Hack 70 Validate Multiple Documents Against an XML Schema at Once

figs/moderate.gif figs/hack70.gif

A Xerces module allows you to validate more than one XML instance at a time against an XML Schema. This hack shows you how to use the Java class xni.XMLGrammarBuilder.

This book describes several online and command-line validators that let you check whether a document conforms to a W3C XML Schema definition. Some are faster than others, and some are more suitable for a particular platform. The special advantage of the Xerces Java xni.XMLGrammarBuilder sample application (which, being a Java program, runs on any platform) is its ability to validate multiple documents simultaneously. This sample application is packaged in the xercesSamples.jar file included with the Java Xerces distribution (http://xml.apache.org/xerces2-j/), which is part of the file archive that came with the book.

If you work with XML, you've probably received an email at work that says "here's the data" and included a ZIP file full of XML files?or worse, a bunch of files all attached individually to the email. Before doing anything with those files, you probably want to validate them to check whether the email's sender is passing along any problems to you. You could write a Perl script to generate a batch file that calls your favorite parser for each file, or you could enter the command to parse the first file, press your cursor-up key to retrieve that command, modify it, run it again, and repeat these steps multiple times. Or, you could use the xni.XMLGrammarBuilder utility and do it all in one command. (Because the program does this by storing a compiled version of the schema in memory and then reusing it for each document instance, the integrity checks that it does while compiling make it a useful schema development tool as well; see [Hack #71] .)

The following listings show you two short XML documents. I won't take up space showing you the multidoc.xsd schema that they point to; take my word for it that ZZ is not one of the valid zone values and oomph is not a valid child of the para element. Here is multidoc1.xml:

<sample zone="ZZ"



  <title>Peyton Place</title>

  <para>Indian summer is like a woman.</para>


Here is multidoc2.xml:

<sample zone="Z1"



  <title>Moby Dick</title>

  <para>Call me Ishmael.</para>

  <para>I <oomph>alone</oomph> survived to tell the tale.</para>


Before executing the command that follows, make sure that your classpath includes both the xercesImpl.jar and the xercesSamples.jar files (Version 2.6.2 or later) that come with the Java Xerces distribution. You can download the Xerces distribution from http://xml.apache.org/xerces2-j/download.cgi. In the following command line, the -a switch identifies the XSD schema and -i shows the list of documents to validate:

java -cp xercesImpl.jar;xercesSamples.jar xni.XMLGrammarBuilder

 -a multidoc.xsd -i multidoc1.xml multidoc2.xml

Use a colon (:) between JAR filenames if you are working in a Unix environment. The xni.XMLGrammarBuilder lists each document's problems:

[Error] multidoc1.xml:3:54: cvc-enumeration-valid: Value 'ZZ' is 

not facet-valid with respect to enumeration '[Z1, Z2, Z3, Z4, Z5, 

Z6]'. It must be a value from the enumeration.

[Error] multidoc1.xml:3:54: cvc-attribute.3: The value 'ZZ' of 

attribute 'zone' on element 'sample' is not valid with respect to 

its type, 'zoneCodes'.

[Error] multidoc2.xml:6:18: cvc-complex-type.2.4.a: Invalid content 

was found starting with element 'oomph'. One of '{"":emph}' is 


The error in multidoc1.xml generated two error messages, and the error in multidoc2.xml generated one, each with information about the location and nature of the error.

Entering the following line with no parameters gives you an overview of xni.XMLGrammarBuilder's command-line options.

java -cp xercesImpl.jar;xercesSamples.jar xni.XMLGrammarBuilder

These options are shown here:

usage: java xni.XMLGrammarBuilder [-p config_file] -d uri ... 

| [-f|-F] -a uri ... [-i uri ...]



  -p config_file:   configuration to use for instance validation

  -d    grammars to preparse are DTD external subsets

  -f  | -F    Turn on/off Schema full checking (default off)

  -a uri ...  Provide a list of schema documents

  -i uri ...  Provide a list of instance documents to validate


NOTE:  both -d and -a cannot be specified!

See the samples directory and documentation that accompanies Xerces Java for more detailed documentation on xni.XMLGrammarBuilder.

?Bob DuCharme