You want to ensure that the XML you're processing conforms to a DTD or XML Schema.
To validate against a DTD, use the XML::LibXML module:
use XML::LibXML; my $parser = XML::LibXML->new; $parser->validation(1); $parser->parse_file($FILENAME);
To validate against a W3C Schema, use the XML::Xerces module:
use XML::Xerces; my $parser = XML::Xerces::DOMParser->new; $parser->setValidationScheme($XML::Xerces::DOMParser::Val_Always); my $error_handler = XML::Xerces::PerlErrorHandler->new( ); $parser->setErrorHandler($error_handler); $parser->parse($FILENAME);
The libxml2 library, upon which XML::LibXML is based, can validate as it parses. The validation method on the parser enables this option. At the time of this writing, XML::LibXML could only validate with DOM parsingvalidation is not available with SAX-style parsing.
Example 22-7 is a DTD for the books.xml file in Example 22-1.
<!ELEMENT books (book*)> <!ELEMENT book (title,edition,authors,isbn)> <!ELEMENT authors (author*)> <!ELEMENT author (firstname,lastname)> <!ELEMENT title (#PCDATA)> <!ELEMENT edition (#PCDATA)> <!ELEMENT firstname (#PCDATA)> <!ELEMENT lastname (#PCDATA)> <!ELEMENT isbn (#PCDATA)> <!ATTLIST book id CDATA #REQUIRED >
To make XML::LibXML parse the DTD, add this line to the books.xml file:
<!DOCTYPE books SYSTEM "books.dtd">
Example 22-8 is a simple driver used to parse and validate.
#!/usr/bin/perl -w # bookchecker - parse and validate the books.xml file use XML::LibXML; $parser = XML::LibXML->new; $parser->validation(1); $parser->parse_file("books.xml");
When the document validates, the program produces no outputXML::LibXML successfully parses the document into a DOM structure that is quietly destroyed when the program ends. Edit the books.xml file, however, and you see the errors the XML::LibXML emits when it discovers broken XML.
For example, changing the id attribute to unique_id causes this error message:
'books.xml:0: validity error: No declaration for attribute unique_id of element book <book unique_id="1"> ^ books.xml:0: validity error: Element book does not carry attribute id </book> ^ ' at /usr/local/perl5-8/Library/Perl/5.8.0/darwin/XML/LibXML.pm line 405. at checker-1 line 7
XML::LibXML does a good job of reporting unknown attributes and tags. However, it's not so good at reporting out-of-order elements. If you return books.xml to its correct state, and then swap the order of a title and an edition element, you get this message:
'books.xml:0: validity error: Element book content does not follow the DTD </book> ^ ' at /usr/local/perl5-8/Library/Perl/5.8.0/darwin/XML/LibXML.pm line 405. at checker-1 line 7
In this case, XML::LibXML says that something in the book element didn't follow the DTD, but it couldn't tell us precisely what it violated in the DTD or how.
At the time of this writing, you must use XML::Xerces to validate while using SAX, or to validate against W3C Schema. Both of these features (and RelaxNG validation) are planned for XML::LibXML, but weren't available at the time of printing.
Here's how you build a DOM tree while validating a DTD using XML::Xerces:
use XML::Xerces; # create a new parser that always validates my $p = XML::Xerces::DOMParser->new( ); $p->setValidationScheme($XML::Xerces::DOMParser::Val_Always); # make it die when things fail to parse my $error_handler = XML::Xerces::PerlErrorHandler->new( ); $p->setErrorHandler($error_handler); $p->parse($FILENAME);
To validate against a schema, you must tell XML::Xerces where the schema is and that it should be used:
$p->setFeature("http://xml.org/sax/features/validation", 1); $p->setFeature("http://apache.org/xml/features/validation/dynamic", 0); $p->setFeature("http://apache.org/xml/features/validation/schema", $SCHEMAFILE);
You can pass three possible values to setValidationScheme:
$XML::Xerces::DOMParser::Val_Always $XML::Xerces::DOMParser::Val_Never $XML::Xerces::DOMParser::Val_Auto
The default is to never validate. Always validating raises an error if the file does not have a DTD or Schema. Auto raises an error only if the file has a DTD or Schema, but it fails to validate against that DTD or Schema.
XML::Xerces requires the Apache Xerces C++ XML parsing library, available from http://xml.apache.org/xerces-c. At the time of writing, the XML::Xerces module required an archived, older version of the Xerces library (1.7.0) and was appallingly lacking in documentationyou can learn how it works only by reading the documentation for the C++ library and consulting the examples in the samples/ directory of the XML::Xerces distribution.
The documentation for the CPAN module XML::LibXML; http://xml.apache.org/xerces-c; http://xml.apache.org/xerces-p/