4.1 Basic Concepts

In the general sense of the word, a schema is a generic representation of a class of things. For example, a schema for restaurant menus could be the phrase "a list of dishes available at a particular eating establishment." A schema may resemble the thing it describes, the way a "smiley face" represents an actual human face. The information contained in a schema allows you to identify when something is or is not a representative instance of the concept.

In the XML context, a schema is a pass-or-fail test for documents.^[1] A document that passes the test is said to conform to it, or be valid. Testing a document with a schema is called validation. A schema ensures that a document fulfills a minimum set of requirements, finding flaws that could result in anomalous processing. It also may serve as a way to formalize an application, being a publishable object that describes a language in unambiguous rules.

^[1] Technically, schemas validate on an element-by-element and attribute-by-attribute basis. It is possible to test a subtree alone for validity and determine that parts are valid while others are not. This process is rather complex and beyond the scope of this book.

4.1.1 Validation

An XML schema is like a program that tells a processor how to read a document. It's very similar to a later topic we'll discuss called transformations. The processor reads the rules and declarations in the schema and uses this information to build a specific type of parser, called a validating parser. The validating parser takes an XML instance as input and produces a validation report as output. At a minimum, this report is a return code, true if the document is valid, false otherwise. Optionally, the parser can create a Post Schema Validation Infoset (PSVI) including information about data types and structure that may be used for further processing.

Validation happens on at least four levels:

Structure: The use and placement of markup elements and attributes.
Data typing: Patterns of character data (e.g., numbers, dates, text).
Integrity: The status of links between nodes and resources.
Business rules: Miscellaneous tests such as spelling checks, checksum results, and so on.

Structural validation is the most important, and schemas are best prepared to handle this level. Data typing is often useful, especially in "data-style" documents, but not widely supported. Testing integrity is less common and somewhat problematic to define. Business rules are often checked by applications.

4.1.2 A History of Schema Languages

There are many different kinds of XML schemas, each with its own strengths and weaknesses.

4.1.2.1 DTD

The oldest and most widely supported schema language is the Document Type Definition (DTD). Borrowed from SGML, a simplified DTD was included in the XML Core recommendation. Though a DTD isn't necessary to read and process an XML document, it can be a useful component for a document, providing the means to define macro-like entities and other conveniences. DTDs were the first widely used method to formally define languages like HTML.

4.1.2.2 W3C XML Schema

As soon as XML hit the streets, developers began to clamor for an alternative to DTDs. DTDs don't support namespaces, which appeared after the XML 1.0 specification. They also have very weak data typing, being mostly markup-focused. The W3C formed a working group for XML Schema and began to receive proposals for what would later become their W3C XML Schema recommendation.

Following are some of the proposals made by various groups.

XML-Data: Submitted by Arbortext, DataChannel, Inso Corporation, Microsoft, and the University of Edinburgh in January 1998, this technical note put forth many of the features incorporated in W3C Schema, and many others that were left out, such as a mechanism for declaring entities and object-oriented programming support. Microsoft implemented a version of this called XML-Data Reduced (XDR).
Document Content Description (DCD): IBM, Microsoft, and Textuality submitted this proposal in July 1998 as an attempt to integrate XML-Data with the Resource Description Framework (RDF). It introduced the idea of making elements and attributes interchangeable.
Schema for Object-Oriented XML (SOX): As the name implies, this technical note was influenced by programming needs, incorporating concepts as interfaces and parameters. It was submitted in July 1998 by Veo Systems/Commerce One. They have created an implementation that they use today.
Document Definition Markup Language (DDML): This proposal came out of discussions on the XML-Dev mailing list. It took the information expressed in a DTD and formatted it as XML, leaving support for data types to other specifications.

Informed by these proposals, the W3C XML Schema Working Group arrived at a recommendation in May 2001, composed of three parts (XMLS0, XMLS1, and XMLS2) named Primer, Structures, and Datatypes, respectively. Although some of the predecessors are still in use, all involved parties agreed that they should be retired in favor of the one, true W3C XML Schema.

4.1.2.3 RELAX NG

An independent effort by a creative few coalesced into another schema language called RELAX NG (pronounced "relaxing"). It is the merging of Regular Language Description for XML (RELAX) and Tree Regular Expressions for XML (TREX). Like W3C Schema, it supports namespaces and datatypes. It also includes some unique innovations, such as interchangeability of elements and attributes in content descriptions and more flexible content models.

RELAX, a product of the Japanese Standard Association's INSTAC XML Working Group, led by Murata Makoto, was designed to be an easy alternative to XML Schema. "Tired of complex specifications?" the home page asks. "You can relax!" Unlike W3C Schema, with its broad scope and high learning curve, RELAX is simple to implement and use.

You can think of RELAX as DTDs (formatted in XML) plus datatypes inherited from W3C Schema's datatype set. As a result, it is nearly painless to migrate from DTDs to RELAX and, if you want to do so later, fairly easy to migrate from RELAX to W3C Schemas. It supported two levels of conformance. "Classic" is just like DTD validation plus datatype checking. "Fully relaxed" added more features.

The theoretical basis of RELAX is Hedge Automata tree processing. While you don't need to know anything about Hedge Automata to use RELAX or RELAX NG, these mathematical foundations make it easier to write efficient code implementing RELAX NG. Murata Makoto has demonstrated a RELAX NG implementation which occupies 27K on a cell phone, including both the schema and the XML parser.

At about the same time RELAX was taking shape, James Clark of Thai Opensource Software was developing TREX. It came out of work on XDuce, a typed programming language for manipulating XML markup and data. XDuce (a contraction of "XML" and "transduce") is a transformation language which takes an XML document as input, extracts data, and outputs another document in XML or another format. TREX uses XDuce's type system and adds various features into an XML-based language. XDuce appeared in March 2000, followed by TREX in January 2001.

Like RELAX, TREX uses a very clear and flexible language that is easy to learn, read, and implement. Definitions of elements and attributes are interchangeable, greatly simplifying the syntax. It has full support for namespaces, mixed content, and unordered content, things that are missing from, or very difficult to achieve, with DTDs. Like RELAX, it uses the W3C XML Schema datatype set, reducing the learning curve further.

RELAX NG (new generation) combines the best features from both RELAX and TREX in one XML-based schema language. First announced in May 2001, an OASIS Technical Committee headed by James Clark and Murata Makoto oversees its development. It was approved as a Draft International Standard by the ISO/IEC.

4.1.2.4 Schematron

Also worth noting is Schematron, first proposed by Rick Jelliffe of the Academia Sinicia Computing Centre in 1999. It uses XPath expressions to define validation rules and is one of the most flexible schema languages around.

4.1.3 Do You Need Schemas?

It may seem like schemas are a lot of work, and you'd be right to think so. In designing a schema, you are forced to think hard about how your language is structured. As your language evolves, you have to update your schema, which is like maintaining a piece of software. There will be bugs, version tracking, usability issues, and even the occasional overhaul to consider. So with all this overhead, is it really worth it?

First, let's look at the benefits:

A schema can function as a publishable specification. There is simply no better way to describe a language than with a schema. A schema is, after all, a "yes or no" test for document conformance. It's designed to be readable by humans and machines alike. DTDs are very reminiscent of Backus-Naur Form (BNF) grammars which are used to describe programming languages. Other schemas, such as RELAX NG, are intuitive and very easy to read. So if you need to disseminate information on how to use a markup language, a schema is not a bad way to do it.
A schema will catch higher-level mistakes. Sure, there are well-formedness rules to protect your software from errors in basic syntax, but do they go far enough? What if a required field of information is missing? Or someone has consistently misspelled an element name? Or a date was entered in the wrong format? These are things only a validating parser can detect.
A schema is portable and efficient. Writing a program to test a document is an option, but it may not be the best one. Software can be platform-dependent, difficult to install, and bulky to transfer. A schema, however, is compact and optimized for one purpose: validation. It's easy to hand someone a schema, and you know it has to work for them because its syntax is governed by a standard specification. And since many schemas are based on XML, they can be edited in XML editors and tested by well-formedness checkers.
A schema is extensible. Schemas are designed to support modularity. If you want to maintain a set of similar languages, or versions of them, they can share common components. For example, DTDs allow you to declare general entities for special characters or frequently used text. They may be so useful that you want to export them to other languages.

Using a schema also has some drawbacks:

A schema reduces flexibility. The expressiveness of schemas varies considerably, and each standard tends to have its flaws. For example, DTDs are notorious for their incompatibility with namespaces. They are also inefficient at specifying a content model that contains required children that may appear in any order. While other schema languages improve upon DTDs, they will always have limitations of one sort or another.
Schemas can be obstacles for authors. In spite of advances in XML editors with fancy graphical interfaces, authoring in XML will never be as easy as writing in a traditional word processor. Time spent thinking about which element to use in a given context is time not spent on thinking about the document's content, which is the original reason for writing it. Some editors supply a menu of elements to select from that changes depending on the context. Depending on the language and the tools, it still can be confusing and frustrating for the lay person.
You have to maintain it. With a schema, you have one more tool to debug and update. Like software, it will have bugs, versions, and even its own documentation. It's all too easy to damage a schema by deleting an imported component or introducing a syntax error. Older documents may not validate if you update the schema, forcing you to make retroactive changes to them. One silver lining is that, except for DTDs, most schema languages are based on XML, which allows you to use XML editors to make changes.
Designing it will be hard. Schemas are tricky documents to compose. You have to really think about how each element will fit together, what kinds of data will be input, whether there are special cases to accommodate. If you're just starting out with a language, there are many needs you don't know about until you start using it, creating a bit of a bootstrapping problem.

To make the decision easier, think about it this way. A schema is basically a quality-control tool. If you are reasonably certain that your documents are good enough for processing, then you have no need for schemas. However, if you want extra assurance that your documents are complete and structurally sound, and the work you save fixing mistakes outweighs the work you will spend maintaining a schema, then you should look into it.

One thing to consider is whether a human will be involved with producing a document. No matter how careful we are, we humans tend to make a lot of mistakes. Validation can find those problems and save frustration later. But software-created documents tend to be very predictable and probably never need to be validated.

The really hard question to answer is not whether you need a schema, but which standard to use. There are a few very valuable choices that I will be describing in the rest of the chapter. I hope to provide you with enough information to decide which one is right for your application.