Section 15.1. Languages and Metalanguages

A language is comprised of symbols that we assemble in a meaningful way to express ourselves and pass along information in a way that is intelligible to others. For example, English is a language with rules (grammar) that define how to put its symbols (words) together to form sentences, paragraphs, and, ultimately, books like the one you are holding. If you know the words and understand the grammar, you can read the book, even if you don't necessarily understand its contents.

An important difference between human and computer-based languages is that human languages are self-describing. We use English sentences and paragraphs to define how to create correct English sentences and paragraphs. Our brains are marvelous machines that have no problem understanding that you can use a language to describe itself. However, computer languages are not so rich and computers are not so bright that you could easily define a computer language with itself. Instead, we can define one language ? a metalanguage ? that defines the rules and symbols of another language.

Software developers can use a metalanguage to define the rules for defining a language and then define one or more languages based on those rules.[1] The metalanguage also guides developers creating the automated agents that display or otherwise process the contents of documents that authors have created using that language.

[1] The use of metalanguages has long been popular in the world of computer programming. The C programming language, for instance, has a set of rules and symbols defined by one of several metalanguages, including yacc. Developers use yacc to create compilers, which in turn process language source files into computer-intelligible programs (hence, its name: Yet Another Compiler Compiler). yacc's only purpose is to help developers create new programming languages.

XML is a metalanguage created by the W3C and used by developers to define markup languages such as XHTML. Browser developers rely on XML's metalanguage rules to create automated processes that read the language definition of XHTML and implement the processes that ultimately display or otherwise process XHTML documents.

Why bother with a markup metalanguage? Because, as the familiar proverb goes, the W3C wants to teach us how to fish so we can feed ourselves for a lifetime. With XML, there is a standardized way to define markup languages that are customized for different needs, rather than having to rely upon HTML extensions. Mathematicians need a way to express mathematical notations; composers need a way to present musical scores; businesses want their web sites to take sales orders from customers; physicians look to exchange medical records; plant managers want to run their factories from web-based documents. All these groups need an acceptable, resilient way to express these different kinds of information, so that the software industry can develop the programs that process and display these diverse documents.

XML provides the answer. Each content sector ? the business group, the factory-automation consortium, the trade association ? may define a markup language to suit its particular needs for information exchange and processing over the Web. Computer programmers can create XML-compliant processes ? parsers ? that read the new language definitions and allow the server to process the documents of those languages.

15.1.1 Creation Versus Display

While there is no limit to the kinds of markup languages you can create with XML, displaying your documents may be more complicated. When you write HTML, a browser understands what to do with the <h1> tag because it is defined in the HTML DTD, and browsers have been programmed to display all standard HTML tags.

With XML, you might create a DTD[2] for describing recipes. It would be a great way to capture and standardize all those kumquat recipes you've been collecting in your kitchen drawers. With special <ingredient> and <portion> tags, the recipes are easy to define and understand. However, browsers won't know what to do with these new tags unless you attach a style sheet that defines their handling. Without a style sheet, XML-capable browsers such as Internet Explorer 5 and 6 and Netscape 6 render these tags in a very generic way ? certainly not the flourishing presentation your kumquat recipes deserve.

[2] An alternative to DTDs is XML Schemas. Schemas offer features related to data typing and are more programmatically oriented than document oriented. For more information, check out XML Schema by Eric van der Vlist (O'Reilly).

Even with style sheets, there are limitations to presenting XML-based information. Let's say you want to create something more challenging, such as a DTD for musical notation or silicon chip design. While describing these data types in a DTD is possible, displaying this information graphically is certainly beyond the capabilites of any style sheets we've seen yet; properly displaying this type of graphically rich information would require a specialized rendering tool.

Nonetheless, your recipe DTD is a great tool for capturing and sharing recipes. As we'll see later in this chapter, XML isn't simply about creating markup languages for displaying content in browsers. It has great promise for sharing and managing information, so that those precious kumquat dishes will be preserved for many generations to come. Just bear in mind that, in addition to writing a DTD to describe your new XML-based markup language, in most cases you will want to supplement the DTD with a style sheet.[3]

[3] In fact, it is possible to write XML documents using only a style sheet. DTDs are highly recommended but optional. See http://www.w3c.org/TR/xml-stylesheet/ for details.

15.1.2 A Little History

To complete your education into the whys and wherefores of markup languages, it helps to know how all these markup languages came to be.

In the beginning, there was SGML, the Standardized Generalized Markup Language. SGML was intended to be the only markup metalanguage, from which all other markup languages would be created. Everything from hieroglyphics to HTML can be defined using SGML, negating the need for any other metalanguage.

The problem with SGML is that it is so broad and all-encompassing that mere mortals cannot use it. Using SGML effectively requires very expensive and complex tools that are completely beyond the scope of regular people who just want to bang out an HTML document in their spare time. As a result, other markup languages that are greatly reduced in scope and much easier to use have been created. The HTML standards themselves were initially defined using a subset of SGML that eliminated many of the more esoteric features. The DTD in Appendix D, uses this subset of SGML to define the HTML 4.01 standard.

Recognizing that SGML was too unwieldy to describe HTML in a useful way and that there was a growing need to define other HTML-like markup languages, the World Wide Web Consortium (W3C) defined XML. XML is a formal markup metalanguage that uses select features of SGML to define markup languages in a style similar to that of HTML. It eliminates many SGML elements that aren't applicable to languages like HTML and simplifies other elements to make them easier to use and understand.

XML is a middle ground between SGML and HTML, a useful tool for defining a wide variety of markup languages. XML is becoming increasingly important as the Web extends beyond browsers and moves into the realm of direct data interchange between people, computers, and disparate systems. A small number of people wind up creating new markup languages with XML, and many more people want to be able to understand XML DTDs in order to use all these new markup languages.