Hack 20 Create Well-Formed XML with Minimal Manual Tagging Using an SGML Parser

Convert minimal markup into XML with James Clark's SP.

The problem of converting plain text into basic, well-formed XML occurs over and over again in XML processing. As a general rule, I like to get data into XML as quickly as possible and leave it in XML for as long as possible (preferably forever). The sooner I can get data into XML, the sooner I can bring all my XML-processing tools and knowledge to bear on the data-processing challenges.

When the volume of markup to be created is small, hand-editing using one-off text editor macros is a powerful technique. For higher volumes of markup, a custom program is often the best way to go?Python, Ruby, and Perl, for example, all excel at this sort of work.

Sometimes, the quickest way to get data into XML is by combining judicious use of hand-edits and automatic addition of the markup required using an SGML parser. XML is a subset of a much larger markup technology standard known as SGML (ISO 8879:1986), which has been an international standard since 1986. SGML provides a variety of mechanisms, not found in XML, to minimize the amount of tagging required in documents. Collectively, these techniques are known as markup minimization features. By using an SGML parser to process text, it is possible to take advantage of the tag minimization features to automatically add markup and help create well-formed XML documents.

In these examples, we will use James Clark's SP SGML parser. You can download it from http://www.jclark.com/sp/. The examples in this hack assume that SP has been installed in the working directory for the book's files.

2.11.1 From HTML to XML

You may already be familiar with some of SGML's tag minimization capabilities, as they are used extensively in HTML. (HTML is an example of an SGML application?by far the most successful SGML application in the world.)

The most common tag minimization technique from SGML used in HTML is known as tag omission. Here is a small HTML document, min.html, which, thanks to SGML's tag omission features, is valid per the HTML DTD:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML Strict//EN">

<title>Hello World</title>

<p>Hello World

Note that numerous HTML tags that you normally see have been omitted from the document: there is no head element, no body element, and no html element. The end tag of the p element has also been omitted.

Using the nsgmls command-line application that ships with SP, we can parse this document against the HTML DTD on Windows using this command:

nsgmls -c pubtext/html.soc min.html >nul

Or on Unix by using this:

nsgmls -c pubtext/html.soc min.html >/dev/null

The -c command-line option is used to tell the parser where to find the HTML DTDs. These are shipped in the pubtext subdirectory of SP, which came with the archive of files for the book. I have redirected normal output to the null device in the examples. The fact that no errors are displayed on the screen tells us that, from an SGML perspective, the document is both well-formed and valid per the HTML DTD.

SP also ships with sx, a utility for converting documents from SGML to XML. Using the sx utility, we can now automatically add all the tags needed to make min.html a valid XML document. Run it on Windows or Unix like this:

sx -c pubtext/html.soc -xno-nl-in-tag min.html >min.xml

The -x command-line option tells the sx application not to add newlines into the tags it creates. This is an option provided by sx for situations where you might wish to avoid the creation of very long lines of XML output. For a complete list of sx options, see doc/sx.htm in the SP distribution.

The resultant file, min.xml, is shown in Example 2-7 and is indented for clarity.

Example 2-7. min.xml

<?xml version="1.0"?>

<HTML VERSION="-//IETF//DTD HTML 2.0 Strict//EN" SDAFORM="Book">

 <HEAD>

  <TITLE SDAFORM="Ti">Hello World</TITLE>

 </HEAD>

 <BODY>

  <P SDAFORM="Para">Hello World</P>

 </BODY>

</HTML>

There are a total of ten tags in this document, of which sx has added seven automatically, while we only contributed three manually?a 70 percent savings on manual markup!

In addition to adding start and end tags as required, sx has also added attributes called SDAFORM and VERSION. These are examples of defaulted attribute values. Defaulting attribute values is a form of markup minimization that, unlike SGML's tag minimization, is included in the XML standard.

2.11.2 Marking Up the Names of People

A common problem in XML data processing is dealing with the names of people. Many applications require that people's names be split into two parts?a family name and a given name. In the general case, doing this across all languages and cultures is very complex at best and impossible at worst. Even within a limited set of languages/cultures, the complexity of the problem rapidly manifests itself. Consider the following text file (names.txt), which contains the names of three people:

Asmar Hohsen  Mickey Joe Mac Entaggart   Javier Ausas Lopez de Castro

Splitting these names into their given name and surname component parts requires the application of complex rules, rules that are very difficult to explain to a computer. We can take advantage of our human ability to out-guess machines to get this data into an XML form quickly by using an SGML parser. The critical human interventions we need to make are:

Split the list into separate names using the whitespace information and our best guess as to where the boundaries lie.
Mark the point where the surname begins, changing the order of given name and surname, as needed.

Here is an SGML document created with the minimal amount of markup added. A Name tag is used to mark the start of each name, and an S tag is used to mark the point where a surname starts (names.sgml):

<!DOCTYPE Names SYSTEM "names.dtd">

<Name>Hohsen  <S>Asmar <Name>Mickey Joe <S>MacEntaggart 

<Name>Javier <S>Ausas Lopez de Castro

Now we need to create a DTD to describe the Names document type. In XML, it would look like this (namex.dtd):

<!ELEMENT Names (Name*)>

<!ELEMENT Name (F,S)>

<!ELEMENT F (#PCDATA)>

<!ELEMENT S (#PCDATA)>

To make it SGML compatible, we need to make a minor alteration (names.dtd):

<!ELEMENT Names o o (Name*)>

<!ELEMENT Name o o (F,S)>

<!ELEMENT F o o (#PCDATA)>

<!ELEMENT S o o (#PCDATA)>

Note the pair of lowercase o's (o o) between the element type name and the content model of each element type declaration. The o stands for omissable and indicates that documents may omit the start tag (first o) and end tag (second o).

Now we can parse the document with the nsgmls utility to check for errors. On Windows, the command is:

nsgmls names.sgml >nul

On Unix, the command is:

nsgmls names.sgml >/dev/null

The fact that no error messages appear on the screen tells us that the document is well-formed and valid per names.dtd. Now we can proceed to use the sx utility to generate fully marked-up XML from this document. On Windows or Unix, the command is:

sx -x no-nl-in-tag -x lower names.sgml >names.xml

Note the addition of another -x switch with lower. This will produce tag names in lowercase. The resultant XML file is names.xml, which is indented for clarity (Example 2-8).

Example 2-8. names.xml

<?xml version="1.0"?>

<names>

 <name>

  <f>Agmar</f>

  <s>Hohsen </s>

 </name>

 <name>

  <f>Mickey Joe </f>

  <s>MacEntaggart   </s>

 </name>

 <name>

  <f>Javier </f>

  <s>Ausas Lopez de Castro</s>

 </name>

</names>

You can't do that with just plain old XML!