Chapter 1: The Basics of XML

Chapter 1: The Basics of XML

This chapter is intended for the FileMaker Pro database designer. You will be presented with examples of markup languages and a brief history of XML. You will begin to understand why XML can be important to you and how XML documents are structured. You will learn about some of the other standards based on XML for document presentation. If examples of similar usage in FileMaker Pro are helpful, you will find them here next to the XML examples.

1.1 A Brief History of XML

Extensible Markup Language (XML) is based upon SGML (Standard Generalized Markup Language). The simplest explanation of SGML is that it is a method of writing documents with special formatting instructions, or markup, included. A publishing editor makes notations in the margin of a document to alert an author of changes needed to a document. The notations are markup of the document and, indeed, this is where the term "markup" originated. Markup allows the SGML or XML document to be distributed electronically while preserving the format or style of the text. An SGML document contains the content and the markup. The emphasis is placed on the formatting rather than the content, otherwise you would simply have an ordinary document.

SGML can be used to facilitate the publishing of documents as electronic or printed copy. Some programs that read the markup may also translate the styles, for example, to Braille readers and printers. The same document might be viewed on a smaller screen such as those on personal digital assistants (PDAs) or pagers and cellular telephones. The markup can mean something completely different based upon the final destination of the document and the translation to another format. Using stylesheets or transformation methods, a single document with content and markup can be changed upon output.

1.11 Markup Simplified

To help you understand markup, four examples are given in this section. They are based on the same results but have very different means of getting there. The first example illustrates that "there may be more than you see" on a monitor or printed page. The second example uses Rich Text Format (RTF) to show a way to embed formatting in a document for transportability. The third example shows the PostScript file (commands) to produce the desired results consistently on a laser printer. The fourth example uses the nested tag style found in SGML, HTML, and XML documents. You will begin to see how this final markup method can provide the formatting that you don't see, the transportability and the consistency of methods two and three, along with additional information about the document and document contents.

Example 1: Text Containing Bold Formatting

This has bold words in a sentence.

Using a word processor or electronic text editor, you may simply click on the word or phrase and apply the text style with special keystrokes (such as Control+B or Command+B) or choose Bold from a menu. On the word processor or computer screen, you can easily read the text, but you do not see the machine description, or code, describing how this text is to be displayed. You may not care how or why that happens, but the computer needs the instructions to comply with your wishes for a format change.

If you save the document and display or print it later, you want the computer to reproduce the document exactly as you designed it. Your computer knows what the stored code (or character markup) means for that text. A problem may arise if you place that code on another operating system or have a different word processor. There may be a different interpretation of the code that produces undesired results. This markup is consistent only if all other variables are equal. The next example uses a text encoding method to change the machine or application code into something more standard and portable.

Example 2: Revealing the Markup in Some Text Editors

{This has }{\b bold words}{ in a sentence.
\par }}

The above sentence shows Rich Text Format (RTF) markup interspersed and surrounding the words of a document. The characters "{", "}", and "\" all mean something in this document but have nothing to do with the content. Rich Text Format markup is used by many word processors to change the visual format of the displayed text. As each new style is encountered, the formatting changes without changing the content of the document. A document becomes easily transportable to other word processors by using Rich Text Format. Each application that knows how to interpret Rich Text Format can show the intent of the author. This book was composed on a word processor, saved as RTF, and electronically submitted to the publisher. Regardless of the application, electronic device, or operating system used to create the document, the styling is preserved.

Rich Text Format markup adds no other information about the text. We may not know who wrote the sentence or when it was written. This information can be included as part of the content of the document but may be difficult to extract easily. We may have no control over the formatting or be allowed to change it for use with other devices. Using a translation application, we can convert it to the next example, the commands our printer understands.

Example 3: PostScript Printer Commands for the Document

%%Title: ()
%%Creator: ()
%%CreationDate: (10:29 AM Saturday, May 26, 2001)
%%For: ()
%%Pages: 1
%%DocumentFonts: Times-Roman Times-Bold
%%DocumentData: Clean7Bit
%%PageOrder: Ascend
%%Orientation: Portrait
   // more code here has been snipped for brevity //
gS 0 0 2300 3033 rC
250 216 :M
f57 sf
(This has )S
431 216 :M
f84 sf
.032 .003(bold words)J
669 216 :M
f57 sf
( in a sentence.)S

The third example, above, is the same text used in the previous two examples and printed to a file as a PostScript document. It uses a different markup even though it is the same text and same document. PostScript is a language, developed by Adobe in 1985, that describes the document for printers, imagesetters, and screen displays. These files can also be converted to Adobe Portable Document Format (.pdf). The markup retains the document or image style so that it can be printed exactly the same way every time. It is a language that is specific to these PostScript devices. An application can translate this document to make it portable, too.

Example 4: Rules-based Nested Structure Used for Document Markup

<? Command: use stylesheet1 for external rules ?>
<document author="Beverly" creationDate="06 AUG 2001">
      <paragraph importance="highest">
            <sentence>This has <b>bold words</b> in a sentence.</sentence>
      <paragraph importance="optional">
            <sentence>The styling may be lost.</sentence>

Unlike the Rich Text Format, nested markup may also contain a description of the text contents. The markup is often called a tag and may define various rules for the document. Sometimes the rules are internal such as "<b>" and "</b>" or external such as a stylesheet (set of rules) to apply to the whole document or portions of a document.

There can be rules for characters, words, sentences, paragraphs, and the entire document. Characters inherit the rules of the word they are in. Words inherit the rules of the sentence, and sentences inherit the rules of the paragraph. The rules may not be just the formatting or style of the text but may also allow for flexibility in display.

<sentence color="blue">Some markup allows for a
<text color="red">change</text> in the document.</sentence>

Some formatting rules may also be different and change the inherited rules. All of the characters and words in the sentence above have a rule telling them to be blue. The text color can change to red without changing the sentence's blue color. In this nested markup, only the inner tags make the rule change.

Whether you use Rich Text Format or the nested structure found in SGML, HTML, and XML, changing the content of the words and phrases in the document does not change the style, the format, or the rules. Documents created with markup can be consistent. As the content changes, the style, formatting, and rules remain the same. The portability of documents containing markup to various applications and systems makes them very attractive. Standards have been recommended to ensure that every document that uses these standards will maintain portability.

1.12 The Standard in SGML

Charles Goldfarb, Ed Mosher, and Ray Lorie created General Markup Language (GML) in 1969. These authors wanted to adapt documents to make them readable by various applications and operating systems. They also saw the need to make the markup standard to industries with diverse requirements. Two or more companies could agree on the markup used in order to facilitate the exchange of information. Different standards could be designed for each industry yet could have elements common to them all.

Another requirement for GML was to have rules for documents. To maintain an industry standard, rules could be created to define a document. One rule could define the type of content allowed within the document. Another rule could define the structure of the document. You might say these rules could be the map of the document. If you had the map, you could go to any place on the map. Using this kind of markup, you could locate and extract portions of the document more easily.

GML evolved and was renamed Standard Generalized Markup Language. In 1986 the International Organization for Standardization (ISO) designated SGML as standard ISO-8879. SGML is now used worldwide for the exchange of information.

1.13 SGML Used as Basis for HTML and XML

When the World Wide Web was developed in 1989, Tim Berners-Lee used SGML as a basis for Hypertext Markup Language (HTML). HTML is a document standard for the Internet. Although the set of rules for HTML is limited, HTML still fulfills many of the SGML goals. The HTML markup includes text formatting for the display of content to web browsers and hyperlinks to connect separate documents. An example of this markup for web browsers is shown in Listing 1.1. HTML is application independent, and documents using HTML can be viewed with various operating systems.

Listing 1.1: Example of Hypertext Markup Language
Start example
            <TITLE>My Document in HTML</TITLE>
            <H1>This Is The Top Level Heading</H1>
            Here is content<BR>
            followed by another line.
            I can include images <IMG SRC="mygraphic.gif"> in a line
            of text!<BR>
            Good-bye for now.<BR>
            <A HREF="anotherPage.html">Go to another page with this
End example

Unlike SGML, HTML was not originally designed to be open to the creation of new markup. However, custom HTML markup was designed for separate applications, and documents lost some of their ability to be easily portable to other applications and systems. One application had defined a rule one way, and another had defined it differently or could not understand all the rules. Hypertext Markup Language became nonstandard.

1.14 HTML Can Become XHTML

XHTML is a standard for revising HTML to make Hypertext Markup Language documents more compatible with XML. You will learn more about HTML and XHTML in Chapter 6, "Using HTML and XHTML to Format Web Pages." You can also read more about XHTML for the World Wide Web Consortium at the Hypertext Markup Language home page, The example of XHTML in Listing 1.2, below, is very similar to Listing 1.1. XHTML is HTML with minor revisions to some of the tags.

Listing 1.2: Example of XHTML
Start example
            <title>My Document in XHTML</title>
            <h1>This Is The Top Level Heading</h1>
            Here is content<br />
            followed by another line.
            <hr />
            I can include images <img src="mygraphic.gif" /> in a
            line of text!<br />
            Good-bye for now.
            <a href="anotherPage.html">Links to another page are the
            same in XHTML</a>
End example

1.15 XML as a Standard

The World Wide Web Consortium (W3C) set up a task force for recommending a language more useful to electronic transmission and display of documents. They wanted this language to be based on SGML but not as complex. They wanted the language to be more flexible than HTML but maintain standards. The first version of the Extensible Markup Language (XML) specification was presented in 1997 as the "Document Object Model (DOM) Activity Statement",

You may see many similarities between HTML and XML. A Hypertext Markup Language document contains a nested structure. With minor adjustments, an HTML document could be an XHTML document and usable as an XML document. However, HTML is used more for display and formatting of the data, while Extensible Markup Language generally separates the data descriptions from the text styles. XML allows the data to be transformed more easily for display on different devices.