1.4 XML Character Conventions

1.4 XML Character Conventions

To keep XML documents well formed, you should remember the requirements and recommendations for naming elements, attributes, and documents. While the recommendations are not requirements, you may find later that they facilitate the exchange of data. Here you will learn about white space and end-of-line characters, and how Unicode and ASCII, the standards for character representation, are used in XML documents. More about the name of entities, such as links, can be found in section 1.51, "URI, URL, and URN."

1.41 White Space and End-of-Line Characters

White space is not just the space character between words. White space is a set of invisible characters that perform visual spacing of the words and lines of text. These characters are introduced in Table 1.2. White space is important if you are displaying or printing text. The beginning of this paragraph, for example, would be difficult to read if there were no spaces between the words or if a new line began at the wrong place. Below is an example of improper white space.

Whitespaceisnot justthespac
echaracter betweenwords.
Table 1.2: White space characters







horizontal tab



carriage return



line feed



White space in an XML document is important if the character is retained within your content where you intended, but it is ignored otherwise. White space in an HTML document is compressed down to one character, even in the content. Multiple spaces become one space in HTML but are ignored in the markup in the XML document. Using white space to make a document more human readable is permissible (and advisable) because the XML processor does not attach significance to it. Since white space is ignored in the markup by the XML processors, you will want to avoid using white space in any element or attribute name. You and the XML processors would have difficulty determining the element name in the example below because of the use of improper white space.

<!-- incorrect element -->
<an element name attribute="here you go" />
<!-- should be: -->
<anElementName attribute="here you go" />

The end-of-line character is the special white space that we rarely see as we type a new line or a new paragraph of text. You press the Return or Enter key and magically you can begin typing to the left and one line down in the document. You do not actually see any "character" there, although one or more exists in the electronic document. Your word processor or text editor may have a utility to toggle the display of white space on and off. The paragraph symbol (¶) may be shown at the end of a line or paragraph if the toggle is on.

Click To expand Figure 1.4: Showing invisibles

Where Do We Get These End-of-Line Characters?

If you have ever typed on an old manual (non-electric) typewriter, you probably pulled a lever to return the carriage (the type head) to the left margin and you made the roller feed the paper up one line (or more for multiple spacing). When the process for document composition is automated, printers and teletype machines have to be given precise instructions for everything they do. The two instructions for the location of the print head are carriage return and line feed. The return to the beginning of a line does not necessarily mean that you want the line to feed down at the same time. Separating these two instructions allows for printing text on top of text in the same line and creating unique symbols or simulated graphics from a limited set of characters.

Using the End-of-Line Characters

Electronic typewriters and computers include a Return or Enter key for the end-of-line action. A single keystroke sends a signal to the system processor, which takes the return to the left margin and moves down a line when the text is displayed on a monitor or as a printed document. A new line is created when the instruction for end of line is received. We also may see the text flow to the next line if the screen is a particular width. This is not a new line but is called text wrap and is the continuation of the same line. End-of-line or new line instructions may be called a hard return or end of paragraph. Hard returns occur only where you specifically press the Return or Enter key.

The end-of-line character is different on various systems. On Macintosh, the end-of-line character is the carriage return. The UNIX operating system uses line feed for the end-of-line character. Carriage return and line feed are both utilized on the Windows operating system. The document is stored with these invisible characters wherever there is an end of line. Sometimes they are not interpreted correctly by applications if the document is written on one system and read on another. You may have seen text appear incorrectly or contain a box character to replace the invisible character it cannot interpret.

XML documents can be processed on any operating system. If the document contains carriage returns, line feeds, or any combination of these two characters, an XML processor may convert the end of line to the line feed character (Unicode #x00010) after processing. This keeps the document consistent for further processing.

1.42 Unicode vs. ASCII

There are so many ways to say the same thing and so little time! We have graphical representations for many of our spoken languages. These are our written languages. Machines need a way to transmit a representation of our spoken and written languages. Just like typing white space characters, other characters on a computer keyboard send a signal for each key or combination of keys. This signal is a numerical representation of the key pressed. Most keyboards use the standard ASCII 256-character set, and often a sort will use the ASCII numerical value. Some of the ASCII characters can be found in Listing 1.16. An exercise to create the ASCII character set in HTML is also included in this section.

Listing 1.16: Sample ASCII codes and character representation
Start example
65   A
66   B
67   C
97   a
98   b
99   c
59   ;
49   1
50   2
51   3
184  π
60   <
163  £
End example

This representation can be used to translate text from one written language to another representation of the same language. Note these special symbols: the Greek pi (π), Scandinavian o-slash (), and British pound symbol (£). However, the American Standard Code for Information Interchange (ASCII) is quite limited for use internationally. ASCII omits a way to represent Japanese, Chinese, symbols, and other highly ideographical languages. ASCII can also be limiting if different applications and systems do not translate the numerical representations identically.

Exercise 1.2: Create Your Own ASCII Table

  1. Open FileMaker Pro.

  2. Create a database called ASCII.FP5 and define these four fields:

    • ASCII (number)

    • Character (calculated, text result, = "&#" & ASCII & ";")

    • HTML (text)

    • gCounter (global number)

  3. Create the script Create ASCII Table:

    Set Error Capture [ On ]
    Show All Records
    Delete All Records [ No dialog ]
    # Comment: Set the counter to zero
    Set Field [ "gCounter", "0" ]
      New Record/Request
      Set Field [ "ASCII", "gCounter" ]
      Set Field [ "HTML", "If(ASCII = 0,  "<html><head><title>ASCII
        <body><table border=0>¶
        <th>Character</th></tr>¶", "") &
        "<tr><td>" &ASCII & "</td><td>" & Character & "</td></tr>¶" &
      If(ASCII = 255, "</table></body></html>", "")" ]
      Set Field [ "gCounter", "gCounter + 1" ]
      Exit Loop If [ "gCounter = 256" ]
    End Loop
    Export Records [ Filename: "ASCII.html"; Export Order: HTML (Text) ]
     [ Restore export order, No dialog ]

After you perform the script and export this table, you can open the document in a text editor to see the results. You can also open the document in your browser to see the characters created. You may get different results from the same document if you change the font type or size in your browser preferences. Viewing the same document on different systems may also produce different results as the character mapping may be different.

A standard (ISO/IEC 10646) has been devised for representing characters used for electronic transmission. Information about the International Organization for Standardization can be found at http://www.iso.ch/iso/en/ISOOnline.frontpage. This representation of characters is called Unicode. If you tested the above exercise, you may have seen how the same character may not be precisely rendered the same by changing your browser default font. The Unicode standard was created to avoid these problems. Unicode attempts to include characters such as those used for scientific symbols and non-English text characters, thus making it a UNIversal CODE set. Only the first 128 characters are the same in Unicode and the ASCII table.

1.43 Names Using Alphanumeric Characters

The use of white space can cause problems when naming your XML elements. Other characters not in the ASCII and Unicode tables might also be a problem for all systems to process. Even within those first 128 characters, you will have control characters that may not be visible. If you follow the recommendation of only using alphanumeric characters for naming entities, you will be assured of compatibility with most systems and applications. The common letters and numbers have ASCII and Unicode equivalents. These ranges can be found in Table 1.3.

Table 1.3: Alphanumeric, ASCII, and Unicode equivalents



UTC Unicode










FileMaker Pro Help makes recommendations for naming fields. Figure 1.5 is a screen shot of this information. The same recommendations might apply to all object names, such as file names, value list names, relationship names, layout names, and script names. Your preference may work well for single databases or complete sets of databases, but for XML or any web publishing, you may need to reconsider current choices.

Click To expand
Figure 1.5: Naming fields in FileMaker Pro