9.3 Specifying Human Languages

Specifying a character encoding is crucial for correctly processing and displaying an XML document in a multilingual world. But there is a higher level to address than just the symbols on the page. Different languages may use the same characters. If a document is encoded with UTF-8, how can you know if it is speaking Vietnamese or Italian?

You may wonder why it matters if software should handle all documents the same way no matter what the language. The push for globalization is not a dream shared by everybody. Sure, we all want equal access to resources, but we would also like to keep our uniqueness intact. So many developers would love a way to know in advance what language a reader prefers to use, and have some automatic means to serve that preference.

XML and many related standards have included some devices to allow special handling based on language. You can use labels to create variations on a document and to customize its appearance and behavior. I will describe a few of the important mechanisms in this section.

9.3.1 The xml:lang Attribute and Language Codes

XML defines the attribute xml:lang as a language label for any element. There is no official action that an XML processor must take when encountering this attribute, but we can imagine some future applications. For example, search engines could be designed to pay attention to the language of a document and use it to categorize its entries. The search interface could then provide a menu for languages to include or exclude in a search. Another use for xml:lang might be to combine several versions of a text in one document, each version labeled with a different language. A web browser could be set to ignore all but a particular language, filtering the document so that it displays only what the reader wants. Or, if you're writing a book that includes text in different languages, you could configure your spellchecker to use a different dictionary for each version.

The attribute's value is a string containing a two-letter language code, like so:

xml:lang="en"

The code "en" stands for English. The language codes, standardized in ISO-639, are case-insensitive, so there are 262 = 676 possible codes. Three-letter codes are also specified, but XML only recognizes two-letter codes; this could be a problem in a world with thousands of different languages, dialects, and subdialects.

Fortunately, we can also specify a language variant using a qualifier, like this:

xml:lang="en-US"

This refers to the American variant of English. By convention, we usually write the language code in lowercase and the qualifier in uppercase. Now we can separate different kinds of English like so:

<para xml:lang="en-US">Please consult the program.</para>
<para xml:lang="en-GB">Please consult the programme.</para>

If for some reason you need to define your own language, you can do so by using the language code x. Some examples could include: x-pascal, x-java, x-martian, and x-babytalk.

9.3.2 Language Support in Stylesheets

CSS and XSLT both have tests that let you specify different behaviors depending on the language of your audience. For example, your document may contain an element that renders as a note with a stylesheet-generated title "CAUTION." In a German translation, you may want it to say "VORSICHT" instead. The following sections describe how this conditional behavior can be implemented.

9.3.2.1 CSS and the :lang( ) pseudo-class

Cascading Style Sheets Level 2 includes a pseudo-class for adding language options to a stylesheet. It determines the language from the xml:lang attribute or from the encoding attribute from the XML declaration. For example, the following rule changes the color of French phrase elements to red:

phrase:lang(fr) { color: 'red'; }
9.3.2.2 XSLT and the lang( ) function

XSLT also pays attention to language. In Chapter 7, we discussed Boolean functions and their roles in conditional template rules. One important function is lang( ), whose value is true if the current node's language is the same as that of the argument. Consider the following template:

<xsl:template match="para">
  <xsl:choose>
    <xsl:when test="lang('de')">
      <h1>ACHTUNG</h1>
      <xsl:apply-templates/>
    </if>
    <xsl:otherwise>
      <h1>ATTENTION</h1>
      <xsl:apply-templates/>
    </xsl:otherwise>
</xsl:template>

The XSLT template rule outputs the word "ACHTUNG" if the language is de, or "ATTENTION" otherwise. Let's apply this rule to the following input tree:

<warning xml:lang="de">
  <para>Bitte, kein rauchen.</para>
</warning>

The para inherits its language property from the warning that contains it, and the first choice in the template rule will be used.