10.10 Internationalization

Most programs present some information to users as text. Such text should be understandable and acceptable to the user. For example, in some countries and cultures, the date "March 7" can be concisely expressed as "3/7". Elsewhere, "3/7" indicates "July 3", and the string that means "March 7" is "7/3". In Python, such cultural conventions are handled with the help of standard module locale.

Similarly, a greeting can be expressed in one natural language by the string "Benvenuti", while in another language the string to use is "Welcome". In Python, such translations are handled with the help of standard module gettext.

Both kinds of issues are commonly called internationalization (often abbreviated i18n, as there are 18 letters between i and n in the full spelling). This is actually a misnomer, as the issues also apply to programs used within one nation by users of different languages or cultures.

10.10.1 The locale Module

Python's support for cultural conventions is patterned on that of C, slightly simplified. In this architecture, a program operates in an environment of cultural conventions known as a locale. The locale setting permeates the program and is typically set early on in the program's operation. The locale is not thread-specific, and module locale is not thread-safe. In a multithreaded program, set the program's locale before starting secondary threads.

If a program does not call locale.setlocale, the program operates in a neutral locale known as the C locale. The C locale is named from this architecture's origins in the C language, and is similar, but not identical, to the U.S. English locale. Alternatively, a program can find out and accept the user's default locale. In this case, module locale interacts with the operating system (via the environment, or in other system-dependent ways) to establish the user's preferred locale. Finally, a program can set a specific locale, presumably determining which locale to set on the basis of user interaction, or via persistent configuration settings such as a program initialization file.

A locale setting is normally performed across the board, for all relevant categories of cultural conventions. This wide-spectrum setting is denoted by the constant attribute LC_ALL of module locale. However, the cultural conventions handled by module locale are grouped into categories, and in some cases a program can choose to mix and match categories to build up a synthetic composite locale. The categories are identified by the following constant attributes of module locale:

LC_COLLATE

String sorting: affects functions strcoll and strxfrm in locale

LC_CTYPE

Character types: affects aspects of module string (and string methods) that have to do with letters, lowercase, and uppercase

LC_MESSAGES

Messages: may affect messages displayed by the operating system, for example function os.strerror and module gettext

LC_MONETARY

Formatting of currency values: affects function locale.localeconv

LC_NUMERIC

Formatting of numbers: affects functions atoi, atof, format, localeconv, and str in locale

LC_TIME

Formatting of times and dates: affects function time.strftime

The settings of some categories (denoted by the constants LC_CTYPE, LC_TIME, and LC_MESSAGES) affect some of the behavior of other modules (string, time, os, and gettext, as indicated). The settings of other categories (denoted by the constants LC_COLLATE, LC_MONETARY, and LC_NUMERIC) affect only some functions of locale.

Module locale supplies functions to query, change, and manipulate locales, as well as functions that implement the cultural conventions of locale categories LC_COLLATE, LC_MONETARY, and LC_NUMERIC.

atof

atof(str)

Converts string str to a floating-point value according to the current LC_NUMERIC setting.

atoi

atoi(str)

Converts string str to an integer according to the LC_NUMERIC setting.

format

format(fmt,num,grouping=0)

Returns the string obtained by formatting number num according to the format string fmt and the LC_NUMERIC setting. Except for cultural convention issues, the result is like fmt%num. If grouping is true, format also groups digits in the result string according to the LC_NUMERIC setting. For example:

>>> locale.setlocale(locale.LC_NUMERIC,'en')
'English_United States.1252'
>>> locale.format('%s',1000*1000)
'1000000'
>>> locale.format('%s',1000*1000,1)
'1,000,000'

When the numeric locale is U.S. English, and argument grouping is true, format supports the convention of grouping digits by threes with commas.

getdefaultlocale

getdefaultlocale(envvars=['LANGUAGE','LC_ALL',
                 'LC_TYPE','LANG'])

Examines the environment variables whose names are specified by argument envvars, in order. The first variable found in the environment determines the default locale. getdefaultlocale returns a pair of strings (lang,encoding) compliant with RFC 1766 (except for the 'C' locale), such as ('en_US','ISO8859-1'). Each item of the pair may be None if gedefaultlocale is unable to discover what value the item should have.

getlocale

getlocale(category=LC_TYPE)

Returns a pair of strings (lang,encoding) with the current setting for the given category. The category cannot be LC_ALL.

localeconv

localeconv(  )

Returns a dictionary d containing the cultural conventions specified by categories LC_NUMERIC and LC_MONETARY of the current locale. While LC_NUMERIC is best used indirectly via other functions of module locale, the details of LC_MONETARY are accessible only through d. Currency formatting is different for local and international use. The U.S. currency symbol, for example, is '$' for local use only. '$' would be ambiguous in international use, since the same symbol is also used for other currencies called "dollars" (Canadian, Australian, Hong Kong, etc.). In international use, therefore, the U.S. currency symbol is the unambiguous string 'USD'. The keys into d to use for currency formatting are the following strings:

'currency_symbol'

Currency symbol to use locally

'frac_digits'

Number of fractional digits to use locally

'int_curr_symbol'

Currency symbol to use internationally

'int_frac_digits'

Number of fractional digits to use internationally

'mon_decimal_point'

String to use as the "decimal point" for monetary values

'mon_grouping'

List of digit grouping numbers for monetary values

'mon_thousands_sep'

String to use as digit-groups separator for monetary values

'negative_sign', 'positive_sign'

String to use as the sign symbol for negative (positive) monetary values

'n_cs_precedes', 'p_cs_precedes'

True if the currency symbol comes before negative (positive) monetary values

'n_sep_by_space', 'p_sep_by_space'

True if a space goes between sign and negative (positive) monetary values

'n_sign_posn', 'p_sign_posn'

Numeric code to use to format negative (positive) monetary values:

0

The value and the currency symbol are placed inside parentheses

1

The sign is placed before the value and the currency symbol

2

The sign is placed after the value and the currency symbol

3

The sign is placed immediately before the value

4

The sign is placed immediately after the value

CHAR_MAX

The current locale does not specify any convention for this formatting

d['mon_grouping'] is a list of numbers of digits to group when formatting a monetary value. When d['mon_grouping'][-1] is 0, there is no further grouping beyond the indicated numbers of digits. When d['mon_grouping'][-1] is locale.CHAR_MAX, grouping continues indefinitely, as if d['mon_grouping'][-2] were endlessly repeated. locale.CHAR_MAX is a constant used as the value for all entries in d for which the current locale does not specify any convention.

normalize

normalize(localename)

Returns a string, suitable as an argument to setlocale, that is the normalized equivalent to localename. If normalize cannot normalize string localename, then normalize returns localename unchanged.

resetlocale

resetlocale(category=LC_ALL)

Sets the locale for category to the default given by getdefaultlocale.

setlocale

setlocale(category,locale=None)

Sets the locale for category to the given locale, if not None, and returns the setting (the existing one when locale is None; otherwise, the new one). locale can be a string, or a pair of strings (lang,encoding). When locale is the empty string '', setlocale sets the user's default locale.

str

str(num)

Like locale.format('%f',num).

strcoll

strcoll(str1,str2)

Like cmp(str1,str2), but according to the LC_COLLATE setting.

strxfrm

strxfrm(str)

Returns a string sx such that the built-in comparison (e.g., by cmp) of strings so transformed is equivalent to calling locale.strcoll on the original strings. strxfrm lets you use the decorate-sort-undecorate (DSU) idiom for sorts that involve locale-conformant string comparisons. However, if all you need is to sort a list of strings in a locale-conformant way, strcoll's simplicity can make it faster. The following example shows two ways of performing such a sort; in this case, the simple variant is often faster than the DSU one:

import locale
# simpler and often faster
def locale_sort_simple(list_of_strings):
    list_of_strings.sort(locale.strcoll)
# less simple and often slower
def locale_sort_DSU(list_of_strings):
    auxiliary_list = [(locale.strxfrm(s),s) for s in 
                                        list_of_strings]
    auxiliary_list.sort(  )
    list_of_strings[:] = [s for junk, s in auxiliary_list]

10.10.2 The gettext Module

A key issue in internationalization is the ability to use text in different natural languages, a task also called localization. Python supports localization via module gettext, inspired by GNU gettext. Module gettext is optionally able to use the latter's infrastructure and APIs, but is simpler and more general. You do not need to install or study GNU gettext to use Python's gettext effectively.

10.10.2.1 Using gettext for localization

gettext does not deal with automatic translation between natural languages. Rather, gettext helps you extract, organize, and access the text messages that your program uses. Use each string literal subject to translation, also known as a message, as the argument of a function named _ (underscore) rather than using it directly. gettext normally installs a function named _ in the _ _builtin_ _ module. To ensure that your program can run with or without gettext, conditionally define a do-nothing function, also named _, that just returns its argument unchanged. Then, you can safely use _('message') wherever you would normally use the literal 'message'. The following example shows how to start a module for conditional use of gettext:

try: _
except NameError:
    def _(s): return s
def greet(  ): print _('Hello world')

If some other module has installed gettext before you run the previous code, function greet outputs a properly localized greeting. Otherwise, greet outputs the string 'Hello world' unchanged.

Edit your sources, decorating all message literals with function _. Then, use any of various tools to extract messages into a text file (normally named messages.pot), and distribute the file to the people who translate messages into the natural languages you support. Python supplies a script pygettext.py (in directory Tools/i18n in the Python source distribution) to perform message extraction on your Python sources.

Each translator edits messages.pot and produces a text file of translated messages with extension .po. Compile the .po files into binary files with extension .mo, suitable for fast searching, using any of various tools. Python supplies a script Tools/i18n/msgfmt.py usable for this purpose. Finally, install each .mo file with a suitable name in an appropriate directory.

Conventions about which directories and names are suitable and appropriate differ among platforms and applications. gettext's default is subdirectory share/locale/<lang>/LC_MESSAGES/ of directory sys.prefix, where <lang> is the language's code (normally two letters). Each file is typically named <name>.mo, where <name> is the name of your application or package.

Once you have prepared and installed your .mo files, you normally execute from somewhere in your application code such as the following:

import os, gettext
os.environ.setdefault('LANG', 'en')          # application-default language
gettext.install('your_application_name')

This ensures that calls such as _('message') henceforward return the appropriate translated strings. You can choose different ways to access gettext functionality in your program, for example if you also need to localize C-coded extensions, or to switch back and forth between different languages during a run. Another important consideration is whether you're localizing a whole application, or just a package that is separately distributed.

10.10.2.2 Essential gettext functions

Module gettext supplies many functions; this section documents the ones that are most often used.

install

install(domain,localedir=None,unicode=False)

Installs in Python's built-in namespace a function named _ that performs translations specified by file <lang>/LC_MESSAGES/<domain>.mo in directory localedir, with language code <lang> as per getdefaultlocale. When localedir is None, install uses directory os.path.join(sys.prefix,'share','locale'). When unicode is true, function _ accepts and returns Unicode strings rather than plain strings.

translation

translation(domain,localedir=None,languages=None)

Searches for a .mo file similarly to function install. When languages is None, translation looks in the environment for the lang to use, like install. However, languages can also be a list of one or more lang names separated by colons (:), in which case translation uses the first of these names for which it finds a .mo file. Returns an instance object that supplies methods gettext (to translate a plain string), ugettext (to translate a Unicode string), and install (to install gettext or ugettext under name _ into Python's built-in namespace).

Function translation offers more detailed control than install, which is like translation(domain,localedir).install(unicode). With translation, you can localize a single package without affecting the built-in namespace by binding name _ on a per-module basis, for example with:

_ = translation(domain).ugettext

translation also lets you switch globally between several languages, since you can pass an explicit languages argument, keep the resulting instance, and call the install method of the appropriate language as needed:

import gettext
translators = {  }
def switch_to_language(lang, domain='my_app', 
                       use_unicode=False):
    if not translators.has_key(lang):
        translators[lang] = \
        gettext.translation(domain, languages=lang)
    translators[lang].install(use_unicode)


    Part III: Python Library and Extension Modules