Locale Support

A locale is a named group of properties that defines culture-specific conventions. Each locale is made up of one or more categories. Each category controls the behavior of a set of features. For example, the LC_MONETARY category contains information about how monetary values are formatted in some specific territory. The ISO and IEEE (POSIX) standards bodies have stated that a locale should include information such as the ordering of date components, the formatting of numbers, and the language preferred for message text.

PostgreSQL makes use of the locale-processing facilities provided by the host operating system. When you log into your operating system, you are automatically assigned a locale. On a Linux host (and most Unix hosts), you can find your current locale using the locale command:


$ locale

LANG=en_US

LC_CTYPE="en_US"

LC_NUMERIC="en_US"

LC_TIME="en_US"

LC_COLLATE="en_US"

LC_MONETARY="en_US"

LC_MESSAGES="en_US"

LC_PAPER="en_US"

LC_NAME="en_US"

LC_ADDRESS="en_US"

LC_TELEPHONE="en_US"

LC_MEASUREMENT="en_US"

LC_IDENTIFICATION="en_US"

LC_ALL=

You can see that I am using a locale named en_US. Locale names are composed of multiple parts. The first component identifies a language. In my case, the language is en, meaning English. The second (optional) component identifies a country, region, or territory where the language is used. I am in the U.S., so my country code is set to US. You can think of en_US as meaning "English as spoken in the U.S.", as opposed to en_AU, which means "English as spoken in Australia." The third component of a locale name is an optional codeset. I'll talk more about codesets later in this chapter. Finally, a locale name may include modifiers, such as "@euro" to indicate that the locale uses the Euro for currency values.

Language IDs are usually two characters long, written in lowercase, and chosen from the ISO 639 list of country codes. Territories are usually two characters long, written in uppercase, and chosen from the ISO 3166 standard.

The POSIX (and ISO) standards define two special locales named C and POSIX. The C and POSIX locales are defined so that they can be used in many different locations.

Table 20.1 shows a few locale names taken from my Linux host.

Table 20.1. Sample Locale Names

Locale Name

Language

Region

Codeset

Modifier

sv_FI

Swedish

Finland

   

sv_FI@euro

Swedish

Finland

 

Euro is used in this locale

sv_FI.utf8

Swedish

Finland

UTF-8

 

sv_FI.utf8@euro

Swedish

Finland

UTF-8

Euro is used in this locale

sv_SE

Swedish

Sweden

   

sv_SE.utf8

Swedish

Sweden

UTF-8

 

en_AU

English

Australia

   

en_AU.utf8

English

Australia

UTF-8

 

en_IE

English

Ireland

   

en_IE@euro

English

Ireland

 

Euro is used in this locale

en_IE.utf8

English

Ireland

UTF-8

 

en_IE.utf8@euro

English

Ireland

UTF-8

Euro is used in this locale

My Red Hat Linux system defines 277 locales. Each locale is broken down into a set of categories. Most locale implementations define (at least) the categories shown in Table 20.2. Some operating systems define additional categories.

Table 20.2. Locale Information Categories

Category

Influences

Used By

LC_MESSAGES

Message formatting and message language

Client/Server

LC_MONETARY

Monetary value formatting

Server

LC_NUMERIC

Numeric value formatting

Server

LC_TIME

Date and time formatting

Not used

LC_CTYPE

Character classifications (uppercase, punctuation, and so on)

Server

LC_COLLATE

Collating order for string values

Cluster

LC_ALL

All of the above

See all of the above

Enabling Locale Support

When you build PostgreSQL from scratch, locale support is not included unless you include --enable-locale when you configure the source code. If you enable locale support, you should also enable NLS (National Language Support) ?without NLS, you will always see PostgreSQL messages in English. Here is an example showing how to enable both:


$ ./configure --enable-locale --enable-nls

You choose a locale by setting one or more environment variables. There are three levels of environment variables that you can use. At the bottom level, you can set the LANG environment variable to the locale that you want to use. For example, if you want all features to run in a French context unless overridden, set LANG=fr_FR. You can mix locales by defining LC_MESSAGES, LC_MONETARY, LC_NUMERIC, LC_CTYPE, and/or LC_COLLATE. The LC_xxx environment variables override LANG. If you are working with a data base that stores French names, for example, you may still want to see PostgreSQL messages in English. In this case, you would set LANG=fr_FR and LC_MESSAGES=en_US. At the top level, LC_ALL overrides any other locale-related environment variables: If you want everything to run in German (as spoken in Germany), set LC_ALL=de_DE.

Effects of Locale Support

Let's see what happens when you change locales.

The first category in Table 20.2, LC_MESSAGES, determines the language that PostgreSQL uses when displaying message text. I've been running with LC_MESSAGES set to en_US when I run psql, so messages are displayed in English:


$ psql -d movies

Welcome to psql, the PostgreSQL interactive terminal.



Type:  \copyright for distribution terms

       \h for help with SQL commands

       \? for help on internal slash commands

       \g or terminate with semicolon to execute query

       \q to quit



movies=#

Let's try setting LC_MESSAGES to fr_CA (French as spoken in Canada):


$ LC_MESSAGES=fr_CA psql -d movies

Bienvenu à psql, l'interface interactif de PostgreSQL.



Tapez:  \copyright pour l'information de copyright

        \h pour l'aide-mémoire sur les commandes SQL

        \? pour l'aide-mémoire sur les commandes internes

        \g ou point-virgule pour exécuter une requête

        \q pour quitter



movies=#

Voilà! The client messages are now in French.

When you are running a PostgreSQL client connected to a PostgreSQL server, there are three locales in use: the client locale, the server locale, and the cluster locale.

Some locale properties affect the server, some affect the client, and a few are stored with the database cluster itself (see Table 20.2). The LC_MESSAGES category affects both the client and server because each can produce message text.

Now, let's try a few of the other categories.

The server uses the LC_MONETARY category to control the way in which monetary values are formatted. I've modified the customers table in my database to include a balance column (using the MONEY data type). Here is the new column, shown in the en_US locale:


movies=# SELECT * FROM customers;

 id |    customer_name     |  phone   | birth_date |  balance

----+----------------------+----------+------------+------------

  4 | Wonderland, Alice N. | 555-1122 | 1969-03-05 |

  1 | Jones, Henry         | 555-1212 | 1970-10-10 |     $10.00

  2 | Rubin, William       | 555-2211 | 1972-07-10 |  $1,000.00

  3 | Panky, Henry         | 555-1221 | 1968-01-21 | $10,000.00

(4 rows)

Now, I'll stop my server, set the LC_MONETARY environment variable to fr_FR (French as spoken in France), and restart the server. Note that you must restart the server before a change to LC_MONETARY can take effect (you can't change monetary formatting on a per-connection basis):


$ pg_ctl stop

waiting for postmaster to shut down......done

postmaster successfully shut down



$ export LC_MONETARY=fr_FR

$ pg_ctl -l /tmp/pg.log start

postmaster successfully started

Now, when I query the customers table, the monetary values are formatted using the fr_FR locale:


movies=# SELECT * FROM customers;

 id |    customer_name     |  phone   | birth_date |  balance

----+----------------------+----------+------------+------------

  4 | Wonderland, Alice N. | 555-1122 | 1969-03-05 |

  1 | Jones, Henry         | 555-1212 | 1970-10-10 |     F10,00

  2 | Rubin, William       | 555-2211 | 1972-07-10 |  F1 000,00

  3 | Panky, Henry         | 555-1221 | 1968-01-21 | F10 000,00

(4 rows)

Notice that MONEY values are now formatted using French preferences.

The LC_NUMERIC category determines which characters will be used for grouping, the currency symbol, positive and negative signs, and the decimal point. Currently, LC_NUMERIC is used only by the TO_CHAR() function. The LC_NUMERIC category affects the server.

PostgreSQL currently does not use the LC_TIME category (each date/time value can include an explicit time zone).

LC_CTYPE is consulted whenever PostgreSQL needs to categorize a character. The server locale determines which characters are considered uppercase, lowercase, numeric, punctuation, and so on. The most obvious uses of LC_COLLATE are the LOWER(), UPPER(), and INITCAP() string functions. LC_COLLATE is also used when evaluating regular expressions and the LIKE operator.

LC_COLLATE affects the result of an ORDER BY clause that sorts by a string value. LC_COLLATE also affects how an index that covers a string value is built. Setting LC_COLLATE ensures that strings are ordered properly for your locale.

Let's look at an example. Create two new database clusters and insert the same values into each one. The first database uses the French locale for collating:


$ PGDATA=/usr/local/locale_FR LC_COLLATE=fr_FR initdb

...

Success. You can now start the database server using:

    postmaster -D /usr/local/locale_FR

or

    pg_ctl -D /usr/local/locale_FR -l logfile start



$ PGDATA=/usr/local/locale_FR pg_ctl start

postmaster successfully started



$ PGDATA=/usr/local/locale_FR createdb french_locale

CREATE DATABASE



$ PGDATA=/usr/local/locale_FR psql -q -d french_locale

french_locale=# CREATE TABLE sort_test ( pkey char );

CREATE TABLE

french_locale=# INSERT INTO sort_test VALUES ('a');

INSERT

french_locale=# INSERT INTO sort_test VALUES ('ä');

INSERT

french_locale=# INSERT INTO sort_test VALUES ('b');

INSERT

french_locale=# SELECT * FROM sort_test;

 pkey

------

 a

 ä

 b

 (3 rows)

french_locale=# \q

Now, repeat this procedure but set LC_COLLATE=en_US before creating the database cluster:


$ PGDATA=/usr/local/locale_EN LC_COLLATE=en_US initdb

...

Success. You can now start the database server using:

    postmaster -D /usr/local/locale_EN

or

    pg_ctl -D /usr/local/locale_EN -l logfile start



$ PGDATA=/usr/local/locale_EN pg_ctl start

postmaster successfully started



$ PGDATA=/usr/local/locale_EN createdb english_locale

CREATE DATABASE



$ PGDATA=/usr/local/locale_EN psql -q -d locale_test

english_locale=# CREATE TABLE sort_test ( pkey char );

CREATE TABLE

english_locale=# INSERT INTO sort_test VALUES ('a');

INSERT

english_locale=# INSERT INTO sort_test VALUES ('ä');

INSERT

english_locale=# INSERT INTO sort_test VALUES ('b');

INSERT

english_locale=# SELECT * FROM sort_test;

 pkey

------

 a

 b

 ä

(3 rows)

locale_test=# \q

Notice that the collation sequence has, in fact, changed. With LC_COLLATE set to fr_FR, you see a,ä,b. With LC_COLLATE set to en_US, the ORDER BY clause returns a,b,ä.

The LC_COLLATE category is honored only when you run the initdb command. Imagine what would happen if you were trying to alphabetize a long list of customer names, but the collation rules changed every few minutes. You'd end up with quite a mess?each portion of the final list would be built with a different ordering. If you could change the collating sequence each time you started a client application, indexes would not be built reliably.



    Part II: Programming with PostgreSQL