1.4 Character Sets

figs/acorn.gif

The character sets that C++ uses at compile time and runtime are implementation-defined. A source file is read as a sequence of characters in the physical character set. When a source file is read, the physical characters are mapped to the compile-time character set, which is called the source character set. The mapping is implementation-defined, but many implementations use the same character set.

At the very least, the source character set always includes the characters listed below. The numeric values of these characters are implementation-defined.

Space
Horizontal tab
Vertical tab
Form feed
Newline
a ... z
A ... Z
0 ... 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " '

The runtime character set, called the execution character set, might be different from the source character set (though it is often the same). If the character sets are different, the compiler automatically converts all character and string literals from the source character set to the execution character set. The basic execution character set includes all the characters in the source character set, plus the characters listed below. The execution character set is a superset of the basic execution character set; additional characters are implemented-defined and might vary depending on locale.

Alert
Backspace
Carriage return
Null

Conceptually, source characters are mapped to Unicode (ISO/IEC 10646) and from Unicode to the execution character set. You can specify any Unicode character in the source file as a universal character in the form \uXXXX (lowercase u) or \UXXXXXXXX (uppercase U), in which 0000XXXX or XXXXXXXX is the hexadecimal value for the character. Note that you must use exactly four or eight hexadecimal digits. You cannot use a universal character to specify any character that is in the source character set or in the range 0-0x20 or 0x7F-0x9F (inclusive).

figs/acorn.gif

How universal characters map to the execution character set is implementation-defined. Some compilers don't recognize universal characters at all, or support them only in limited contexts.

Typically, you would not write a universal character manually. Instead, you might use a source editor that lets you edit source code in any language, and the editor would store source files in a manner that is appropriate for a particular compiler. When necessary, the editor would write universal character names for characters that fall outside the compiler's source character set. That way, you might write the following in the editor:

const long double 
figs/U03C0.gif
= 3.1415926535897932385L;

and the editor might write the following in the source file:

const long double \u03c0 = 3.1415926535897932385L;

The numerical values for characters in all character sets are implementation-defined, with the following restrictions:

  • The null character always has a value that contains all zero bits.

  • The digit characters have sequential values, starting with 0.

The space, horizontal tab, vertical tab, form feed, and newline characters are called whitespace characters. In most cases, whitespace characters only separate tokens and are otherwise ignored. (Comments are like whitespace; see Section 1.3 earlier in this chapter.)