C.3 Encodings

A full 32-bits of encoding space leaves plenty of room for every character we might want to represent, but it has its own problems. If we need to use 4 bytes for every character we want to encode, that makes for rather verbose files (or strings, or streams). Furthermore, these verbose files are likely to cause a variety of problems for legacy tools. As a solution to this, Unicode is itself often encoded using "Unicode Transformation Formats" (abbreviated as UTF-*). The encodings UTF-8 and UTF-16 use rather clever techniques to encode characters in a variable number of bytes, but with the most common situation being the use of just the number of bits indicated in the encoding name. In addition, the use of specific byte value ranges in multibyte characters is designed in such a way as to be friendly to existing tools. UTF-32 is also an available encoding, one that simply uses all four bytes in a fixed-width encoding.

The design of UTF-8 is such that US-ASCII characters are simply encoded as themselves. For example, the English letter "e" is encoded as the single byte 0x65 in both ASCII and in UTF-8. However, the non-English "e-umlaut" diacritic, which is Unicode character OxOOEB, is encoded with the two bytes OxC3 OxAB. In contrast, the UTF-16 representation of every character is always at least 2 bytes (and sometimes 4 bytes). UTF-16 has the rather straightforward representations of the letters "e" and "e-umlaut" as 0x65 0x00 and 0xEB 0x00, respectively. So where does the odd value for the e-umlaut in UTF-8 come from? Here is the trick: No multibyte encoded UTF-8 character is allowed to be in the 7-bit range used by ASCII, to avoid confusion. So the UTF-8 scheme uses some bit shifting and encodes every Unicode character using up to 6 bytes. But the byte values allowed in each position are arranged in such a manner as not to allow confusion of byte positions (for example, if you read a file nonsequentially).

Let's look at another example, just to see it laid out. Here is a simple text string encoded in several ways. The view presented is similar to what you would see in a hex-mode file viewer. This way, it is easy to see both a likely on-screen character representation (on a legacy, non-Unicode terminal) and a representation of the underlying hexadecimal values each byte contains:

Hex view of several character string encodings

------------------- Encoding = us-ascii ------------------------
55 6E 69 63 6F 64 65 20 20 20 20 20 20 20 20 20 | Unicode
------------------- Encoding = utf-8 -------------------
55 6E 69 63 6F 64 65 20 20 20 20 20 20 20 20 20 | Unicode
------------------- Encoding = utf-16 ----------------------
FF FE 55 00 6E 00 69 00 63 00 6F 00 64 00 65 00 |   U n i c o d e