Recipe 8.20 Reading or Writing Unicode from a Filehandle

8.20.1 Problem

You have a file containing text in a particular encoding and when you read data from that into a Perl string, Perl treats it as a series of 8-bit bytes. You'd like to work with characters instead of bytes because your encoding characters can take more than one byte. Also, if Perl doesn't know about your encoding, it may fail to identify certain characters as letters. Similarly, you may want to output text in a particular encoding.

8.20.2 Solution

Use I/O layers to tell Perl that data from that filehandle is in a particular encoding.

open(my $ifh, "<:encoding(ENCODING_NAME)", $filename);
open(my $ofh, ">:encoding(ENCODING_NAME)", $filename);

8.20.3 Discussion

Perl's text manipulation functions handle UTF-8 strings just as well as they do 8-bit datathey just need to know what type of data they're working with. Each string in Perl is internally marked as either UTF-8 or 8-bit data. The encoding(...) layer converts data between variable external encodings and the internal UTF-8 within Perl. This is done by way of the Encode module.

In the section on Unicode Support in Perl back in the Introduction to Chapter 1, we explained how under Unicode, every different character had a different code point (i.e., a different number) associated with it. Assigning all characters unique code points solves many problems. No longer does the same number, like 0xC4, represent one character under one character repertoire (e.g., a LATIN CAPITAL LETTER A WITH DIAERESIS under ISO-8859-1) and a different character in another repertoire (e.g., a GREEK CAPITAL LETTER DELTA under ISO-8859-7).

This neatly solves many problems, but still leaves one important issue: the precise format used in memory or disk for each code point. If most code points fit in 8 bits, it would seem wasteful to use, say, a full 32 bits for each character. But if every character is the same size as every other character, the code is easier to write and may be faster to execute.

This has given rise to different encoding systems for storing Unicode, each offering distinct advantages. Fixed-width encodings fit every code point into the same number of bits, which simplifies programming but at the expense of some wasted space. Variable-width encodings use only as much space as each code point requires, which saves space but complicates programming.

One further complication is combined characters, which may look like single letters on paper but in code require multiple code points. When you see a capital A with two dots above it (a diaeresis) on your screen, it may not even be character U+00C4. As explained in Recipe 1.8, Unicode supports the idea of combining characters, where you start with a base character and add non-spacing marks to it. U+0308 is a "COMBINING DIAERESIS", so you could use a capital A (U+0041) followed by U+0308, or A\x{308} to produce the same output.

The following table shows the old ISO 8859-1 way of writing a capital A with a diaeresis, in which the logical character code and the physical byte layout enjoyed an identical representation, and the new way under Unicode. We'll include both ways of writing that character: one precomposed in one code point and the other using two code points to create a combined character.

	Old way	New way
	Ä	A	Ä	Ä
Character(s)	0xC4	U+0041	U+00C4	U+0041 U+0308
Character repertoire	ISO 8859-1	Unicode	Unicode	Unicode
Character code(s)	0xC4	0x0041	0x00C4	0x0041 0x0308
Encoding		UTF-8	UTF-8	UTF-8
Byte(s)	0xC4	0x41	0xC3 0x84	0x41 0xCC 0x88

The internal format used by Perl is UTF-8, a variable-width encoding system. One reason for this choice is that legacy ASCII requires no conversion for UTF-8, looking in memory exactly as it did beforejust one byte per character. Character U+0041 is just 0x41 in memory. Legacy data sets don't increase in size, and even those using Western character sets like ISO 8859-n grow only slightly, since in practice you still have a favorable ratio of regular ASCII characters to 8-bit accented characters.

Just because Perl uses UTF-8 internally doesn't preclude using other formats externally. Perl automatically converts all data between UTF-8 and whatever encoding you've specified for that handle. The Encode module is used implicitly when you specify an I/O layer of the form ":encoding(....)". For example:

binmode(FH, ":encoding(UTF-16BE)")
    or die "can't binmode to utf-16be: $!";

or directly in the open:

open(FH, "< :encoding(UTF-32)", $pathname)
    or die "can't open $pathname: $!";

Here's a comparison of actual byte layouts of those two sequences, both representing a capital A with diaeresis, under several other popular formats:

	U+00C4	U+0041 U+0308
UTF-8	c3 84	41 cc 88
UTF-16BE	00 c4	00 41 03 08
UTF-16LE	c4 00	41 00 08 03
UTF-16	fe ff 00 c4	fe ff 00 41 03 08
UTF-32LE	c4 00 00 00	41 00 00 00 08 03 00 00
UTF-32BE	00 00 00 c4	00 00 00 41 00 00 03 08
UTF-32	00 00 fe ff 00 00 00 c4	00 00 fe ff 00 00 00 41 00 00 03 08

This can chew up memory quickly. It's also complicated by the fact that some computers are big-endian, others little-endian. So fixed-width encoding formats that don't specify their endian-ness require a special byte-ordering sequence ("FF EF" versus "EF FF"), usually needed only at the start of the stream.

If you're reading or writing UTF-8 data, use the :utf8 layer. Because Perl natively uses UTF-8, the :utf8 layer bypasses the Encode module for performance.

The Encode module understands many aliases for encodings, so ascii, US-ascii, and ISO-646-US are synonymous. Read the Encode::Supported manpage for a list of available encodings. Perl supports not only standard Unicode names but vendor-specific names, too; for example, iso-8859-1 is cp850 on DOS, cp1252 on Windows, MacRoman on a Mac, and hp-roman8 on NeXTstep. The Encode module recognizes all of these as names for the same encoding.