Recipe 8.21 Converting Microsoft Text Files into Unicode

8.21.1 Problem

You have a text file written on a Microsoft computer that looks like garbage when displayed. How do you fix this?

8.21.2 Solution

Set the encoding layer appropriately when reading to convert this into Unicode:

binmode(IFH, ":encoding(cp1252)")
    || die "can't binmode to cp1252 encoding: $!";

8.21.3 Discussion

Suppose someone sends you a file in cp1252 format, Microsoft's default in-house 8-bit character set. Files in this format can be annoying to readwhile they might claim to be Latin1, they are not, and if you look at them with Latin1 fonts loaded, you'll get garbage on your screen. A simple solution is as follows:

open(MSMESS, "< :crlf :encoding(cp1252)", $inputfile)
    || die "can't open $inputfile: $!";

Now data read from that handle will be automatically converted into Unicode when you read it in. It will also be processed in CRLF mode, which is needed on systems that don't use that sequence to indicate end of line.

You probably won't be able to write out this text as Latin1. That's because cp1252 includes characters that don't exist in Latin1. You'll have to leave it in Unicode, and displaying Unicode properly may not be as easy as you wish, because finding tools to work with Unicode is something of a quest in its own right. Most web browsers support ISO 10646 fonts; that is, Unicode fonts (see http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html). Whether your text editor does is a different matter, although both emacs and vi (actually, vim, not nvi) have mechanisms for handling Unicode. The authors used the following xterm(1) command to look at text:

xterm -n unicode -u8 -fn -misc-fixed-medium-r-normal--20-200-75-75-c-100-iso10646-1

But many open questions still exist, such as cutting and pasting of Unicode data between windows.

The www.unicode.org site has help for finding and installing suitable tools for a variety of platforms, including both Unix and Microsoft systems.

You'll also need to tell Perl it's alright to emit Unicode. If you don't, you'll get a warning about a "Wide character in print" every time you try. Assuming you're running in an xterm like the one shown previously (or its equivalent for your system) that has Unicode fonts available, you could just do this:

binmode(STDOUT, ":utf8");

But that requires the rest of your program to emit Unicode, which might not be convenient. When writing new programs specifically designed for this, though, it might not be too much trouble.

As of v5.8.1, Perl offers a couple of other means of getting this effect. The -C command-line switch controls some Unicode features related to your runtime environment. This way you can set those features on a per-command basis without having to edit the source code.

The -C switch can be followed by either a number or a list of option letters. Some available letters, their numeric values, and effects are as follows:

Letter	Number	Meaning
I	1	`STDIN` is assumed to be in UTF-8
O	2	`STDOUT` will be in UTF-8
E	4	`STDERR` will be in UTF-8
S	7	I + O + E
i	8	UTF-8 is the default PerlIO layer for input streams
o	16	UTF-8 is the default PerlIO layer for output streams
D	24	i + o
A	32	the @ARGV elements are expected to be strings encoded in UTF-8

You may use letters or numbers. If you use numbers, you have to add them up. For example, -COE and -C6 are synonyms of UTF-8 on both STDOUT and STDERR.

One last approach is to use the PERL_UNICODE environment variable. If set, it contains the same value as you would use with -C. For example, with the xterm that has Unicode fonts loaded, you could do this in a POSIX shell:

sh% export PERL_UNICODE=6

or this in the csh:

csh% setenv PERL_UNICODE 6

The advantage of using the environment variable is that you don't have to edit the source code as the pragma would require, and you don't even need to change the command invocation as setting -C would require.