Recipe 1.8 Treating Unicode Combined Characters as Single Characters

1.8.1 Problem

You have a Unicode string that contains combining characters, and you'd like to treat each of these sequences as a single logical character.

1.8.2 Solution

Process them using \X in a regular expression.

$string = "fac\x{0327}ade";         # "façade"
$string =~ /fa.ade/;                # fails
$string =~ /fa\Xade/;               # succeeds

@chars = split(//, $string);        # 7 letters in @chars
@chars = $string =~ /(.)/g;         # same thing
@chars = $string =~ /(\X)/g;        # 6 "letters" in @chars

1.8.3 Discussion

In Unicode, you can combine a base character with one or more non-spacing characters following it; these are usually diacritics, such as accent marks, cedillas, and tildas. Due to the presence of precombined characters, for the most part to accommodate legacy character systems, there can be two or more ways of writing the same thing.

For example, the word "façade" can be written with one character between the two a's, "\x{E7}", a character right out of Latin1 (ISO 8859-1). These characters might be encoded into a two-byte sequence under the UTF-8 encoding that Perl uses internally, but those two bytes still only count as one single character. That works just fine.

There's a thornier issue. Another way to write U+00E7 is with two different code points: a regular "c" followed by "\x{0327}". Code point U+0327 is a non-spacing combining character that means to go back and put a cedilla underneath the preceding base character.

There are times when you want Perl to treat each combined character sequence as one logical character. But because they're distinct code points, Perl's character-related operations treat non-spacing combining characters as separate characters, including substr, length, and regular expression metacharacters, such as in /./ or /[^abc]/.

In a regular expression, the \X metacharacter matches an extended Unicode combining character sequence, and is exactly equivalent to (?:\PM\pM*) or, in long-hand:

(?x:                # begin non-capturing group
        \PM         # one character without the M (mark) property,
                    #   such as a letter
        \pM         # one character that does have the M (mark) property,
                    #   such as an accent mark
        *           # and you can have as many marks as you want

Otherwise simple operations become tricky if these beasties are in your string. Consider the approaches for reversing a word by character from the previous recipe. Written with combining characters, "année" and "niño" can be expressed in Perl as "anne\x{301}e" and "nin\x{303}o".

for $word ("anne\x{301}e", "nin\x{303}o") {
    printf "%s simple reversed to %s\n", $word, 
        scalar reverse $word;
    printf "%s better reversed to %s\n", $word, 
        join("", reverse $word =~ /\X/g);

That produces:

année simple reversed to éenna
année better reversed to eénna
niño simple reversed to õnin
niño better reversed to oñin

In the reversals marked as simply reversed, the diacritical marking jumped from one base character to the other one. That's because a combining character always follows its base character, and you've reversed the whole string. By grabbing entire sequences of a base character plus any combining characters that follow, then reversing that list, this problem is avoided.

1.8.4 See Also

The perlre(1) and perluniintro(1) manpages; Chapter 15 of Programming Perl; Recipe 1.9