Recipe 6.2 Matching Letters :: Chapter 6. Pattern Matching :: Perl tutorial :: Programming

Recipe 6.2 Matching Letters

6.2.1 Problem

You want to see whether a string contains only alphabetic characters.

6.2.2 Solution

The obvious character class for matching regular letters isn't good enough in the general case:

if ($var =~ /^[A-Za-z]+$/) {
    # it is purely alphabetic
}

because it doesn't pay attention to letters with diacritics or characters from other writing systems. The best solution is to use Unicode properties:

if ($var =~ /^\p{Alphabetic}+$/) {   # or just /^\pL+$/
    print "var is purely alphabetic\n";
}

On older releases of Perl that don't support Unicode, your only real option was to use either a negated character class:

if ($var =~ /^[^\W\d_]+$/) {
    print "var is purely alphabetic\n";
}

or, if supported, POSIX character classes:

if ($var =~ /^[[:alpha:]]+$/) {
    print "var is purely alphabetic\n";
}

But these don't work for non-ASCII letters unless you use locale and the system you're running on actually supports POSIX locales.

6.2.3 Discussion

Apart from Unicode properties or POSIX character classes, Perl can't directly express "something alphabetic" independent of locale, so we have to be more clever. The \w regular expression notation matches one alphabetic, numeric, or underscore characterhereafter known as an "alphanumunder" for short. Therefore, \W is one character that is not one of those. The negated character class [^\W\d_] specifies a character that must be neither a non-alphanumunder, a digit, nor an underscore. That leaves nothing but alphabetics, which is what we were looking for.

Here's how you'd use this in a program:

use locale;
use POSIX 'locale_h';

# the following locale string might be different on your system
unless (setlocale(LC_ALL, "fr_CA.ISO8859-1")) {
    die "couldn't set locale to French Canadian\n";
}

while (<DATA>) {
    chomp;
    if (/^[^\W\d_]+$/) {
        print "$_: alphabetic\n";
    } else {
        print "$_: line noise\n";
    }
}

_ _END_ _
silly
façade
coöperate
niño
Renée
Molière
hæmoglobin
naïve
tschüß
random!stuff#here

POSIX character classes help a little here; available ones are alpha, alnum, ascii, blank, cntrl, digit, graph, lower, print, punct, space, upper, word, and xdigit. These are valid only within a square-bracketed character class specification:

$phone =~ /\b[:digit:]{3}[[:space:][:punct:]]?[:digit:]{4}\b/;     # WRONG
$phone =~ /\b[[:digit:]]{3}[[:space:][:punct:]]?[[:digit:]]{4}\b/; # RIGHT

It would be easier to use properties instead, because they don't have to occur only within other square brackets:

$phone =~ /\b\p{Number}{3}[\p{Space}\p{Punctuation]?\p{Number}{4}\b/;
$phone =~ /\b\pN{3}[\pS\pP]?\pN{4}\b/;   # abbreviated form

Match any one character with Unicode property prop using \p{prop}; to match any character lacking that property, use \P{prop} or [^\p{prop}]. The relevant property when looking for alphabetics is Alphabetic, which can be abbreviated as simply Letter or even just L. Other relevant properties include UppercaseLetter, LowercaseLetter, and TitlecaseLetter; their short forms are Lu, Ll, and Lt, respectively.

6.2.4 See Also

The treatment of locales in Perl in perllocale(1); your system's locale(3) manpage; we discuss locales in greater depth in Recipe 6.12; the "Perl and the POSIX Locale" section of Chapter 7 of Mastering Regular Expressions; also much of that book's Chapter 3