A.16 Regular Expressions

Regular expressions are, in effect, an extra language that lives inside the Perl language. In Perl, they have quite a lot of features. First, I'll summarize how regular expressions work in Perl; then, I'll present some of their many features.

A.16.1 Overview

Regular expressions describe patterns in strings. The pattern described by a single regular expression may match many different strings.

Regular expressions are used in pattern matching, that is, when you look to see if a certain pattern exists in a string. They can also change strings, as with the s/// operator that substitutes the pattern, if found, for a replacement. Additionally, they are used in the tr function that can transliterate several characters into replacement characters throughout a string. Regular expressions are case-sensitive, unless explicitly told otherwise.

The simplest pattern match is a string that matches itself. For instance, to see if the pattern 'abc' appears in the string 'abcdefghijklmnopqrstuvwxyz', write the following in Perl:

$alphabet = 'abcdefghijklmnopqrstuvwxyz';
if( $alphabet =~ /abc/ ) {
    print $&;
}

The =~ operator binds a pattern match to a string. /abc/ is the pattern abc, enclosed in forward slashes to indicate that it's a regular-expression pattern. $& is set to the matched pattern, if any. In this case, the match succeeds, since 'abc' appears in the string $alphabet, and the code just given prints out abc.

Regular expressions are made from two kinds of characters. Many characters, such as 'a' or 'Z', match themselves. Metacharacters have a special meaning in the regular-expression language. For instance, parentheses are used to group other characters and don't match themselves. If you want to match a metacharacter such as ( in a string, you have to precede it with the backslash metacharacter \( in the pattern.

There are three basic ideas behind regular expressions. The first is concatenation: two items next to each other in a regular-expression pattern (that's the string between the forward slashes in the examples) must match two items next to each other in the string being matched (the $alphabet in the examples). So, to match 'abc' followed by 'def', concatenate them in the regular expression:

$alphabet = 'abcdefghijklmnopqrstuvwxyz';
if( $alphabet =~ /abcdef/ ) {
        print $&; 
}

This prints:

abcdef

The second major idea is alternation. Items separated by the | metacharacter match any one of the items. For example, the following:

$alphabet = 'abcdefghijklmnopqrstuvwxyz';
if( $alphabet =~ /a(b|c|d)c/ ) {
        print $&;
}

prints as:

abc.

The example also shows how parentheses group things in a regular expression. The parentheses are metacharacters that aren't matched in the string; rather, they group the alternation, given as b|c|d, meaning any one of b, c, or d at that position in the pattern. Since b is actually in $alphabet at that position, the alternation, and indeed the entire pattern a(b|c|d)c, matches in the $alphabet. (One additional point: ab|cd means (ab)|(cd), not a(b|c)d.)

The third major idea of regular expressions is repetition (or closure). This is indicated in a pattern with the quantifier metacharacter *, sometimes called the Kleene star after one of the inventors of regular expressions. When * appears after an item, it means that the item may appear 0, 1, or any number of times at that place in the string. So, for example, all of the following pattern matches will succeed:

'AC' =~ /AB*C/;
'ABC' =~ /AB*C/;
'ABBBBBBBBBBBC' =~ /AB*C/;

A.16.2 Metacharacters

The following are metacharacters:

\ | ( ) [ { ^ $ * + ? .

A.16.2.1 Escaping with \

A backslash \ before a metacharacter causes it to match itself; for instance, \ matches a single \ in the string.

A.16.2.2 Alternation with |

The pipe | indicates alternation, as described previously.

A.16.2.3 Grouping with ( )

The parentheses ( ) provide grouping, as described previously.

A.16.2.4 Character classes

Square brackets [ ] specify a character class. A character class matches one character, which can be any character specified. For instance, [abc] matches either a, or b, or c at that position (so it's the same as a|b|c). A -Z is a range that matches any uppercase letter, a-z matches any lowercase letter, and 0-9 matches any digit. For instance, [A-Za-z0-9] matches any single letter or digit at that position. If the first character in a character class is ^, any character except those specified match; for instance, [^0-9] matches any character that isn't a digit.

A.16.2.5 Matching any character with a dot

The period or dot . represents any character except a newline. (The pattern modifier /s makes it also match a newline.) So, . is like a character class that specifies every character.

A.16.2.6 Beginning and end of strings with ^ and $

The ^ metacharacter doesn't match a character; rather, it asserts that the item that follows must be at the beginning of the string. Similarly, the $ metacharacter doesn't match a character but asserts that the item that precedes it must be at the end of the string (or before the final newline). For example: /^Watson and Crick/ matches if the string starts with Watson and Crick; and /Watson and Crick$/ matches if the string ends with Watson and Crick or Watson and Crick\n.

A.16.2.7 Quantifiers

These metacharacters indicate the repetition of an item. The * metacharacter indicates zero, one, or more of the preceding item. The + metacharacter indicates one or more of the preceding item. The brace { } metacharacters let you specify exactly the number of previous items, or a range. For instance, {3} means exactly three of the preceding item; {3,7} means three, four, five, six, or seven of the preceding item; and {3,} means three or more of the preceding item. The ? matches none or one of the preceding item.

A.16.2.8 Making quantifiers match minimally with ?

The quantifiers just shown are greedy (or maximal) by default, meaning that they match as many items as possible. Sometimes, you want a minimal match that will match as few items as possible. You get that by following each of * + {} ? with a ?. So, for instance, *? tries to match as few as possible, perhaps even none, of the preceding item before it tries to match one or more of the preceding item. Here's a maximal match:

'hear ye hear ye hear ye' =~ /hear.*ye/;
print $&;

This matches 'hear' followed by .* (as many characters as possible), followed by 'ye', and prints:

hear ye hear ye hear ye

Here is a minimal match:

'hear ye hear ye hear ye' =~ /hear.*?ye/;
print $&;

This matches 'hear' followed by .*? (the fewest number of characters possible), followed by 'ye', and prints:

hear ye

A.16.3 Capturing Matched Patterns

You can place parentheses around parts of the pattern for which you want to know the matched string. Take, for example, the following:

$alphabet = 'abcdefghijklmnopqrstuvwxyz';
$alphabet =~ /k(lmnop)q/;
print $1;

This prints:

lmnop

You can place as many pairs of parentheses in a regular expression as you like; Perl automatically stores their matched substrings in special variables named $1, $2, and so on. The matches are numbered in order of the left-to-right appearance of their opening parenthesis.

Here's a more intricate example of capturing parts of a matched pattern in a string:

$alphabet = 'abcdefghijklmnopqrstuvwxyz';
$alphabet =~ /(((a)b)c)/;
print "First pattern = ", $1,"\n";
print "Second pattern = ", $2,"\n";
print "Third pattern = ", $3,"\n";

This prints:

First pattern = abc
Second pattern = ab
Third pattern = a

A.16.4 Metasymbols

Metasymbols are sequences of two or more characters consisting of backslashes before normal characters. These metasymbols have special meanings in Perl regular expressions (and in double-quoted strings for most of them). There are quite a few of them, but that's because they're so useful. Table A-3 lists most of these metasymbols. The column "Atomic" indicates Yes if the metasymbol matches an item, No if the metasymbol just makes an assertion, and - if it takes some other action.

Table A-3. Alphanumeric metasymbols
Symbol	Atomic	Meaning
\0	Yes	Match the null character (ASCII NULL)
\NNN	Yes	Match the character given in octal, up to 377
\n	Yes	Match `n`th previously captured string (decimal)
\a	Yes	Match the alarm character (BEL)
\A	No	True at the beginning of a string
\b	Yes	Match the backspace character (BS)
\b	No	True at word boundary
\B	No	True when not at word boundary
\cX	Yes	Match the control character Control-X
\d	Yes	Match any digit character
\D	Yes	Match any nondigit character
\e	Yes	Match the escape character (ASCII ESC, not backslash)
\E	-	End case (\L, \U) or metaquote (\Q) translation
\f	Yes	Match the formfeed character (FF)
\G	No	True at end-of-match position of prior m//g
\l	-	Lowercase the next character only
\L	-	Lowercase till \E
\n	Yes	Match the newline character (usually NL, but CR on Macs)
\Q	-	Quote (do-meta) metacharacters till \E
\r	Yes	Match the return character (usually CR, but NL on Macs)
\s	Yes	Match any whitespace character
\S	Yes	Match any nonwhitespace character
\t	Yes	Match the tab character (HT)
\u	-	Titlecase the next character only
\U	-	Uppercase (not titlecase) till \E
\w	Yes	Match any "word" character (alphanumerics plus _ )
\W	Yes	Match any nonword character
\x{abcd}	Yes	Match the character given in hexadecimal
\z	No	True at end of string only
\Z	No	True at end of string or before optional newline

A.16.5 Extending Regular-Expression Sequences

Table A-4 includes several useful features that have been added to Perl's regular-expression capabilities.

Table A-4. Extended regular-expression sequences
Extension	Atomic	Meaning
(?#...)	No	Comment, discard
(?:...)	Yes	Cluster-only parentheses, no capturing
(?imsx-imsx)	No	Enable/disable pattern modifiers
(?imsx-imsx:...)	Yes	Cluster-only parentheses plus modifiers
(?=...)	No	True if lookahead assertion succeeds
(?!...)	No	True if lookahead assertion fails
(?<=...)	No	True if lookbehind assertion succeeds
(?<!...)	No	True if lookbehind assertion fails
(?>...)	Yes	Match nonbacktracking subpattern
(?{...})	No	Execute embedded Perl code
(??{...})	Yes	Match regex from embedded Perl code
(?(...)...\|...)	Yes	Match with if-then-else pattern
(?(...)...)	Yes	Match with if-then pattern

A.16.6 Pattern Modifiers

Pattern modifiers are single-letter commands placed after the forward slashes. They delimit a regular expression or a substitution and change the behavior of some regular-expression features. Table A-5 lists the most common pattern modifiers, followed by an example.

Table A-5. Pattern modifiers
Modifier	Meaning
/i	Ignore upper- or lowercase distinctions
/s	Let . match newline
/m	Let ^ and $ match next to embedded \n
/x	Ignore (most) whitespace and permit comments in patterns
/o	Compile pattern once only
/g	Find all matches, not just the first one

As an example, say you were looking for a name in text, but you didn't know if the name had an initial capital letter or was all capitalized. You can use the /i modifier, like so:

$text = "WATSON and CRICK won the Nobel Prize";
$text =~ /Watson/i;
print $&;

This matches (since /i causes upper- and lowercase distinctions to be ignored) and prints out the matched string WATSON.

Foreword

Preface

Part I: Object-Oriented Programming in Perl

Part II: Perl and Bioinformatics

Colophon

A.16 Regular Expressions

A.16.1 Overview

A.16.2 Metacharacters

A.16.2.1 Escaping with \

A.16.2.2 Alternation with |

A.16.2.3 Grouping with ( )

A.16.2.4 Character classes

A.16.2.5 Matching any character with a dot

A.16.2.6 Beginning and end of strings with ^ and $

A.16.2.7 Quantifiers

A.16.2.8 Making quantifiers match minimally with ?

A.16.3 Capturing Matched Patterns

A.16.4 Metasymbols

Table A-3. Alphanumeric metasymbols

A.16.5 Extending Regular-Expression Sequences

Table A-4. Extended regular-expression sequences

A.16.6 Pattern Modifiers

Table A-5. Pattern modifiers