Regular expressions are, in effect, an extra language that lives inside the Perl language. In Perl, they have quite a lot of features. First, I'll summarize how regular expressions work in Perl; then, I'll present some of their many features.
Regular expressions describe patterns in strings. The pattern described by a single regular expression may match many different strings.
Regular expressions are used in pattern matching, that is, when you look to see if a certain pattern exists in a string. They can also change strings, as with the s/// operator that substitutes the pattern, if found, for a replacement. Additionally, they are used in the tr function that can transliterate several characters into replacement characters throughout a string. Regular expressions are case-sensitive, unless explicitly told otherwise.
The simplest pattern match is a string that matches itself. For instance, to see if the pattern 'abc' appears in the string 'abcdefghijklmnopqrstuvwxyz', write the following in Perl:
$alphabet = 'abcdefghijklmnopqrstuvwxyz'; if( $alphabet =~ /abc/ ) { print $&; }
The =~ operator binds a pattern match to a string. /abc/ is the pattern abc, enclosed in forward slashes to indicate that it's a regular-expression pattern. $& is set to the matched pattern, if any. In this case, the match succeeds, since 'abc' appears in the string $alphabet, and the code just given prints out abc.
Regular expressions are made from two kinds of characters. Many characters, such as 'a' or 'Z', match themselves. Metacharacters have a special meaning in the regular-expression language. For instance, parentheses are used to group other characters and don't match themselves. If you want to match a metacharacter such as ( in a string, you have to precede it with the backslash metacharacter \( in the pattern.
There are three basic ideas behind regular expressions. The first is concatenation: two items next to each other in a regular-expression pattern (that's the string between the forward slashes in the examples) must match two items next to each other in the string being matched (the $alphabet in the examples). So, to match 'abc' followed by 'def', concatenate them in the regular expression:
$alphabet = 'abcdefghijklmnopqrstuvwxyz'; if( $alphabet =~ /abcdef/ ) { print $&; }
This prints:
abcdef
The second major idea is alternation. Items separated by the | metacharacter match any one of the items. For example, the following:
$alphabet = 'abcdefghijklmnopqrstuvwxyz'; if( $alphabet =~ /a(b|c|d)c/ ) { print $&; }
prints as:
abc.
The example also shows how parentheses group things in a regular expression. The parentheses are metacharacters that aren't matched in the string; rather, they group the alternation, given as b|c|d, meaning any one of b, c, or d at that position in the pattern. Since b is actually in $alphabet at that position, the alternation, and indeed the entire pattern a(b|c|d)c, matches in the $alphabet. (One additional point: ab|cd means (ab)|(cd), not a(b|c)d.)
The third major idea of regular expressions is repetition (or closure). This is indicated in a pattern with the quantifier metacharacter *, sometimes called the Kleene star after one of the inventors of regular expressions. When * appears after an item, it means that the item may appear 0, 1, or any number of times at that place in the string. So, for example, all of the following pattern matches will succeed:
'AC' =~ /AB*C/; 'ABC' =~ /AB*C/; 'ABBBBBBBBBBBC' =~ /AB*C/;
The following are metacharacters:
\ | ( ) [ { ^ $ * + ? .
A backslash \ before a metacharacter causes it to match itself; for instance, \ matches a single \ in the string.
The pipe | indicates alternation, as described previously.
The parentheses ( ) provide grouping, as described previously.
Square brackets [ ] specify a character class. A character class matches one character, which can be any character specified. For instance, [abc] matches either a, or b, or c at that position (so it's the same as a|b|c). A -Z is a range that matches any uppercase letter, a-z matches any lowercase letter, and 0-9 matches any digit. For instance, [A-Za-z0-9] matches any single letter or digit at that position. If the first character in a character class is ^, any character except those specified match; for instance, [^0-9] matches any character that isn't a digit.
The period or dot . represents any character except a newline. (The pattern modifier /s makes it also match a newline.) So, . is like a character class that specifies every character.
The ^ metacharacter doesn't match a character; rather, it asserts that the item that follows must be at the beginning of the string. Similarly, the $ metacharacter doesn't match a character but asserts that the item that precedes it must be at the end of the string (or before the final newline). For example: /^Watson and Crick/ matches if the string starts with Watson and Crick; and /Watson and Crick$/ matches if the string ends with Watson and Crick or Watson and Crick\n.
These metacharacters indicate the repetition of an item. The * metacharacter indicates zero, one, or more of the preceding item. The + metacharacter indicates one or more of the preceding item. The brace { } metacharacters let you specify exactly the number of previous items, or a range. For instance, {3} means exactly three of the preceding item; {3,7} means three, four, five, six, or seven of the preceding item; and {3,} means three or more of the preceding item. The ? matches none or one of the preceding item.
The quantifiers just shown are greedy (or maximal) by default, meaning that they match as many items as possible. Sometimes, you want a minimal match that will match as few items as possible. You get that by following each of * + {} ? with a ?. So, for instance, *? tries to match as few as possible, perhaps even none, of the preceding item before it tries to match one or more of the preceding item. Here's a maximal match:
'hear ye hear ye hear ye' =~ /hear.*ye/; print $&;
This matches 'hear' followed by .* (as many characters as possible), followed by 'ye', and prints:
hear ye hear ye hear ye
Here is a minimal match:
'hear ye hear ye hear ye' =~ /hear.*?ye/; print $&;
This matches 'hear' followed by .*? (the fewest number of characters possible), followed by 'ye', and prints:
hear ye
You can place parentheses around parts of the pattern for which you want to know the matched string. Take, for example, the following:
$alphabet = 'abcdefghijklmnopqrstuvwxyz'; $alphabet =~ /k(lmnop)q/; print $1;
This prints:
lmnop
You can place as many pairs of parentheses in a regular expression as you like; Perl automatically stores their matched substrings in special variables named $1, $2, and so on. The matches are numbered in order of the left-to-right appearance of their opening parenthesis.
Here's a more intricate example of capturing parts of a matched pattern in a string:
$alphabet = 'abcdefghijklmnopqrstuvwxyz'; $alphabet =~ /(((a)b)c)/; print "First pattern = ", $1,"\n"; print "Second pattern = ", $2,"\n"; print "Third pattern = ", $3,"\n";
This prints:
First pattern = abc Second pattern = ab Third pattern = a
Metasymbols are sequences of two or more characters consisting of backslashes before normal characters. These metasymbols have special meanings in Perl regular expressions (and in double-quoted strings for most of them). There are quite a few of them, but that's because they're so useful. Table A-3 lists most of these metasymbols. The column "Atomic" indicates Yes if the metasymbol matches an item, No if the metasymbol just makes an assertion, and - if it takes some other action.
Symbol |
Atomic |
Meaning |
---|---|---|
\0 |
Yes |
Match the null character (ASCII NULL) |
\NNN |
Yes |
Match the character given in octal, up to 377 |
\n |
Yes |
Match nth previously captured string (decimal) |
\a |
Yes |
Match the alarm character (BEL) |
\A |
No |
True at the beginning of a string |
\b |
Yes |
Match the backspace character (BS) |
\b |
No |
True at word boundary |
\B |
No |
True when not at word boundary |
\cX |
Yes |
Match the control character Control-X |
\d |
Yes |
Match any digit character |
\D |
Yes |
Match any nondigit character |
\e |
Yes |
Match the escape character (ASCII ESC, not backslash) |
\E |
- |
End case (\L, \U) or metaquote (\Q) translation |
\f |
Yes |
Match the formfeed character (FF) |
\G |
No |
True at end-of-match position of prior m//g |
\l |
- |
Lowercase the next character only |
\L |
- |
Lowercase till \E |
\n |
Yes |
Match the newline character (usually NL, but CR on Macs) |
\Q |
- |
Quote (do-meta) metacharacters till \E |
\r |
Yes |
Match the return character (usually CR, but NL on Macs) |
\s |
Yes |
Match any whitespace character |
\S |
Yes |
Match any nonwhitespace character |
\t |
Yes |
Match the tab character (HT) |
\u |
- |
Titlecase the next character only |
\U |
- |
Uppercase (not titlecase) till \E |
\w |
Yes |
Match any "word" character (alphanumerics plus _ ) |
\W |
Yes |
Match any nonword character |
\x{abcd} |
Yes |
Match the character given in hexadecimal |
\z |
No |
True at end of string only |
\Z |
No |
True at end of string or before optional newline |
Table A-4 includes several useful features that have been added to Perl's regular-expression capabilities.
Extension |
Atomic |
Meaning |
---|---|---|
(?#...) |
No |
Comment, discard |
(?:...) |
Yes |
Cluster-only parentheses, no capturing |
(?imsx-imsx) |
No |
Enable/disable pattern modifiers |
(?imsx-imsx:...) |
Yes |
Cluster-only parentheses plus modifiers |
(?=...) |
No |
True if lookahead assertion succeeds |
(?!...) |
No |
True if lookahead assertion fails |
(?<=...) |
No |
True if lookbehind assertion succeeds |
(?<!...) |
No |
True if lookbehind assertion fails |
(?>...) |
Yes |
Match nonbacktracking subpattern |
(?{...}) |
No |
Execute embedded Perl code |
(??{...}) |
Yes |
Match regex from embedded Perl code |
(?(...)...|...) |
Yes |
Match with if-then-else pattern |
(?(...)...) |
Yes |
Match with if-then pattern |
Pattern modifiers are single-letter commands placed after the forward slashes. They delimit a regular expression or a substitution and change the behavior of some regular-expression features. Table A-5 lists the most common pattern modifiers, followed by an example.
Modifier |
Meaning |
---|---|
/i |
Ignore upper- or lowercase distinctions |
/s |
Let . match newline |
/m |
Let ^ and $ match next to embedded \n |
/x |
Ignore (most) whitespace and permit comments in patterns |
/o |
Compile pattern once only |
/g |
Find all matches, not just the first one |
As an example, say you were looking for a name in text, but you didn't know if the name had an initial capital letter or was all capitalized. You can use the /i modifier, like so:
$text = "WATSON and CRICK won the Nobel Prize"; $text =~ /Watson/i; print $&;
This matches (since /i causes upper- and lowercase distinctions to be ignored) and prints out the matched string WATSON.