Recipe 6.10 Speeding Up Interpolated Matches

6.10.1 Problem

You want your function or program to take one or more regular expressions as arguments, but doing so seems to run slower than using literals.

6.10.2 Solution

To overcome this bottleneck, if you have only one pattern whose value won't change during the entire run of a program, store it in a string and use /$pattern/o:

while ($line = <>) {
    if ($line =~ /$pattern/o) {
        # do something
    }
}

However, that won't work for more than one pattern. Precompile the pattern strings using the qr// operator, then match each result against each of the targets:

@pats = map { qr/$_/ } @strings;
while ($line = <>) {
    for $pat (@pats) {
        if ($line =~ /$pat/) {
            # do something;
        }
    }
}

6.10.3 Discussion

When Perl compiles a program, it converts patterns into an internal form. This conversion occurs at compile time for patterns without variables, but at runtime for those that do. Interpolating variables into patterns, as in /$pattern/, can slow your program downsometimes substantially. This is particularly noticeable when $pattern changes often.

The /o modifier locks in the values from variables interpolated into the pattern. That is, variables are interpolated only once: the first time the match is run. Because Perl ignores any later changes to those variables, make sure to use it only on unchanging variables.

Using /o on patterns without interpolated variables doesn't hurt, but it also doesn't help. The /o modifier is also of no help when you have an unknown number of regular expressions and need to check one or more strings against all of these patterns, since you need to vary the patterns' contents. Nor is it of any use when the interpolated variable is a function argument, since each call to the function gives the variable a new value.

Example 6-4 is an example of the slow but straightforward technique for matching many patterns against many lines. The array @popstates contains the standard two-letter abbreviations for some of the places in the heartland of North America where we normally refer to soft drinks as pop (soda to us means either plain soda water or else handmade delicacies from the soda fountain at the corner drugstore, preferably with ice cream). The goal is to print any line of input that contains any of those places, matching them at word boundaries only. It doesn't use /o, because the variable that holds the pattern keeps changing.

Example 6-4. popgrep1
  #!/usr/bin/perl
  # popgrep1 - grep for abbreviations of places that say "pop"
  # version 1: slow but obvious way
  @popstates = qw(CO ON MI WI MN);
  LINE: while (defined($line = <>)) {
      for $state (@popstates) {
          if ($line =~ /\b$state\b/) {  # this is  s l o o o w
              print; next LINE;
         }
      }
  }

Such a direct, obvious, brute-force approach is also distressingly slow, because Perl has to recompile all patterns with each line of input. A better solution is the qr// operator (used in Example 6-5), which first appeared in v5.6 and offers a way to step around this bottleneck. The qr// operator quotes and possibly compiles its string argument, returning a scalar to use in later pattern matches. If that scalar is used by itself in the interpolated match, Perl uses the cached compiled form and so avoids recompiling the pattern.

Example 6-5. popgrep2
  #!/usr/bin/perl
  # popgrep2 - grep for abbreviations of places that say "pop"
  # version 2: fast way using qr//
  @popstates = qw(CO ON MI WI MN);
  @poppats = map { qr/\b$_\b/ } @popstates;
  LINE: while (defined($line = <>)) {
      for $pat (@poppats) {
          if ($line =~ /$pat/) {        # this is fast
              print; next LINE;
         }
      }
  }

Print the array @poppats and you'll see strings like this:

(?-xism:\bCO\b)
(?-xism:\bON\b)
(?-xism:\bMI\b)
(?-xism:\bWI\b)
(?-xism:\bMN\b)

Those are used for the stringified print value of the qr// operator, or to build up a larger pattern if the result is interpolated into a larger string. But also associated with each is a cached, compiled version of that string as a pattern, and this is what Perl uses when the interpolation into a match or substitution operator contains nothing else.

6.10.4 See Also

The qr// operator in perlop(1) and in the section on "The qr// quote regex operator" in Chapter 5 of Programming Perl