Recipe 6.6 Matching Within Multiple Lines

6.6.1 Problem

You want to use regular expressions on a string containing more than one logical line, but the special characters . (any character but newline), ^ (start of string), and $ (end of string) don't seem to work for you. This might happen if you're reading in multiline records or the whole file at once.

6.6.2 Solution

Use /m, /s, or both as pattern modifiers. /s allows . to match a newline (normally it doesn't). If the target string has more than one line in it, /foo.*bar/s could match a "foo" on one line and a "bar" on a following line. This doesn't affect dots in character classes like [#%.], since they are literal periods anyway.

The /m modifier allows ^ and $ to match immediately before and after an embedded newline, respectively. /^=head[1-7]/m would match that pattern not just at the beginning of the record, but anywhere right after a newline as well.

6.6.3 Discussion

A common, brute-force approach to parsing documents where newlines are not significant is to read the file one paragraph at a time (or sometimes even the entire file as one string) and then extract tokens one by one. If the pattern involves dot, such as .+ or .*?, and must match across newlines, you need to do something special to make dot match a newline; ordinarily, it does not. When you've read more than one line into a string, you'll probably prefer to have ^ and $ match beginning- and end-of-line, not just beginning- and end-of-string.

The difference between /m and /s is important: /m allows ^ and $ to match next to an embedded newline, whereas /s allows . to match newlines. You can even use them togetherthey're not mutually exclusive.

Example 6-2 creates a simplistic filter to strip HTML tags out of each file in @ARGV and then send those results to STDOUT. First we undefine the record separator so each read operation fetches one entire file. (There could be more than one file, because @ARGV could have several arguments in it. If so, each readline would fetch the entire contents of one file.) Then we strip out instances of beginning and ending angle brackets, plus anything in between them. We can't use just .* for two reasons: first, it would match closing angle brackets, and second, the dot wouldn't cross newline boundaries. Using .*? in conjunction with /s solves these problems.

Example 6-2. killtags

  #!/usr/bin/perl
  # killtags - very bad html tag killer
  undef $/;           # each read is whole file
  while (<>) {        # get one whole file at a time
      s/<.*?>//gs;    # strip tags (terribly)
      print;          # print file to STDOUT
  }

Because this is just a single character, it would be much faster to use s/<[^>]*>//gs, but that's still a naïve approach: it doesn't correctly handle tags inside HTML comments or angle brackets in quotes (<IMG SRC="here.gif" ALT="<<Ooh la la!>>">). Recipe 20.6 explains how to avoid these problems.

Example 6-3 takes a plain text document and looks for lines at the start of paragraphs that look like "Chapter 20: Better Living Through Chemisery". It wraps these with an appropriate HTML level-one header. Because the pattern is relatively complex, we use the /x modifier so we can embed whitespace and comments.

Example 6-3. headerfy

  #!/usr/bin/perl
  # headerfy: change certain chapter headers to html
  $/ = '';
  while (<> ) {              # fetch a paragraph
      s{
          \A                  # start of record
          (                   # capture in $1
              Chapter         # text string
              \s+             # mandatory whitespace
              \d+             # decimal number
              \s*             # optional whitespace
              :               # a real colon
              . *             # anything not a newline till end of line
          )
      }{<H1>$1</H1>}gx;
      print;
  }

Here it is as a one-liner from the command line for those of you for whom the extended comments just get in the way of understanding:

% perl -00pe 's{\A(Chapter\s+\d+\s*:.*)}{<H1>$1</H1>}gx' datafile

This problem is interesting because we need to be able to specify start-of-record and end-of-line in the same pattern. We could normally use ^ for start-of-record, but we need $ to indicate not only end-of-record, but end-of-line as well. We add the /m modifier, which changes ^ and $. Instead of using ^ to match beginning-of-record, we use \A instead. We're not using it here, but in case you're interested, the version of $ that always matches end-of-record with an optional newline, even in the presence of /m, is \Z. To match the real end without the optional newline, use \z.

The following example demonstrates using /s and /m together. That's because we want ^ to match the beginning of any line in the paragraph; we also want dot to match a newline. The predefined variable $. represents the record number of the filehandle most recently read from using readline(FH) or <FH>. The predefined variable $ARGV is the name of the file that's automatically opened by implicit <ARGV> processing.

$/ = '';            # paragraph read mode
while (<ARGV>) {
    while (/^START(.*?)^END/sm) {   # /s makes . span line boundaries
                                    # /m makes ^ match near newlines
        print "chunk $. in $ARGV has <<$1>>\n";
    }
}

If you're already committed to the /m modifier, use \A and \Z for the old meanings of ^ and $, respectively. But what if you've used the /s modifier and want the original meaning of dot? You use [^\n].

Finally, although $ and \Z can match one before the end of a string if that last character is a newline, \z matches only at the very end of the string. We can use lookaheads to define the other two as shortcuts involving \z:

`$` without `/m`	`(?=\n)?\z`
`$` with `/m`	`(?=\n)\|\z`
`\Z` always	`(?=\n)?\z`

6.6.4 See Also

The $/ variable in perlvar(1) and in the "Per-Filehandle Variables" section of Chapter 28 of Programming Perl; the /s and /m modifiers in perlre(1) and "The Fine Print" section of Chapter 2 of Programming Perl; the "Anchors and Other Zero-Width Assertions" section in Chapter 3 of Mastering Regular Expressions; we talk more about the special variable $/ in Chapter 8