Recipe 6.15 Greedy and Non-Greedy Matches

6.15.1 Problem

You have a pattern with a greedy quantifier like *, +, ?, or { }, and you want to stop it from being greedy.

A classic example is the naïve substitution to remove tags from HTML. Although it looks appealing, s#<TT>.*</TT>##gsi deletes everything from the first open TT tag through the last closing one. This would turn "Even <TT>vi</TT> can edit <TT>troff</TT> effectively." into "Even effectively", completely changing the meaning of the sentence!

6.15.2 Solution

Replace the offending greedy quantifier with the corresponding non-greedy version. That is, change *, +, ?, and { } into *?, +?, ??, and { }?, respectively.

6.15.3 Discussion

Perl has two sets of quantifiers: the maximal ones*, +, ?, and { }and the minimal ones*?, +?, ??, and { }?. Less formally, these two sorts of quantifiers are often referred to as greedy and non-greedy (or sometimes lazy), respectively. For instance, given the string "Perl is a Swiss Army Chainsaw!", the pattern /(r.*s)/ matches "rl is a Swiss Army Chains", whereas /(r.*?s)/ matches "rl is".

With maximal quantifiers, when you ask to match a variable number of times, such as zero or more times for * or one or more times for +, the matching engine prefers the "or more" portion of that description. Thus /foo.*bar/ matches the first "foo" through the last "bar" in the string, rather than only through the next "bar" as some might expect. That's because the greedy .* first expands to the rest of the string, but since that wouldn't leave any characters for "bar" to match, the engine backs up one character at a time until it finds "bar".

To make any repetition operator match minimally instead of maximally, add an extra ?. So *? still matches zero or more times, but rather than match as much as it can, the way * would, it matches as little as it can.

# greedy pattern
s/<.*>//gs;              # try to remove tags, very badly

# nongreedy pattern
s/<.*?>//gs;             # try to remove tags, better (but still rather badly)

This approach doesn't remove tags from all possible HTML correctly, because a single regular expression is seldom an acceptable replacement for a real parser. See Recipe 20.6 for the right way.

Minimal matching isn't all it's cracked up to be. Don't fall into the trap of thinking that including the partial pattern BEGIN.*?END in a pattern amidst other elements will always match the shortest amount of text between occurrences of BEGIN and END. Consider the pattern /BEGIN(.*?)END/. If matched against the string "BEGIN and BEGIN and END", $1 would contain "and BEGIN and". This is probably not what you want.

Imagine trying to pull out everything between bold-italic pairs:

<b><i>this</i> and <i>that</i> are important</b> Oh, <b><i>me too!</i></b>

A pattern to find only text between bold-italic HTML pairs, that is, text that doesn't include them, might appear to be:

m{ <b><i>(.*?)</i></b> }sx

You might be surprised to learn that the pattern doesn't find such pairs. Many people incorrectly understand this as matching a "" sequence, then anything up to a "" sequence, leaving the intervening text in $1. While it often works out that way due to the input data, that's not what it says. There's nothing in that pattern that says .*? can't match "" again (and again and again) before it comes to "". If the intention were to extract only stuff between "" and its corresponding "", with no other bold-italic tags in between, that pattern would be incorrect.

If the string in question is just one character, and if what follows the minimal match is not a literal character, a negated class is remarkably more efficient than a minimal match, as in /X([^X]*)X/. But the general way to say "match BEGIN, then not BEGIN, then END" for any arbitrary values of BEGIN and END would be as follows (this also stores the intervening part in $1):

/BEGIN((?:(?!BEGIN).)*)END/s

or, more legibly:

{
  BEGIN               # locate initial portion
  (                   # save this group into $1
      (?:             # non-capturing group
          (?! BEGIN)  # assert: can't be at another BEGIN
          .           # now match any one character
      ) *             # entire group 0 or more 
  )                   # end $1 group
  END                 # locate final portion
}sx

However, this might not be what you're after, either. The greedy star quantifier means that the non-BEGIN portion in $1 will be maximized, giving fence posts of the last BEGIN through not the first END, but the last one. So if your string were:

$_ = "BEGIN1 BEGIN2 BEGIN3 3END 2END 1END";

$1 would contain "3 3END 2END 1". Making the quantifier a minimal matching one:

/BEGIN((?:(?!BEGIN).)*?)END/s

puts "3 3" in $1 for you. Now add another lookahead negation, (?!END), next to the existing one. Written out with plenty of whitespace, we now have:

m{
    BEGIN           # locate initial portion
    (               # save this group into $1
        (?:         # non-capturing group
            (?! BEGIN   )   # can't be at a BEGIN 
            (?! END     )   # also can't be at an END
            .               # finally, match any one char
        ) *         # repeat entire group ad libitum
    )               # end $1 capture
    END
}sx

Instead of adding another lookahead, another possibility is to use alternation within the existing one: (?!BEGIN|END). Applying this approach to the HTML-matching code, we end up with something like:

m{ <b><i>(  (?: (?!</b>|</i>). )*  ) </i></b> }sx

or perhaps:

m{ <b><i>(  (?: (?!</[ib]>). )*  ) </i></b> }sx

Jeffrey Friedl points out that this quick-and-dirty method isn't particularly efficient. He suggests crafting a more elaborate pattern when speed matters, such as:

m{
    <b><i>
    [^<]*  # stuff not possibly bad, and not possibly the end.
    (?:
 # at this point, we can have '<' if not part of something bad
     (?! </?[ib]>  )    # what we can't have
     <                  # okay, so match the '<'
     [^<]*              # and continue with more safe stuff
   ) *
   </i> </b>
 }sx

This is a variation on Jeffrey's unrolling-the-loop technique, described in Chapter 6 of Mastering Regular Expressions, Second Edition.