Recipe 6.16 Detecting Doubled Words

6.16.1 Problem

You want to check for doubled words in a document.

6.16.2 Solution

Use backreferences in your pattern.

6.16.3 Discussion

Parentheses in a pattern make the matching engine remember what text that portion of the pattern matched. Later in the pattern, refer to the actual string that matched with \1 (indicating the string matched by the first set of parentheses), \2 (for the string matched by the second set of parentheses), and so on. Don't use $1 within a regex, because it would be a variable interpolated before the match began. The pattern /([A-Z])\1/ matches a capital letter followed not just by any capital letter, but by whichever one was just matched (i.e., captured by the first set of parentheses in that pattern).

The next sample code reads its input files by paragraph, with the definition of paragraph following Perl's notion of a paragrapha chunk of text terminated by two or more contiguous newlines. Within each paragraph, the code finds all doubled words. It ignores case and can match across newlines.

Here we use /x to embed whitespace and comments to improve readability. The /i permits both instances of "is" in the sentence "Is is this ok?" to match, even though they differ in case. We use /g in a while loop to keep finding doubled words until we run out of text.

$/ = '';                      # paragrep mode
while (<>) {
    while ( m{
                \b            # start at a word boundary (begin letters)
                (\S+)         # find chunk of non-whitespace
                \b            # until another word boundary (end letters)
                (
                    \s+       # separated by some whitespace
                    \1        # and that very same chunk again
                    \b        # until another word boundary
                ) +           # one or more sets of those
             }xig
         )
    {
        print "dup word '$1' at paragraph $.\n";
    }
}

That code finds the duplicated test in the following paragraph:

This is a test
test of the doubled word finder.

Word boundary anchors surrounding \S+ are often a bad idea because they do something you might not be expecting. That's because word boundaries in Perl are defined as transitions between alphanumunders (that's a \w) and either the edge of the string or a non-alphanumunder. Surrounding \S+ with \b subtly changes \S+ from its normal meaning of one or more non-whitespace characters to a stretch of non-whitespace whose first and last character must be an alphanumunder.

Sometimes, though, this might be just what you're looking for. Consider the string:

$string = q("I can't see this," she remarked.);

@a = $string =~ /\b\S+\b/g;
@b = $string =~ /\S+/g;

The elements of @a are now:

0  I
1  can't
2  see
3  this
4  she
5  remarked

but those of @b are:

0  "I
1  can't
2  see
3  this,"
4  she
5  remarked.

Here's another interesting demonstration of backreferences. Imagine two words in which the end of the first word is the same as the start of the next one, such as "nobody" and "bodysnatcher". You'd like to find that overlapping part and come up with "nobodysnatcher". This is a variant on the doubled word problem.

Conventional character-by-character processing the way a C programmer would write it would take a great deal of tricky code. But with a backtracking pattern matcher, it just takes one simple pattern match.

$a = 'nobody';
$b = 'bodysnatcher';
if ("$a $b" =~ /^(\w+)(\w+) \2(\w+)$/) {
    print "$2 overlaps in $1-$2-$3\n";
}
body overlaps in no-body-snatcher

You might think that $1 would first grab up all of "nobody" due to greediness. It doesfor a while. But once it's done so, there aren't any more characters to put in $2. So the engine backs up, and $1 begrudgingly gives up one character to $2. The space character matches successfully, but then sees \2, which currently holds a lone "y". The next character in the string is not a "y", but a "b". This makes the engine back up, eventually forcing $1 to surrender enough to $2 that the pattern can match some string, a space, and then that same string again.

That won't quite work out if the overlap is itself the product of a doubling, as in "rococo" and "cocoon". The preceding algorithm would have decided that the overlapping string, $2, must be just "co" rather than "coco". But we don't want a "rocococoon"; we want a "rococoon". Adding a minimal matching quantifier to the $1 part gives the much better pattern:

/^(\w+?)(\w+) \2(\w+)$/,

which solves this problem.

Backtracking is more powerful than you might imagine. Example 6-8 offers another take on the prime factorization problem from Chapter 1.

Example 6-8. prime-pattern

  #!/usr/bin/perl
  # prime_pattern -- find prime factors of argument using pattern matching
  for ($N = ('o' x shift); $N =~ /^(oo+?)\1+$/; $N =~ s/$1/o/g) {
      print length($1), " ";
  }
  print length ($N), "\n";

Although not practical, this approach marvelously demonstrates the power of backtracking.

Here's another example. Using a brilliant insight first illustrated by Doug McIlroy (or so says Andrew Hume), you can find solutions to Diophantine equations of order one with regular expressions. Consider the equation 12x + 15y + 16z = 281. Can you think of possible values for x, y, and z? Perl can!

# solve for 12x + 15y + 16z = 281, maximizing x
if (($X, $Y, $Z)  =
   (('o' x 281)  =~ /^(o*)\1{11}(o*)\2{14}(o*)\3{15}$/))
{
    ($x, $y, $z) = (length($X), length($Y), length($Z));
    print "One solution is: x=$x; y=$y; z=$z.\n";
} else {
    print "No solution.\n";
}
One solution is: x=17; y=3; z=2.

Because the first o* was greedy, x was allowed to grow as large as it could. Changing one or more * quantifiers to *?, +, or +? can produce different solutions.

('o' x 281)  =~ /^(o+)\1{11}(o+)\2{14}(o+)\3{15}$/
One solution is: x=17; y=3; z=2
('o' x 281)  =~ /^(o*?)\1{11}(o*)\2{14}(o*)\3{15}$/
One solution is: x=0; y=7; z=11.
('o' x 281)  =~ /^(o+?)\1{11}(o*)\2{14}(o*)\3{15}$/
One solution is: x=1; y=3; z=14.

An important lesson to be learned from these amazing feats of mathematical prowess by a lowly pattern matcher is that a pattern-matching engine, particularly a backtracking one, very much wants to give you an answer, and it will work phenomenally hard to do so. But solving a regular expression with backreferences can take time exponentially proportional to the length of the input to complete. For all but trivial inputs, such algorithms make continental drift seem brisk.

6.16.4 See Also

The explanation of backreferences in the "Regular Expressions" section of perlre(1), and in "The Little Engine That /Could(n't)?/" section of Chapter 5 of Programming Perl; the "The Doubled-Word Thing" section in Chapter 2 of Mastering Regular Expressions