Recipe 1.22 Soundex Matching

1.22.1 Problem

You have two English surnames and want to know whether they sound somewhat similar, regardless of spelling. This would let you offer users a "fuzzy search" of names in a telephone book to catch "Smith" and "Smythe" and others within the set, such as "Smite" and "Smote".

1.22.2 Solution

Use the standard Text::Soundex module:

use Text::Soundex;
$CODE  = soundex($STRING);
@CODES = soundex(@LIST);

Or use the CPAN module Text::Metaphone:

use Text::Metaphone;
$phoned_words = Metaphone('Schwern');

1.22.3 Discussion

The soundex algorithm hashes words (particularly English surnames) into a small space using a simple model that approximates an English speaker's pronunciation of the words. Roughly speaking, each word is reduced to a four-character string. The first character is an uppercase letter; the remaining three are digits. By comparing the soundex values of two strings, we can guess whether they sound similar.

The following program prompts for a name and looks for similarly sounding names from the password file. This same approach works on any database with names, so you could key the database on the soundex values if you wanted to. Such a key wouldn't be unique, of course.

use Text::Soundex;
use User::pwent;

print "Lookup user: ";
chomp($user =<STDIN>);
exit unless defined $user;
$name_code = soundex($user);

while ($uent = getpwent( )) {
    ($firstname, $lastname) = $uent->gecos =~ /(\w+)[^,]*\b(\w+)/;

    if ($name_code eq soundex($uent->name) ||
        $name_code eq soundex($lastname)   ||
        $name_code eq soundex($firstname)  )
        printf "%s: %s %s\n", $uent->name, $firstname, $lastname;

The Text::Metaphone module from CPAN addresses the same problem in a different and better way. The soundex function returns a letter and a three-digit code that maps just the beginning of the input string, whereas Metaphone returns a code as letters of variable length. For example:

                            soundex  metaphone  

    Christiansen            C623     KRSXNSN
    Kris Jenson             K625     KRSJNSN

    Kyrie Eleison           K642     KRLSN
    Curious Liaison         C624     KRSLSN

To get the most of Metaphone, you should also use the String::Approx module from CPAN, described more fully in Recipe 6.13. It allows for there to be errors in the match and still be successful. The edit distance is the number of changes needed to go from one string to the next. This matches a pair of strings with an edit distance of two:

if (amatch("string1", [2], "string2") {  }

There's also an adist function that reports the edit distance. The edit distance between "Kris Jenson" "Christiansen" is 6, but between their Metaphone encodings is only 1. Likewise, the distance between the other pair is 8 originally, but down to 1 again if you compare Metaphone encodings.

use Text::Metaphone qw(Metaphone);
use String::Approx  qw(amatch);

if (amatch(Metaphone($s1), [1], Metaphone($s1)) {
    print "Close enough!\n";

This would successfully match both of our example pairs.

1.22.4 See Also

The documentation for the standard Text::Soundex and User::pwent modules; the Text::Metaphone and String::Approx modules from CPAN; your system's passwd(5) manpage; Volume 3, Chapter 6 of The Art of Computer Programming, by Donald E. Knuth (Addison-Wesley)