Recipe 4.16 Sorting a List by Computable Field

4.16.1 Problem

You want to sort a list by something more complex than a simple string or numeric comparison.

This is common when working with objects ("sort by the employee's salary") or complex data structures ("sort by the third element in the array that this is a reference to"). It's also applicable when you want to sort by more than one key; for instance, sorting by birthday and then by name when multiple people share the same birthday.

4.16.2 Solution

Use the customizable comparison routine in sort:

@ordered = sort { compare( ) } @unordered;

You can speed this up by precomputing the field.

@precomputed = map { [compute( ),$_] } @unordered;
@ordered_precomputed = sort { $a->[0] <=> $b->[0] } @precomputed;
@ordered = map { $_->[1] } @ordered_precomputed;

And, finally, you can combine the three steps:

@ordered = map { $_->[1] }
           sort { $a->[0] <=> $b->[0] }
           map { [compute( ), $_] }
           @unordered;

4.16.3 Discussion

The use of a comparison routine was explained in Recipe 4.15. As well as using built-in operators like <=>, you can construct more complex tests:

@ordered = sort { $a->name cmp $b->name } @employees;

You often see sort used like this in part of a foreach loop:

foreach $employee (sort { $a->name cmp $b->name } @employees) {
    print $employee->name, " earns \$", $employee->salary, "\n";
}

If you're going to do a lot of work with elements in a particular order, it's more efficient to sort once and work from that:

@sorted_employees = sort { $a->name cmp $b->name } @employees;
foreach $employee (@sorted_employees) {
    print $employee->name, " earns \$", $employee->salary, "\n";
}
# load %bonus
foreach $employee (@sorted_employees) {
    if ( $bonus{ $employee->ssn } ) {
      print $employee->name, " got a bonus!\n";
    }
}

We can put multiple comparisons in the routine and separate them with ||. || is a short-circuit operator: it returns the first true value it finds. This means we can sort by one kind of comparison, but if the elements are equal (the comparison returns 0), we can sort by another. This has the effect of a sort within a sort:

@sorted = sort { $a->name cmp $b->name
                           ||
                  $b->age <=> $a->age } @employees;

This first considers the names of the two employees to be compared. If they're not equal, || stops and returns the result of the cmp (effectively sorting them in ascending order by name). If the names are equal, though, || keeps testing and returns the result of the <=> (sorting them in descending order by age). The result is a list that is sorted by name and by age within groups of the same name.

Let's look at a real-life example of sorting. First we fetch all system users, represented as User::pwent objects. Then we sort them by name and print the sorted list:

use User::pwent qw(getpwent);
@users = ( );
# fetch all users
while (defined($user = getpwent)) {
    push(@users, $user);
}
@users = sort { $a->name cmp $b->name } @users;
foreach $user (@users) {
    print $user->name, "\n";
}

We can have more than simple comparisons, or combinations of simple comparisons. This code sorts a list of names by comparing the second letters of the names. It gets the second letters by using substr:

@sorted = sort { substr($a,1,1) cmp substr($b,1,1) } @names;

and here we sort by string length:

@sorted = sort { length $a <=> length $b } @strings;

The sort function calls the code block each time it needs to compare two elements, so the number of comparisons grows dramatically with the number of elements we're sorting. Sorting 10 elements requires (on average) 46 comparisons, but sorting 1,000 elements requires 14,000 comparisons. A time-consuming operation like a split or a subroutine call for each comparison can easily make your program crawl.

Fortunately, we can remove this bottleneck by running the operation once per element prior to the sort. Use map to store the results of the operation in an array whose elements are anonymous arrays containing both the computed field and the original field. Then we sort this array of arrays on the precomputed field and use map to get the sorted original data. This map-sort-map concept is useful and common, so let's look at it in depth.

Let's apply map-sort-map to the sorting by string length example:

@temp   = map  {  [ length $_, $_ ]  } @strings;
@temp   = sort { $a->[0] <=> $b->[0] } @temp;
@sorted = map  {        $_->[1]      } @temp;

The first line creates a temporary array of strings and their lengths, using map. The second line sorts the temporary array by comparing the precomputed lengths. The third line turns the sorted temporary array of strings and lengths back into a sorted array of strings. This way, we calculate the length of each string only once.

Because the input to each line is the output of the previous line (the @temp array we make in line 1 is fed to sort in line 2, and that output is fed to map in line 3), we can combine it into one statement and eliminate the temporary array:

@sorted = map  { $_->[1] }
          sort { $a->[0] <=> $b->[0] }
          map  { [ length $_, $_ ] }
          @strings;

The operations now appear in reverse order. When you meet a map-sort-map, you should read it from the bottom up to determine the function:

@strings: The last part is the data to be sorted. Here it's just an array, but later we'll see that this can be a subroutine or even backticks. Anything that returns a list is fair game.
map: The map closer to the bottom builds the temporary list of anonymous arrays. This list contains the precomputed fields (length $_) and also records the original element ($_) by storing both in an anonymous array. Look at this map line to find out how the fields are computed.
sort: The sort line sorts the list of anonymous arrays by comparing the precomputed fields. It won't tell you much, other than whether the list is sorted in ascending or descending order.
map: The map at the top of the statement turns the sorted list of anonymous arrays back into a list of the sorted original elements. It will generally be the same for every map-sort-map.

Here's a more complicated example, which sorts by the first number that appears on each line in @fields:

@temp          = map  {    [ /(\d+)/, $_ ]  } @fields;
@sorted_temp   = sort { $a->[0] <=> $b->[0] } @temp;
@sorted_fields = map  {       $_->[1]       } @sorted_temp;

The regular expression mumbo jumbo in the first line extracts the first number from the line being processed by map. We use the regular expression /(\d+)/ in a list context to extract the number.

We can remove the temporary arrays in that code, giving us:

@sorted_fields = map  { $_->[1] }
                 sort { $a->[0] <=> $b->[0] }
                 map  { [ /(\d+)/, $_ ] }
                 @fields;

This final example compactly sorts colon-separated data, as from Unix's passwd file. It sorts the file numerically by the fourth field (group id), then numerically by the third field (user id), and then alphabetically by the first field (username).

print map  { $_->[0] }             # whole line
      sort {
              $a->[1] <=> $b->[1]  # gid
                      ||
              $a->[2] <=> $b->[2]  # uid
                      ||
              $a->[3] cmp $b->[3]  # login
      }
      map  { [ $_, (split /:/)[3,2,0] ] }
      `cat /etc/passwd`;

4.16.4 See Also

The sort function in perlfunc(1) and Chapter 29 of Programming Perl; the cmp and <=> operators in perlop(1) and Chapter 3 of Programming Perl; Recipe 4.15