5.6 Sorting Internationalized Strings

One big advantage with Strings is that they are built (almost) from the ground up to support internationalization. This means that the Unicode character set is the lingua franca in Java. Unfortunately, because Unicode uses two-byte characters, many string libraries based on one-byte characters that can be ported into Java do not work so well. Most string-search optimizations use tables to assist string searches, but the table size is related to the size of the character set. For example, a traditional Boyer-Moore string search takes a great deal of memory and a long initialization phase to use with Unicode.

The Boyer-Moore String-Search Algorithm

The Boyer-Moore string search uses a table of characters to skip comparisons. Here's a simple example with none of the complexities. Assume you are matching "abcd" against a string. The "abcd" is aligned against the first four characters of the string. The fourth character of the string is checked first. If that fourth character is none of a, b, c, or d, the "abcd" can be skipped to be matched against the fifth to eighth characters, and the matching proceeds in the same way. If instead the fourth character of the string is b, the "abcd" can be skipped to align the b against the fourth character, and the matching proceeds as before. For optimum speed, this algorithm requires several arrays giving skip distances for each possible character in the character set. For more detail, see The Art of Computer Programming by Donald Knuth (Addison-Wesley) or the paper "Fast Algorithms for Sorting and Searching Strings," by Jon Bentley and Robert Sedgewick, Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, January 1997.

Furthermore, sorting international Strings requires the ability to handle many kinds of localization issues, such as the sorted location for accented characters, characters that can be treated as character pairs, and so on. In these cases, it is difficult (and usually impossible) to handle the general case yourself. It is almost always easier to use the String helper classes Java provides, for example, the java.text.Collator class.^[7]

^[7] The code that handles this type of work didn't really start to get integrated in Java until 1.1 and did not start to be optimized until JDK 1.2. An article by Laura Werner of IBM in the February 1999 issue of the Java Report, "Efficient Text Searching in Java," covers the optimizations added to the java.text.Collator class for JDK 1.2. There is also a useful StringSearch class available at the IBM alphaWorks site (http://alphaworks.ibm.com/tech/stringsearch).

Using the java.text.CollationKey object to represent each string is a standard optimization for repeated comparisons of internationalized Strings. You can use this when sorting an array of Strings, for example. CollationKeys perform more than twice as fast as using java.text.Collator.compare( ) . It is probably easiest to see how to use collation keys with a particular example. So let's look at tuning an internationalized String sort.

For this, I use a standard quicksort algorithm (the quicksort implementation can be found in Section 11.9). The only modification to the standard quicksort is that for each optimization, the quicksort needs to be adjusted to use the appropriate comparison method and the appropriate data type. For example, the generic quicksort that sorts an array of Comparable objects has the signature:

public static void quicksort(Comparable[  ] arr, int lo, int hi)

and uses the Comparable.compareTo(Object) method when comparing two Comparable objects. On the other hand, a generic quicksort that sorts objects based on a java.util.Comparator has the signature:

public static void quicksort(Object[  ] arr, int lo, int hi, Comparator c)

and uses the java.util.Comparator.compare(Object, Object) method when comparing any two objects. (See java.util.Arrays.sort( ) for a specific example.) In each case the underlying algorithm is the same. Only the comparison method changes (and in general the data type too, though not in these examples where the data type was Object).

The obvious first test, to get a performance baseline, is the straightforward internationalized sort:

public runsort(  ) {
  quicksort(stringArray,0,stringArray.length-1, Collator.getInstance(  ));
}
public static void quicksort(String[  ] arr, int lo, int hi, java.text.Collator c)
{
  ...
  int mid = ( lo + hi ) / 2;
  String middle = arr[ mid ]; //String data type
  ...
  //uses Collator.compare(String, String)
  if( c.compare(arr[ lo ], middle) > 0 )
  ...
}

I use a large dictionary of words for the array of strings, inserted in random order, and I use the same random order for each of the tests. The first test took longer than expected. Looking at the Collator class, I can see that it does a huge amount of work, and I cannot possibly bypass its internationalized support if I want to support internationalized strings.^[8]

^[8] The kind of investment made in building such global support is beyond most projects; it is almost always much cheaper to buy the support. In this case, Taligent put a huge number of man years into the globalization you get for free with the JDK.

However, as previously mentioned, the java.util.CollationKey class is specifically designed to provide for this type of speedup. It is simple to convert the sort in order to use this. You still need the Collator to generate the CollationKeys, so add a conversion method. The sort now looks like:

public runsort(  ) {
  quicksort(stringArray,0,stringArray.length-1, Collator.getInstance(  ));
}
public static void quicksort(String[  ] arr, int lo, int hi, Collator c)
{
  //convert to an array of CollationKeys
  CollationKey keys[  ] = new CollationKey[arr.length];
  for (int i = arr.length-1; i >= 0; i--)
    keys[i] = c.getCollationKey(arr[i]);
  
  //Run the sort on the collation keys
  quicksort_collationKey(keys, 0, arr.length-1);
  
  //and unwrap so that we get our Strings in sorted order
  for (int i = arr.length-1; i >= 0; i--)
    arr[i] = keys[i].getSourceString(  );
}
public static void quicksort_collationKey(CollationKey[  ] arr, int lo, int hi)
{
  ...
  int mid = ( lo + hi ) / 2;
  CollationKey middle = arr[ mid ];  //CollationKey data type
  ...
  //uses CollationKey.compareTo(CollationKey)
  if( arr[ lo ].compareTo(middle) > 0 )
  ...
}

Normalizing the time for the first test to 100%, this test is much faster and takes less than half the time (see Table 5-10). This is despite the extra cost imposed by a whole new populated array of CollationKey objects, one for each string. Can it do better? Well, there is nothing further in the java.text package that suggests so. Instead look at the String class, and consider its implementation of the String.compareTo( ) method. This is a simple lexicographic ordering, basically treating the char array as a sequence of numbers and ordering sequence pairs as if there is no meaning to the object being Strings. Obviously, this is useless for internationalized support, but it is much faster. A quick test shows that sorting the test String array using the String.compareTo( ) method takes just 2% of time of the first test, which seems much more reasonable.

But is this test incompatible with the desired internationalized sort? Well, maybe not. Sort algorithms usually execute faster if they operate on a partially sorted array. Perhaps using the String.compareTo( ) sort first might bring the array considerably closer to the final ordering of the internationalized sort, and at a fairly low cost. Testing this is straightforward:

public runsort(  ) {
  quicksort(stringArray,0,stringArray.length-1, Collator.getInstance(  ));
}
public static void quicksort(String[  ] arr, int lo, int hi, Collator c)
{
  //simple sort using String.compareTo(  )
  simple_quicksort(arr, lo, hi);
  
  //Full international sort on a hopefully partially sorted array
  intl_quicksort(arr, lo, hi, c);
}
public static void simple_quicksort(String[  ] arr, int lo, int hi)
{
  ...
  int mid = ( lo + hi ) / 2;
  String middle = arr[ mid ];  //uses String data type
  ...
  //uses String.compareTo(String)
  if( arr[ lo ].compareTo(middle) > 0 )
  ...
}
public static void intl_quicksort(String[  ] arr, int lo, int hi, Collator c)
{
  //convert to an array of CollationKeys
  CollationKey keys[  ] = new CollationKey[arr.length];
  for (int i = arr.length-1; i >= 0; i--)
    keys[i] = c.getCollationKey(arr[i]);
  
  //Run the sort on the collation keys
  quicksort_collationKey(keys, 0, arr.length-1);
  
  //and unwrap so that we get our Strings in sorted order
  for (int i = arr.length-1; i >= 0; i--)
    arr[i] = keys[i].getSourceString(  );
}
public static void quicksort_collationKey(CollationKey[  ] arr, int lo, int hi)
{
  ...
  int mid = ( lo + hi ) / 2;
  CollationKey middle = arr[ mid ]; //CollationKey data type
  ...
  //uses CollationKey.compareTo(CollationKey)
  if( arr[ lo ].compareTo(middle) > 0 )
  ...
}

This double-sorting implementation reduces the international sort time to a quarter of the original test time (see Table 5-10). Partially sorting the list first using a much simpler (and quicker) comparison test has doubled the speed of the total sort as compared to using only the CollationKeys optimization.

Table 5-10. Timings using different sorting strategies
Sort using:	1.1.8	1.2.2	1.3.1	1.3.1-server	1.4.0	1.4.0-server	1.4.0-Xint
Collator	266%	100%	65%	22%	56%	51%	1235%
CollationKeys	80%	39%	24%	18%	27%	15%	226%
Sorted twice	35%	18%	9.8%	7.9%	12%	8.5%	133%
String.compareTo( )	2.3%	2.3%	1.8%	1.5%	1.9%	1.4%	23%

Of course, these optimizations have improved the situation only for the particular locale I have tested (my default locale is set for US English). However, running the test in a sampling of other locales (European and Asian locales), I find similar relative speedups. Without using locale-specific dictionaries, this locale variation test may not be fully valid, but the speedup will likely hold across all Latinized alphabets. You can also create a simple partial-ordering class-specific sort to some locales, which provides a similar speedup. For example, by duplicating the effect of using String.compareTo( ), you can provide the basis for a customized partial sorter:

public class PartialSorter {
  String source;
  char[  ] stringArray;
  public Sorting(String s)
  {
    //retain the original string
    source = s;
    //and get the array of characters for our customized comparison
    stringArray = new char[s.length(  )];
    s.getChars(0, stringArray.length, stringArray, 0);
  }
  /* This compare method should be customized for different locales */
  public static int compare(char[  ] arr1, char[  ] arr2)
  {
    //basically the String.compareTo(  ) algorithm
    int n = Math.min(arr1.length, arr2.length);
    for (int i = 0; i < n; i++)
    {
      if (arr1[i] != arr2[i])
        return arr1[i] - arr2[i];
    }
    return arr1.length - arr2.length;
  }
  public static void quicksort(String[  ] arr, int lo, int hi)
  {
    //convert to an array of PartialSorters
    PartialSorter keys[  ] = new PartialSorter[arr.length];
    for (int i = arr.length-1; i >= 0; i--)
      keys[i] = new PartialSorter(arr[i]);
    quicksort_mysorter(keys, 0, arr.length-1);
    //and unwrap so that we get our Strings in sorted order
    for (int i = arr.length-1; i >= 0; i--)
      arr[i] = keys[i].source;
  }
  public static void quicksort_mysorter(PartialSorter[  ] arr, int lo, int hi)
  {
    ...
    int mid = ( lo + hi ) / 2;
    PartialSorter middle = arr[ mid ]; //PartialSorter data type
    ...
    //Use the PartialSorter.compare(  ) method to compare the char arrays
    if( compare(arr[ lo ].stringArray, middle.stringArray) > 0 ) 
    ...
  }
}

This PartialSorter class works similarly to the CollationKey class, wrapping a string and providing its own comparison method. The particular comparison method shown here is just an implementation of the String.compareTo( ) method. It is pointless to use it exactly as defined here because object-creation overhead means that using the PartialSorter is twice as slow as using the String.compareTo( ) directly. But customizing the PartialSorter.compare( ) method for any particular locale is a reasonable task: remember, we are interested only in a simple algorithm that handles a partial sort, not the full intricacies of completely accurate locale-specific comparison.

Generally, you cannot expect to support internationalized strings and retain the performance of simple one-byte-per-character strings. But, as shown here, you can certainly improve the performance.