7.2 Tuning a Loop

Let's look at an example of tuning a loop. In the java.io package, the Reader (and Writer) classes provide character-based I/O (as opposed to byte-based I/O). The InputStreamReader provides a bridge from byte to character streams. It reads bytes and translates them into characters according to a specified character encoding. If no encoding is specified, a default converter class is provided. For applications that spend a significant amount of time reading, it is not unusual to see the convert( ) method of this encoding class high up on a profile of how the application time is spent.

It is instructive to examine how this particular conversion method functions and to see the effect of a tuning exercise. Examining the bytecodes of the convert( ) method[2] where most of the time is being spent, you can see that the bytecodes correspond to the following method (the Exception used is different; I have just used the generic Exception class):

[2] The convert method is a method in one of the sun.* packages, so the source code is not available. I have chosen the convert method from the default class used in some ASCII environments, the ISO 8859_1 conversion class.

public int convert(byte input[  ], int byteStart, int byteEnd, 
                  char output[  ], int charStart, int charEnd)
  throws Exception
{
  int charOff = charStart;
  for(int byteOff = byteStart; byteOff < byteEnd;)
  {
    if(charOff >= charEnd)
      throw new Exception(  );
    int i1 = input[byteOff++];
    if(i1 >= 0)
      output[charOff++] = (char)i1;
    else
      output[charOff++] = (char)(256 + i1);
  }
  
  return charOff - charStart;
}

Basically, the method takes a byte array (input) and converts the elements from byteStart to byteEnd of that array into characters. The conversion of bytes to chars is straightforward, consisting of mapping positive byte values to the same char value, and mapping negative byte values to the char with value (byte value + 256). These chars are put into the passed char array (output) from indexes charStart to charEnd.

It doesn't seem that there is too much scope for tuning. There is the obvious first test, which is performed every time through the loop. You can certainly move that. But let's start by trying to tune the data conversion itself. First, be sure that casts on data types are efficient. It's only a quick test to find out. Add a static char array to the class, which contains just char values 0 to 127 at elements 0 to 127 in the array. Calling this array MAP1, test the following altered method:

public int convert(byte input[  ], int byteStart, int byteEnd, 
                  char output[  ], int charStart, int charEnd)
  throws Exception
{
  int charOff = charStart;
  for(int byteOff = byteStart; byteOff < byteEnd;)
  {
    if(charOff >= charEnd)
      throw new Exception(  );
    int i1 = input[byteOff++];
    if(i1 >= 0)
      output[charOff++] = MAP1[i1];
    else
      output[charOff++] = (char)(256 + i1);
  }
  
  return charOff - charStart;
}

On the basis of the original method taking a normalized 100.0 seconds in test runs, this alternative takes an average of 111 seconds over a set of test runs (some VMs, notably the server-mode HotSpot VMs, show even worse performance). Well, that says that casts are not so slow, but it hasn't helped make this method any faster. However, the second cast involves an addition as well, and perhaps you can do better here. Unfortunately, there is no obvious way to use a negative value as an index into the array without executing some offset operation, so you won't gain time. For completeness, test this (with an index offset given by i1+128) and find that the average time is at the 110-second mark. This is not significantly better than the last test, and definitely worse than the original.

Array-lookup speeds are highly dependent on the processor and the memory-access instructions available from the processor. The lookup speed is also dependent on the compiler taking advantage of the fastest memory-access instructions available. It is possible that other processors, VMs, or compilers will produce lookups faster than the cast.

But we have gained an extra option from these two tests. It is now clear that we can map all the bytes to chars through an array. Perhaps we can eliminate the test for positiveness applied to the byte (i.e., if(i1 >= 0)) and use a char array to map all the bytes directly. And indeed we can. Use the index conversion from the second test (an index offset given by i1+128), with a static char array that contains just char values 128 to 255 at elements 0 to 127 in the array, and char values 0 to 127 at elements 128 to 255 in the array.

The method now looks like:

public int convert(byte input[  ], int byteStart, int byteEnd, 
                  char output[  ], int charStart, int charEnd)
  throws Exception
{
  int charOff = charStart;
  for(int byteOff = byteStart; byteOff < byteEnd;)
  {
    if(charOff >= charEnd)
      throw new Exception(  );
    int i1 = input[byteOff++];
    output[charOff++] = MAP3[128 + i1];
  }
  
  return charOff - charStart;
}

We have eliminated one boolean test each time through the loop at the expense of using a slightly more expensive data-conversion method (array access rather than the cast). The average test result is now slightly faster than the original method. But different VMs show different speedups at this stage: the VMs of 1.1.6, 1.1.8, 1.2.2, 1.3.1, 1.4.1 server, and 1.4.1 interpreted are 5% to 30% faster, whereas 1.2.0, 1.3.1 server, and 1.4.0 client are 5% to 15% slower.

Cleaning up the method slightly, we can see that the temporary variable, i1, which was previously required for the test, is no longer needed. Being assiduous tuners and clean coders, we eliminate it and retest so that we have a new baseline to start from. Astonishingly (to me at least), this speeds up the test measurably in some VMs. The average test time is now even better, though again, a couple of VMs are still slower than the original method. Some VMs incurred a definite overhead from the redundant temporary variable in the loop: a lesson to keep in mind for general tuning.

It may be worth testing to see if an int array performs better than the char array (MAP3) previously used, since ints are the faster data type. And indeed, changing the type of this array and putting a char cast in the loop improves times slightly for some but not all VMs, and on average times are worse. More to the point, after this effort, we have not really managed a speedup consistent enough or good enough to justify the time spent on this tuning exercise.

Now I'm out of original ideas, but one of my readers, Jesper Larsson from Sweden, has thought of a better way to map the chars to bytes. Jesper noticed that the conversion corresponds to a simple bitwise operation, guaranteed by the Java language specification to work. The resulting method uses the following bitwise operator:

    output[charOff++] = (char)(input[byteOff++] & 0xFF);

instead of the previously used array map:

    output[charOff++] = (char) MAP5[input[byteOff++]+128];

All the VMs except the 1.4.0 server mode show Jesper's optimization to be significantly better. And the 1.4.0 server mode is slower only because it has already done a brilliant job of optimizing the earlier changes: in absolute time, the 1.4.0 server mode at this stage is nearly twice as fast as any other VM (probably from loop unrolling; see the later discussion).

Now we will apply the standard optimizations. Start by eliminating expressions from the loop that do not need to be repeatedly called, and move the other boolean test (the one for the out-of-range Exception) out of the loop. The method now looks like this:

public int convert(byte input[  ], int byteStart, int byteEnd, 
                  char output[  ], int charStart, int charEnd)
  throws Exception
{
  int max = byteEnd;
  boolean throwException = false;
  if ( byteEnd-byteStart > charEnd-charStart )
  {
    max = byteStart+(charEnd-charStart);
    throwException = true;
  }
  
  int charOff = charStart;
  for(int byteOff = byteStart; byteOff < max;)
  {
    output[charOff++] = (char)(input[byteOff++] & 0xFF);
  }
  if(throwException)
    throw new Exception(  );
  
  return charOff - charStart;
}

I am taking the trouble to make the method functionally identical to the original. The original version filled in the array until the actual out-of-range exception is encountered, so I do the same. If you throw the exception as soon as you establish the index is out of range, the code will be slightly more straightforward. Other than that, the loop is the same as before, but without the out-of-range test and without the temporary assignment. The average test result is now the fastest we've obtained on any tests on all VMs. We've shaved off a third to a half of the time spent in this loop. This is mainly down to eliminating tests that were originally being run on each loop iteration. This speedup applied to all VMs tested.

Loop unrolling is another standard optimization that eliminates some more tests. Let's partially unroll the loop and see what sort of a gain we get. In practice, the optimal amount of loop unrolling corresponds to the way the application uses the convert( ) method, for example, the size of the typical array that is being converted. But in any case, we use a particular example of 10 loop iterations to see the effect.

Optimal loop unrolling depends on a number of factors, including the underlying operating system and hardware. Loop unrolling is ideally achieved by way of an optimizing compiler rather than by hand. HotSpot interacts with manual loop unrolling in a highly variable way: sometimes HotSpot makes the unoptimized loop faster, sometimes the manually unrolled loop comes out faster. Table 8-1 and Table 8-2 show HotSpot producing both faster and slower times for the same manually unrolled loop, depending on the data being processed. These two tables show the results from the same optimized program being run against files with long lines (Table 8-1) and files with short lines (Table 8-2). Of all the VMs tested, only the HotSpot VM produces inconsistent results, with a speedup when processing the long-line files but a slowdown when processing the short-line files. (The last two lines of each table show the difference between the original loop and the manually unrolled loop.)

The method now looks like this:

public int convert(byte input[  ], int byteStart, int byteEnd, 
                  char output[  ], int charStart, int charEnd)
  throws Exception
{
  //Set the maximum index of the input array to wind to
  int max = byteEnd;
  boolean throwException = false;
  if ( byteEnd-byteStart > charEnd-charStart )
  {
    //If the byte arry length is larger than the char array length
    //then we will throw an exception when we get to the adjusted max
    max = byteStart+(charEnd-charStart);
    throwException = true;
  }
  
  //charOff is the 'current' index into 'output'
  int charOff = charStart;
  
  //Check that we have at least 10 elements for our
  //unrolled part of the loop
  if (max-byteStart > 10)
  {
    //shift max down by 10 so that we have some elements
    //left over before we run out of groups of 10
    max -= 10;
    int byteOff = byteStart;
    //The loop test only tests every 10th test compared
    //to the normal loop. All the increments are done in
    //the loop body. Each line increments the byteoff by 1
    //until it's incremented by 10 after 10 lines. Then the test
    //checks that we are still under max - if so then loop again.
    for(; byteOff < max;)
    {
      output[charOff++] = (char) (input[byteOff++] & 0xFF);
      output[charOff++] = (char) (input[byteOff++] & 0xFF);
      output[charOff++] = (char) (input[byteOff++] & 0xFF);
      output[charOff++] = (char) (input[byteOff++] & 0xFF);
      output[charOff++] = (char) (input[byteOff++] & 0xFF);
      output[charOff++] = (char) (input[byteOff++] & 0xFF);
      output[charOff++] = (char) (input[byteOff++] & 0xFF);
      output[charOff++] = (char) (input[byteOff++] & 0xFF);
      output[charOff++] = (char) (input[byteOff++] & 0xFF);
      output[charOff++] = (char) (input[byteOff++] & 0xFF);
    }
  
    //We exited the loop because the byteoff went over the max.
    //Fortunately we kept back 10 elements so that we didn't go
    //too far past max. Now add the 10 back, and go into the
    //normal loop for the last few elements.
    max += 10;
    for(; byteOff < max;)
    {
      output[charOff++] = (char) (input[byteOff++] & 0xFF);
    }
  }
  else
  {
    //If we're in this conditional, then there aren't even
    //10 elements to process, so obviously we don't want to
    //do the unrolled part of the method.
    for(int byteOff = byteStart; byteOff < max;)
    {
      output[charOff++] = (char) (input[byteOff++] & 0xFF);
    }
  }
  //Finally if we indicated that the method needed an exception
  //thrown, we do it now.
  if(throwException)
    throw new Exception(  );
  
  return charOff - charStart;  
}

The average test result is now around 50% of the original method time for almost all the VMs. Only the 1.4.0 server-mode VM has a different result. Though still faster than the original method and almost all the other VMs, the last manual loop unrolling actually slowed down the 1.4.0 VM compared to running it with the earlier optimized method. This is likely to be caused by the 1.4 server VM doing a far better job of unrolling the loop than our handcrafted unroll.

It's good news that this kind of optimization is finally being applied efficiently by the VM. But from a performance-tuning point of view, this means that it is difficult to know whether to unroll the loop manually or not. Obviously, if you know exactly which VM your application runs on, you can establish whether the unrolling optimization produces faster code. But if your application could be used under any VM, the decision is more complex. The slower VMs benefit from manual unrolling, whereas the faster, server-mode VMs still remain faster in absolute terms even after being slowed down by manual loop unrolling. This suggests that, at least for the time being, manual loop unrolling is worth considering.

It is worth repeating that the speedup we have obtained is mainly a result of eliminating tests that were originally run in each loop iteration. For tight loops (i.e., loops that have a small amount of actual work that needs to be executed on each iteration), the overhead of tests is definitely significant.

It is also important during the tuning exercise to run the various improvements under different VMs and determine that the improvements are generally applicable. My tests indicate that these improvements are generally valid for all runtime environments. (One development environment with a very slow VMan order of magnitude slower than the Sun VM without JITshowed only a small improvement. However, it is not generally a good idea to base performance tests on development environments.)

For a small Java program that does simple filtering or conversion of data from text files, this convert( ) method could take 40% of the total program time. Improving this one method as shown can shave 20% from the time of the whole program, which is a good gain for a relatively small amount of work (it took me longer to write this section than to tune the convert( ) method).