8.3 From Raw I/O to Smokin' I/O

So far we have looked only at general points about I/O and logging. Now we look at an example of tuning I/O performance. The example consists of reading lines from a large file. This section was inspired by an article from Sun Engineering,[4] though I go somewhat further along the tuning cycle.

[4] "Java Performance I/O Tuning," Java Developer's Journal, Volume 2, Issue 11. See http://www.JavaDevelopersJournal.com.

The initial attempt at file I/O might be to use the FileInputStream to read through a file. Note that DataInputStream has a readLine( ) method (now deprecated because it is byte-based rather than char-based, but ignore that for the moment), so you wrap the FileInputStream with the DataInputStream, and run. The code looks like:

DataInputStream in = new DataInputStream(new FileInputStream(file));
while ( (line = in.readLine(  )) != null)
{
  doSomethingWith(line);
}
in.close(  );

For these timing tests, I use two different files, a 1.8 MB file with about 20,000 lines (long lines), and a one-third of a megabyte file with about 34,000 lines (short lines). I test using several VMs to show the variations across VMs and the challenges in improving performance across different runtime environments. To make comparisons simpler, I report the times as normalized to 100% for the JDK 1.2.2 VM with JIT. The long-line case and the short-line case are normalized separately. Tests are averages across at least three test runs. For the baseline test, I have the following chart (see Table 8-1 and Table 8-2 for full results). Note that the server mode results show the second run of tests, after HotSpot has had a chance to apply its optimizations.

Short lines

1.1.8

1.2.2a

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Unbuffered input stream

86%

100%

109%

98%

103%

115%

146%

Long lines

1.1.8

1.2.2[5]

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Unbuffered input stream

83%

100%

108%

95%

95%

110%

170%

[5] The short-line 1.2 and long-line 1.2 cases have been separately normalized to 100%. All short-line times are relative to the short-line 1.2, and all long-line times are relative to the long-line 1.2.

The first test in absolute times is really dreadful because you are executing I/O one byte at a time. This performance is the result of using a plain FileInputStream without buffering the I/O, because the process is completely I/O-bound. For this reason, I expected the absolute times of the various VMs to be similar, since the CPU is not the bottleneck. But curiously, they are varied. Possibly the underlying native call implementation may be different between VM versions, but I am not interested enough to spend time deciding why there should be differences for the unbuffered case. After all, no one uses unbuffered I/O. Everyone knows you should buffer your I/O (except when memory is really at a premium, as in an embedded system).

So let's immediately move to wrap the FileInputStream with a BufferedInputStream .[6]

[6] Buffering I/O does not require the use of buffered classes. You can buffer I/O directly from the FileInputStream class and other low-level classes by passing arrays to the read( ) and write( ) methods. This means you need to handle buffer overflows yourself.

The code has only slight changes, in the constructor:

//DataInputStream in = new DataInputStream(new FileInputStream(file)); 
DataInputStream in = new DataInputStream(
    new BufferedInputStream(new FileInputStream(file)));
while ( (line = in.readLine(  )) != null)
{
  doSomethingWith(line);
}
in.close(  );

However, the times are already faster by an order of magnitude, as you can see in the following charts:

Short lines

1.1.8

1.2.2

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Unbuffered input stream

86%

100%

109%

98%

103%

115%

146%

Buffered input stream

7%

6%

3%

3%

3%

2%

21%

Long lines

1.1.8

1.2.2

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Unbuffered input stream

83%

100%

108%

95%

95%

110%

170%

Buffered input stream

6%

5%

3%

3%

3%

2%

24%

The lesson is clear, if you haven't already had it drummed home somewhere else: buffered I/O performs much better than unbuffered I/O. Having established that buffered I/O is better than unbuffered, you renormalize your times on the buffered I/O case so that you can compare any improvements against the normal case.

So far, we have used only the default buffer, which is a 2048-byte buffer (contrary to the JDK 1.1.6 documentation, which states it is 512 bytes; always check the source on easily changeable things like this). Perhaps a larger buffer would be better. Let's try 8192 bytes:

//DataInputStream in = new DataInputStream(new FileInputStream(file)); 
//DataInputStream in = new DataInputStream(
//    new BufferedInputStream(new FileInputStream(file)));
DataInputStream in = new DataInputStream(
    new BufferedInputStream(new FileInputStream(file), 8192));
while ( (line = in.readLine(  )) != null)
{
  doSomethingWith(line);
}
in.close(  );

Short lines

1.1.8

1.2.2

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Unbuffered input stream

1551%

1808%

1965%

1764%

1872%

2088%

2646%

Buffered input stream

132%

100%

48%

60%

48%

36%

373%

8K buffered input stream

136%

96%

36%

57%

48%

36%

361%

Long lines

1.1.8

1.2.2

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Unbuffered input stream

1655%

1992%

2169%

1895%

1895%

2201%

3385%

Buffered input stream

123%

100%

65%

59%

57%

37%

487%

8K buffered input stream

123%

99%

64%

59%

56%

37%

484%

The variations are large, but there is a mostly consistent pattern. The 8K buffer doesn't seem to be significantly better than the default.

Let's get back to the fact that we are using a deprecated method, readLine( ). You should really be using Readers instead of InputStreams, according to the Javadoc, for full portability, etc. Let's move to Readers and ascertain what this change costs us:

//DataInputStream in = new DataInputStream(new FileInputStream(file)); 
//DataInputStream in = new DataInputStream(
//    new BufferedInputStream(new FileInputStream(file)));
//DataInputStream in = new DataInputStream(
//    new BufferedInputStream(new FileInputStream(file), 8192));
BufferedReader in = new BufferedReader(new FileReader(file));
while ( (line = in.readLine(  )) != null)
{
  doSomethingWith(line);
}
in.close(  );

Short lines

1.1.8

1.2.2

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Buffered input stream

132%

100%

48%

60%

48%

36%

373%

8K buffered input stream

136%

96%

36%

57%

48%

36%

361%

Buffered reader

192%

96%

56%

24%

60%

24%

590%

Long Lines

1.1.8

1.2.2

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Buffered input stream

123%

100%

65%

59%

57%

37%

487%

8K buffered input stream

123%

99%

64%

59%

56%

37%

484%

Buffered reader

53%

43%

43%

20%

52%

24%

582%

These results tell us that someone at Sun spent time optimizing Readers. You can reasonably use Readers in most situations where you would have used an InputStream. Some situations can show a performance decrease, but generally there is a performance increase. Note that if you are running your own versions of these tests, you need to repeat some measurements within the VM, even in plain JIT VMs, to eliminate the JIT compiler overhead.

Now let's get down to some real tuning. So far we have just been working from bad coding to good working practice. The final version so far uses buffered Reader classes for I/O, as recommended by Sun. Can we do better? Well of course, but now let's get down and dirty. You know from general tuning practices that creating objects is overhead you should try to avoid. Up until now, we have used the readLine( ) method, which returns a string. Suppose you work on that string and then discard it, as is the typical situation. You would do better to avoid the String creation altogether. Also, if you want to process the String, then for performance purposes you are better off working directly on the underlying char array. Working on char arrays is quicker since you can avoid the String method overhead (or, more likely, the need to copy the String into a char array buffer to work on it). See Chapter 5 for more details on this technique.

Basically, this means that you need to implement the readLine( ) functionality with your own buffer while passing the buffer to the method that does the string processing. The following implementation uses its own char array buffer. It reads in characters to fill the buffer, then runs through the buffer looking for ends of lines. Each time the end of a line is found, the buffer, together with the start and end index of the line in that buffer, is passed to the doSomething( ) method for processing. This implementation avoids both the String-creation overhead and the subsequent String-processing overhead, but these are not included in any timings here. The only complication comes when you reach the end of the buffer and you need to fill it with the next chunk from the file, but you also need to retain the line fragment from the end of the last chunk. It is unlikely your 8192-char chunk will end exactly on an end of line, so there are almost always some characters left to be carried over to the next chunk. To handle this, simply copy the characters to the beginning of the buffer and read the next chunk into the buffer starting from after those characters. The commented code looks like this:

public static void myReader(String string)
    throws IOException
  {
    //Do the processing myself, directly from a FileReader
    //But don't create strings for each line, just leave it
    //as a char array
    FileReader in = new FileReader(string);
    int defaultBufferSize = 8192;
    int nextChar = 0;
    char[  ] buffer = new char[defaultBufferSize];
  
    char c;    
    int leftover;
    int length_read;
    int startLineIdx = 0;
  
    //First fill the buffer once before we start
    int nChars = in.read(buffer, 0, defaultBufferSize);
    boolean checkFirstOfChunk = false;
  
    for(;;)
    {
      //Work through the buffer looking for end of line characters.
      //Note that the JDK does the eol search as follows:
      //It hardcodes both of the characters \r and \n as end
      //of line characters, and considers either to signify the
      //end of the line. In addition, if the end of line character
      //is determined to be \r, and the next character is \n,
      //it winds past the \n. This way it allows the reading of
      //lines from files written on any of the three systems
      //currently supported (Unix with \n, Windows with \r\n,
      //and Mac with \r), even if you are not running on any of these.
      for (; nextChar < nChars; nextChar++)
      {
        if (((c = buffer[nextChar]) =  = '\n') || (c =  = '\r'))
        {
          //We found a line, so pass it for processing
          doSomethingWith(buffer, startLineIdx, nextChar-1);
  
          //And then increment the cursors. nextChar is
          //automatically incremented by the loop,
          //so only need to worry if 'c' is \r
          if (c =  = '\r')
          {
            //need to consider if we are at end of buffer
            if (nextChar =  = (nChars - 1) )
              checkFirstOfChunk = true;
            else if (buffer[nextChar+1] =  = '\n')
              nextChar++;
          }
          startLineIdx = nextChar + 1;
        }
      }
  
      leftover = 0;
      if (startLineIdx < nChars)
      {
        //We have some characters left over at the end of the chunk.
        //So carry them over to the beginning of the next chunk.
        leftover = nChars - startLineIdx;
        System.arraycopy(buffer, startLineIdx, buffer, 0, leftover);
      }
      do
      {
        length_read = in.read(buffer, leftover,
              buffer.length-leftover );
      } while (length_read =  = 0);
      if (length_read > 0)
      {
        nextChar -= nChars;
        nChars = leftover + length_read;
        startLineIdx = nextChar;
        if (checkFirstOfChunk)
        {
          checkFirstOfChunk = false;
          if (buffer[0] =  = '\n')
          {
            nextChar++;
            startLineIdx = nextChar;
          }
        }
      }
      else
      { /* EOF */
        in.close(  );
        return;
      }
    }
  }

The following chart shows the new times:

Short lines

1.1.8

1.2.2

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Buffered input stream

132%

100%

48%

60%

48%

36%

373%

Buffered reader

192%

96%

56%

24%

60%

24%

590%

Custom-built reader

28%

24%

36%

164%

34%

24%

420%

Long lines

1.1.8

1.2.2

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Buffered input stream

123%

100%

65%

59%

57%

37%

487%

Buffered reader

53%

43%

43%

20%

52%

24%

582%

Custom-built reader

28%

28%

38%

37%

37%

22%

547%

The timings are the best so far, with the single exception of the 1.3.1 server mode VM, and most times are significantly better than before.[7] The 1.3.1 server mode results are quite peculiar and look like a compiler bug of some sort caused the compiler to generate inefficient code. If you were running this code under a 1.3.1 server VM, you would need to track down what produced the anomalous times by changing a little bit of the code at a time, until you produced a workaround.

[7] Note that the HotSpot timings are, once again, for the second run of the repeated tests. No other VMs exhibited consistent variations between the first and second run tests. See Table 8-1 and Table 8-2 for the full set of results.

You can try one more thing: performing the byte-to-char conversion. The code comes from Chapter 7, where we looked at this conversion in detail. The changes are straightforward. Change the FileReader to FileInputStream and add a byte array buffer of the same size as the char array buffer:

//    FileReader in = new FileReader(string);
//this last line becomes
    FileInputStream in = new FileInputStream(string);
    int defaultBufferSize = 8192;
    //and add the byte array buffer
    byte[  ] byte_buffer = new byte[defaultBufferSize];

You also need to change the read( ) calls to read into the byte buffer, adding a convert( ) call after these. The first read( ) is changed like this:

//First fill the buffer once before we start
//  this next line becomes a byte read followed by convert(  ) call
//  int nChars = in.read(buffer, 0, defaultBufferSize); 
    int nChars = in.read(byte_buffer, 0, defaultBufferSize);
    convert(byte_buffer, 0, nChars, buffer, 0, nChars, MAP3);

The second read( ) in the main loop is also changed, but the conversion isn't done immediately here. It's done just after the number of characters, nChars, is set, a few lines later:

//      length_read = in.read(buffer, leftover,
//              buffer.length-leftover );
//becomes
        length_read = in.read(byte_buffer, leftover,
              buffer.length-leftover);
      } while (length_read =  = 0);
      if (length_read > 0)
      {
        nextChar -= nChars;
        nChars = leftover + length_read;
        startLineIdx = nextChar;
        //And add the conversion here
        convert(byte_buffer, leftover, nChars, buffer,
              leftover, nChars, MAP3);

Measuring the performance with these changes, the times are now significantly better in almost every case:

Short lines

1.1.8

1.2.2

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Buffered input stream

132%

100%

48%

60%

48%

36%

373%

Custom-built reader

28%

24%

36%

164%

34%

24%

420%

Custom reader and converter

12%

12%

20%

12%

20%

12%

120%

Long lines

1.1.8

1.2.2

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Buffered input stream

123%

100%

65%

59%

57%

37%

487%

Custom-built reader

28%

28%

38%

37%

37%

22%

547%

Custom reader and converter

9%

10%

19%

29%

17%

14%

125%

Once again, all the VMs produce their best time except for the 1.3.1 server mode long-line case (which was faster with the BufferedReader).[8] All the times are now under one second, even on a slow machine. Subsecond times are notoriously variable, although in my tests the results were fairly consistent.

[8] This shows that HotSpot is quite variable with its optimizations. HotSpot sometimes makes an unoptimized loop faster, and sometimes the manually unrolled loop comes out faster. Tables 8-1 and 8-2 show HotSpot producing both faster and slower times for the same manually unrolled loop, depending on the data being processed (i.e., short lines or long lines).

We have, however, hardcoded in the ISO 8859_1 type of byte-to-char conversion rather than supporting the generic case (where the conversion type is specified as a property). But this conversion represents a common class of character-encoding conversions, and you could fall back on the method used in the previous test where the conversion is specified differently (in the System property file.encoding). Often, you will read from files you know and whose format you understand and can predict. In those cases, building in the appropriate encoding is not a problem.

Using a buffered reader is adequate for most purposes. But we have seen that it is possible to speed up I/O even further if you're willing to spend the effort. Avoiding the creation of intermediate Strings gives you a good gain. This is true for both reading and writing and allows you to work on the char arrays directly. Working directly on char arrays is usually better for performance, but also more work. In specialized cases, you might want to consider taking control of every aspect of the I/O right down to the byte-to-char encoding, but for this you need to consider how to maintain compatibility with the JDK.

Tables Table 8-1 and Table 8-2 summarize all the results from these experiments.

Table 8-1. Timings of the long-line tests normalized to the JDK 1.2.2 buffered input stream test

Long lines

1.1.8

1.2.2

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Unbuffered input stream

1655%

1992%

2169%

1895%

1895%

2201%

3385%

Buffered input stream

123%

100%

65%

59%

57%

37%

487%

8K buffered input stream

123%

99%

64%

59%

56%

37%

484%

Buffered reader

53%

43%

43%

20%

52%

24%

582%

Custom-built reader

28%

28%

38%

37%

37%

22%

547%

Custom reader and converter

9%

10%

19%

29%

17%

14%

125%

Table 8-2. Timings of the short-line tests normalized to the JDK 1.2.2 buffered input stream test

Short lines

1.1.8

1.2.2

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Unbuffered input stream

1551%

1808%

1965%

1764%

1872%

2088%

2646%

Buffered input stream

132%

100%

48%

60%

48%

36%

373%

8K buffered input stream

136%

96%

36%

57%

48%

36%

361%

Buffered reader

192%

96%

56%

24%

60%

24%

590%

Custom-built reader

28%

24%

36%

164%

34%

24%

420%

Custom reader and converter

12%

12%

20%

12%

20%

12%

120%