5.4 Strings Versus char Arrays

In one of my first programming courses, in the language C, our instructor made an interesting comment. He said, "C has lightning-fast string handling because it has no string type." He went on to explain this oxymoron by pointing out that in C, any null-terminated sequence of bytes can be considered a string: this convention is supported by all string-handling functions. The point is that since the convention is adhered to fairly rigorously, there is no need to use only the standard string-handling functions. Any string manipulation you want to do can be executed directly on the byte array, allowing you to bypass or rewrite any string-handling functions you need to speed up. Because you are not forced to run through a restricted set of manipulation functions, it is always possible to optimize code using your own hand-crafted functions. Furthermore, some string-manipulating functions operate directly on the original byte array rather than creating a copy of this array. This can be a source of bugs, but is another reason speed can be optimized.

In Java, the inability to subclass String or access its internal char array means you cannot use the techniques applied in C. Even if you could subclass String, this does not avoid the second problem: many other methods operate on or return copies of a String. Generally, there is no way to avoid using String objects for code external to your application classes. But internally, you can provide your own char array type that allows you to manipulate strings according to your needs.

As an example, let's look at a couple of simple text-parsing problems: first, counting the words in a body of text, and second, using a filter to select lines of a file based on whether they contain a particular string.

5.4.1 Word-Counting Example

Let's look at the typical Java approach to counting words in a text. I use the StreamTokenizer for the word count, as that class is tailor-made for this kind of problem.

The word count is fairly easy to implement. The only difficulty comes in defining what a word is and coaxing the StreamTokenizer to agree with that definition. To keep things simple, I define a word as any contiguous sequence of alphanumeric characters. This means that words with apostrophes and numbers with decimal points count as two words, but I'm more interested in the performance than the niceties of word definitions here, and I want to keep the implementation simple. The implementation looks like this:

public static void wordcount(String filename)
  throws IOException
{
  int count = 0;
  //create the tokenizer, and initialize it
  FileReader r = new FileReader(filename);
  StreamTokenizer rdr = new StreamTokenizer(r);
  rdr.resetSyntax(  );
  rdr.wordChars('a', 'z');  //words include any lowercase character
  rdr.wordChars('A', 'Z');  //words include any uppercase character
  rdr.wordChars('0','9');   //words include any digit
  //everything else is whitespace
  rdr.whitespaceChars(0, '0'-1);
  rdr.whitespaceChars('9'+1, 'A'-1);
  rdr.whitespaceChars('z'+1, '\uffff');
  int token;
  //loop getting each token (word) from the tokenizer
  //until we reach the end of the file
  while( (token = rdr.nextToken(  )) != StreamTokenizer.TT_EOF)
  {
    //If the token is a word, count it, otherwise it is whitespace
    if ( token =  = StreamTokenizer.TT_WORD)
      count++;
  }
  System.out.println(count + " words found.");
  r.close(  );
}

Now, for comparison, implement a more efficient version using char arrays. The word-count algorithm is relatively straightforward: test for sequences of alphanumerics and skip anything else. The only slight complication comes when you refill the buffer with the next chunk from the file. You need to avoid counting one word as two if it falls across the junction of the two reads into the buffer, but this turns out to be easy to handle. You simply need to remember the last character of the last chunk and skip any alphanumeric characters at the beginning of the next chunk if that last character was alphanumeric (i.e., continue with the word until it terminates). The implementation looks like this:

public static void cwordcount(String filename)
  throws IOException
{
  int count = 0;
  FileReader rdr = new FileReader(filename);
  //buffer to hold read in characters
  char[  ] buf = new char[8192];
  int len;
  int idx = 0;
  //initialize so that our 'current' character is in whitespace
  char c = ' ';
  //read in each chunk as much as possible,
  //until there is nothing left to read
  while( (len = rdr.read(buf, 0, buf.length)) != -1)
  {
    idx = 0;
    int start;
    //if we are already in a word, then skip the rest of it
    if (Character.isLetterOrDigit(c))
      while( (idx < len) && Character.isLetterOrDigit(buf[idx]) )
        {idx++;}
    while(idx < len)
    {
      //skip non alphanumeric
      while( (idx < len) && !Character.isLetterOrDigit(buf[idx]) )
        {idx++;}
      //skip word
      start = idx;
      while( (idx < len) && Character.isLetterOrDigit(buf[idx]) )
        {idx++;}
      if (start < len)
      {
        count++; //count word
      }
    }
    //get last character so we know whether to carry on a word
    c = buf[idx-1];
  }
  System.out.println(count + " words found.");
}

You can compare this implementation with the one using the StreamTokenizer. All tests use the same large text file for counting the words. I normalize to 100% the time taken by StreamTokenizer using JDK 1.2.2 with the JIT compiler (see Table 5-6). Interestingly, the test takes a similar amount of time when I run using the StreamTokenizer without the JIT compiler running.

Table 5-6. Word counter timings using wordcount or cwordcount methods

VM

1.1.8

1.2.2

1.3.1

1.3.1 -server

1.4.0

1.4.0-server

1.2 no JIT

wordcount

153%

100%

232%

230%

11%

9.3%

171%

cwordcount

1.7%

1.6%

2.3%

1.9%

2.3%

1.9%

33%

These results are quite curious. I suspect the curious results and huge differences may have something to do with StreamTokenizer being a severely underoptimized class, as well as being too generic a tool for this particular test.

Looking at object usage,[4] you find that the StreamTokenizer implementation winds through over 1.5 million temporary objects, whereas the char array implementation uses only around 20 objects. Now the results are more explicable. Object-creation differences of this order of magnitude impose a huge overhead on the StreamTokenizer implementation, explaining why the StreamTokenizer is so much slower than the char array implementation. The object-creation overhead also explains why both the JIT and non-JIT tests took similar times for the StreamTokenizer. Object creation requires similar amounts of time in the pre-1.4 VMs, and clearly the performance of the StreamTokenizer is limited by the number of objects it uses (see Chapter 4 for further details). The times also show that the VMs are getting much much better at reducing object creation and garbage collection overhead. The 1.4 test has the advantages of the latest object-creation/garbage-collection techniques, in addition to more optimized byte-to-char conversion from the 1.4 nio package.

[4] Object monitoring is easily done using the monitoring tools from Chapter 2, both the object-creation monitor and the -verbosegc option with an explicit System.gc( ) at the end of the test.

5.4.2 Line Filter Example

For the example of a filter to select lines of a file, I'll use the simple BufferedReader.readLine( ) method. This contrasts with the previous methodology using a dedicated class (StreamTokenizer), which turned out to be extremely inefficient. The readline( ) method should present us with more of a performance-tuning challenge, as it is relatively much simpler and so should be more efficient. I'll also add case-independence to the filtering, i.e., the lines will be selected even if the case of the characters in the line do not exactly match the case of the characters in the filter.

The filter using BufferedReader and Strings is easily implemented. The search phrase is uppercased at the beginning, and each line of text is uppercased so that lines are selected independently of case. I include an option to print only the count of matching lines. The only slightly complex tweak is that I want to avoid any dependence on I/O in the timings, as this is not an I/O test, so I map the file contents into memory and use a CharArrayReader rather than a FileReader:

public static void filter1(String filter, String filename, boolean print)
    throws IOException
  {
    count = 0;
//    BufferedReader rdr = new BufferedReader(new FileReader(filename));
    BufferedReader rdr = new BufferedReader(new CharArrayReader(buf));
    String line;
    String ufilter = filter.toUpperCase(  );
    while( (line = rdr.readLine(  )) != null)
    {
      if (line.toUpperCase(  ).indexOf(ufilter) != -1)
      {
        count++;
        if (print)
          System.out.println(line);
      }
    }
    System.out.println(count + " lines matched.");
    rdr.close(  );
  }

Clearly it is not optimal to generate an extra string for every line, as toUpperCase( ) does. String doesn't provide any simple case-independent search alternatives, though regionMatches( ) can be used by testing iteratively through each line. For completeness we'll measure that solution too:

public static void filter2(String filter, String filename, boolean print)
    throws IOException
  {
    count = 0;
//    BufferedReader rdr = new BufferedReader(new FileReader(filename));
    BufferedReader rdr = new BufferedReader(new CharArrayReader(buf));
    String line;
    int filterLength = filter.length(  );
    while( (line = rdr.readLine(  )) != null)
    {
      for(int i = 0; i < line.length(  ); i++)
        if (line.regionMatches(true, i, filter, 0, filterLength))
        {
          count++;
          if (print)
            System.out.println(line);
          break;
        }
    }
    System.out.println(count + " lines matched.");
    rdr.close(  );
  }

Now let's consider how to handle this filter using char arrays. As in the previous example, you read chunks into your char array. However, this example is a bit more complicated than the word-count example. Here you need to test for a match against another char array, look for line endings, and handle reforming lines that are broken between read( ) calls in a more complete manner than for the word count.

Internationalization doesn't change this example in any obvious way. Both the readLine( ) implementation and the char array implementation stay the same whatever language the text contains.

This statement about internationalization is slightly disingenuous. In fact, searches in some languages allow words to match even if they are spelled differently. For example, when searching for a French word that contains an accented letter, the user might expect a nonaccented spelling to match. This is similar to searching for the word "color" and expecting to match the British spelling "colour."

Such sophistication depends on how extensively the application supports this variation in spelling. The java.text.Collator class has four "strength" levels that support variations in the precision of word comparisons. Both implementations for the example in this section correspond to matches using the Collator.IDENTICAL strength together with the Collator.NO_DECOMPOSITION mode.

The full commented listing for the char array implementation is shown shortly. Looking at the code, it is clearly more complicated than using the BufferedReader.readLine( ) . Obviously you have to work a lot harder to get the performance you want. The result, though, is that some tests run as much as five times faster using the char array implementation (see Table 5-7 and Table 5-8). The line lengths of the test files make a big difference, hence the variation in results. In addition, the char array implementation uses only 1% of the number of objects compared to the BufferedReader.readLine( ) implementation.

Table 5-7. Filter timings using filter or cfilter method on a short-line file

VM

1.1.8

1.2.2

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

filter1 (uppercased)

168%

100%

49%

40%

49%

31%

626%

filter2 (regionMatches)

140%

88%

53%

48%

60%

49%

1137%

cfilter (proprietary)

17%

17%

19%

17%

20%

15%

337%

Table 5-8. Filter timings using filter or cfilter method on a long-line file

VM

1.1.8

1.2.2

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

filter1 (uppercased)

184%

100%

64%

54%

71%

47%

829%

filter2 (regionMatches)

183%

138%

113%

107%

124%

103%

2332%

cfilter (proprietary)

33%

33%

37%

28%

38%

27%

633%

I have implemented a relatively straightforward class for the char array parsing. If you look in more detail at what you are doing, you can apply further optimizations and make the routine even faster (see, for example, Chapter 7 and Chapter 8).

Tuning like this takes effort, but you can see that it is possible to use char arrays to very good effect for most types of String manipulation. If you are an object purist, you may want to encapsulate the char array access. Otherwise, you may be content to expose external access through static methods. In any case, it is worth investing some time and effort in creating a usable char-handling class. Usually this creation is a single, up-front effort. If the classes are well constructed, you can use them consistently within your applications, and this effort pays off handsomely when it comes to tuning (or, occasionally, the lack of a need to tune).

Here is the commented char array implementation that executes a line-by-line string-matching filter on a file:

/** Note that this implementation may have problems with
    text lines longer than 8192 characters.
*/
class MatchReader
{
  
  public static void filter(String filter, String filename, boolean print)
    throws IOException
  {
//    MatchReader rdr = new MatchReader(new FileReader(filename), filter);
    MatchReader rdr = new MatchReader(new CharArrayReader(FilterComparison.buf), 
filter);
    rdr.filter(print);
  }
  
  static final int BUFFER_SIZE = 8192;
  char[  ] buffer;
  int bufferSize;
  int bufferPos;
  char[  ] matchString;
  Reader reader;
  Writer sysout;
  
  public MatchReader(Reader rdr, String match)
    throws IOException
  {
    reader = rdr;
    matchString = new char[match.length(  )];
    match.toUpperCase(  ).getChars(0, match.length(  ), matchString, 0);
    buffer = new char[BUFFER_SIZE];
    bufferSize = 0;
    bufferPos = 0;
    sysout = new OutputStreamWriter(System.out);
    fillBuffer(  );
  }
  
  /* return true if more characters were read, otherwise false */
  private boolean fillBuffer(  )
    throws IOException
  {
    int len;
    boolean added = false;
    while(bufferSize < buffer.length)
    {
      len = reader.read(buffer, bufferSize, buffer.length-bufferSize);
      if (len <= 0)
        return added;
      else
        bufferSize += len;
      added = true;
    }
    return added;
  }
  
  public void filter(boolean print)
    throws IOException
  {
    int count = 0;
    while( nextMatchedLine(  ) )
    {
      count++;
      if (print)
        printCurrentLine(  );
      else
        nextLine(  );
    }
    System.out.println(count + " lines matched.");
    close(  );
  }
  
  public void close(  )
    throws IOException
  {
    buffer = null;
    matchString = null;
    reader.close(  );
  }
  
  /* Return true if we matched a line,
   * false if there were no more matches
   */
  public boolean nextMatchedLine(  )
    throws IOException
  {
    while(!scrollToNextMatchInCurrentBuffer(  ))
    {
      if (!refillBuffer(  ))
      {
        //No more characters to read, just make sure
        //that no more lines are left in the buffer
        return scrollToNextMatchInCurrentBuffer(  );
      }
    }
    return true;
  }
  
  private boolean scrollToNextMatchInCurrentBuffer(  )
  {
    //Simple linear search
    //No need to try to match beyond the end of the buffer
    int highIdx = bufferSize-matchString.length;
    for (; bufferPos <= highIdx; bufferPos++)
    {
      if (matches(  ))
        return true;
    }
    return false;
  }
  
  private boolean matches(  )
  {
    //Assume that this is only called if the match
    //characters can fit into the remaining buffer
    for(int j = 0; j < matchString.length ; j++)
      if(Character.toUpperCase(buffer[bufferPos+j]) != matchString[j])
        return false;
    return true;
  }
  
  private boolean refillBuffer(  )
    throws IOException
  {
    return refillBuffer(bufferSize - 1);
  }
  
  private boolean refillBuffer(int lastIdx)
    throws IOException
  {
    //Find the start of the last line in the buffer,
    //move that to the start of the buffer,
    //then append some more to the buffer.
    while( (lastIdx > 0) && (buffer[lastIdx] != '\n') && (buffer[lastIdx] != '\r') )
      lastIdx--;
    if ( (buffer[lastIdx] =  = '\n') || (buffer[lastIdx] =  = '\r') )
    {
      //Found the most recent newline character
      bufferSize -= lastIdx+1;
      System.arraycopy(buffer, lastIdx+1, buffer, 0, bufferSize);
      bufferPos = 0; //be safe
      return fillBuffer(  );
    }
    else
    {
      //reached the beginning of the buffer and we still don't have a newline
      return fillBuffer(  );
    }
  }
  
  /* Scroll to just after the next newline character */
  public void nextLine(  )
    throws IOException
  {
    while(!scrollToNextLineInCurrentBuffer(  ))
    {
      if (!refillBuffer(  ))
      {
        //No more characters to read, just make sure
        //that no more lines are left in the buffer
        scrollToNextLineInCurrentBuffer(  );
      }
    }
  }
  
  private boolean scrollToNextLineInCurrentBuffer(  )
  {
    //Simple linear search
    //No need to try to match beyond the end of the buffer
    int highIdx = bufferSize-1;
    for (; bufferPos <= highIdx; bufferPos++)
    {
      if ( (buffer[bufferPos] =  = '\n') || (buffer[bufferPos] =  = '\r') )
      {
        bufferPos++;
        return true;
      }
    }
    return false;
  }
  
  private void printCurrentLine(  )
    throws IOException
  {
    //Move the start of the current line back to beginning of
    //the buffer, fill it up, and find the next line
    refillBuffer(bufferPos);
    scrollToNextLineInCurrentBuffer(  );
    sysout.write(buffer, 0, bufferPos-1);
    sysout.write(FilterComparison.NewLine);
    sysout.flush(  );
  }
}

The individual methods listed here are fairly basic. As with the JDK methods, I assume a line termination is indicated by a newline or return character. Otherwise, the main effort comes in writing efficient array-matching methods. In this example, I did not try hard to look for the very best array-matching algorithms. Instead, I used straightforward algorithms for clarity, since these are fast enough for the example. There are many sources describing more sophisticated array-matching algorithms; for example, the University of Rouen in France has a nice site listing "Exact String Matching Algorithms" at http://www-igm.univ-mlv.fr/~lecroq/string/.

5.4.3 Line Filtering with Regular Expressions

JDK 1.4 includes native support for regular expressions in one of the core packages, java.util.regex. String methods were also added as shortcuts to using the regular-expression objects. For example, the simplest additional new method is String.matches(String regex), which simply returns a boolean if the string can be matched by the regular-expression argument.

Regular expressions are a pattern-matching language that provides a powerful mechanism to determine if sequences of characters contain particular patterns and to extract those patterns. Almost every type of parsing is much easier and more flexible using regular expressions. For more details about regular expressions, see Mastering Regular Expressions by Jeffrey Friedl (O'Reilly & Associates).

However, be aware that the methods in the String class are adequate for one-off uses of regular expressions but are inefficient for repeated application of a regular expression. The String methods both compile (into a java.util.regex.Pattern object) and apply the regular expression on each execution, whereas it is more efficient to compile a regular expression once and apply it repeatedly using the same object. For example, this statement:

boolean result = string.matches(regex);

executes the equivalent of:

Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(string);
boolean result = m.matches(  );

If the regular expression is to be reapplied to multiple strings, the efficient solution would call the Pattern.compile( ) method only once, but that option is not available if you use the shortcut String.matches(String regex) method.

The line-filtering example in the previous section is a fairly simple problem and doesn't need the full power of regular expressions, but since we have already seen the equivalent functionality in alternative implementations it is worth looking at the cost of using regular expressions to handle the filtering. The method required is straightforward, but I'll walk through it in case you are unused to regular expressions.

First, the regular-expression pattern needs to be compiled into a Pattern object. The pattern itself needs to be wrapped with some characters to indicate that we are searching one line at a time: the ^ character denotes the beginning of a line, and the $ character denotes the end of a line. The .* pattern simply indicates that anything can match between the beginning and the end of the line as long as the filter string is included. In addition, the Pattern object needs to know that we are searching line-by-line, as it also supports searching text while treating line endings simply as any other characters, so we use the MULTILINE flag. We also need the CASE_INSENSITIVE flag to make the match case-insensitive. Next, we will use a java.nio.CharBuffer to wrap the characters, a convenient mechanism to present the text to the Matcher. We could use a String, but if we were actually reading from a file the most efficient mechanism would be to use a CharBuffer on a FileChannel, so I'll stay with the CharBuffer.

Finally, we simply loop, repeatedly matching the regular expression against the text using the Matcher.find( ) method. The Matcher.group( ) method provides us with the previously matched line of text if we need it for printing:

public static void filter3(String filter, String filename, boolean print)
    throws IOException
  {
    count = 0;
    Pattern p = Pattern.compile("^.*" + filter + ".*$", 
                  Pattern.MULTILINE | Pattern.CASE_INSENSITIVE);
    CharBuffer cbuff = CharBuffer.wrap(buf);
    Matcher m = p.matcher(cbuff);
    while( m.find(  ) )
    {
      count++;
      if (print)
        System.out.println(m.group(  ));
    }
    System.out.println(count + " lines matched.");
  }

The results of testing this method along with the previous methods used to filter lines are shown in Table 5-9. The results show that for our simple line-filtering problem, using regular expressions is slower than the other implementations, but not hugely slower. This shows that the regular-expression implementation in JDK 1.4. is pretty efficient, given how much more it is doing (and can do) compared with the other methods. It certainly looks like you can use regular expressions with some confidence that the implementation is pretty efficient.

Table 5-9. Filter timings using various filter methods in JDK 1.4

VM

1.4.0

1.4.0 -server

1.4.0 -Xint

filter1 (uppercased)

100%

65%

1320%

filter2 (regionMatches)

128%

104%

2390%

cfilter (proprietary)

42%

31%

707%

filter3 (regular expression)

143%

88%

3666%