5.1 The Performance Effects of Strings

Let's first look at the advantages of the String implementation:

  • Compilation creates unique strings. At compile time, strings are resolved as far as possible. This includes applying the concatenation operator and converting other literals to strings. So "hi7" and ("hi"+7) both get resolved at compile time to the same string, and are identical objects in the class string pool (see Section Compilers differ in their ability to achieve this resolution. You can always check your compiler (e.g., by decompiling some statements involving concatenation) and change it if needed.

  • Because String objects are immutable, a substring operation doesn't need to copy the entire underlying sequence of characters. Instead, a substring can use the same char array as the original string and simply refer to a different start point and endpoint in the char array. This means that substring operations are efficient, being both fast and conserving of memory; the extra object is just a wrapper on the same underlying char array with different pointers into that array.[1]

    [1] Strings are implemented in the JDK as an internal char array with index offsets (actually a start offset and a character count). This basic structure is extremely unlikely to be changed in any version of Java.

  • Strings have strong support for internationalization. It would take a large effort to reproduce the internationalization support for an alternative class.

  • The close relationship with StringBuffers allows Strings to reference the same char array used by the StringBuffer. This is a double-edged sword. For typical practice, when you use a StringBuffer to manipulate and append characters and data types, and then convert the final result to a String, this works just fine. The StringBuffer provides efficient mechanisms for growing, inserting, appending, altering, and other types of String manipulation. The resulting String then efficiently references the same char array with no extra character copying. This is very fast and reduces the number of objects being used to a minimum by avoiding intermediate objects. However, if the StringBuffer object is subsequently altered, the char array in that StringBuffer is copied into a new char array that is now referenced by the StringBuffer. The String object retains the reference to the previously shared char array. This means that copying overhead can occur at unexpected points in the application. Instead of the copying occurring at the toString( ) method call, as might be expected, any subsequent alteration of the StringBuffer causes a new char array to be created and an array copy to be performed. To make the copying overhead occur at predictable times, you could explicitly execute some method that makes the copying occur, such as StringBuffer.setLength( ). This allows StringBuffers to be reused with more predictable performance.

The disadvantages of the String implementation are:

  • Not being able to subclass String means that it is not possible to add behavior to String for your own needs.

  • The previous point means that all access must be through the restricted set of currently available String methods, imposing extra overhead.

  • The only way to increase the number of methods allowing efficient manipulation of String characters is to copy the characters into your own array and manipulate them directly, in which case String is imposing an extra step and extra objects you may not need.

  • char arrays are faster to process directly.

  • The tight coupling with StringBuffer can lead to unexpectedly high memory usage. When StringBuffer.toString( ) creates a String, the current underlying array holds the string, regardless of the size of the array (i.e., the capacity of the StringBuffer). For example, a StringBuffer with a capacity of 10,000 characters can build a string of 10 characters. However, that 10-character String continues to use a 10,000-char array to store the 10 characters. If the StringBuffer is now reused to create another 10-character string, the StringBuffer first creates a new internal 10,000-char array to build the string with; then the new String also uses that 10,000-char array to store the 10 characters. Obviously, this process can continue indefinitely, using vast amounts of memory where not expected.

The advantages of Strings can be summed up as ease of use, internationalization support, and compatibility to existing interfaces. Most methods expect a String object rather than a char array, and String objects are returned by many methods. The disadvantage of Strings boils down to inflexibility. With extra work, most things you can do with String objects can be done faster and with less intermediate object-creation overhead by using your own set of char array manipulation methods.

For most performance tuning, you pinpoint a bottleneck and make localized changes to objects and methods that speed up that bottleneck. But String tuning often involves converting to char arrays, whereas you rarely come across public methods or interfaces that deal in char arrays. This makes it difficult to switch between Strings and char arrays in any localized way. The consequences are that you either have to switch back and forth between Strings and char arrays, or you have to make extensive modifications that can reach across many application boundaries. I have no easy solution for this problem. String tuning can get messy. Sun recognizes that Strings are not the optimal solution in many cases and has added a CharSequence interface in JDK 1.4 that String and other classes implement. New methods have been added that operate on CharSequence objects rather than requiring Strings. For example, the regular expression classes accept CharSequence objects. This doesn't necessarily help your particular bottleneck, and CharSequences still access the char elements through a charAt( ) method, but it does at least increase the options available for optimizing applications.

It is difficult to handle String internationalization capabilities using raw char arrays. But in many cases, internationalized Strings form a specific subset of String usage in an application, mainly in the user interface, and that subset of Strings rarely causes bottlenecks. You should differentiate between Strings that need internationalization and those that are simply processing characters, independent of language. These latter Strings can be replaced for tuning with char arrays.[2] Internationalization-dependent Strings are more difficult to tune, and I provide some examples of tuning these later in the chapter. Note also that internationalized Strings can be treated as char arrays for some types of processing without any problems; see Section 5.4.2 later in this chapter.

[2] My editor Mike Loukides summarized this succinctly with the statement, "Avoid using String objects if you don't intend to represent text."