10.4 Synchronization Overhead

There are two separate costs of synchronization. First, there is the operational cost of managing the monitors. This overhead can be significant: acquiring and testing for locks on the monitor for every synchronized method and block can impose a lot of overhead. Attempting to acquire a lock must itself be a synchronized activity within the VM; otherwise, two threads can simultaneously execute the lock-acquisition code. This overhead can be reduced by clever techniques in the VM, but never completely eliminated. The next section addresses this overhead and looks at ways to avoid it whenever possible.

Attempts to lock on different objects in two threads must still be synchronized to ensure that the object identity check and granting of the lock are handled atomically. This means that even attempting to get a lock on any object by two or more threads at the same time can still cause a performance degradation, as the VM grants only one thread at a time access to the lock-acquisition routine.

In some VMs, synchronizing static methods takes significantly longer than synchronizing nonstatic methods, suggesting that code is global in these VMs for the static synchronizations. (This is not strictly speaking a bug, but certainly not optimal for performance.)

The second cost of synchronization is in what it actually does. Synchronization serializes execution of a set of statements so that only one thread at a time executes that set. Whenever multiple threads simultaneously try to execute the same synchronized block, those threads are effectively run together as one single thread. This completely negates the purpose of having multiple threads and is potentially a huge bottleneck in any program. On machines with multiple CPUs, you can leave all but one CPU idle when serialized execution occurs. The later section "Avoiding Serialized Execution" addresses techniques for avoiding serialized execution where possible.

10.4.1 Desynchronization and Synchronized Wrappers

As we just noted, synchronized methods have performance costs. In fact, for short methods, using a synchronized method can mean that the basic time involved in calling the method is significantly larger than the time for actually running it. The overhead of calling an unsynchronized method can be much smaller than that of calling a synchronized method.

You should be aware of when you do not need to synchronize. Read-only objects never need synchronization. Stateless objects (including no-static state) almost never need synchronization on their methods. (There are certain unusual implementations when methods may be altering state directly in another shared object, where synchronization would be required.) Some objects with state may have no need for synchronization because access to the object is highly restricted, and the synchronization is handled by other objects. Some objects can implement a copy-on-write mechanism (StringBuffer uses this; see Chapter 5). You can define copy-on-write in such a way to allow multithreaded updates of that object.

Many multithreaded applications actually use most of their objects in a single-threaded manner. Each individual thread maintains its own references to most objects, with just a few data or utility objects actually being used by multiple threads. From a performance standpoint, it seems a shame to have the overhead of synchronized objects on many classes where synchronization is not needed or used. On the other hand, when you design and build a particular class, it is seldom possible to anticipate that it will never be shared among several threads, so to be on the safe side, typically the class is built with synchronization.

When you have identified a bottleneck that uses synchronized objects, you can sometimes remove synchronization on those objects by giving different threads their own unsynchronized copies of those objects. This is especially easy to achieve when you use objects that have an unsynchronized implementation held in a synchronized wrapper.

The idea behind synchronized wrappers is that you build your class completely unsynchronized, as if it is to be used single-threaded. But you also provide a wrapper class with exactly the same interface. The difference in the wrapper class is that all methods that require synchronization are defined with the synchronized modifier. The wrapper could be a subclass with methods reimplemented, but more typically, it is a separate class that holds an internal reference to an instance of the unsynchronized class and wraps all the methods to call that internal object. Using synchronized wrappers allows you the benefits of thread-safe objects by default, while still retaining the capability to selectively use unsynchronized versions of those classes in bottlenecks.

From Java 2, the framework of using synchronized wrappers has become standard. All the new collection classes in java.util are now defined unsynchronized, with synchronized wrappers available. Old collection classes (e.g., Hashtable, Vector) that are already synchronized remain so for backward compatibility. The wrappers are usually generic, so you can actually create wrapped synchronized objects from any object of the right type.

I include a short example of the synchronized-wrapper framework here for clarity. If class UnsyncedAdder is defined as follows:

public interface Adder {
  public void add(int aNumber);
}
   
public class UnsyncedAdder
  implements Adder
{
  int total;
  int numAdditions;
  public void add(int aNumber) {total += aNumber; numAdditions++;}
}

Then the synchronized wrapper for this class can be:

public class SyncedAdder
  implements Adder
{
  Adder adder;
  public SyncedAdder(Adder a) {adder = a;}
  public synchronized void add(int aNumber) { adder.add(aNumber);}
}

Obviously, you refer to Adder objects in your code; don't refer explicitly to concrete implementations of Adder classes (such as UnsyncedAdder and SyncedAdder) except in the constructor or factory classes. Note that the synchronized wrapper is completely generic. It wraps any implementation of Adder, providing synchronization on the add( ) method irrespective of the underlying concrete implementation of the Adder class.

Using unsynchronized classes gives a performance advantage, but it is a maintenance drawback. There is every likelihood that initial implementation of any application will use the unsynchronized classes by default, leading to many subtle threading bugs that can be a debugging and maintenance nightmare. Typical development scenarios then try to identify which objects need to be synchronized for the application, and then wrap those objects in their synchronized wrappers.

Under the stress of project milestones, I know of one project where the developers went through all their code with a recursive routine, chopping out every synchronized keyword in method declarations. This seemed quicker than carefully tuning the code, and did in fact give a performance improvement. They put a few synchronized keywords back in after the regression tests. This type of tuning is exactly the opposite of what I recommend.

Instead, you should use synchronized wrapped objects throughout the application by default, but ensure that you have the capability to easily replace these with the unsynchronized underlying objects. (Remember, tuning is better done after the application works correctly, not at the beginning.) When you come to tune the application, identify the bottlenecks. Then, when you find that a particular class needs to be speeded up, determine whether that class can be used unsynchronized. If so, replace it with its unsynchronized underlying object, and document this thoroughly. Any changes in the application must reexamine these particular tuning changes to ensure that these objects do not subsequently need to become synchronized.^[1]

^[1] When the design indicates that a class or a set of methods should definitely be synchronized or definitely does not need synchronization, then of course you should apply that design decision. For example, stateless objects can often be specified with no synchronization. However, there are many classes and methods where this decision is uncertain, and this is where my recommendation applies.

Be aware, though, that there is no win-win situation here. If you tend toward unsynchronized classes by default, you leave your application open to corruption. If you prefer my recommended "synchronized by default" approach, your application has an increased chance of encountering deadlocks. On the basis that deadlocks are both more obvious and easier to fix than corrupt objects, I prefer the "synchronized by default" option. Implementing with interfaces and synchronized wrappers gives you an easy way to selectively back out of synchronization problems.

The next test gives you an idea of the relative performance of synchronized and unsynchronized methods, and of synchronized wrappers. The test compares synchronized (Vector ), unsynchronized (ArrayList), and synchronized wrapper (ArrayList wrapped) classes.

package tuning.threads;
   
import java.util.*;
   
public class ListTesting
{
  public static final int CAPACITY = 100000;
  public static void main(String args[  ])
  {
    //In order to isolate the effects of synchronization, we make sure
    //that the garbage collector doesn't interfere with the test. So
    //we use a bunch of pre-allocated, pre-sized collections, and
    //populate those collections with pre-existing objects. No objects
    //will be created or released during the timing phase of the tests.
    List[  ] l = {new Vector(CAPACITY), new Vector(CAPACITY),
      new Vector(CAPACITY), new ArrayList(CAPACITY),
      new ArrayList(CAPACITY), new ArrayList(CAPACITY),
      Collections.synchronizedList(new ArrayList(CAPACITY)),
      Collections.synchronizedList(new ArrayList(CAPACITY)),
      Collections.synchronizedList(new ArrayList(CAPACITY))};
    int REPEAT = (args.length > 0) ? Integer.parseInt(args[0]) : 100;
   
    //Vary the order.
    test(l[0], REPEAT, "Vector");
    test(l[6], REPEAT, "sync ArrayList" );
    test(l[3], REPEAT, "ArrayList");
    test(l[1], REPEAT, "Vector");
    test(l[4], REPEAT, "ArrayList");
    test(l[7], REPEAT, "sync ArrayList" );
    test(l[2], REPEAT, "Vector");
    test(l[5], REPEAT, "ArrayList");
    test(l[8], REPEAT, "sync ArrayList" );
  }
   
  public static void test(List l, int REPEAT, String ltype)
  {
    //need to initialize for set(  ) to work. Don't measure this time
    for (int j = 0; j < CAPACITY; j++)
      l.add(Boolean.FALSE);
   
    long time = System.currentTimeMillis(  );
    //The test sets elements repeatedly. The set methods are
    //very similar. Apart from synchronization, the Vector.set(  )
    //is slightly more efficient than the ArrayList.set(  ), which
    //is in turn more efficient than the wrapped ArrayList because
    //there is one extra layer of method calls for the wrapped object.
    for (int i = REPEAT; i > 0; i--)
      for (int j = 0; j < CAPACITY; j++)
        l.set(j, Boolean.TRUE);
    System.out.println(ltype + " took " +
        (System.currentTimeMillis(  )-time));
  }
}

The normalized results from running this test are shown in Table 10-3.

Table 10-3. Timings of the various array-manipulation tests, normalized to the JDK 1.2 Vector test
	1.2.2	1.3.1_02	1.3.1_02-server	1.4.0	1.4.0-server	1.4.0-Xint
Vector	100%	16%	15%	20%	12%	134%
ArrayList	15%	14%	9%	14%	13%^[2]	166%
Wrapped ArrayList	131%	23%	18%	26%	19%	231%

^[2] Note that the Vector.set( ) method implementation is slightly more efficient (faster) than the ArrayList.set( ) implementation, so if there were no effect from the synchronization, the Vector test could be slightly faster than the ArrayList test.

There are some reports that the latest VMs have negligible overhead for synchronized methods; however, my own tests show that synchronized methods continue to incur some overhead (VMs up to and including JDK 1.4). The 1.4 server-mode test is the only VM that shows negligible overhead from synchronized methods. This comes from server mode's aggressive inlining together with being able to analyze the requirement for acquiring the lock. In this case, the test is fairly simple, and it looks like the 1.4 server mode is able to establish that the lock does not need acquiring on each pass of the loop and to correctly optimize the situation. In more complex real-world situations, server mode is not always able to optimize away the lock acquisition so well. On the other hand, I shouldn't underplay the fact that the latest 1.3 and 1.4 VMs all do very well in minimizing the synchronization overhead (especially the 1.4 server mode), so much so that synchronization overhead should not be an issue for most applications.

10.4.2 Avoiding Serialized Execution

One way of completely avoiding the requirement to synchronize methods is to use separate objects and storage structures for different threads. Care must be taken to avoid calling synchronized methods from your own methods, or you will lose all your carefully built benefits. For example, Hashtable access and update methods are synchronized, so using one in your storage structure can eliminate any desired benefit. Prior to JDK 1.2, there is no unsynchronized hash table in the JDK, and you have to build or buy your own unsynchronized version. From JDK 1.2, unsynchronized collection classes are available, including Map classes.

As an example of implementing this framework, I look at a simple set of global counters, keyed on a numeric identifier. Basically, the concept is a global counter to which any thread can add a number. This concept is extended slightly to allow for multiple counters, each counter having a different key. String keys are more useful, but for simplicity I use integer keys in this example. To use String keys, an unsynchronized Map replaces the arrays.

The simple, straightforward version of the class looks like this:

package tuning.threads;
   
public class Counter1
{
  //For simplicity make just 10 counters
  static long[  ] vec = new long[10]; 
   
  public static void initialize(int key)
  {
    vec[key] = 0;
  }
   
  //And also just make key the index into the array
  public static void addAmount(int key, long amount)
  {
    //This is not atomically synchronized since we do an array
    //access together with an update, which are two operations.
    vec[key] += amount;
  }
   
  public static long getAmount(int key)
  {
    return vec[key];
  }
}

This class is basic and easy to understand. Unfortunately, it is not thread-safe and leads to corrupt counter values when used. A test run on a particular single-processor system with four threads running simultaneously, each adding the number 1 to the same key 10 million times, gives a final counter value of around 26 million instead of the correct 40 million.^[3] On the positive side, the test is blazingly fast, taking very little time to complete and get the wrong answer.

^[3] The results discussed are for one particular test run. On other test runs, the final value is different, but it is almost never the correct value (40 million). If I use a faster CPU or a lower total count, the threads can get serialized by the operating system (by finishing quickly enough), leading to consistently correct results for the total count. But those correct results are an artifact of the environment and are not guaranteed. Other system loads and environments generate corrupt values.

To get the correct behavior, you need to synchronize the update methods in the class. Here is Counter2, which is just Counter1 with the methods synchronized:

package tuning.threads;
   
public class Counter2
{
  //For simplicity make just 10 counters
  static long[  ] vec = new long[10]; 
   
  public static synchronized void initialize(int key)
  {
    vec[key] = 0;
  }
   
  //And also make the just make key the index into the array
  public static synchronized void addAmount(int key, long amount)
  {
    //Now the method is synchronized, so we will always
    //complete any particular update
    vec[key] += amount;
  }
  public static synchronized long getAmount(int key)
  {
    return vec[key];
  }
}

Now you get the correct answer of 40 million. Unfortunately, the test takes 20 times longer to execute (see Table 10-4). Avoiding the synchronization is going to be more work. To do this, create a set of counters, one for each thread, and update each thread's counter separately.^[4] When you want to see the global total, you need to sum the counters across the threads.

^[4] ThreadLocal variables might be appropriate here, but not in JDK 1.2, which used an underlying implementation of a synchronized map to allocate pre-thread objects. That implementation would defeat our intention to avoid synchronization completely. JDK 1.3 uses an instance variable in the Thread object to hold an unsynchronized map and would work.

The class definition follows:

package tuning.threads;
   
public class Counter3
{
  //support up to 10 threads of 10 counters
  static long vec[  ][  ] = new long[10][  ];
   
  public static synchronized void initialize(CounterTest t)
  {
    //For simplicity make just 10 counters per thread
    vec[t.num] = new long[10];
  }
   
  public static void addAmount(int key, long amount)
  {
    //Use our own threads to make the mapping easier,
    //and to illustrate the technique of customizing threads.
    //For generic Thread objects, could use an unsynchronized 
    //HashMap or other Map,
    //Or use ThreadLocal if JDK 1.2 is available
   
    //We use the num instance variable of the CounterTest
    //object to determine which array we are going to increment.
    //Since each thread is different, here is no conflict.
    //Each thread updates its own counter.
    long[  ] arr =  vec[((CounterTest) Thread.currentThread(  )).num];
    arr[key] += amount;
  }
  public static synchronized long getAmount(int key)
  {
    //The current amount must be aggregated across the thread
    //storage arrays. This needs to be synchronized, but
    //does not matter here as I just call it at the end.
    long amount = 0;
    for (int threadnum = vec.length-1; threadnum >= 0 ; threadnum--)
    {
      long[  ] arr = vec[threadnum];
      if (arr != null)
        amount += arr[key];
    }
    return amount;
  }
}

Using Counter3, you get the correct answer for the global counter, and the test is quicker than Counter2. The relative timings for a range of VMs are listed in Table 10-4.

Table 10-4. Timings of the various counter tests, normalized to the JDK 1.2 Counter2 test
	1.1.8	1.2.2	1.3.1_02	1.3.1_02-server	1.4.0	1.4.0-server	1.4.0-Xint
Counter2	24.8%	100%	124%	130%	123%	131%	131%
Counter3	1.5%	1.7%	2.2%	2%	2.4%	0.01%	5.7%
Counter1 (incorrect result)	0.2%	0.2%	0.2%	0.01%	0.1%	0.01%	1.9%

The serialized execution avoidance class is a significant improvement on the synchronized case. The Counter2 timings can be extremely variable. This variation is generated from the nature of multithreaded context switching, together with the fact that the activity taking much of the time in this test is lock management. Switching is essentially unpredictable, and the amount of switching and where it occurs affects how often the VM has to release and reacquire locks in different threads. Nevertheless, across a number of measurements, Counter3 was always faster than Counter2, normally orders of magnitude faster.

The listed times were measured on a single-processor machine. Consider what happens on a multiprocessor machine where the threads can run on different CPUs (i.e., where the Java runtime and operating system support preemptive thread scheduling on separate CPUs). Counter3 (the serialized execution avoidance class) is parallelized automatically and scales very nicely. This same test with Counter3 running on a four-CPU machine tends towards one-quarter of the single-CPU time, assuming that the four CPUs have the same power as the single CPU we tested earlier. On the other hand, the synchronized version of the counter, Counter2, always has serialized execution (that's what synchronized does). Consequently, it does not scale and generally performs no better than in the single-CPU test (except for the advantage of running the OS on another CPU).