3.7 Faster VMs

VM runtimes and Java compilers vary enormously over time and across vendors. More and more optimizations are finding their way into both VMs and compilers. Many possible compiler optimizations are considered in later sections of this chapter. In this section I focus on VM optimizations.

3.7.1 VM Speed Variations

Different VMs have different running characteristics. Some VMs are intended purely for development and are highly suboptimal in terms of performance. These VMs may have huge inefficiencies, even in such basic operations as casting between different numeric types, as was the case with one development VM I used. It provided the foundation of an excellent development environment (actually my preferred environment) but was all but useless for performance testing. Any data type manipulation other than with ints or booleans produced highly varying and misleading times.

It is important to run any tests involving timing or profiling in the same VM that will run the application. You should test your application in the current "standard" VMs if your target environment is not fully defined.

There is, of course, nothing much you can do about speeding up any one VM (apart from heap tuning or upgrading the CPUs). But you should be aware of the different VMs available, whether or not you control the deployment environment of your application. If you control the target environment, you can choose your VM appropriately. If you do not control the environment on which your application runs, remember that performance is partly user expectation. If you tell your user that VM "A" gives such and such a performance for your application, but VM "B" gives much slower performance, then you at least inform your user community of the implications of their choice of VM. This might also pressure vendors with slower VMs to improve them.

3.7.2 VMs with JIT Compilers

The basic bytecode interpreter VM executes by decoding and executing bytecode. This is slow and is pure overhead, adding nothing to the functionality of the application. A just-in-time (JIT) compiler in a virtual machine eliminates much of this overhead by doing the bytecode fetch and decode just once. The first time the method is loaded, the decoded instructions are converted into machine code native for the CPU the system is running on. After that, future invocations of a particular method no longer incur the interpreter overhead. However, a JIT must be fast at compiling to avoid slowing the runtime, so extensive optimizations within the compile phase are unlikely. This means that the compiled code is often not as fast as it could be. A JIT also imposes a significantly larger memory footprint to the process.

Without a JIT, you might have to optimize your bytecodes for a particular platform. Optimizing the bytecode for one platform can conceivably make that code run slower on another platform (though a speedup is usually reflected to some extent on all platforms). A JIT compiler can theoretically optimize the same code differently for different underlying CPUs, thus getting the best of all worlds.

In his tests,[4] Mark Roulo found that a good JIT speeded up the overhead of method calls from a best of 280 CPU clock cycles in the fastest non-JIT VM to just 2 clock cycles in the JIT VM. In a direct comparison of method call times for this JIT VM compared to a compiled C++ program, the Java method call time was found to be just one clock cycle slower than C++: fast enough for almost any application. However, object creation is not speeded up by anywhere near this amount, which means that with a JIT VM, object creation is relatively more expensive (and consequently more important when tuning) than with a non-JIT VM.

[4] "Accelerate Your Java Apps," JavaWorld, September 1998, http://www.javaworld.com/javaworld/jw-09-1998/jw-09-speed.html.

3.7.3 VM Startup Time

The time your application takes to start depends on a number of factors. First, there is the time taken by the operating system to start the executable process. This time is mostly independent of the VM, though the size of the executable and the size and number of shared libraries needed to start the VM process have some effect. But the main time cost is mapping the various elements into system memory. This time can be shortened by having as much as possible already in system memory. The most obvious way to have the shared libraries already in system memory is to have recently started a VM. If the VM was recently started, even for a short time, the operating system is likely to have cached the shared libraries in system memory, so the next startup is quicker. A better but more complicated way of having the executable elements in memory is to have the relevant files mapped onto a memory-resident filesystem; see Section 14.1.3 for more on how to manage this. Yet another option is to use a prestarted VM; see the earlier section Section 3.5. A prestarted VM also partially addresses the startup overhead discussed in the next paragraph.

The second component in the startup time of the VM is the time taken to manage the VM runtime initializations. This is purely dependent on the VM system implementation. Interpreter VMs generally have faster startup times than JIT VMs because the JIT VMs need to manage extra compilations during the startup and initial classloading phases. Starting with SDK 1.3, Sun tried to improve VM startup time. VMs are now already differentiated by their startup times; for example, the 1.3 VM has a deliberately shortened startup time compared to 1.2. HotSpot has the more leisurely startup time acceptable for long-running server processes. In the future you can expect to see VMs differentiated by their startup times even more.

Finally, the application architecture and class file configuration determine the last component of startup time. The application may require many classes and extensive initializations before the application is started, or it may be designed to start up as quickly as possible. It is useful to bear in mind the user perception of application startup when designing the application. For example, if you can create the startup window as quickly as possible and then run any initializations in the background without blocking windowing activity, the user sees this as a faster startup than if you waited for initializations to finish before creating the initial window. This design takes more work but improves startup time.

The number of classes that need to be loaded before the application starts are part of the application initializations, and again the application design affects this time. In the later section Section 3.12, I discuss the effects of class file configuration on startup time. Section 13.3 also includes an example of designing an application to minimize startup time.

3.7.4 Other VM Optimizations

On the VM side, improvements are possible using JIT compilers to compile methods to machine code, using algorithms for code caching, applying intelligent analyses of runtime code, etc. Some bytecodes allow the system to bypass table lookups that would otherwise need to be executed. But these bytecodes take extra effort to apply to the VM. Using these techniques, an intelligent VM could skip some runtime steps after parts of the application have been resolved.

Generally, a VM with a JIT compiler gives a huge boost to a Java application and is probably the quickest and simplest way to improve performance. The most optimistic predictions say that using optimizing compilers to generate bytecodes, together with VMs with intelligent JIT (re)compilers, puts Java performance on a par with or even above that of an equivalent natively compiled C++ application. Theoretically, better performance is possible. Having a runtime environment adapt to the running characteristics of a program should, in theory at least, provide better performance than a statically compiled application. A similar argument runs in CPU design circles where dynamic rescheduling of instructions to take pipelining into account allows CPUs to process instructions out of order. But at the time of writing this book, we are not particularly close to proving this theory for the average Java application. The time available for a VM to do something other than the most basic execution and bytecode translation is limited. The following quote about dynamic scheduling in CPUs also applies to adaptive VMs:

At runtime, the CPU knows almost everything, but it knows everything almost too late to do anything about it. (Tom R. Halfhill quoting Gerrit A. Slavenburg, "Inside IA-64," Byte, June 1998)

As an example of an "intelligent" VM, Sun's HotSpot VM is targeted precisely to this area of adaptive optimization. This VM includes some basic improvements (all of which are also present in VMs from other vendors) such as using direct pointers instead of Java handles (which may be a security issue),[5] improved thread synchronization, a generational garbage collector, speedups to object allocation, and an improved startup time (by not JIT-compiling methods initially). In addition to these basic improvements, HotSpot includes adaptive optimization, which works as follows: HotSpot runs the application initially in interpreted mode (as if there is no JIT compiler present) while a profiler identifies the bottlenecks in the application. Then, an optimizing JIT compiler compiles into native machine code only those hotspots in the application that are causing the bottlenecks. Because only a small part of the application is targeted, the JIT compiler (which might in this case be more realistically called an "after-a-while" compiler rather than a "just-in-time" compiler) can spend extra time compiling those targeted parts of the application, thus allowing more than the most basic compiler optimizations to be applied.

[5] A handle is a pointer to a pointer. Java uses handles to ensure security so that one object cannot gain direct access to another object without the security capabilities of Java being able to intervene.

Consider the example where 20% of the code accounts for 80% of the running application time. Here, a classic JIT compiler might improve the whole application by 30%: the application would now take 70% of the time it originally took.

The HotSpot compiler ignores the nonbottlenecked code, instead focusing on getting the 20% of hotspot code to run twice as fast. The 80% of application time is halved to just 40% of the original time. Adding in the remaining 20% means that the application now runs in 60% of the original time. (These statistics are purely for illustration purposes.)

Note, however, that HotSpot tries too hard sometimes. For example, HotSpot can speculatively optimize on the basis of guessing the type of particular objects. If that guess turns out to be wrong, HotSpot has to deoptimize the code, which results in some wildly variable timings.

So far, I have no evidence that optimizations I have applied in the past (and detailed in this book) have caused any problems after upgrading compilers and VMs. However, it is important to note that the performance characteristics of your application may change with different VMs and compilers, and not necessarily always for the better. Be on the lookout for any problems a new compiler and VM may bring. The technique of loading classes explicitly from a new thread after application startup can conflict with a particular JIT VM's caching mechanism and actually slow down the startup sequence of your application. I have no evidence for this; I am just speculating on possible conflicts.