Before diving into the actual tuning, there are a number of considerations that will make your tuning phase run more smoothly and result in clearly achieved objectives.
Any application must meet the needs and expectations of its users, and a large part of those needs and expectations is performance. Before you start tuning, it is crucial to identify the target response times for as much of the system as possible. At the outset, you should agree with your users (directly if you have access to them, or otherwise through representative user profiles, market information, etc.) what the performance of the application is expected to be.
The performance should be specified for as many aspects of the system as possible, including:
Multiuser response times depending on the number of users (if applicable)
Systemwide throughput (e.g., number of transactions per minute for the system as a whole, or response times on a saturated network, again if applicable)
The maximum number of users, data, files, file sizes, objects, etc., the application supports
Any acceptable and expected degradation in performance between minimal, average, and extreme values of supported resources
Agree on target values and acceptable variances with the customer or potential users of the application (or whoever is responsible for performance) before starting to tune. Otherwise, you will not know where to target your effort, how far you need to go, whether particular performance targets are achievable at all, and how much tuning effort those targets may require. But most importantly, without agreed targets, whatever you achieve will tend to become the starting point.
The following scenario is not unusual: a manager sees horrendous performance, perhaps a function that was expected to be quick, but takes 100 seconds. His immediate response is, "Good grief, I expected this to take no more than 10 seconds." Then, after a quick round of tuning that identifies and removes a huge bottleneck, function time is down to 10 seconds. The manager's response is now, "Ah, that's more reasonable, but of course I actually meant to specify 3 secondsI just never believed you could get down so far after seeing it take 100 seconds. Now you can start tuning." You do not want your initial achievement to go unrecognized (especially if money depends on it), and it is better to know at the outset what you need to reach. Agreeing on targets before tuning makes everything clear to everyone.
After establishing targets with the users, you need to set benchmarks. These are precise specifications stating what part of the code needs to run in what amount of time. Without first specifying benchmarks, your tuning effort is driven only by the target, "It's gotta run faster," which is a recipe for a wasted return. You must ask, "How much faster and in which parts, and for how much effort?" Your benchmarks should target a number of specific functions of the application, preferably from the user perspective (e.g., from the user pressing a button until the reply is returned or the function being executed is completed).
You must specify target times for each benchmark. You should specify ranges: for example, best times, acceptable times, etc. These times are often specified in frequencies of achieving the targets. For example, you might specify that function A take not more than 3 seconds to execute from user click to response received for 80% of executions, with another 15% of response times allowed to fall in the 3- to 5-second range, and 5% in the 5- to 10-second range. Note that the earlier section on user perceptions indicates that the user will see this function as having a 5-second response time (the 90th percentile value) if you achieve the specified ranges.
You should also have a range of benchmarks that reflect the contributions of different components of the application. If possible, it is better to start with simple tests so that the system can be understood at its basic levels, and then work up from these tests. In a complex application, this helps to determine the relative costs of subsystems and which components are most in need of performance-tuning.
The following point is critical: Without clear performance objectives, tuning will never be completed. This is a common syndrome on single or small group projects, where code keeps being tweaked as better implementations or cleverer code is thought up.
Your general benchmark suite should be based on real functions used in the end application, but at the same time should not rely on user input, as this can make measurements difficult. Any variability in input times or any other part of the application should either be eliminated from the benchmarks or precisely identified and specified within the performance targets. There may be variability, but it must be controlled and reproducible.
There are tools for testing applications in various ways.[2] These tools focus mostly on testing the robustness of the application, but as long as they measure and report times, they can also be used for performance testing. However, because their focus tends to be on robustness testing, many tools interfere with the application's performance, and you may not find a tool you can use adequately or cost-effectively. If you cannot find an acceptable tool, the alternative is to build your own harness.
[2] You can search the Web for "java+perf+test" to find performance-testing tools. In addition, some Java profilers are listed in Chapter 19.
Your benchmark harness can be as simple as a class that sets some values and then starts the main( ) method of your application. A slightly more sophisticated harness might turn on logging and timestamp all output for later analysis. GUI-run applications need a more complex harness and require either an alternative way to execute the graphical functionality without going through the GUI (which may depend on whether your design can support this), or a screen event capture and playback tool (several such tools exist[3]). In any case, the most important requirement is that your harness correctly reproduce user activity and data input and output. Normally, whatever regression-testing apparatus you have (and presumably are already using) can be adapted to form a benchmark harness.
[3] JDK 1.3 introduced a java.awt.Robot class, which provides for generating native system-input events, primarily to support automated testing of Java GUIs.
The benchmark harness should not test the quality or robustness of the system. Operations should be normal: startup, shutdown, and uninterrupted functionality. The harness should support the different configurations your application operates under, and any randomized inputs should be controlled, but note that the random sequence used in tests should be reproducible. You should use a realistic amount of randomized data and input. It is helpful if the benchmark harness includes support for logging statistics and easily allows new tests to be added. The harness should be able to reproduce and simulate all user input, including GUI input, and should test the system across all scales of intended use up to the maximum numbers of users, objects, throughputs, etc. You should also validate your benchmarks, checking some of the values against actual clock time to ensure that no systematic or random bias has crept into the benchmark harness.
For the multiuser case, the benchmark harness must be able to simulate multiple users working, including variations in user access and execution patterns. Without this support for variations in activity, the multiuser tests inevitably miss many bottlenecks encountered in actual deployment and, conversely, do encounter artificial bottlenecks that are never encountered in deployment, wasting time and resources. It is critical in multiuser and distributed applications that the benchmark harness correctly reproduce user-activity variations, delays, and data flows.
Each run of your benchmarks needs to be under conditions that are as identical as possible; otherwise, it becomes difficult to pinpoint why something is running faster (or slower) than in another test. The benchmarks should be run multiple times, and the full list of results retained, not just the average and deviation or the ranged percentages. Also note the time of day that benchmarks are being run and any special conditions that apply, e.g., weekend or after hours in the office. Sometimes the variation can give you useful information. It is essential that you always run an initial benchmark to precisely determine the initial times. This is important because, together with your targets, the initial benchmarks specify how far you need to go and highlight how much you have achieved when you finish tuning.
It is more important to run all benchmarks under the same conditions than to achieve the end-user environment for those benchmarks, though you should try to target the expected environment. It is possible to switch environments by running all benchmarks on an identical implementation of the application in two environments, thus rebasing your measurements. But this can be problematic: it requires detailed analysis because different environments usually have different relative performance between functions (thus your initial benchmarks could be skewed compared with the current measurements).
Each set of changes (and preferably each individual change) should be followed by a run of benchmarks to precisely identify improvements (or degradations) in the performance across all functions. A particular optimization may improve the performance of some functions while at the same time degrading the performance of others, and obviously you need to know this. Each set of changes should be driven by identifying exactly which bottleneck is to be improved and how much of a speedup is expected. Rigorously using this methodology provides a precise target for your effort.
You need to verify that any particular change does improve performance. It is tempting to change something small that you are sure will give an "obvious" improvement, without bothering to measure the performance change for that modification (because "it's too much trouble to keep running tests"). But you could easily be wrong. Jon Bentley once discovered that eliminating code from some simple loops can actually slow them down.[4] If a change does not improve performance, you should revert to the previous version.
[4] Jon Bentley, "Code Tuning in Context," Dr. Dobb's Journal, May 1999. An empty loop in C ran slower than one that contained an integer increment operation.
The benchmark suite should not interfere with the application. Be on the lookout for artificial performance problems caused by the benchmarks themselves. This is very common if no thought is given to normal variation in usage. A typical situation might be benchmarking multiuser systems with lack of user simulation (e.g., user delays not simulated, causing much higher throughput than would ever be seen; user data variation not simulated, causing all tests to try to use the same data at the same time; activities artificially synchronized, giving bursts of activity and inactivity; etc.). Be careful not to measure artificial situations, such as full caches with exactly the data needed for the test (e.g., running the test multiple times sequentially without clearing caches between runs). There is little point in performing tests that hit only the cache, unless this is the type of work the users will always perform.
When tuning, you need to alter any benchmarks that are quick (under five seconds) so that the code applicable to the benchmark is tested repeatedly in a loop to get a more consistent measure of where any problems lie. By comparing timings of the looped version with a single-run test, you can sometimes identify whether caches and startup effects are altering times in any significant way.
Optimizing code can introduce new bugs, so the application should be tested during the optimization phase. A particular optimization should not be considered valid until the application using that optimization's code path has passed quality assessment.
Optimizations should also be completely documented. It is often useful to retain the previous code in comments for maintenance purposes, especially as some kinds of optimized code can be more difficult to understand (and therefore to maintain).
It is typically better (and easier) to tune multiuser applications in single-user mode first. Many multiuser applications can obtain 90% of their final tuned performance if you tune in single-user mode, and then identify and tune just a few major multiuser bottlenecks (which are typically a sort of give-and-take between single-user performance and general system throughput). Occasionally, though, there will be serious conflicts that are revealed only during multiuser testing, such as transaction conflicts that can slow an application to a crawl. These may require a redesign or rearchitecting of the application. For this reason, some basic multiuser tests should be run as early as possible to flush out potential multiuser-specific performance problems.
Tuning distributed applications requires access to the data being transferred across the various parts of the application. At the lowest level, this can be a packet sniffer on the network or server machine. One step up from this is to wrap all the external communication points of the application so that you can record all data transfers. Relay servers are also useful. These are small applications that just reroute data between two communication points. Most useful of all is a trace or debug mode in the communications layer that allows you to examine the higher-level calls and communication between distributed parts.