2.3 Method Calls

Most profiling tools provide a profile of method calls, showing where the bottlenecks in your code are and helping you decide where to target your efforts. By showing which methods and lines take the most time, a good profiling tool can quickly pinpoint bottlenecks.

Most method profilers work by sampling the call stack at regular intervals and recording the methods on the stack.^[7] This regular snapshot identifies the method currently being executed (the method at the top of the stack) and all the methods below, to the program's entry point. By accumulating the number of hits on each method, the resulting profile usually identifies where the program is spending most of its time. This profiling technique assumes that the sampled methods are representative, i.e., if 10% of stacks sampled show method foo( ) at the top of the stack, then the assumption is that method foo( ) takes 10% of the running time. However, this is a sampling technique , so it is not foolproof: methods can be missed altogether or have their weighting misrecorded if some of their execution calls are missed. But usually only the shortest tests are skewed. Any reasonably long test (i.e., seconds rather than milliseconds) normally gives correct results.

^[7] A variety of profiling metrics, including the way different metrics can be used, are reported in "A unifying approach to performance analysis in the Java environment" by Alexander, Berry, Levine, and Urquhart, IBM Systems Journal, Vol. 39, No. 1, http://www.research.ibm.com/journal/sj/391/alexander.html. Specifically, see Table 2-1 in this paper.

This sampling technique can be difficult to get right. It is not enough to simply sample the stack. The profiler must also ensure that it has a coherent stack state, so the call must be synchronized across stack activities, possibly by temporarily stopping the thread. The profiler also needs to make sure that multiple threads are treated consistently and that the timing involved in its activities is accounted for without distorting the regular sample time. Also, too short a sample interval causes the program to become extremely slow, while too long an interval results in many method calls being missed and misrepresentative profile results being generated.

The JDK comes with a minimal profiler, obtained by running a program using the java executable with the -Xrunhprof option (-prof before JDK 1.2, -Xprof with HotSpot). This option produces a profile data file called java.hprof.txt (java.prof before 1.2). The filename can be specified by using the modified option -Xrunhprof: file=<filename> (-prof:<filename> before 1.2).

2.3.1 Profiling Methodology

When using a method profiler, the most useful technique is to target the top five to ten methods and choose the quickest to fix. The reason for this is that once you make one change, the profile tends to be different the next time, sometimes markedly so. This way, you can get the quickest speedup for a given effort.

However, it is also important to consider what you are changing so you know what your results are. If you select a method that is taking 10% of the execution time and then halve the time that method takes, you speed up your application by 5%. On the other hand, targeting a method that takes up only 1% of execution time gives you a maximum of only 1% speedup to the application, no matter how much effort you put in.

Similarly, if you have a method that takes 10% of the time but is called a huge number of times, with each individual method call being quite short, you are less likely to speed up the method. On the other hand, if you can eliminate some significant fraction of the calling methods (the methods that call the method that takes 10% of the time), you might gain speed that way.

Let's look at the profile output from a short program that repeatedly converts some numbers to strings and inserts them into a hash table:

package tuning.profile;
import java.util.*;
  
public class ProfileTest
{
  
  public static void main(String[  ] args)
  {
    //Repeat the loop this many times
    int repeat = 2000;
  
    //Two arrays of numbers, eight doubles and ten longs
    double[  ] ds = {Double.MAX_VALUE, -3.14e-200D,
      Double.NEGATIVE_INFINITY, 567.89023D, 123e199D,
      -0.000456D, -1.234D, 1e55D};
    long[  ] ls = {2283911683699007717L, -8007630872066909262L,
      4536503365853551745L, 548519563869L, 45L,
      Long.MAX_VALUE, 1L, -9999L, 7661314123L, 0L};
  
    //Initializations
    long time;
    StringBuffer s = new StringBuffer(  );
    Hashtable h = new Hashtable(  );
    System.out.println("Starting test");
    time = System.currentTimeMillis(  );
  
    //Repeatedly add all the numbers to a stringbuffer
    //and also put them into a hash table
    for (int i = repeat; i > 0; i--)
    {
        s.setLength(0);
        for (int j = ds.length-1; j >= 0; j--)
        {
            s.append(ds[j]);
            h.put(new Double(ds[j]), Boolean.TRUE);
        }
        for (int j = ls.length-1; j >= 0; j--)
        {
            s.append(ls[j]);
            h.put(new Long(ls[j]), Boolean.FALSE);
        }
    }
    time = System.currentTimeMillis(  ) - time;
    System.out.println("  The test took " + time + " milliseconds");
  }
}

The relevant output from running this program with the JDK 1.2 method profiling option follows. (See Section 2.3.2 later in this chapter for a detailed explanation of the 1.2 profiling option and its output.)

CPU SAMPLES BEGIN (total = 15813) Wed Jan 12 11:26:47 2000
rank   self  accum   count trace method
   1 54.79% 54.79%    8664   204 java/lang/FloatingDecimal.dtoa
   2 11.67% 66.46%    1846   215 java/lang/Double.equals
   3 10.18% 76.64%    1609   214 java/lang/FloatingDecimal.dtoa
   4  3.10% 79.74%     490   151 java/lang/FloatingDecimal.dtoa
   5  2.90% 82.63%     458   150 java/lang/FloatingDecimal.<init>
   6  2.11% 84.74%     333   213 java/lang/FloatingDecimal.<init>
   7  1.23% 85.97%     194   216 java/lang/Double.doubleToLongBits
   8  0.97% 86.94%     154   134 sun/io/CharToByteConverter.convertAny
   9  0.94% 87.88%     148   218 java/lang/FloatingDecimal.<init>
  10  0.82% 88.69%     129   198 java/lang/Double.toString
  11  0.78% 89.47%     123   200 java/lang/Double.hashCode
  12  0.70% 90.17%     110   221 java/lang/FloatingDecimal.dtoa
  13  0.66% 90.83%     105   155 java/lang/FloatingDecimal.multPow52
  14  0.62% 91.45%      98   220 java/lang/Double.equals
  15  0.52% 91.97%      83   157 java/lang/FloatingDecimal.big5pow
  16  0.46% 92.44%      73   158 java/lang/FloatingDecimal.constructPow52
  17  0.46% 92.89%      72   133 java/io/OutputStreamWriter.write

In this example, I extracted only the top few lines from the profile summary table. The methods are ranked according to the percentage of time they take. Note that the trace does not identify actual method signatures, only method names. The top three methods take, respectively, 54.79%, 11.67%, and 10.18% of the time taken to run the full program.^[8]

^[8] The samples that count toward a particular method's execution time are those where the method itself is executing at the time of the sample. If method foo( ) were calling another method when the sample was taken, that other method would be at the top of the stack instead of foo( ). So you do not need to worry about the distinction between foo( )'s execution time and the time spent executing foo( )'s callees. Only the method at the top of the stack is tallied.

The fourth method in the list takes 3.10% of the time, so clearly you need look no further than the top three methods to optimize the program. The methods ranked first, third, and fourth are the same method, possibly called in different ways. Obtaining the traces for these three entries from the relevant section of the profile output (trace 204 for the first entry, and traces 215 and 151 for the second and fourth entries), you get:

TRACE 204:
java/lang/FloatingDecimal.dtoa(FloatingDecimal.java:Compiled method)
java/lang/FloatingDecimal.<init>(FloatingDecimal.java:Compiled method)
java/lang/Double.toString(Double.java:Compiled method)
java/lang/String.valueOf(String.java:Compiled method)
TRACE 214:
java/lang/FloatingDecimal.dtoa(FloatingDecimal.java:Compiled method)
TRACE 151:
java/lang/FloatingDecimal.dtoa(FloatingDecimal.java:Compiled method)
java/lang/FloatingDecimal.<init>(FloatingDecimal.java:Compiled method)
java/lang/Double.toString(Double.java:132)
java/lang/String.valueOf(String.java:2065)

In fact, both traces 204 and 151 are the same stack, but trace 151 provides line numbers for two of the methods. Trace 214 is a truncated entry, and is probably the same stack as the other two (these differences highlight the fact that the JDK profiler sometimes loses information).

All three entries refer to the same stack: an inferred call from the StringBuffer to append a double, which calls String.valueOf( ), which calls Double.toString( ), which in turn creates a FloatingDecimal object. (<init> is the standard way to write a constructor call; <clinit> is the standard way to show a class initializer being executed. These are also the actual names for constructors and static initializers in the class file.) FloatingDecimal is private to the java.lang package, which handles most of the logic involved in converting floating-point numbers. FloatingDecimal.dtoa( ) is the method called by the FloatingDecimal constructor that converts the binary floating-point representation of a number into its various parts of digits before the decimal point, after the decimal point, and the exponent. FloatingDecimal stores the digits of the floating-point number as an array of chars when the FloatingDecimal is created; no strings are created until the floating-point number is converted to a string.

Since this stack includes a call to a constructor, it is worth checking the object-creation profile to see whether you are generating an excessive number of objects; object creation is expensive, and a method that generates many new objects is often a performance bottleneck. (I show the object-creation profile and how to generate it later in this chapter under Section 2.4.) The object-creation profile shows that a large number of extra objects are being created, including a large number of FDBigInt objects that are created by the new FloatingDecimal objects.

Clearly, FloatingDecimal.dtoa( ) is the primary method to optimize in this case. Almost any improvement in this one method translates directly to a similar improvement in the overall program. However, normally only Sun can modify this method, and even if you want to modify it, it is long and complicated and takes an excessive amount of time to optimize unless you are already familiar with both floating-point binary representation and converting that representation to a string format.

Normally when tuning, the first alternative to optimizing FloatingDecimal.dtoa( ) is to examine the other significant bottleneck method, Double.equals( ) , which was second in the summary. Even though this entry takes up only 11.67% compared to over 68% for the FloatingDecimal.dtoa( ) method, it may be an easier optimization target. But note that while a small 10% improvement in the FloatingDecimal.dtoa( ) method translates into a 6% improvement for the program as a whole, the Double.equals( ) method needs to be speeded up to be more than twice as fast to get a similar 6% improvement for the full program.

The trace corresponding to this second entry in the summary example turns out to be another truncated trace, but the example shows the same method in 14th position, and the trace for that entry identifies the Double.equals( ) call as coming from the Hashtable.put( ) call. Unfortunately for tuning purposes, the Double.equals( ) method itself is already quite fast and cannot be optimized further.

When methods cannot be directly optimized, the next best choice is to reduce the number of times they are called or even avoiding the methods altogether. (In fact, eliminating method calls is actually the better tuning choice, but it is often considerably more difficult to achieve and so is not a first-choice tactic for optimization.) The object-creation profile and the method profile together point to the FloatingDecimal class as being a huge bottleneck, so avoiding this class is the obvious tuning tactic here. In Chapter 5, I employ this technique, avoiding the default call through the FloatingDecimal class for the case of converting floating-point numbers to Strings, and I obtain an order-of-magnitude improvement. Basically, the strategy is to create a more efficient routine to run the equivalent conversion functionality and then replace the calls to the underperforming FloatingDecimal methods with calls to more efficient optimized methods.

The 1.1 profiling output is quite different and much less like a standard profiler's output. Running the 1.1 profiler with this program (details of this output are described in Section 2.3.4) gives:

count callee caller time
21 java/lang/System.gc(  )V  
    java/lang/FloatingDecimal.dtoa(IJI)V 760
8 java/lang/System.gc(  )V  
    java/lang/Double.equals(Ljava/lang/Object;)Z 295
2 java/lang/Double.doubleToLongBits(D)J 
    java/lang/Double.equals(Ljava/lang/Object;)Z 0

I have shown only the top four lines from the output. This output actually identifies both the FloatingDecimal.dtoa( ) and the Double.equals( ) methods as taking the vast majority of the time, and the percentages (given by the reported times) are listed as around 70% and 25% of the total program time for the two methods, respectively. Since the "callee" for these methods is listed as System.gc( ), this also indicates that the methods are significantly involved in memory creation, and suggests that the next tuning step might be to analyze the object-creation output for this program.

The best way to avoid the Double.equals( ) method is to replace the hash table with another implementation that stores double primitive data types directly rather than requiring the doubles to be wrapped in a Double object. This allows the = = operator to make the comparison in the put( ) method, thus completely avoiding the Double.equals( ) call. This is another standard tuning tactic, replacing a data structure with one more appropriate and faster for the task.

2.3.2 Java 2 cpu=samples Profile Output

The default profile output gained from executing with -Xrunhprof in Java 2 is not useful for method profiling. The default output generates object-creation statistics from the heap as the dump (output) occurs. By default, the dump occurs when the application terminates; you can modify the dump time by typing Ctrl-\ on Solaris and other Unix systems, or Ctrl-Break on Windows. To get a useful method profile, you need to modify the profiler options to specify method profiling. A typical call to achieve this is:

% java -Xrunhprof:cpu=samples,thread=y <classname>

(Note that in a Windows command-line prompt, you need to surround the options with double quotes because the equals sign is considered a meta character.) The full list of options available with -Xrunhprof can be viewed using the -Xrunhprof:help option.

Note that -Xrunhprof has an "h" in it. There seems to be an undocumented feature of the VM in which the option -Xrun<something> makes the VM try to load a shared library called <something>, for example, using -Xrunprof results in the VM trying to load a shared library called prof. This can be quite confusing if you are not expecting it. In fact, -Xrunhprof loads the hprof shared library.

The profiling option in JDKs 1.2/1.3/1.4 can be pretty flaky. Several of the options can cause the runtime to crash (core dump). The output is a large file because huge amounts of trace data are written rather than summarized. Since the profile option is essentially a Sun engineering tool, it is pretty rough, especially since Sun has a separate (not free) profile tool for its engineers. Another tool that Sun provides to analyze the output of the profiler is the Heap Analysis Tool (search http://www.java.sun.com for "HAT"). But this tool analyzes only the object-creation statistics output gained with the default profile output, so it is not that useful for method profiling (see Section 2.4 for slightly more about this tool).

Nevertheless, I expect the free profiling option to stabilize and be more useful in future versions. The output when run with the options already listed (cpu=samples, thread=y) already results in fairly usable information. This profiling mode operates by periodically sampling the stack. Each unique stack trace provides a TRACE entry in the second section of the file, describing the method calls on the stack for that trace. Multiple identical samples are not listed; instead, the number of their "hits" is summarized in the third section of the file. The profile output file in this mode has three sections:

Section 1

A standard header section describing possible monitored entries in the file. For example:

WARNING!  This file format is under development, and is subject to
change without notice.
  
This file contains the following types of records:
  
THREAD START
THREAD END      mark the lifetime of Java threads
  
TRACE           represents a Java stack trace.  Each trace consists
                of a series of stack frames.  Other records refer to
                TRACEs to identify (1) where object allocations have
                taken place, (2) the frames in which GC roots were
                found, and (3) frequently executed methods.

Section 2

Individual entries describing monitored events, i.e., threads starting and terminating, but mainly sampled stack traces. For example:

THREAD START (obj=8c2640, id = 6, name="Thread-0", group="main")
THREAD END (id = 6)
TRACE 1:
<empty>
TRACE 964:
java/io/ObjectInputStream.readObject(ObjectInputStream.java:Compiled method)
java/io/ObjectInputStream.inputObject(ObjectInputStream.java:Compiled method)
java/io/ObjectInputStream.readObject(ObjectInputStream.java:Compiled method)
java/io/ObjectInputStream.inputArray(ObjectInputStream.java:Compiled method)
TRACE 1074:
java/io/BufferedInputStream.fill(BufferedInputStream.java:Compiled method)
java/io/BufferedInputStream.read1(BufferedInputStream.java:Compiled method)
java/io/BufferedInputStream.read(BufferedInputStream.java:Compiled method)
java/io/ObjectInputStream.read(ObjectInputStream.java:Compiled method)

Section 3

A summary table of methods ranked by the number of times the unique stack trace for that method appears. For example:

CPU SAMPLES BEGIN (total = 512371) Thu Aug 26 18:37:08 1999
rank   self  accum   count trace method
   1 16.09% 16.09%   82426  1121 java/io/FileInputStream.read
   2  6.62% 22.71%   33926   881 java/io/ObjectInputStream.allocateNewObject
   3  5.11% 27.82%   26185   918 java/io/ObjectInputStream.inputClassFields
   4  4.42% 32.24%   22671   887 java/io/ObjectInputStream.inputObject
   5  3.20% 35.44%   16392   922 java/lang/reflect/Field.set

Section 3 is the place to start when analyzing this profile output. It consists of a table with six fields:

rank

This column simply counts the entries in the table, starting with 1 at the top and incrementing by 1 for each entry.

self

The self field is usually interpreted as a percentage of the total running time spent in this method. More accurately, this field reports the percentage of samples that have the stack given by the trace field. Here's a one-line example:

rank   self  accum   count trace method 
   1 11.55% 11.55%   18382   545 java/lang/FloatingDecimal.dtoa

This example shows that stack trace 545 occurred in 18,382 of the sampled stack traces, and this is 11.55% of the total number of stack trace samples made. It indicates that this method was probably executing for about 11.55% of the application execution time because the samples are at regular intervals. You can identify the precise trace from the second section of the profile output by searching for the trace with identifier 545. For the previous example, this trace was:

TRACE 545: (thread=1)
java/lang/FloatingDecimal.dtoa(FloatingDecimal.java:Compiled method)
java/lang/FloatingDecimal.<init>(FloatingDecimal.java:Compiled method)
java/lang/Double.toString(Double.java:Compiled method)
java/lang/String.valueOf(String.java:Compiled method)

This TRACE entry clearly identifies the exact method and its caller. Note that the stack is reported to a depth of four methods. This is the default depth; the depth can be changed using the depth parameter to -Xrunhprof, e.g., -Xrunhprof:depth=6,cpu=samples,....

accum

This field is a running additive total of all the self field percentages as you go down the table. For the Section 3 example shown previously, the third line lists 27.82% for the accum field, indicating that the sum total of the first three lines of the self field is 27.82%.

count

This field indicates how many times the unique stack trace that gave rise to this entry was sampled while the program ran.

trace

This field shows the unique trace identifier from the second section of profile output that generated this entry. The trace is recorded only once in the second section no matter how many times it is sampled; the number of times that this trace has been sampled is listed in the count field.

method

This field shows the method name from the top line of the stack trace referred to from the trace field, i.e., the method that was running when the stack was sampled.

This summary table lists only the method name and not its argument types. Therefore, it is frequently necessary to refer to the stack itself to determine the exact method if the method is an overloaded method with several possible argument types. (The stack is given by the trace identifier in the trace field, which in turn references the trace from the second section of the profile output.) If a method is called in different ways, it may also give rise to different stack traces. Sometimes the same method call can be listed in different stack traces due to lost information. Each of these different stack traces results in a different entry in the third section of the profiler's output, even though the method field is the same. For example, it is perfectly possible to see several lines with the same method field, as in the following table segment:

rank   self  accum   count trace method
  95   1.1% 51.55%    110    699 java/lang/StringBuffer.append
 110   1.0% 67.35%    100    711 java/lang/StringBuffer.append
 128   1.0% 85.35%     99    332 java/lang/StringBuffer.append

When traces 699, 711, and 332 are analyzed, one trace might be StringBuffer.append(boolean), while the other two traces could both be StringBuffer.append(int), but called from two different methods (and so giving rise to two different stack traces and consequently two different lines in the summary example). Note that the trace does not identify actual method signatures, only method names. Line numbers are given if the class was compiled so that line numbers remain. This ambiguity can be a nuisance at times.

The profiler in this mode (cpu=samples) suffices when you have no better alternative. It does have an effect on real measured times, slowing down operations by variable amounts even within one application run. But it normally indicates major bottlenecks, although sometimes a little extra work is necessary to sort out multiple identical method-name references.

Using the alternative cpu=times mode, the profile output gives a different view of application execution. In this mode, the method times are measured from method entry to method exit, including the time spent in all other calls the method makes. This profile of an application provides a tree-like view of where the application is spending its time. Some developers are more comfortable with this mode for profiling the application, but I find that it does not directly identify bottlenecks in the code.

2.3.3 HotSpot and 1.3 -Xprof Profile Output

HotSpot does not support the standard Java 2 profiler detailed in the previous section; it supports a separate profiler using the -Xprof option. JDK 1.3 supports the HotSpot profiler as well as the standard Java 2 profiler. The HotSpot profiler has no further options available to modify its behavior; it works by sampling the stack every 10 milliseconds.

The output, printed to standard output, consists of a number of sections. Each section lists entries in order of the number of ticks counted while the method was executed. The various sections include methods executing in interpreted and compiled modes, and VM runtime costs as well:

Section 1

A one-line header. For example:

Flat profile of 7.55 secs (736 total ticks): main

Section 2

A list of methods sampled while running in interpreted mode. The methods are listed in order of the total number of ticks counted while the method was at the top of the stack. For example:

Interpreted + native   Method                        
  3.7%    23  +     4    tuning.profile.ProfileTest.main
  2.4%     4  +    14    java.lang.FloatingDecimal.dtoa
  1.4%     3  +     7    java.lang.FDBigInt.<init>

Section 3

A list of methods sampled while running in compiled mode. The methods are listed in order of the total number of ticks counted while the method was at the top of the stack. For example:

Compiled + native   Method                        
 13.5%    99  +     0    java.lang.FDBigInt.quoRemIteration
  9.8%    71  +     1    java.lang.FDBigInt.mult
  9.1%    67  +     0    java.lang.FDBigInt.add

Section 4

A list of external (non-Java) method stubs, defined using the native keyword. Listed in order of the total number of ticks counted while the method was at the top of the stack. For example:

Stub + native   Method                        
  2.6%    11  +     8    java.lang.Double.doubleToLongBits
  0.7%     2  +     3    java.lang.StrictMath.floor
  0.5%     3  +     1    java.lang.Double.longBitsToDouble

Section 5

A list of internal VM function calls. Listed in order of the total number of ticks counted while the method was at the top of the stack. Not tuneable. For example:

Runtime stub + native  Method                        
  0.1%     1  +     0    interpreter_entries
  0.1%     1  +     0    Total runtime stubs

Section 6

Other miscellaneous entries not included in the previous sections:

Thread-local ticks:
  1.4%    10             classloader
  0.1%     1             Interpreter
 11.7%    86             Unknown code

Section 7

A global summary of ticks recorded. This includes ticks from the garbage collector, thread-locking overhead, and other miscellaneous entries:

Global summary of 7.57 seconds:
100.0%   754             Received ticks
  1.9%    14             Received GC ticks
  0.3%     2             Other VM operations

The entries at the top of Section 3 are the methods that probably need tuning. Any method listed near the top of Section 2 should have been targeted by the HotSpot optimizer and may be listed lower down in Section 3. Such methods may still need to be optimized, but it is more likely that the methods at the top of Section 3 need optimizing. The ticks for the two sections are the same, so you can easily compare the time taken up by the top methods in the different sections and decide which to target.

2.3.4 JDK 1.1.x -prof and Java 2 cpu=old Profile Output

The JDK 1.1.x method-profiling output, obtained by running with the -prof option, is quite different from the normal 1.2 output. This output format is supported in Java 2, using the cpu=old variation of the -Xrunhprof option. This output file consists of four sections:

Section 1

The method profile table showing cumulative times spent in each method executed. The table is sorted on the first count field. For example:

callee caller time
29 java/lang/System.gc(  )V  
        java/io/FileInputStream.read([B)I 10263
1 java/io/FileOutputStream.writeBytes([BII)V 
        java/io/FileOutputStream.write([BII)V 0

Section 2

One line describing high-water gross memory usage. For example:

handles_used: 1174, handles_free: 339046, heap-used: 113960, heap-free: 21794720

The line reports the number of handles and the number of bytes used by the heap memory storage over the application's lifetime. A handle is an object reference. The number of handles used is the maximum number of objects that existed at any one time in the application (handles are recycled by the garbage collector, so over its lifetime the application could have used many more objects than are listed). Heap measurements are given in bytes.

Section 3

Reports the number of primitive data type arrays left at the end of the process, just before process termination. For example:

sig  count     bytes  indx
[C     174     19060     5
[B       5     19200     8

This section has four fields. The first field is the primitive data type (array dimensions and data type given by letter codes listed shortly), the second field is the number of arrays, and the third is the total number of bytes used by all the arrays. This example shows 174 char arrays with a combined space of 19,060 bytes, and 5 byte arrays with a combined space of 19,200 bytes.

The reported data does not include any arrays that may have been garbage-collected before the end of the process. For this reason, the section is of limited use. You could use the -noasyncgc option to try to eliminate garbage collection (if you have enough memory; you may also need -mx with a large number to boost the maximum memory available). If you do, also use -verbosegc so that if garbage collection is forced, you at least know that it has occurred and can get the basic number of objects and bytes reclaimed.

Section 4

The fourth section of the profile output is the per-object memory dump. Again, this includes only objects left at the end of the process just before termination, not objects that may have been garbage-collected before the end of the process. For example:

*** tab[267] p=4bba378 cb=1873248 cnt=219 ac=3 al=1103
  Ljava/util/HashtableEntry; 219 3504
  [Ljava/util/HashtableEntry; 3 4412

This dump is a snapshot of the actual object table. The fields in the first line of an entry are:

***tab[<index>]: The entry location as listed in the object table. The index is of no use for performance tuning.
p=<hex value>: Internal memory locations for the instance and class; of no use for performance tuning.
cb=<hex value>: Internal memory locations for the instance and class; of no use for performance tuning.
cnt=<integer>: The number of instances of the class reported on the next line.
ac=<integer>: The number of instances of arrays of the class reported on the next line.
al=<integer>: The total number of array elements for all the arrays counted in the previous (ac) field.

This first line of the example is followed by lines consisting of three fields: first, the class name prefixed by the array dimension if the line refers to the array data; next, the number of instances of that class (or array class); and last, the total amount of space used by all the instances, in bytes. So the example reports that there are 219 HashtableEntry instances that comprise (collectively) a total of 3504 bytes,^[9] and three HashtableEntry arrays having 1103 array indexes (which amounts to 4412 bytes total, as each entry is a 4-byte object handle).

^[9] A HashtableEntry has one int and three object handle instance variables, each of which takes 4 bytes, so each HashtableEntry is 16 bytes.

Sections 3 and 4 give snapshots of the object table memory and can be used in an interesting way: to run a garbage collection just before termination of your application. That leaves in the object table all the objects that are rooted^[10] by the system and by your application (from static variables). If this snapshot shows significantly more objects than you expect, you may be referencing more objects than you realized.

^[10] Objects rooted by the system are objects that the JVM runtime keeps alive as part of its runtime system. Rooted objects generally cannot be garbage-collected because they are referenced in some way from other objects that cannot be garbage-collected. The roots of these non-garbage-collectable objects are normally objects referenced from the stack, objects referenced from static variables of classes, and special objects the runtime system ensures are kept alive.

The first section of the profile output is the most useful. It consists of multiple lines, each specifying a method and its caller, together with the total cumulative time spent in that method and the total number of times it was called from that caller. The first line of this section specifies the four fields in the profile table in this section: count, callee, caller, and time. They are detailed here:

count

The total number of times the callee method was called from the caller method, accumulating multiple executions of the caller method. For example, if foo1( ) calls foo2( ) 10 times every time foo1( ) is executed, and foo1( ) was itself called three times during the execution of the program, the count field should hold the value 30 for the callee-caller pair foo2( )-foo1( ). The line in the table should look like this:

30 x/y/Z.foo2(  )V x/y/Z.foo1(  )V 1263

(assuming the foo*( ) methods are in class x.y.Z and they both have a void return). The actual reported numbers may be less than the true number of calls: the profiler can miss calls.

callee

The method that was called count times in total from the caller method. The callee can be listed in other entries as the callee method for different caller methods.

caller

The method that called the callee method count times in total.

time

The cumulative time (in milliseconds) spent in the callee method, including time when the callee method was calling other methods (i.e., when the callee method was in the stack but not at the top, and so was not the currently executing method).

If each of the count calls in one line took exactly the same amount of time, then one call from caller to callee took time divided by count milliseconds.

This first section is normally sorted into count order. However, for this profiler, the time spent in methods tends to be more useful. Because the times in the time field include the total time that the callee method was anywhere on the stack, interpreting the output of complex programs can be difficult without processing the table to subtract subcall times. This format is different from the 1.2 output with cpu=samples specified, and is similar to a 1.2 profile with cpu=times specified.

The lines in the profile output are unique for each callee-caller pair, but any one callee method and any one caller method can (and normally do) appear in multiple lines. This is because any particular method can call many other methods, and so the method registers as the caller for multiple callee-caller pairs. Any particular method can also be called by many other methods, and so the method registers as the callee for multiple callee-caller pairs.

The methods are written out using the internal Java syntax listed in Table 2-1.

Table 2-1. Internal Java syntax for -prof output format
Internal symbol	Java meaning
/	Replaces the . character in package names (e.g., java/lang/String stands for java.lang.String)
B	byte
C	char
D	double
I	int
F	float
J	long
S	short
V	void
Z	boolean
[	One array dimension (e.g., `[[B` stands for a two-dimensional array of bytes, such as `new` `byte[3][4]`)
L<classname>;	A class (e.g., Ljava/lang/String; stands for java.lang.String)

There are free viewers, including source code, for viewing this file:

Vladimir Bulatov's HyperProf (search for HyperProf on the Web)
Greg White's ProfileViewer (search for ProfileViewer on the Web)

The biggest drawback to the 1.1 profile output is that threads are not shown at all. This means that it is possible to get time values for method calls that are longer than the total time spent in running the application, since all the call times from multiple threads are added together. It also means that you cannot determine from which thread a particular method call was made. Nevertheless, after re-sorting the section on the time field rather than the count field, the profile data is useful enough to suffice as a method profiler when you have no better alternative.

One problem I've encountered is the limited size of the list of methods that can be held by the internal profiler. Technically, this limitation is 10,001 entries in the profile table, and there is presumably one entry per method. There are four methods that help you avoid the limitation by profiling only a small section of your code:

sun.misc.VM.suspendJavaMonitor(  )
sun.misc.VM.resumeJavaMonitor(  )
sun.misc.VM.resetJavaMonitor(  )
sun.misc.VM.writeJavaMonitorReport(  )

These methods also allow you some control over which parts of your application are profiled and when to dump the results.

2.3.5 Java 2 -Xhprof Option

The -Xhprof option seems to be simply an alternative name for the -Xrunhprof option. I believe that originally it was called -Xhprof, and then the option was left in although all subsequent documentation used -Xrunhprof.

2.3.6 Java 2 -Xaprof option

-Xaprof appears to be a simple allocation profiler. It prints the number and total size of instances allocated per class, including array classes, accumulating instances across all threads and creation points. In fact, it seems to be very similar to the tool I describe in the next section. Like other VM profiling tools, it is unfortunately not 100% stable (for example, it core dumps with my 1.4 Windows VM). Nevertheless, it is useful when it works, and it was introduced with 1.3.

Using this profiler to monitor the tuning.profile.ProfileTest class used in the example from the "Profiling Methodology" section results in the following output:

Allocation profile (sizes in bytes, cutoff = 0 bytes):
  
_ _ _ _ _ _Size_ _Instances_ _Average_ _Class_ _ _ _ _ _ _ _ _ _ _ _ _ 
  13491592     186025       73  [I
   5634592      86602       65  [C
   2496352     156022       16  java.lang.FDBigInt
    875112      36463       24  java.lang.String
    768000      16000       48  java.lang.FloatingDecimal
    320000      20000       16  java.lang.Long
    256000      16000       16  java.lang.Double
     29832         14     2131  [B
     14256        594       24  java.util.Hashtable$Entry
      8960          6     1493  [S
      8112         25      324  [Ljava.util.Hashtable$Entry;
      2448        102       24  java.lang.StringBuffer
      2312          3      771  [Ljava.lang.FDBigInt;
      1600         24       67  [Ljava.lang.Object;
       584          9       65  [Ljava.util.HashMap$Entry;
       528         22       24  java.util.Locale
       440         11       40  java.util.Hashtable
       432          9       48  java.util.HashMap
       392          4       98  [D
       376          3      125  [J
       320          7       46  [Ljava.lang.String;
       256          4       64  java.lang.Thread
...
23916904     518123       46  --total--

The listing has been truncated, but a full listing is output down to objects and arrays with only one instance created. The primitive arrays are listed using the one-character labels from Table 2-1. The listing is fairly clear. All instances created at any time by the VM are included, whether they have been garbage-collected or not. The first column is the total size in bytes taken up by all the instances summed together; the second column provides the number of those instances created; and the third divides the first column by the second column to give an average size per object in bytes.

The only disadvantage seems to be that you cannot take a snapshot. There seems to be no way of registering only those objects created between time 1 (e.g., after initialization) and time 2. Otherwise, this is another useful tool to add to your armory.