The Java bytecode specification allows a switch statement to be compiled into one of two different bytecodes. One compiled switch type works as follows:
Given a particular value passed to the switch block to be compared, the passed value is successively compared against the value associated with each case statement in order. If, after testing all cases, no statements match, then the default label is matched. When a case statement that matches is found, the body of that statement and all subsequent case bodies are executed (until one body exits the switch statement, or the last one is reached).
The operation of this switch statement is equivalent to holding an ordered collection of values that are compared to the passed value, one after the other in order, until a match is determined. This means that the time taken for the switch to find the case that matches depends on how many case statements there are and where in the list the matched case is. If no cases match and the default must be used, that always takes the longest matching time.
The other switch bytecode works for switch statements where the case values all lie (or can be made to lie) in a particular range. It works as follows:
Given a particular value passed to the switch block to be compared, the passed value is tested to see if it lies in the range. If it does not, the default label is matched; otherwise, the offset of the case is calculated and the corresponding case is matched directly. The body of that matched label and all subsequent case bodies are executed (until one body exits the switch statement, or the last one is reached).
For this latter switch bytecode, the time taken for the switch statement to match the case is constant. The time is not dependent on the number of cases in the switch, and if no cases match, the time to carry out the matching and go to the default is still the same. This switch statement operates as an ordered collection with the switch value first being checked to see if it is a valid index into the ordered collection, and then that value is used as the index to arrive immediately at the matched location.
Clearly, the second type of switch statement is faster than the first. Sometimes compilers can add dummy cases to a switch statement, converting the first type of switch into the second (faster) kind. (A compiler is not obliged to use the second type of switch bytecode at all, but generally it does if it can easily be used.) You can determine which switch a particular statement has been compiled into using javap, the disassembler available with the JDK. Using the -c option so that the code is disassembled, examine the method that contains the switch statement. It contains either a "tableswitch" bytecode identifier or a "lookupswitch" bytecode identifier. The tableswitch keyword is the identifier for the faster (second) type of switch.
If you identify a bottleneck that involves a switch statement, do not leave the decision to the compiler. You are better off constructing switch statements that use contiguous ranges of case values, ideally by inserting dummy case statements to specify all the values in the range, or possibly by breaking up the switch into multiple switches that each use contiguous ranges. You may need to apply both of these optimizations as in the next example.
Our tuning.loop.SwitchTest class provides a repeated test on three methods with switch statements and one other array-access method for comparison. The first method, switch1( ), contains some noncontiguous values for the cases, with each returning a particular integer value. The second method, switch2( ), converts the single switch statement in switch1( ) into four switch statements, with some of those four switch statements containing extra dummy cases to make each switch statement contain a contiguous set of cases. This second method, switch2( ), is functionally identical to switch1( ).
The third method, switch3( ), replaces the cases with a contiguous set of cases, integers 1 to 13. This method is not directly comparable to the first two methods; it is present as a control test. The fourth method, switch4( ), is functionally identical to switch3( ) but uses an array access instead of the switch statement, essentially doing in Java code what the compiler implicitly does in bytecodes for switch3( ). I run two sets of tests. The first set of tests, labeled "varying," passes in a different integer for each call to the switches. This means that most of the time, the default label is matched. The second set of tests, labeled "constant," alternates between passing in the integers 7 and 8 to the switches. Interestingly, my original test passed in only the integer 8, but HotSpot server mode optimized that to call the method only once and reuse the result, hence the need for the alternation. The results are shown in Table 7-2 for various VMs.
1.1.8 |
1.2.2 |
1.3.1 |
1.3.1-server |
1.4.0 |
1.4.0-server |
1.4.0-Xint |
||
---|---|---|---|---|---|---|---|---|
1 |
switch1 varying |
101% |
100% |
59% |
31% |
51% |
35% |
284% |
2 |
switch2 varying |
44% |
44% |
80% |
36% |
81% |
36% |
341% |
3 |
switch3 varying |
14% |
14% |
51% |
28% |
52% |
25% |
213% |
4 |
switch4 varying |
11% |
11% |
27% |
9% |
17% |
14% |
47% |
5 |
switch1 constant |
73% |
73% |
86% |
50% |
84% |
20% |
68% |
6 |
switch2 constant |
55% |
55% |
103% |
41% |
104% |
16% |
67% |
7 |
switch3 constant |
50% |
50% |
93% |
32% |
90% |
20% |
56% |
8 |
switch4 constant |
30% |
30% |
89% |
24% |
72% |
24% |
79% |
There is a big difference in optimizations gained depending on whether the VM has a plain JIT or uses HotSpot technology. The times are all relative to the JDK 1.2.2 "switch1 varying" case. From the variation in timings, it is not clear whether the HotSpot technology fails to compile the handcrafted switch in an optimal way or whether it does optimally compile all the switch statements but adds overhead that cancels some of the optimizations. From the first line, it is clear that HotSpot does a great job of compiling the original unoptimized switch. Comparing the times across the different VMs in the second line, for the optimized switch, we can see that client-mode HotSpot does really badly. It appears that the way you optimize your switch statement is heavily dependent on which VM runs your application. This is unfortunate.
For the JIT results, the first and second lines of output show the speedup you can get by recrafting the switch statements. Here, both switch1( ) and switch2( ) are using the default for most of the tests. In this situation, switch1( ) requires 13 failed comparisons before executing the default statement. switch2( ), on the other hand, checks the value against the range of each of its four switch statements, then immediately executes the default statement.
The first and third lines of output show the worst-case comparison for the two types of switch statements. In this test, switch1( ) almost always fails all its comparison tests. On the other hand, switch3( ), with the contiguous range, is much faster than switch1( ) (JIT cases only). This is exactly what is expected, as the average case for switch1( ) here consists of 13 failed comparisons followed by a return statement. The average case for switch3( ) in this test is only a pair of checks followed by a return statement. The two checks are that the integer is smaller than or equal to 13 and larger than or equal to 1. Both checks fail in most of the calls for this "varying" case. Again, the HotSpot values are not so good.
Even when the case statement in switch1( ) is always matched, the fifth and sixth lines show that switch2( ) can be faster (though again, not with HotSpot client mode). In this test, the matched statement is about halfway down the list of cases in switch1( ), so the seven or so failed comparisons for switch1( ) compared to two range checks for switch2( ) should translate into switch2( ) being more than twice as fast as switch1( ).
In each set of tests, switch2( ), which is functionally identical to switch1( ), is faster. The output for switch4( ) is included for comparison, and it turns out to be faster than the functionally identical switch3( ), thus indicating that it is worth considering dispensing with the switch tests completely when you can convert the switch to an array access. In this example, the switch merely returns an integer, so the conversion to an array access is feasible; in general, it may be difficult to convert a set of body statements into an array access and subsequent processing:
package tuning.loop; public class SwitchTest { //Use a default size for the loop of 1 million iterations static int SIZE = 10000000; public static void main(String args[ ]) { //Allow an argument to set the size of the loop. if (args.length != 0) SIZE = Integer.parseInt(args[0]); int result = 0; //run tests looking mostly for the default (switch //test uses many different values passed to it) long time = System.currentTimeMillis( ); for (int i = SIZE; i >=0 ; i--) result += switch1(i); System.out.println("Switch1 took " + (System.currentTimeMillis( )-time) + " millis to get " + result); //and the same code to test timings on switch2( ), //switch3( ) and switch4( ) ... //run tests using one particular passed value (8) result = 0; time = System.currentTimeMillis( ); for (int i = SIZE; i >=0 ; i--) result += switch1(i%2= =0 ? 7 : 8); System.out.println("Switch1 took " + (System.currentTimeMillis( )-time) + " millis to get " + result); //and the same code to test timings on switch2( ), //switch3( ) and switch4( ) ... } public static int switch1(int i) { //This is one big switch statement with 13 case statements //in no particular order. switch(i) { case 318: return 99; case 320: return 55; case 323: return -1; case 14: return 6; case 5: return 8; case 123456: return 12; case 7: return 15; case 8: return 29; case 9: return 11111; case 123457: return 12345; case 112233: return 6666; case 112235: return 9876; case 112237: return 12; default: return -1; } } public static int switch2(int i) { //In this method we break up the 13 case statements from //switch1( ) into four almost contiguous ranges. Then we //add in a few dummy cases so that the four ranges are //definitely contiguous. This should ensure that the compiler //will generate the more optimal tableswitch bytcodes switch(i) { case 318: return 99; case 319: break; //dummy case 320: return 55; case 321: break; //dummy case 322: break; //dummy case 323: return -1; } switch(i) { case 5: return 8; case 6: break; //dummy case 7: return 15; case 8: return 29; case 9: return 11111; case 10: break; //dummy case 11: break; //dummy case 12: break; //dummy case 13: break; //dummy case 14: return 6; } switch(i) { case 112233: return 6666; case 112234: break; //dummy case 112235: return 9876; case 112236: break; //dummy case 112237: return 12; } switch(i) { case 123456: return 12; case 123457: return 12345; default: return -1; } } public static int switch3(int i) { switch(i) { //13 contiguous case statements as a kind of fastest control case 1: return 99; case 2: return 55; case 3: return -1; case 4: return 6; case 5: return 8; case 6: return 12; case 7: return 15; case 8: return 29; case 9: return 11111; case 10: return 12345; case 11: return 6666; case 12: return 9876; case 13: return 12; default: return -1; } } final static int[ ] RETURNS = { 99, 55, -1, 6, 8, 12, 15, 29, 11111, 12345, 6666, 9876, 12 }; public static int switch4(int i) { //equivalent to switch3( ), but using an array lookup //instead of a switch statement. if (i < 1 || i > 13) return -1; else return RETURNS[i-1]; } }