Many of these suggestions apply only after a bottleneck has been identified:
Test your benchmarks on each version of Java available to you (classes, compiler, and VM) to identify any performance improvements.
Test performance using the target VM or "best practice" VMs.
Avoid using VM options that are detrimental to performance.
Include some tests of the garbage collector appropriate to your application, so that you can identify changes that minimize the cost of garbage collection in your application.
Run your application with both the -verbosegc option and with full application tracing turned on to see when the garbage collector kicks in and what it is doing.
Vary the -Xmx/-Xms option values to determine the optimal memory sizes for your application.
Fine-tuning the heap is possible, but requires knowledge of the GC algorithm and the many parameter options available.
Sharing memory between multiple VMs is easy with the Echidna library. This can also provide prestarted VMs for faster startup.
Use -noclassgc/-Xnoclassgc to avoid having classes repeatedly reloaded.
Replace generic classes with more specific implementations dedicated to the data type being manipulated, e.g., implement a LongVector to hold longs rather than using a Vector object with Long wrappers.
Extend collection classes to access internal arrays for queries on the class.
Replace collection objects with arrays where the collection object is a bottleneck.
Try various compilers. Look for compilers targeted at optimizing performance: these provide the cheapest significant speedup for all runtime environments.
Use the -O option (but always check that it does not produce slower code).
Identify the optimizations a compiler is capable of so that you do not negate the optimizations.
Use a decompiler to determine precisely the optimizations generated by a particular compiler.
Consider using a preprocessor to apply some standard compiler optimizations more precisely.
Remember that an optimizing compiler can only optimize algorithms, not change them. A better algorithm is usually faster than an optimized slow algorithm.
Include optimizing compilers from the early stages of development.
Make sure that the deployed classes have been compiled with the correct compilers.
Make sure that any loops using native method calls are converted so that the native call includes the loop instead of running the loop in Java. Any loop iteration parameters should be passed to the native call.
Minimize the number of data transfers through the JNI. Native ByteBuffers can help.
Deliver classes in uncompressed format in ZIP or JAR files (unless network download time is significant, in which case files should be compressed).
Use a customized classloader running in a separate thread to load class files.