Here's a strategy I have found works well when attacking performance problems:
Identify the main bottlenecks (look for about the top five bottlenecks, but go higher or lower if you prefer).
Choose the quickest and easiest one to fix, and address it (except for distributed applications where the top bottleneck is usually the one to attack: see the following paragraph).
Repeat from Step 1.
This procedure gets your application tuned the quickest. The advantage of choosing the "quickest to fix" of the top few bottlenecks rather than the absolute topmost problem is that once a bottleneck has been eliminated, the characteristics of the application change, and the topmost bottleneck may not need to be addressed any longer. However, in distributed applications I advise you target the topmost bottleneck. The characteristics of distributed applications are such that the main bottleneck is almost always the best to fix and, once fixed, the next main bottleneck is usually in a completely different component of the system.
Although this strategy is simple and actually quite obvious, I nevertheless find that I have to repeat it again and again: once programmers get the bit between their teeth, they just love to apply themselves to the interesting parts of the problems. After all, who wants to unroll loop after boring loop when there's a nice juicy caching technique you're eager to apply?
You should always treat the actual identification of the cause of the performance bottleneck as a science, not an art. The general procedure is straightforward:
Measure the performance by using profilers and benchmark suites and by instrumenting code.
Identify the locations of any bottlenecks.
Think of a hypothesis for the cause of the bottleneck.
Consider any factors that may refute your hypothesis.
Create a test to isolate the factor identified by the hypothesis.
Test the hypothesis.
Alter the application to reduce the bottleneck.
Test that the alteration improves performance, and measure the improvement (include regression-testing the affected code).
Repeat from Step 1.
Here's the procedure for a particular example:
You run the application through your standard profiler (measurement).
You find that the code spends a huge 11% of its time in one method (identification of bottleneck).
Looking at the code, you find a complex loop and guess this is the problem (hypothesis).
You see that it is not iterating that many times, so possibly the bottleneck could be outside the loop (confounding factor).
You could vary the loop iteration as a test to see if that identifies the loop as the bottleneck. However, you instead try to optimize the loop by reducing the number of method calls it makes: this provides a test to identify the loop as the bottleneck and at the same time provides a possible solution. In doing this, you are combining two steps, Steps 5 and 7. Although this is frequently the way tuning actually goes, be aware that this can make the tuning process longer: if there is no speedup, it may be because your optimization did not actually make things faster, in which case you have neither confirmed nor eliminated the loop as the cause of the bottleneck.
Rerunning the profile on the altered application finds that this method has shifted its percentage time down to just 4%. This method may still be a candidate for further optimization, but nevertheless it's confirmed as the bottleneck and your change has improved performance.
(Already done, combined with Step 5.)
(Already done, combined with Step 6.)