13.4 Design and Architecture

Many design-stage decisions affect performance. These include how long a transaction will be, how often data or objects need to be updated, where objects will be located, whether they are persistent and how persistency is achieved, how data is manipulated, how components interact, and how tightly coupled subsystems are, as well as determining responses to errors, retry frequencies, and alternative routes for solving tasks.

As I mentioned in the last section, the general technique for performance tuning during the analysis and design phases is to predict performance based on the best available data.[6] During the design phase, a great deal of prototype testing is possible, and all such tests should feed data back to help predict the performance of the application. Any predictions indicating a problem with performance should be addressed at the design phase, prior to coding. If necessary, it is better to revisit the analysis and alter specifications rather than leave any indicated performance issues unresolved.

[6] See Loosley and Douglas.

At each stage, part of the design objective should be to predict the performance of the application. (Note that when I refer to the design phase, I include both logical and physical design; physical design is often called architecture.) The design phase usually includes determining the target platforms, and any predictions must be tailored to the limitations of those platforms. This is especially important for embedded Java systems (e.g., applets and servlets), environments where a specific nonstandard target VM must be used, and where the target VM may be highly variable (i.e., is unknown). In all these cases, the target Java runtime system performance cannot be inferred from using the latest standard VM, and performance prediction must be targeted at the known system or at the worst-performing Java runtime system. (Alternatively, the design phase may rule out some runtime systems as being unsupported by the application.)

Any decoupling, indirection, abstraction, or extra layers in the design are highly likely to be candidates for causing performance problems. You should include all these elements in your design if they are called for. But you need to be careful to design using interfaces in such a way that the concrete implementation allows any possible performance optimizations to be incorporated. Design elements that block, copy, queue, or distribute also frequently cause performance problems. These elements can be difficult to optimize, and the design should focus attention on them and ensure that they can either be replaced or that their performance is targeted.[7] Asynchronous and background events can affect times unpredictably, and their effects need to be clearly identified by benchmark testing.

[7] For example, in Chapter 10, we considered a load-balancing solution that included a queue. The queue is a potential bottleneck, and care must be taken to ensure that the queue does not unnecessarily delay requests as they pass through.

13.4.1 Shared Resources

Resources that must be shared by several users, processes, or threads are always a potential source of bottlenecks. When a resource is shared, the resource usually requires its various sharers to use it one at a time to avoid a conflict of states and corruption. During the design phase, you should try to identify all shared resources, and predict what performance limitations they impose on the application. Be careful to consider the fully scaled version of the application, i.e., with as many users, objects, files, network connections, etc., as are possible according to the application specifications. Considering fully scaled versions of the application is important because shared resources are highly nonlinear in performance. They usually impose a gently decreasing performance at their bottleneck as the number of sharers increases, up to a point at which there is a sudden and catastrophic decrease in performance as the number of sharers increases further.

If the performance prediction indicates that a particular shared resource is likely to impose too high a performance cost, alternative designs that bypass or reduce the performance cost of that shared resource need to be considered. For example, multiple processes or threads updating a shared collection have to synchronize their updates to avoid corrupting the collection. If this synchronized update is identified as a performance problem, an alternative is to allow each process or thread to update its own collection, and wrap the collections in another collection object that provides global access to all the collections. This solution was illustrated in Section 10.4.2.

Failing to identify a shared resource at the design phase can be expensive. In some cases, a simple class substitution of a redesigned class can reduce the performance drawback of the shared resource to acceptable performance levels. But in many cases, a complete redesign of part or all of the application may be needed to achieve adequate performance.

13.4.1.1 Transactions

The purpose of a transaction is to ensure consistency when using shared resources. If there are no possible conflicts across sharers of those resources, there is no need for a transaction. Removing unnecessary transactions is the simplest and most effective performance optimization for applications that include transactions. So, if you do not need a transaction, do not use one. Most systems that provide transactions usually have a "transactionless" mode, i.e., a way to access any shared resources without entering a defined transaction. This mode normally has better performance than the transaction mode.

When transactions are absolutely necessary, your design goal should be to minimize the time spent in the transaction. If transactions extend for too long and cause performance problems, a complete redesign of a significant part of the application is often needed.

You also need to be aware of the shared resources used by the transacting system itself. Any system providing transaction semantics to your application uses an internal set of shared resources: this is necessary to ensure that the transactions are mutually consistent. These transaction-system internal shared resources invariably have some product-specific idiosyncrasies that result in their being used more or less efficiently. These idiosyncrasies can have a large effect on performance. Many products have a performance-tuning section within their documentation, detailing how best to take advantage of idiosyncrasies in their product (more usually termed "features").

Even where short transactions are designed into an application, the application may enter unexpectedly long transactions in two common situations. The first situation is when bugs occur in the transaction, and the second, when the user has control over the transaction. Because unintended long transactions can occur, transactions should always have a timeout imposed on them: this is usually fairly easy to incorporate in Java using a separate high-priority timeout thread.

A standard way to convert naturally long transactions into short ones is to maintain sets of changes, rather like undo/redo logs. In this design pattern, changes are abstracted into separate objects, and ordered lists of these changes can be "played." With this design pattern, the changes that occur in what would be a long transaction are leisurely collected without entering the transaction, and then a short transaction rapidly "plays" all the changes. This pattern cannot be used exactly as described if the precise time of a particular change is important. However, variations of this pattern can be applied to many cases.

13.4.1.2 Locking

Locking is a technique for ensuring access to a shared resource while maintaining coherence of state within the shared resource. There are a variety of lock types, from exclusive (only one sharer has any type of access) to various types of shared locks (allowing any of a set of sharers simultaneous access, or all sharers access to a restricted set of capabilities of the shared resource, or both of these combined[8]).

[8] For example, one type of write-lock allows read access by multiple sharers to the shared resource while restricting write access to just one sharer.

Locking can be expensive. Overhead includes the locking and unlocking overhead itself; the fact that locks must be shared resources implying extra shared-resource considerations; the explicit serialization of activities that result from using locks; and the possibility of deadlock when two sharers are simultaneously trying to obtain a lock held by the sharer, causing both sharers to "freeze" activity (see Chapter 10 for a concrete example).

These drawbacks mean you should consider locking only when the design absolutely requires it. For instance, locking must be used when there is a requirement for definite deterministic noncorrupted access to a shared resource. To illustrate: a bank account with no overdraft facilities must serialize access to the account and ensure that each deposit and withdrawal takes place without any other activities legally occurring at the same time. The balance accessed for display also needs to be accurate. You do not want to display a balance of $100 at the ATM window, then have the ATM deny withdrawal of $50 because the actual balance is lower due to a check for $55 being processed at the same time. From the bank's point of view, both transactions might go though, and the bank is owed $5 from someone it did not want to lend to. Or the customer is given the wrong information and suffers frustration. To avoid these two situations, locking is required.

Occasionally, locking improves performance by reducing the conflicts otherwise generated by simultaneous access to a shared resource. Consider the situation in which objects are concurrently added to a collection. You can define the addition operation to provide exclusive updates either by locking the update method or by throwing an exception if the update is not exclusive. The lockable update method can be defined easily with a synchronized method:

public synchronized add(Object o)
{
  unsynchronized_add(o);
}

The exception-throwing method would be a little more complex:

public add(Object o)
  throws InUseException
{
  //throws an InUseException if I am currently already in use
  setInUse(  );
  unsynchronized_add(o);
  setNotInUse(  );
}

The advantage of the second definition is that the locking overhead is avoided during the update.[9] This definition is suitable for cases where there is unlikely to be much concurrent execution of the add( ) method: the exception is not thrown very often. But when there are frequent simultaneous updates to the collection, you will encounter the exception more often than not. For this latter situation, your performance will be better if you use the first synchronized implementation of the add( ) method and explicitly serialize the updates.

[9] In fact, the setInUse( ) method probably needs to be synchronized, so this pattern is useful only for avoiding synchronizing methods that might take a long time. The long synchronization is replaced by a short synchronization and a possible thrown exception.

13.4.1.3 Parallelism

For performance reasons, you should try to design parallelism into the application wherever possible. The general guideline is to assume that you parallelize every activity. One of the tasks for the design phase, then, is to identify what cannot be parallelized. This guideline is fairly cost-effective. It is always easy to move from a parallelized design back to the nonparallelized version, since the nonparallelized version is essentially a degenerate case of the more general version. But retrofitting parallelism to the application is often considerably more difficult. Starting with an application designed to work without any parallelism and trying to restructure it to add in parallelism can be extremely difficult and expensive.

Any parallelism designed into the application should take advantage of multiple processors. This can be evaluated in the design phase by predicting the performance of the application on single- and multiple-CPU machines.

Once the application has been designed to run with parallelism, you can decide at the implementation stage, or possibly even at runtime, whether to use the parallelism. A low degree of parallelism (e.g., 5 threads, not 500 threads) almost always improves the performance of an application. But parallelism has overhead that can swamp the advantages. The overhead comes from contention in trying to use shared resources and delays from the communication needed for synchronization. Additional overhead comes from starting extra threads and distributing and coordinating work between the threads (there may also be overhead from caches that deal with twice the data throughput in the same space).

When designing the application to run activities in parallel, you need to focus on shared resources, especially the time spent using these resources. Increasing the time spent exclusively using a shared resource adversely affects all other activities using that resource. For example, the CPU is the most basic shared resource. The more separate threads (and processes) using the CPU, the smaller the time slices allocated to each thread relative to the time spent waiting to acquire the CPU by each thread. Consequently, the time actually taken for computing any particular activity becomes longer, since this is the sum of the time slices allocated to carry out the computation together with the sum of times waiting to gain a time slice.

And the situation is not linear, but exponential. Consider a CPU where 10% is currently used. If there is a computation that normally takes five seconds when this CPU has no work, then as the CPU can currently allocate 90% of its power to that computation, the computation will instead take just over 10% longer: the actual expected time is 5/0.9 = 5.55 seconds.

If instead the CPU were 40% utilized (i.e., 60% available), the expected time for that computation would instead be 5/0.6 = 8.3 seconds. Now look what happens to a 90% utilized CPU (10% available). The expected time for the computation is now 5/0.1 = 50 seconds. And a 99% busy CPU is going to make this computation take 500 seconds (see Table 13-2). You can see the need to keep spare capacity in the CPU to avoid an exponential degradation in performance.

Table 13-2. Theoretical computation time (single-threaded) depending on CPU availability

CPU used

CPU available

Computation time

0% used

100% available

5 seconds

10% used

90% available

5/0.9 = 5.55 seconds

40% used

60% available

5/0.6 = 8.3 seconds

90% used

10% available

5/0.1 = 50 seconds

99% used

1% available

5/0.01 = 500 seconds

You can also predict the effect threading can have if you can parallelize any particular calculation, even on a single-CPU machine. If one thread does the calculation using 10% of the CPU in five seconds, and you can fully parallelize the calculation, then two threads (ideally) each take half the time to do their half of the calculation. Assuming that the calculation does not saturate the CPU when running,[10] then if the two halves run together, each half takes 2.5 seconds on an unutilized CPU. But since there are two threads and each thread takes 10% of the CPU, each thread sees only 90% availability of the CPU. This means that each half calculation takes 2.5/0.9 = 2.8 seconds. Both calculations run at the same time (that is why the CPU has double the utilization), so this is also the total time taken. Time-slicing adds some additional overhead, but this will leave the expected time well below the three-second mark.

[10] CPU availability is indicated in the example since the calculation loads the CPU only by 10%; presumably, there is some disk activity required by the calculation.

So even on a single-CPU machine, parallelizing this calculation enables it to run faster. This can happen because the calculation considered here is not a pure CPU calculation: it obviously spends time doing some I/O (perhaps a database query), and thus it can be parallelized effectively. If the calculation were number crunching of some sort, the CPU utilization would be 100%, and parallelizing the calculation would actually slow it down.

For example, suppose the number-crunching calculation took five seconds and caused a 100% CPU utilization on an otherwise unworked machine. Running the same calculation on a 50% utilized machine would take 5/0.5 = 10 seconds. So theoretically, if you can parallelize this calculation into two equal halves running together on an otherwise unutilized machine, each half is allocated 50% of the CPU utilization. Each takes half the time of the unparallelized calculation running on a 50% utilized machine (which we just calculated to take 10 seconds), i.e., each parallelized half calculation takes 10/2 = 5 seconds, both running simultaneously. So the total time taken is still five seconds, and there is no overall speedup. If you add in the slight factor due to CPU time-slicing overhead, the total time increases beyond the five-second mark, so it is actually slower to parallelize this calculation. This is what we should intuitively expect for any process that already takes up all the CPU's power.

Now what about the multiple CPU case: do we get a benefit here? Well, for a two-CPU machine, the CPU synchronization overhead may be 5% (this is normally an overestimate). In this case, each part of the parallelized application effectively gets a 5% utilized CPU of its own. For the example, the expected times taken are 2.5/0.95 = 2.63 seconds. And since the two threads are running in parallel, this is also the total expected time taken. See Table 13-3.[11]

[11] In the case of the number-crunching calculation, you have the exact same calculation resulting in 2.63 seconds. So again, as you intuitively expect, the two-CPU machine lets the CPU-swamping parallelized calculation take just over half the time of the original unparallelized version.

Table 13-3. Theoretical computation time depending on number of CPUs and non-CPU-bound threads
 

CPU used

CPU available

Computation time

1 CPU, 1 thread

10%

90%

5 seconds

1 CPU, 2 threads serialized

10%/thread

90%/thread

2.5 + 2.5 = 5 seconds

1 CPU, 2 threads parallelized

10%/thread

90%/thread

max(2.5/0.9,2.5/0.9) = 2.8 seconds

2 CPUs, 2 threads parallelized

5%/CPU

95%/CPU

max(2.5/0.95,2.5/0.95) = 2.6 seconds

However, CPU overhead is increased for each additional CPU, as they all have to synchronize with each other. This means that almost another 5% utilization is added to the overhead of each CPU: in fact, Dan Graham of IBM has determined that the overhead is multiplicative, so that if two CPUs each have a 5% utilization (0.95 x 100% free) from CPU parallelism, then three CPUs each have a 9.75% utilization (0.95 x 0.95 x 100% free), and four CPUs each have a 14.26% utilization (0.95 x 0.95 x 0.95 x 100% free), and so on. See Table 13-4.

Table 13-4. Theoretical computation time depending on number of CPUs and threads
 

CPU used

CPU available

Computation time

1 CPU, 1 thread

0%

100%

100 seconds

2 CPUs, 2 threads parallelized

5%/CPU

95%/CPU

100/0.95/2 = 52.6 seconds

3 CPUs, 3 threads parallelized

9.75%/CPU

90.25%/CPU

100/0.9025/3 = 36.9 seconds

9 CPUs, 9 threads parallelized

34%/CPU

66%/CPU

100/0.66/9 = 16.8 seconds

10 CPUs, 10 threads parallelized

37%/CPU

63%/CPU

100/0.63/10 = 15.9 seconds

19 CPUs, 19 threads parallelized

60.3%/CPU

39.7%/CPU

100/0.397/19 = 13.26 seconds

20 CPUs, 20 threads parallelized

62.3%/CPU

37.7%/CPU

100/0.377/20 = 13.26 seconds

21 CPUs, 21 threads parallelized

64.2%/CPU

35.8%/CPU

100/0.358/21 = 13.30 seconds

30 CPUs, 30 threads parallelized

64.2%/CPU

22.6%/CPU

100/0.226/30 = 14.75 seconds

Clearly, there are diminishing returns from adding CPUs. In fact, at some point, adding CPUs actually makes performance worse. For example, let's suppose our number-crunching application is fully parallelizable to any number of CPUs, and that on a single unutilized CPU it takes 100 seconds. On a two-CPU machine, it takes 100 seconds divided by 2 (the number of CPUs, which is how many parts you can parallelize the calculation by) and then divided by 0.95 (the factor by which the CPU is utilized by the CPU parallelization overhead), giving 52.6 seconds.

For three CPUs, this time is 100/(0.95 x 0.95 x 3) = 36.9 seconds. So far, so good. Now, let's move on to 20 CPUs. This works out as 100/(0.9519 x 20) = 13.26 seconds. But 21 CPUs takes 100/(0.9520 x 21) = 13.30 seconds, actually more time. In fact, for this particular sequence, 20 CPUs gives the minimum time. Beyond that, the overhead of parallelizing CPUs makes things successively worse, and each additional CPU makes the fully parallelized calculation take longer.

In addition, well before the 20th CPU was added, you reach a point where each additional CPU is not at all cost-effective: 6 CPUs gave a value of 21.5 seconds for the calculation, 7 CPUs reduced that by only a couple of seconds to 19.4 seconds. A 10% reduction in time does not justify the cost of an extra CPU.

The general calculation presented here applies to other shared resources too. In a similar way, you can determine the performance effects of adding other additional shared resources to the application and predict whether the advantages will outweigh the disadvantages.

Note that these are all general predictions, useful for estimating the benefits of adding shared resources. The actual tested results can sometimes differ dramatically. For example, parallelizing some searches can provide a tenfold speedup on a single CPU because the increased variation in starting points of the solution space means that the probability of one of the searches starting much nearer the solution is greatly increased (see Section 10.9 in Chapter 10). But this is an exception. The cutoff where adding a shared resource gives a useful speedup is usually quite small, so you can mostly assume that a little parallelizing is good, but a lot of parallelizing is too much of a good thing.

All the calculations we made in this section assumed full load-balancing. Each thread (sharer) took exactly the same share of time to complete its task, and thus the total time was that of any one sharer since they all operated simultaneously. In reality, this is unlikely. If the sharers are unbalanced (as they usually are), the sharer that takes the longest to complete its activity is the one limiting the performance of the system. And the less balanced the various sharers, the worse the performance. This is extremely important as the application scales across different workloads. An unbalanced workload means that one resource is used far more intensively than others. It also means that all other parallel resources are being underutilized, and that the overused resource is highly likely to be a performance bottleneck in the system.

13.4.1.4 Data parallelism

If you have a large amount of data that needs to reside on disk, a typical strategy for improving access and searches of the data is to split up the data among many different files (preferably on separate disks). This is known as partitioning the data. Partitioning the data provides support for parallel access to the data, which takes advantage of I/O and CPU parallelism.

There are many data-partitioning schemes. Some of the more popular are:

Schema partitioning

Separates the data into logically distinct datasets and allocates each dataset to a separate file/disk.

Hash partitioning

Places data in multiple files/disks with location based on a hash function.

Range partitioning

Splits data into ranges, and each range is allocated a separate file/disk; for example, a-c in disk1, d-f in disk2, etc.

Expression partitioning

Uses a logical expression to determine the mapping of data to file/disk. Unbalanced partitioning requires refinement of the expression and repartitioning.

Round-robin partitioning

Allocates data to disks sequentially.

Partitioning schemes work best when used with indexes. Indexes also make searches much faster.

Although your design does not need to support a specific partitioning scheme, it should support partitioning in general if it is relevant to your application.

13.4.2 Scaling

The performance characteristics of an application vary with the number of different factors the application can deal with. These variable factors can include the number of users, the amount of data dealt with, the number of objects used by the application, etc. During the design phase, whenever considering performance, you should consider how the performance scales as the load on the application varies. It is usually not possible to predict (or measure) the performance for all possible variations of these factors. But you should select several representative sets of values for the factors, and predict (and measure) performance for these sets. The sets should include factors for when the application:

  • Has a light load

  • Has a medium load

  • Has a heavy load

  • Has a varying load predicted to represent normal operating conditions

  • Has spiked loads (where the load is mostly "normal" but occasionally spikes to the maximum supported)

  • Consistently has the maximum load the application was designed to support

You need to ensure that your scaling conditions include variations in threads, objects, and users, and variations in network conditions if appropriate. Measure response times and throughput for the various different scenarios and decide whether any particular situation needs optimizing for throughput of the system as a whole or for response times for individual users.

It is clear that many extra factors need to be taken into account during scaling. The tools you have for profiling scaling behavior are fairly basic: essentially, only graphs of response times or throughput against scaled parameters. It is typical to have a point at which the application starts to have bad scaling behavior: the knee or elbow in the response-time curve. At that point, the application has probably reached some serious resource conflict that requires tuning so that "nice" scaling behavior can be extended further. Clearly, tuning for scaling behavior is likely to be a long process, but you cannot shortcut this process if you want to be certain your application scales.[12]

[12] By including timer-based delays in the application code, at least one multiuser application has deliberately slowed response times for low-scaled situations. The artificial delay is reduced or cut out at higher scaling values. The users perceive a system with a similar response time under most loads.

13.4.3 Distributed Applications

The essential design points for ensuring good performance of distributed applications are:

  • Supporting asynchronous communications

  • Decoupling process activities from each other in such a way that no process is forced to wait for others (using queues achieves this)

  • Supporting parallelism in the design of the workflows

Determining the bottleneck in a distributed application requires looking at the throughput of every component:

  • Client and server processes

  • Network transfer rates (peak and average)

  • Network interface card throughput

  • Router speed, disk I/O

  • Middleware/queuing transfer rates

  • Database access, update, and transaction rates

  • Operating-system loads

Tuning any component other than the current bottleneck gives no improvement. Peak performance of each component is rarely achieved. You need to assume average rates of performance from the underlying resource and expect performance based on those average rates.

Distributed applications tend to exaggerate any performance characteristics. So when performance is bad, the application tends to slow significantly more than in nondistributed applications. The distributed design aspects should emphasize asynchronous and concurrent operations. Typical items to include in the design are:

  • Queues

  • Asynchronous communications and activities

  • Parallelizable activities

  • Minimized serialization points

  • Balanced workloads across multiple servers

  • Redundant servers and automatic switching capabilities

  • Activities that can be configured at runtime to run in different locations

  • Short transactions

The key to good performance in a distributed application is to minimize the amount of communication necessary. Performance problems tend to be caused by too many messages flying back and forth between distributed components. Bell's rule of networking applies: "Money can buy bandwidth, but latency is forever."[13]

[13] Thomas E. Bell, "Performance of distributed systems," a paper presented at the ICCM Capacity Management Forum 7, San Francisco, October 1993.

Unfortunately, communication overhead can be incurred by many different parts of a distributed application. There are some general high-level guidelines:

  • Allow the application to be partitioned according to the data and processing power. Any particular task should be able to run in several locations, and the location that provides the best performance should be chosen at runtime. Usually the best location for the task is where the data required for the task is stored, as transferring data tends to be a significant overhead.

  • Avoid generating distributed garbage. Distributed garbage collection can be a severe overhead on any distributed application.

  • Reduce the costs of keeping data synchronized by minimizing the duplication of data.

  • Reduce data-transfer costs by duplicating data. This conflicts directly with the last point, so the two techniques must be balanced to find the optimal data duplication points.

  • Cache distributed data wherever possible.

  • Use compression to reduce the time taken to transfer large amounts of data.

13.4.4 Object Design

My advice for object design is to use interfaces and interface-like patterns throughout the code. Although there are slightly higher runtime costs from using interfaces, that cost is well outweighed by the benefits of being able to easily replace one object implementation with another. Using interfaces means you can design with the option to replace any class or component with a faster one. Consider also where the design requires comparison by identity or by equality and where these choices can be made at implementation time.

The JDK classes are not all designed with interfaces. Those JDK classes and other third-party classes that do not have interface definitions should be wrapped by your own classes so that their use can be made more generic. (Applications that need to minimize download time, such as applets, may need to avoid the extra overhead that wrapping causes.)

Object creation is one significant place where interfaces fall down, since interfaces do not support constructor declarations, and constructors cannot return an object of a different class. To handle object creation in a way similar to interfaces, you should use the factory pattern. The factory design pattern recommends that object creation be centralized in a particular factory method. So rather than calling new Something( ) when you want to create an instance of the Something class, you call a method such as SomethingFactory.getNewSomething( ), which creates and returns a new instance of the Something class. Again, this pattern has performance costs, as there is the overhead of an extra method call for every object creation, but the pattern provides more flexibility when it comes to tuning.

Design for reusable objects: do not unnecessarily throw away objects. The factory design pattern can help, as it supports the recycling of objects. Canonicalize objects where possible (see Section 4.2.4). Keep in mind that stateless objects can usually be safely shared, so try to move to stateless objects where appropriate.

Using stateless objects is a good way to support changing algorithms easily by implementing different algorithms in particular types of objects. For example, see Section 9.2, where different sorting algorithms are implemented in various sorting classes. The resulting objects can be interchanged whenever the sorting algorithm needs to be varied.

Consider whether to optimize objects for update or access. For example, a "statistics-calculating" object might update its average and standard deviation each time a value is added to it, thus slowing down updates but making access of those statistics lightning-fast. Or, the object could simply store added values and calculate the average and standard deviation each time those statistics are accessed, making the update as fast as possible, but increasing the time for statistics access.

13.4.5 Techniques for Predicting Performance

Predicting performance is the mainstay of performance tuning at the analysis and design stages. Often it is the experience of the designers that steers design one way or another. Knowing why a particular design element has caused bad performance in another project allows the experienced designer to alter the design in just the right way to get good performance.

Some general guidelines can guide the application designer and avoid bad performance. In the following sections we consider some of these guidelines.

13.4.5.1 Factor in comparative performance of operations

Different types of operations have widely varying execution times. Some design abstractions decouple the type of intercomponent-communication mechanism from any specific implementation. The design allows the intercomponent communication to be based on a local or remote call, which allows components to be placed very flexibly. However, the performance of different types of calls varies hugely and helps define whether some designs can perform fast enough.

Specifically, if local procedure calls have an average time overhead of one unit, a local interprocess call incurs an overhead of about 100 units. On the same scale, a remote procedure call (RPC) on a local area network takes closer to 1000 time units, and an RPC routed across the Internet likely takes over 10,000 time units.

Applying these variations to the design and factoring the number of messages that components need to send to each other may rule out some distributed architectures. Alternatively, the overhead predictions may indicate that a redesign is necessary to reduce the number of intercomponent messages.

Note also that process startup overhead may need to be considered. For example, Common Gateway Interface (CGI) scripts for HTTP servers typically need to be started for every message sent to the server. For this type of design, the time taken to start up a script is significant, and when many scripts are started together, this can slow down the server considerably. Similarly, if your design allows many thread startups within short intervals, you need to determine whether the architecture can handle this, or if it may be a better option to redesign the applications to use thread pools (see Section 10.7 in Chapter 10).

13.4.5.2 Consider the relative costs of different types of accesses and updates

Accesses and updates to system memory are always going to be significantly faster than accesses and updates to other memory media. For example, reads from a local disk can be a thousand times slower than memory access, and disk writes are typically half as fast as disk reads. Random access of disks is significantly slower than sequential access.

Recognizing these variations may steer your design to alternatives you might otherwise not have considered. For example, one application server that supports a shared persistent cache redesigned the persistent cache update mechanism to take account of these different update times (the GemStone application server, http://www.gemstone.com). The original architecture performed transactional updates to objects by writing the changes to the objects on the disk, which required random disk access and updates. The modified architecture wrote all changes to shared memory as well as to a sequential journaling log file (for crash recovery). Another asynchronous process handled flushing the changes from shared memory to the objects stored on disk. Because disk navigation to the various objects was significant, this change in architecture improved performance by completely removing that bottleneck from the transaction.

13.4.5.3 Use simulations and benchmarks

Ideally, you have a detailed simulation of your application that allows you to predict the performance under any set of conditions. More usually, you have a vague simulation that has some characteristics similar to your intended application. It is important to keep striving for the full detailed simulation to be able to predict the performance of the application. But since your resources are limited, you need to project measurements as close as possible to your target application.

You should try to include loads and delays in your simulation that approximate to the expected load of the application. Try to acquire the resources your finished application will use, even if those resources are not used in the simulation. For example, spawn as many threads as you expect the application to use, even if the threads do little more than sleep restlessly.[14]

[14] Sleeping restlessly is calling Thread.sleep( ) in a loop, with the sleep time set to some value that requires many loop iterations before the loop terminates. Other activities can be run intermittently in the loop to simulate work.

Graphing the results from increasing various application-specific parameters allows you to predict the performance of the application under a variety of conditions. It is worth checking vendor or standard benchmarks if you need some really basic statistics, but bear in mind that those benchmarks seldom have much relevance to a particular application.

13.4.5.4 Consider the total work done and the design overhead

Try stripping your design to the bare essentials or going back to the specification. Consider how to create a special-purpose implementation that handles the specification for a specific set of inputs. This can give you an estimate of the actual work your application will do. Now consider your design and look at the overhead added by the design for each piece of functionality. This provides a good way to focus on the overhead and determine if it is excessive.

13.4.5.5 Focus on shared resources

Shared resources almost always cause performance problems if they have not been designed to optimize performance. Ensure that any simulation correctly simulates the sharing of resources, and use prediction analyses such as those in Section 13.4.1.3 earlier in this chapter to predict the behavior of multiple objects using shared resources.

13.4.5.6 Predict the effects of parallelism

Consider what happens when your design is spread over multiple threads, processes, CPUs, machines, etc. This analysis can be quite difficult without a simulation and test bed, but it can help to identify whether the design limits the use of parallelism.

13.4.5.7 Assess the costs of data conversions

Many applications convert data between different types (e.g., between strings and numbers). From your design, you should be able to determine the frequency and types of data conversions, and it is fairly simple to create small tests that determine the costs of the particular conversions you are using. Don't forget to include any concurrency or use of shared resources in the tests. Remember that external transfer of objects or data normally includes some data conversions. The cost of data conversion may be significant enough to direct you to alter your design.

13.4.5.8 Determine whether batch processing is faster

Some repeated tasks can be processed as a batch instead of one at a time. Batch processing can take advantage of a number of efficiencies, such as accessing and creating some objects just once, eliminating some tests for shared resources, processing tasks in optimal order, avoiding repeated searches, etc.

If any particular set of tasks could be processed in batch mode, consider the effect this would have on your application and how much faster the processing could be. The simplest conceptual example is that of adding characters one by one to a StringBuffer, as opposed to using a char array to add all the characters together. Adding the characters using a char array is much faster for any significant number of characters.