4.4 Performance

As was mentioned earlier, clusters are built in most cases to harness the resources of many machines to solve a single problem. In order for any problem to be solved on many hosts in a faster time than the single machine execution time, these hosts need to coordinate.

During application execution, this coordination takes the form of messages transmitted from one node to another. The communication patterns of these messages vary widely; some programs will spend most of their execution time performing computation, with very occasional messages reporting results and receiving a new assignment. This sort of program will typically perform independently of network performance. Other programs are constantly communicating between the parallel processes; small variances in network performance can cause huge differences in application performance in these cases.

4.4.1 Hardware Performance

Network performance can be characterized in terms of three basic metrics: latency, bandwidth, and topology. Latency is the time for a message to travel from the sender to a receiver via the network. Bandwidth is the rate at which data can be transmitted. The topology of the network is the underlying "shape" of the network. These attributes are the key determinants of network-based application performance for all applications. However, the nature of the application determines which of these attributes, if any, are important with respect to performance. Section 1.3 introduced the analysis of application performance with respect to latency and bandwidth in abstract terms; in this section we'll discuss these from the standpoint of the network hardware.

Latency is the measure of time for a message to transit the network from a sender to a receiver. Latency is important to application performance for a number of reasons. Whenever synchronous communication occurs, the receiver is waiting for messages to arrive. Fundamentally, this is the speed at which nodes in the cluster can coordinate themselves during a parallel computation. Application latency can range from upwards of 100 microseconds down to approximately 4 microseconds.

Bandwidth is the most straightforward metric of networks. It is the rate of data transmission. This is also an extremely important metric, as it governs how fast data can be exchanged between nodes.

There are many types of descriptions of bandwidth in a system, so some clarification is necessary. A network is composed of nodes with network interfaces, a set of switches, and network links connecting these parts together into some topology. All components in this system have individual bandwidth limitations, so determining what the actual available network bandwidth can be tricky. Also, as all of these function as limiting factors, many factors must be considered together in order to form a complete picture of available network bandwidth.

The most common network bandwidth quoted is the bandwidth of an individual link in the system. For example, gigabit Ethernet networks are composed of links running at 1 gigabit per second. Current generation single link bandwidths currently range from 100 Mbps (12.5 MB/s) to nearly 4 Gbps (500 MB/s). In the next year, products featuring link speeds between 4–10 Gbps are expected. It is worth noting that some network interfaces include multiple links in order to increase the available aggregate bandwidth.

Bandwidth available within network switch complexes also effects the usable bandwidth for nodes on a network. In a network composed of a single switch, the switch backplane bandwidth is an important factor. Backplane bandwidth is how much traffic the switch can handle simultaneously. Some switches, typically cheaper ones, are also limited in the number of packets per second (PPS) they are able to handle. While all of the links will still run at full speed, these two limitations cause packets to get dropped within the switch itself.

Bisection bandwidth is the other important measure of network bandwidth. Bisection bandwidth is defined as the minimum of the aggregate bandwidth between any two halves of a system. When communication is occurring between a number of stations on the network at the same time, contention inside of the switching complex can reduce the bandwidth available to communications regardless of the speed of links in the network. In many cases, individual links may not be usable because of a lack of available bisection bandwidth.

4.4.2 Software Performance

Network software is essentially responsible for moving data from system main memory to the NIC for transmission, and vice versa. This involves translating data into a format suitable for transmission, and translating data back from this formate upon receipt. Performance in this process is limited by a few factors. The first of these is the use of data copies in libraries and protocol stacks. In many cases, data starts in the user application, where it is copied into the network stack and processed. After this has completed, the data is copied across the I/O bus and is transmitted. In the case of inbound messages, data is received on the NIC, copied across the I/O bus into the network stack, processed, and finally copied into the application's memory.

A more optimized scheme would be to copy data directly from application memory to the NIC for transmission. This would avoid one of the copies mentioned in the previous scenario for each direction. This has been implemented in two ways. The first is user-level networking. In this case, all networking code exists in the user application; kernel facilities are used only to access the NIC. The other way to implement this is to use NICs with hardware network protocol processing support. This allows the NIC to process packets into an application usable form without involving the kernel at all.

Another performance problem is caused by the network stack's usage of the system CPU for computationally intensive tasks. One example of this is computation of TCP checksums. The performance of early generation gigabit Ethernet NICs were severely limited by the ability of the system CPU to compute TCP checksums quickly enough. Moreover, this computation also hampers the node's ability to perform its primary computational tasks, like the execution of user applications. This problem can only be solved by the addition of NIC support for network protocol processing.

As we mentioned previously, performance problems are caused by the frequent generation of interrupts during the usage of high speed network interfaces. Any time an interrupt is received, the current running task is stopped, causing a context shift. If this occurs every time a packet is received on the network, the host CPU will spend all of its time context shifting, without accomplishing much in between. NIC hardware assistance can help with this issue in two ways. Interrupt coalescing helps with this issue quite a bit. However, even when using interrupt coalescing, interrupt load scales with the number of packets received, not with the number of messages received. If more network protocol processing can be done in hardware, the host CPU will get interrupted less often, with more benefit.

Many of these issues have been addressed in both software and hardware. High end NIC manufacturers have begun addressing these issues, interconnect vendors more so than Ethernet NIC vendors. Myricom already addresses most of these issues in their software releases, because their hardware already supports these features. More details on tuning network software are presented in Chapter 5.