4.1 Network Hardware

4.1 Network Hardware

Networks are composed of a several types of components. First, there are the nodes (or peers) on the network. Each of these will have one or more network interface card on its I/O buses. The term "card" is figurative in some cases; network interfaces have been integrated into many motherboards in recent years. Every interface will be connected to the network fabric by a network link. The network fabric is composed of some number of network devices, interconnected into some topology. The functionality and performance of networks are composites of particular components used.

4.1.1 Host Interfaces

All peers in a network must have an interface into the network itself. These interfaces usually take the form of add-on network peripherals. These network interface cards (NICs) are usually I/O boards that plug into the system. On PC hardware, the most common bus type is PCI (or its newer replacement, PCI-X).

The function of these NICs is to allow nodes on the network to send and receive messages on the network. In order to support these operations, NICs have several parts. One component is hardware that interfaces with the physical layer of the network, the wires that carry data in a network. This hardware will work with either copper or fiber physical layers. It can convert messages from data used on the NIC and in the host stack to wire format messages for transmission, and provides the reverse functionality for message receipt.

Another portion of the NIC performs a similar task for the I/O bus. For the purpose of simplicity, we will assume the NIC in question is PCI based. In order for applications running under the host operating system to transmit a message, the message data needs to be copied to the NIC from the application so that the actual message can be prepared for transmission. This copy is done over the PCI bus from the system's main memory. So this second part of the NIC is responsible for collecting data from the PCI bus for network transmission and transmitting data received off the network over the PCI bus to the system main memory.

All network access on a peer will go through a NIC. This means the rate at which data can be transmitted is limited by the rate at which data can be copied into and out of the NIC via the I/O bus, and it is also limited by the rate at which the NIC can transmit and receive data from the network. In the days of 100 Mbps Ethernet, the link speed of links connecting nodes to the network were typically the limiting factor in hardware performance. At this point, high-end network vendors are able to nearly saturate even the fastest of I/O buses available.

4.1.2 Network Links

Network links are the channels connecting interfaces to devices and interconnecting devices. The link medium affects several other properties. Fiber and copper are typical link media. Link speeds vary widely; 10 and 100 Mb (Megabit, not to be confused with MB, or MegaByte) Ethernet is still in common use, running at 10 and 100 Mb/s, respectively. Current-generation high-end interconnect links function at rates in excess of 2–3 Gb/s. Emerging technologies, like 10Gb Ethernet and 4X Infiniband feature link speeds near 10 Gb/s.

Some network links are full duplex. If a link is full duplex, no action of two devices on the network segment can cause a collision on the link. If a link is half-duplex, or not full duplex, multiple hosts' simultaneous transmission can cause a collision. Collisions cause a few types of performance degradation. First, the average latency of messages varies with the overall usage of the network, since messages will frequently need to be retransmitted, or will have to wait before transmission can occur. Second, the aggregate bandwidth available to the entire network is lower because of the cost of collision detection and retransmission. Also, in a network featuring half-duplex links there will typically exist a single collision domain. This means that the amount of bandwidth available to all hosts is that of a single link. This is undesirable when compared with switched, full-duplex network that provide up to full bandwidth of all links.

We note that the ability to operate in either full or half duplex mode for any link in a network is governed by the devices at either end. Some devices are limited in terms of supported operational modes. Hubs are unable to function in full-duplex environments due their basic design. Some Ethernet interfaces are unable to run in full-duplex mode. All Ethernet devices, by specification are able to run in half-duplex mode.

4.1.3 Network Devices

A network device is hardware that interconnects some number of network links. The network device uses one of a number of algorithms to process and forward the traffic between hosts. The style of traffic forwarding affects the properties of the whole network greatly; different algorithms yield different behavior of the network under load. These devices also vary widely in terms of media, performance, and price.

The two main classifications of network devices are hubs and switches. Hubs implicitly contain a single broadcast domain. That is, any traffic received on any port is transmitted to all other ports on the switch. All links connected to these devices are half-duplex. These are typically among the least expensive network devices. They were most common in the days of 10 and 100 Mbps Ethernet. Gigabit hubs are unheard of. Hubs will only function with network link types that allow for contention. Ethernet does this, though many other networks currently in use do not. This sort of contention detection and correction come at some cost. When all of the links connected to a hub are suffering from contention simultaneously, the aggregate bandwidth available to clients drops to about 35%. As we mentioned previously, hubs cannot use full-duplex links, due to their basic design. For this reason, hubs are less desirable in the cluster environment.

Switches have become the standard network device in the last few years. This has occurred because of their plummeting cost and performance benefit. Ethernet switches maintain network state information that maps known Ethernet hardware addresses to the port they were last seen on. This means that when a packet is processed by the switch, the switch will have only have to flood (broadcast on all links) the first packet; the client's response will cause an entry to be created in the MAC address table of the switch and all subsequent packets will be directly forwarded to the proper port. This approach is extremely effective in small environments. A relatively small number of packets are flooded allowing all links to be used efficiently. The switch is able to cache near complete network state and the network can be near-optimally used. In more complex networks, the simplicity of this approach makes it difficult to get as good performance as one might want.

Many switches have limitations in terms of the quantity of traffic they can process. This limit is described in one of two ways. The term backplane bandwidth is used to describe the aggregate amount of bandwidth a switch can handle at once. For example, a switch that has a backplane bandwidth of 16 Gbps is able to process the load generated by 16 clients each with a 1Gbps NIC. The other way this capacity is described in specifications is in packets per second, or PPS. Also, a switch may be said to be non-blocking. This means that any configuration of clients that can be connected can be supported by the switch without packet loss because of internal bandwidth limitations. The backplane bandwidth in these cases is higher than the sum of the individual bandwidths of all links in the network.

In complex networks, many network switches will be interconnected. This is required because of the bandwidth and port counts of single switches. In large configurations, multiple switches must be used in conjunction to provide enough capacity. All clients on one switch will be limited to the link speed of the connecting link when communicating with clients on another switch. For this reason, switches are typically connected with multiple links. This allows for more packets to be exchanged by clients on different switches. This is referred to as trunking, or link aggregation.

The algorithm used to forward packets in Ethernet switches has been modified to allow for multiple link channels. These channels are treated like normal links. A variety of hashing algorithms are used to distribute the network traffic across the underlying links. Many of these algorithms use peer configuration information, like IP address or NIC hardware address. Many of these hashing algorithms do not work very well in cluster environments because of the uniformity in hardware and software. In most clusters, hosts are configured with sequential IP addresses. Also, most clusters also have homogeneous hardware. It is not uncommon for cluster nodes to have sequential, or at least very similar NIC hardware addresses. Both of these facts make many hashing algorithms suboptimal in clusters. Round-robin hashing algorithms distribute traffic well, but tend to cause packet reordering to occur. This causes problems in higher layers of the network software. Because of these problems, Ethernet switch complexes tend to be reserved for network-intensive tasks in smaller environments. In small environments, clients will have good connectivity to a large fraction of the system because of a shared switch. In larger configurations, inter-client connectivity is diminished because inter-switch connectivity is typically poor.

In order to address these sorts of problems in large switch complexes, some vendors, such as Myricom, use source routing. This means that each packet handled by the network will contain a complete route to its destination. If packets contain this information, the switch needs to simply use the stored route to forward the packet to the next hop in the stored route. This is a more scalable approach, because the switches process traffic identically whether there are 2 or 1024 nodes in the network. On the other hand, the clients need to do a lot more work. Each client needs to maintain a set of routes to all other clients in the network. This can be a complicated task; it involves complete knowledge of the whole network topology. However, it allows more flexibility for the clients of the system. This leads to better network performance overall, especially on large systems.

4.1.4 Topology

Many small cluster networks are extremely simple, consisting of a single network device and a number of clients. This configuration is advantageous in the following way. A single network device, by definition, needs to connect to other devices in the network. This means that all hosts are equally well connected to all other hosts in the system. There are no issues of traffic distribution as discussed previously. The MAC address-based forwarding scheme described previously for Ethernet switches works beautifully. Hardware performance in these configurations is typically governed by the performance of the single switch.

Once multiple switches become involved, things become more complicated. Hosts on the same switch enjoy lower latency to one another than hosts on different switches do. All of the switches need to be inter-connected. Depending on the network topology, packets may be handled by multiple switches during delivery. Depending on the particular case, packets may even by handled by all switches.

Multiple network links may be aggregated in order to improve connectivity between switches. Traffic needs to be distributed across these links. If these switches are multiply-interconnected, the path from any given host on the network may not be fixed any more.

The topology of the system will impact the overall performance of the network for clients. The primary metric of this is bisection bandwidth. Bisection bandwidth is the maximum amount of bandwidth that an arbitrary half of nodes on the network can use to communicate with the other half. In simpler networks, this is usually determined by finding the limiting factor in communication between two regions in the network. In a single switch case, this is usually the backplane bandwidth of the switch. In a multiple Ethernet switch case, this is usually the set of uplinks between switches.

Complex networks are usually built in order to provide full bisection bandwidth to cluster nodes. This means that any half of the network can communicate with its conjugate at line rate; i.e., the network itself doesn't limit communication between any set of nodes in the system. In small configurations, this task can be achieved with a single switch. Once the network outgrows a single switch, topology becomes more complicated. These configurations are composed of two types of switches. Some switches connect clients to switches. Others only connect switches to other switches. On any switch connected directly to clients, one port must be connected to another switch for each port connected to a client. This is required to allow data to flow between clients connected to different switches. Switches connected only to switches are used to distribute traffic between the switches connected to clients. As these configurations get larger, the second category of switches grows in size quickly. In larger configurations, half or more of the ports available on switches are used as inter-switch links, not as client ports.

Part III: Managing Clusters