5.2 Internet Protocol Stack

Simple Beowulf clusters are built with commodity networking hardware, typically Ethernet based, and communicate using standard networking protocols such as TCP/IP. Before examining UNIX networking concepts and services, as well as the configuration of a simple cluster, it is important that you understand the protocols involved in network communication. Understanding the protocols will be necessary when performing advanced configuration, troubleshooting problems or attempting to improve performance.

Networking protocols are built, at least conceptually, in layers. Figure 5.1 depicts the layers involved in TCP and UDP communication. In the paragraphs that follow, we will describe the layers from the bottom up, focusing on details important to our later discussions. While a full discussion of IP networking is beyond the scope of this chapter, the interested reader will find that [28, 110] discuss the topic in great detail. In addition, a more general discussion of network hardware, software, and protocols can be found in Chapter 4.

Figure 5.1: Layering of network protocols

A combination of the network interface card and the associated driver is responsible for sending frames out to other devices on the local area network. The maximum amount of data that can be placed in a frame is otherwise known as the maximum transmission unit (MTU). The MTU for an Ethernet device depends on which specification the device implements, but most devices have a MTU of 1500 bytes. Some newer Ethernet devices can be configured to send and receive jumbo frames, resulting in a MTU as large as 9000 bytes. Jumbo frames and their implications will be discussed further in Section 5.5.4.

The Internet Protocol (IP) is the building block for TCP and UDP. IP is a communication protocol for transferring messages known as datagrams between machines, even machines on different networks. An IP datagram consists of a header plus data. The header contains, among other things, the addresses for the source and destination machines and the length of the datagram (in bytes). The destination address is used by special network devices known as routers to forward (or route) the datagram between networks until the datagram reaches its destination. Section 5.3.1 contains a more detailed discussion of IP addresses and routing.

The length field of the datagram header is only 16 bits wide. As a result, the combination of the datagram header and data can be at most 65,535 bytes in length. However, as you might have guessed, IP datagrams are transmitted on the underlying network using frames, a network whose MTU is generally much smaller 65,535 bytes. To solve this problem, IP datagrams larger than the MTU are fragmented into a series of IP packets and reassembled by the receiver. In addition, fragmentation may occur if a packet is routed through any network having a smaller MTU.

IP is what is known as an unreliable, unordered, and connectionless protocol. Unreliable suggests that datagrams sent using IP may not arrive at their destination. Although the protocol makes every effort to deliver the datagram, network misconfiguration, resource exhaustion, or outright failure may result in data loss. Unordered indicates that datagrams that do arrive at their destination may arrive in a different order from the one in which they were sent. And finally, connectionless implies that no state is maintained at the sender or the receiver between datagrams.

The User Datagram Protocol (UDP) is a thin layer on top of IP. Like IP, UDP is unreliable, unordered and connectionless. The primary contribution of UDP is the addition of ports. IP only identifies the source and destinations machines, not which application or service was involved in the communication. The port is an integer identifier that allows multiple flows of communication to exist between a pair of machines and ensures that the datagrams are delivered to the appropriate application or service.

The Transmission Control Protocol (TCP), also layered on top of IP, is substantially more complex that UDP. TCP provides a bidirectional connection over which a stream of bytes is reliably communicated. Like UDP, TCP uses ports. A connection is uniquely identified by a four-tuple (source address, source port, destination address, destination port). Using this four-tuple, the TCP layer can locate the structures maintaining the state of the connection.

With TCP, data in the stream is divided into segments for transmission. These segments, plus a TCP header, are encapsulated into an IP datagram. To avoid fragmentation, which can adversely affect performance, the maximum segment size (MSS) is advertised when the connection is formed so that the segment data plus the TCP and IP headers do not exceed the MTU of the underlying network. On a local-area network (LAN), the MSS can be computed by subtracting the size of the TCP header from the network device's MTU.

TCP connections that reach outside of the LAN are more difficult as the MTU of all the networks involved is unknown when the connection is formed. In this case, most TCP/IP implementations assume an initial MTU of 576 bytes, unless an alternative value is specified by the system administrator. A discovery process is then employed to determine a MTU that is acceptable for all networks involved in the connection. Since the primary focus of this chapter is the cluster network, a discussion of wide-area network MTU discovery is unwarranted. However, the interested reader will find introductions to the topic in [28, 110] and a detailed discussion in [74].

TCP uses a coupling of positive acknowledgments and a sliding window protocol. Positive acknowledgments and data buffering along with timeouts and retransmission provide the reliability. The sliding window protocol allows the sender to have multiple unacknowledged segments outstanding, substantially increase throughput. Additionally, the protocol provides the receiver with the ability to advertise the amount of buffer space available at its end of the connection. By knowing the amount of available space at the receiver, the sender can avoid transmitting more data than can be accommodated by the receiver. This is known as flow control. More detailed discussions of these topics, and TCP as a whole, can be found in [28, 110, 87].

This concludes our high-level overview of the Internet Protocol stack. Building and operating a Beowulf cluster by no means necessitates mastering these protocols; however, a basic understanding is required. After all, it is these protocols that enable network communication. In the coming section, we will discuss a series of networking concepts and services which are built upon these very protocols.