5.7 Troubleshooting

Although cluster networks are typically rather robust, they are still sometimes responsible for unexpected behavior in one's cluster. Some of these problems can be caused by hardware failures, but are more often the result of improper software configuration or corrupted data in the system. Since the cluster network is not always easily identifiable as the cause of problems, we have chosen to present some simple techniques which cluster administrators can employ while tracking down various cluster network related problems. We also wish to illustrate the use of popular network troubleshooting tools by walking the reader through some common failure/recovery scenarios. For a more complete network troubleshooting handbook, the reader may refer to one of many such books devoted to the topic [103]. This section is designed to bring some potential pitfalls to the attention of the reader but is more intended as a starting point for administrators attempting to track down various bugs in the system.

In order to diagnose a cluster network problem, we first must understand the various levels of the cluster and how they might cause a problem. It is usually good practice to start at the application level, and work our way down through the kernel and logical network, and finish by checking hardware. An example of an application problem may be a user using an incorrect hostname or port in their application. OS level problems, which range from service configuration to driver problems, offer a wide variety of debugging challenges. Logical network problems can be improper firewall rules or routing configurations, and hardware issues range from bad switch ports to damaged cables. Attacking problems from the top of this chain, we can eliminate higher level problems before getting lost in lower level details that may not have anything to do with the original problem.

Before we begin the failure scenarios along with solutions, we first need to have a toolkit of utilities that we can use to help us determine the source of the problem.

ping. The faithful UNIX command ping has proven to be one of the most useful utilities in UNIX history. It uses a property of the ICMP protocol that specifies that when an echo request packet is sent to a remote machine or gateway, the remote machine sends back an echo response packet along with some timing metadata. Essentially we will use ping to give us a first impression of whether a host is alive on the network.
netstat. Linux provides a utility netstat which allows us to inspect the current network connection status of our machine. We use it to see which ports our machine has open, which remote machines are currently connected to us, what state our TCP connections are in, etc.
Iperf. The iperf utility is a very complete network performance testing software suite. Being a modern utility for testing network bandwidth, it supports all standard protocols, includes support for multicast performance testing, and has IPv6 support.
nmap. The nmap utility is used to probe the network accessibility of a remote machine. It can be used to essentially "map" a network by finding which machines are alive on the network and what ports they currently have open.
telnet. Although the use of the telnet remote login service is most likely disabled on any reasonable modern OS (or should be), the client program, telnet, has other useful applications. To telnet we can specify a hostname and a port to connect to, at which point the client makes a straightforward TCP connection to the remote host/port and allows us to send and receive character streams to/from the remote host. This usage model can be quite helpful when testing basic machine connectivity.
User applications. Often times, one of the best tools for finding problems, and sometimes solving them, is the actual application codes being run on the system. After all, if our users are having no problems, are there actually any problems?

Now that we have some useful tools in our toolbox, we can examine some problem scenarios and see how we can diagnose, then attempt to solve them. The reader should bare in mind that real life problems will not mirror our examples exactly, and our procedures are only meant to illustrate a general process, not a specific solution.

When I try to rsh/ssh to a remote machines, it fails.

Most often this problem is caused by improper software configuration. First, following our own advice, we should quickly check the sshd/rsh configuration files to see if anything is obviously misconfigured. If the services appear to be configured correctly, we step down to the OS/network level. For the ssh/rsh tools to function properly, the two machines in question must be visible to each other on the network (connected), and they must be able to correctly identify each other when a connection is attempted. We use ping and telnet to determine if both above conditions are satisfied.

log into source machine
ping destination machine
log into destination machine
ping source machine

This process will give us a very crude notion of whether the machines can contact each other over the network. If the above process fails, skip down to the next scenario ping doesn't seem to be working to try resolving the problem, then return to this scenario if there is still a problem with rsh/ssh.

Both ssh and rsh use TCP to start up an initial connection. We can test simple TCP connectivity using the telnet command. Start by logging into the source machine. If a connection is established, one should see the following form of output.

    source.myu.edu % telnet remote.myu.edu 514
    Trying 192.168.13.7...
    Connected to remote.myu.edu.
    Escape character is '^]'.
    Connection closed by foreign host.
    source.myu.edu %

For ssh, replace the port number of rshd (514 in the above example) with sshd's port, 22. Current port assignments should be verified by looking in the machine's '/etc/services' file. If for some reason the two machines we able to ping one another but not send TCP traffic to specified ports, we would expect the session to look similar to the following.

    source.myu.edu % telnet remote.myu.edu 514
    Trying 192.168.13.7...
    telnet: connect to address 192.168.13.7: Connection refused
    source.myu.edu %

If this occurs, our problem may be related to a routing or firewall problem, refer to the problem situation below entitled "ssh works, but ... does not" for more details on how to track this down.

If we can ping our remote machine and telnet to the port in question, our problem is most likely a simple configuration file problem (we're most likely to see an error message reporting a permission problem or similar). Check the utility's documentation to learn more on how to set up the servers (sshd for ssh problems, inetd/xinetd for rsh problems) to accept remote logins/commands.

ping doesn't seem to be working.

If our simple ping procedure is failing, either the machines are not properly configured for the network they're connected to, our name resolution configuration is incorrect, our firewall is improperly configured, or we are having hardware problems.

To confirm that our machines are properly configured to have a presence on their networks, we can attempt to ping some external machine (the gateway perhaps, some internal web site, etc). If one or the other cannot ping any external machine, there is most likely a problem with the way the network interface is configured on the machine (see Section 5.4.2) or with bad hardware/cables. If they are both alive and able to ping a common third machine, then we should try to ping with an IP address as opposed to using hostnames. Using the ifconfig utility, we can acquire both machine's IP addresses which can then be used instead of hostnames by a repeat of our ping procedure. If this fails, please refer to the problem scenario below entitled ssh works, but ... does not.. Now if pinging with IP addresses works, but pinging with hostnames does not, then we know we have a problem with the way our machines are resolving hostname mappings (or vice versa). We should consider how our systems are supposed to resolve these mappings ('/etc/hosts', NIS, DNS, all three) and check the appropriate configuration files to make sure both sides are properly set up to resolve hostnames (refer to Section 5.4.3 for details).

ssh works, but ping/rsh/application/etc does not.

If one finds that some specific application is functioning properly, while others are failing, the problem usually lies in the misconfiguration of the failing application(s). Great care should be first put into determining if the cause of the failure is specific to an application. If the failure continues when all configurations appear correct, we should turn our attention to router/firewall based causes. Remember that just as we can configure a firewall to only allow certain traffic, we can also configure it to deny certain traffic. We should check to make sure our firewall isn't explicitly denying our service traffic. Another possibility would be that we have forgotten to include a rule in our firewall that fully allows a service's network requirements to be fulfilled. Often times services require only one port for an initial connection to be made by a client, but use other ports upon successful connections, and we must allow connections on all needed ports in order for such services to operate. Note that commonly we only need to allow a single open port in one direction, but many ports must be unblocked in the other direction. The firewall must be configured to manage this types of service behaviors.

The user's application is running, but seems like the network is slowing it down.

If everything appears functionally to be operating, but is simply performing poorly or is performance is wildly varying, we can usually use iperf to quantify the problem. Below is an example of running the simplest test (TCP bandwidth, default window size) on a set of machines.

    # This is the server command
    remote.myu.edu %,  iperf -s

    # This is the client command
    source.myu.edu %,  iperf -c remote.myu.edu

Both processes will show that a test has started and after a few seconds each will report the number of seconds taken, size of total transfer, and calculated bandwidth of the connection. Try running this benchmark a few times, checking to see that whether your network is supplying the expected performance. On an unloaded system, one should expect to see approximately 95 percent of total link bandwidth to be reported by iperf, the remaining bandwidth being used by headers and other control traffic. If iperf is giving you expected measurements, there may be something wrong with the application that is showing poor network performance. Otherwise, the problem could be a bad port, cable, or even network card driver.

Nothing works!

A good rule of thumb to follow when nothing seems to be working is to follow the chain of commands that should be apparent from this section. We have application errors, local host service configuration, local name resolution configuration, logical network failures (firewalls), and hardware failures. Most problems that appear in a cluster network lie in one or many of these steps, and careful consideration at each step before moving to the next should flush out the problem.