17.6 Troubleshooting

17.6 Troubleshooting

The following is a list of common problems and recommended solutions. Additional information is always available on the PBS Web sites.

17.6.1 Clients Unable to Contact Server

If a client command (such as qstat or qmgr) is unable to connect to a Server there are several possible errors to check. If the error return is 15034, No server to connect to, check (1) that there is indeed a Server running and (2) that the default Server information is set correctly. The client commands will attempt to connect to the Server specified on the command line if given or, if not given, the Server specified in the default server file, '/usr/spool/PBS/default_server'.

If the error return is 15007, No permission, check for (2) as above. Also check that the executable pbs_iff is located in the search path for the client and that it is setuid root. Additionally, try running pbs_iff by typing

        pbs_iff server_host 15001

where server_host is the name of the host on which the Server is running and 15001 is the port to which the Server is listening (if started with a different port number, use that number instead of 15001). The executable pbs_iff should print out a string of garbage characters and exit with a status of 0. The garbage is the encrypted credential that would be used by the command to authenticate the client to the Server. If pbs_iff fails to print the garbage and/or exits with a nonzero status, either the Server is not running or it was installed with a different encryption system from that used for pbs_iff.

17.6.2 Nodes Down

The PBS Server determines the state of nodes (up or down), by communicating with MOM on the node. The state of nodes may be listed by two commands: qmgr and pbsnodes.

        % qmgr
        Qmgr: list node @active

        % pbsnodes  -a
        Node jupiter
                state = down, state-unknown
                properties = sparc, mine
                ntype = cluster

A node in PBS may be marked down in one of two substates. For example, the state above of node "Jupiter" shows that the Server has not had contact with MOM on that since the Server came up. Check to see whether a MOM is running on the node. If there is a MOM and if the MOM was just started, the Server may have attempted to poll her before she was up. The Server should see her during the next polling cycle in ten minutes. If the node is still marked down, state-unknown after ten minutes, either the node name specified in the Server's node file does not map to the real network hostname or there is a network problem between the Server's host and the node.

If the node is listed as

        % pbsnodes  -a
        Node jupiter
                state = down
                properties = sparc, mine
                ntype = cluster

then the Server has been able to communicate with MOM on the node in the past, but she has not responded recently. The Server will send a ping PBS message to every free node each ping cycle (10 minutes). If a node does not acknowledge the ping before the next cycle, the Server will mark the node down.

17.6.3 Nondelivery of Output

If the output of a job cannot be delivered to the user, it is saved in a special directory '/usr/spool/PBS/undelivered' and mail is sent to the user. The typical causes of nondelivery are the following:

  • The destination host is not trusted and the user does not have a .rhost file.

  • An improper path was specified.

  • A directory in the specified destination path is not writable.

  • The user's .cshrc on the destination host generates output when executed.

The '/usr/spool/PBS/spool' directory on the execution host does not have the correct permissions. This directory must have mode 1777 (drwxrwxrwxt).

17.6.4 Job Cannot Be Executed

If a user receives a mail message containing a job identifier and the line "Job cannot be executed," the job was aborted by MOM when she tried to place it into execution. The complete reason can be found in one of two places: MOM's log file or the standard error file of the user's job.

If the second line of the message is "See Administrator for help," then MOM aborted the job before the job's files were set up. The reason will be noted in MOM's log. Typical reasons are a bad user/group account or a system error.

If the second line of the message is "See job standard error file," then MOM had already created the job's file, and additional messages were written to standard error.

Part III: Managing Clusters