Jobs submitted to the cluster can encounter a variety of issues that slow or even halt execution. In some cases, users can diagnose and repair these problems themselves. In this document, you will learn how to monitor your submitted jobs, make modifications to them, and troubleshoot common job issues.
Job monitoring enables users to determine the status of their submitted or active jobs. Tracking the status of these jobs can help identify issues that require remediation. Several tools on the cluster facilitate job monitoring.
qstat is one of these tools.
qstat outputs a summary of the jobs a user has submitted to the cluster, both interactive and non-interactive. To review a brief summary, execute
qstat -a. Figure 2.1 shows the output of
qstat with this option. To review a verbose summary, execute
qstat -f. Most situations will not require verbose output, but it can be useful for troubleshooting scenarios.
In total, qstat -a outputs ten columns. Most of the columns are self-explanatory. Table 2.1 describes the purpose of columns that are not readily obvious.
|NDS||Shows the amount of nodes requested by the job. qsub’s
|TSK||Shows the amount of cores requested by the job. qsub's
|S||Shows the job’s status. Table 2.2 defines the possible job statuses.|
|C||Job has completed execution. Please note that this does not indicate the job successfully completed, only that it was processed by the scheduler and executed.|
|E||Job has exited after execution.|
|H||Job is held. It will not execute until it is released.|
|Q||Job is queued and eligible to execute. When sufficient resources become available, the job will execute.|
|R||Job is currently running.|
|T||Job is being moved to a new location.|
|W||Job is waiting for its execution time to be reached.|
qstat -a outputs summarized information on every job a user has submitted, it is not well-suited to find detailed information on individual jobs. To monitor individual jobs, either the
qstat -f <job-ID> or the
checkjob utilities can be used. Figure 2.2 shows the output of
checkjob. Figure 2.3 presents the syntax for the
checkjob utility. The <job-ID> argument accepts the job’s numeric identifier followed by the batch server’s hostname. In Figure 2.2, the job’s identifier is 2804750.apollo-acf. When you submit a job, the scheduler outputs its identifier. You may also use
qstat -a to quickly find job identifiers.
Another useful job monitoring utility is showq. When it is executed without any options or arguments, it shows the cluster’s entire queue. With the -w user option, it will only show the queue for the specified user. Figure 2.4 shows the output of showq with the -w user option. Observe that showq divides its output into active, eligible, and blocked jobs. This provides a convenient way to determine the current status of all your submitted jobs. You could also use the -w option with group to identify the queue for a specific group on the cluster.
Jobs may require alteration to execute after they are submitted. While syntax errors in a job script will prevent the job from being submitted, errors with the values provided in the job script may not. These situations can be remedied by users. Please note that if the job is running and requires modification, an administrator must make those changes. Please submit a ticket to email@example.com if you need a running job to be modified.
In general, the process for modifying a job in the queue is as follows.
- Use one of the utilities mentioned in the Job Monitoring section to gather information about the jobs that require alteration. You will need the job’s identifier and the parameters that require modification.
- Use the
qholdcommand to remove the job(s) from the queue. Figure 3.1 shows the syntax for this command. Replace <job-ID> with the job’s numeric identifier followed by the hostname of the batch server (apollo-acf).
- Use the qalter command to modify the job. For instance, to modify the amount of nodes requested by the job, execute the command shown in Figure 3.2.
- Use the qrls command to put the job back into the scheduler’s queue. Figure 3.3 shows the syntax for this command.
- Continue to monitor the job and make additional modifications, if necessary.
If you determine that the job must be completely removed from the queue, use the
qdel command. The job will be completely purged from the queue. Figure 3.4 shows how to use the
Common Job Issues
Various issues can affect the submission or execution of jobs. In these situations, users can ordinarily remedy the problems themselves without requiring assistance. This section covers the most common problems users encounter with their jobs and provides instructions to resolve them.
Job Script Errors
Syntax errors and invalid arguments are frequent in job scripts. If the job cannot be submitted, verify that the job script conforms to the appropriate syntax for the interpreter and uses proper arguments for each PBS option. For example, consider the error in Figure 4.1. In this case, the user provided the ppn value after a colon rather than an equal (=) sign. The scheduler will not permit a job with this error to be queued.
Another possible error is shown in Figure 4.2. Here, the user provided two environment variables with the incorrect syntax. The leading dollar sign ($) should be removed to ensure that the variables are passed to the job. Additionally, there should be no spaces between variables, only commas.
Jobs that occupy the queue for extended periods of time have likely requested more resources than the cluster has available. In these situations, modify the resource requirements of the job to reflect what the job needs to execute. For instance, a job that requests four nodes is unlikely to execute in a timely fashion. Instead, it should request two nodes with a ppn (processors per node) value that indicates how many cores the job requires. Figure 4.3 shows the alteration that should be made. Review the process for altering jobs outlined in the Modifying Jobs section before attempting to make any modifications.
If the job specifies the -n option to the qsub command or in the job script, it should be removed unless the job requires an entire node. Figure 4.4 shows how to unset the node-exclusive option from a submitted job. The output of
qstat -f will reflect the change under the “node_exclusive” key.
Using the Wrong Partition or Feature
Jobs may request a partition that does not exist. Figure 4.5 illustrates an invalid partition specification. “beacon_gpu” is not a partition; it is a feature. The scheduler will not queue this job with this partition selected. It should be changed to the valid partition “beacon” so that it will be queued.
Other jobs may request a partition to which the submitting user has no access. In these cases, the partition should be changed to one that the user can access. Figure 4.6 shows how to use the qalter command to change the job’s partition. Remember to follow the process for altering jobs outlined in the Modifying Jobs section before altering a job.
Though less likely, errors involving features can occur. Requesting features like “gpu” or “bigcore” will prevent the job from being queued. Features should only be used to specify the exact node set on which the jobs should run. Figure 4.7 demonstrates a change to a job that targets the general partition and the sigma feature, which will place the job on a Sigma node.
For more information about partitions and features, please review the Running Jobs document.
Using the Wrong Project or Reservation
If the wrong project or reservation is used to submit a job, the scheduler may refuse to place it in the queue or it may occupy the queue for an indefinite time. Before submitting a job, review the project(s) to which you belong in the User Portal. Use the project’s identifier to submit your jobs to the scheduler. By default, all UTK users belong to the ACF-UTK0011 project, while all UTHSC users belong to the ACF-UTHSC0001 project. Figure 4.8 shows how to modify the project used by a submitted job.
Reservations are generally only available for a limited time. For example, a class may have a two-day reservation to have exclusive access to some of the nodes in the cluster. The reservation identifier should be provided to all the users who require access to the reservation. The showres command can also be used to output the reservations to which you belong. If you use the wrong reservation or mistype the identifier, use the command in Figure 4.9 to change it.
Some projects have access to private condos. In these cases, the QoS (quality of service) attribute must be specified in order to access the condo. Review the Writing Job Scripts and Running Jobs documents for more information. If you need to change the QoS attribute of a queued job, use the command in Figure 4.10 to modify it.
Last Updated: 04 / 28 / 2020