squeue

squeue is a scheduler command used to view information about jobs located in the SLURM scheduling queue.

Documentation

Documentation is available on the system using the command

$ man squeue

Usage

The squeue command is used to pull up information about jobs in the queue, by default this command will list the job ID, partition, username, job status, number of nodes, and name of nodes for all jobs queued or running within SLURM.

Display all jobs queued on the system:

$ squeue

To limit this command to only display a single user’s jobs the --user flag can be used:

$ squeue --user=$USER

To limit this command to only display your own jobs, the --me flag can be used:

$ squeue --me

Further information without abbreviation can be shown by using the --long flag:

$ squeue --me --long

The squeue command also provides a method to calculate the estimated start time for a job by using the --start flag:

$ squeue --me --start

The accuracy of squeue --start estimates varies due to factors like queue dynamics, resource availability (affected by maintenance, node failures, etc), making it a guideline rather than a guarantee.

When checking the status of a job you may wish to check for updates at a time interval. This can be achieved by using the --iterate flag and a number of seconds:

$ squeue --me --start --iterate=n_seconds

You can stop this command by pressing Ctrl + C.

Example output:

$ squeue
        JOBID   PARTITION   NAME      USER  ST       TIME  NODES NODELIST(REASON)
        1234567 interacti   bash   foo1bar   R   17:19:40      1 bessemer-node001
        1234568 sheffield job.sh   foo1bar   R   17:21:40      1 bessemer-node046
        1234569 sheffield job.sh   foo1bar  PD   17:22:40      1 (Resources)
        1234570 sheffield job.sh   foo1bar  PD   16:47:06      1 (Priority)
        1234571       gpu job.sh   foo1bar   R 1-19:46:53      1 bessemer-node026
        1234572       gpu job.sh   foo1bar   R 1-19:46:54      1 bessemer-node026
        1234573       gpu job.sh   foo1bar   R 1-19:46:55      1 bessemer-node026
        1234574       gpu job.sh   foo1bar   R 1-19:46:56      1 bessemer-node026
        1234575       gpu job.sh   foo1bar  PD       9:04      1 (ReqNodeNotAvail, UnavailableNodes:bessemer-node026)
        1234576 sheffield job.sh   foo1bar  PD    2:57:24      1 (QOSMaxJobsPerUserLimit)

States shown above indicate job states including running “R” and Pending “PD” with various reasons for pending states including a node (ReqNodeNotAvail) full of jobs and a user hitting the max limit for numbers of jobs they can run simultaneously in a QOS (QOSMaxJobsPerUserLimit).

A list of the most relevant job states and reasons can be seen below:

SLURM Job States:

Jobs typically pass through several states in the course of their execution. The typical states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED.

Status

Code

Explanation

COMPLETED

CD

The job has completed successfully.

COMPLETING

CG

The job is finishing but some processes are still active.

CANCELLED

CA

Job was explicitly cancelled by the user or system administrator.

FAILED

F

The job terminated with a non-zero exit code and failed to execute.

PENDING

PD

The job is waiting for resource allocation. It will eventually run.

PREEMPTED

PR

The job was terminated because of preemption by another job.

RUNNING

R

The job currently is allocated to a node and is running.

SUSPENDED

S

A running job has been stopped with its cores released to other jobs.

STOPPED

ST

A running job has been stopped with its cores retained.

OUT_OF_MEMORY

OOM

Job experienced out of memory error.

TIMEOUT

TO

Job exited because it reached its walltime limit.

NODE_FAIL

NF

Job terminated due to failure of one or more allocated nodes.

A full list of job states can be found at: https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES

SLURM Job Reasons:

These codes identify the reason that a job is waiting for execution. A job may be waiting for more than one reason, in which case only one of those reasons is displayed.

Reason Code

Explanation

Priority

One or more higher priority jobs is in queue for running. Your job will eventually run.

Dependency

This job is waiting for a dependent job to complete and will run afterwards.

Resources

The job is waiting for resources to become available and will eventually run.

InvalidAccount

The job’s account is invalid. Cancel the job and rerun with correct account.

InvaldQoS

The job’s QoS is invalid. Cancel the job and rerun with correct account.

QOSGrpMaxJobsLimit

Maximum number of jobs for your job’s QoS have been met; job will run eventually.

PartitionMaxJobsLimit

Maximum number of jobs for your job’s partition have been met; job will run eventually.

AssociationMaxJobsLimit

Maximum number of jobs for your job’s association have been met; job will run eventually.

JobLaunchFailure

The job could not be launched. This may be due to a file system problem, invalid program name, etc.

NonZeroExitCode

The job terminated with a non-zero exit code.

SystemFailure

Failure of the Slurm system, a file system, the network, etc.

TimeLimit

The job exhausted its time limit.

WaitingForScheduling

No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason.

BadConstraints

The job’s constraints can not be satisfied.

A full list of job reasons can be found at: https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES