squeue
squeue
is a scheduler command used to view information about jobs located in the SLURM
scheduling queue.
Documentation
Documentation is available on the system using the command
$ man squeue
Usage
The squeue
command is used to pull up information about jobs in the queue, by default this
command will list the job ID, partition, username, job status, number of nodes, and name of nodes
for all jobs queued or running within SLURM.
Display all jobs queued on the system:
$ squeue
To limit this command to only display a single user’s jobs the --user
flag can be used:
$ squeue --user=$USER
To limit this command to only display your own jobs, the --me
flag can be used:
$ squeue --me
Further information without abbreviation can be shown by using the --long
flag:
$ squeue --me --long
The squeue
command also provides a method to calculate the estimated start time for a job by
using the --start
flag:
$ squeue --me --start
The accuracy of squeue --start
estimates varies due to factors like queue dynamics,
resource availability (affected by maintenance, node failures, etc), making it a guideline rather than a guarantee.
When checking the status of a job you may wish to check for updates at a time interval. This can
be achieved by using the --iterate
flag and a number of seconds:
$ squeue --me --start --iterate=n_seconds
You can stop this command by pressing Ctrl + C
.
Example output:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1234567 interacti bash foo1bar R 17:19:40 1 bessemer-node001
1234568 sheffield job.sh foo1bar R 17:21:40 1 bessemer-node046
1234569 sheffield job.sh foo1bar PD 17:22:40 1 (Resources)
1234570 sheffield job.sh foo1bar PD 16:47:06 1 (Priority)
1234571 gpu job.sh foo1bar R 1-19:46:53 1 bessemer-node026
1234572 gpu job.sh foo1bar R 1-19:46:54 1 bessemer-node026
1234573 gpu job.sh foo1bar R 1-19:46:55 1 bessemer-node026
1234574 gpu job.sh foo1bar R 1-19:46:56 1 bessemer-node026
1234575 gpu job.sh foo1bar PD 9:04 1 (ReqNodeNotAvail, UnavailableNodes:bessemer-node026)
1234576 sheffield job.sh foo1bar PD 2:57:24 1 (QOSMaxJobsPerUserLimit)
States shown above indicate job states including running “R” and Pending “PD” with various reasons for pending states including a node (ReqNodeNotAvail) full of jobs and a user hitting the max limit for numbers of jobs they can run simultaneously in a QOS (QOSMaxJobsPerUserLimit).
A list of the most relevant job states and reasons can be seen below:
SLURM Job States:
Jobs typically pass through several states in the course of their execution. The typical states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED.
Status |
Code |
Explanation |
---|---|---|
COMPLETED |
CD |
The job has completed successfully. |
COMPLETING |
CG |
The job is finishing but some processes are still active. |
CANCELLED |
CA |
Job was explicitly cancelled by the user or system administrator. |
FAILED |
F |
The job terminated with a non-zero exit code and failed to execute. |
PENDING |
PD |
The job is waiting for resource allocation. It will eventually run. |
PREEMPTED |
PR |
The job was terminated because of preemption by another job. |
RUNNING |
R |
The job currently is allocated to a node and is running. |
SUSPENDED |
S |
A running job has been stopped with its cores released to other jobs. |
STOPPED |
ST |
A running job has been stopped with its cores retained. |
OUT_OF_MEMORY |
OOM |
Job experienced out of memory error. |
TIMEOUT |
TO |
Job exited because it reached its walltime limit. |
NODE_FAIL |
NF |
Job terminated due to failure of one or more allocated nodes. |
A full list of job states can be found at: https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES
SLURM Job Reasons:
These codes identify the reason that a job is waiting for execution. A job may be waiting for more than one reason, in which case only one of those reasons is displayed.
Reason Code |
Explanation |
---|---|
Priority |
One or more higher priority jobs is in queue for running. Your job will eventually run. |
Dependency |
This job is waiting for a dependent job to complete and will run afterwards. |
Resources |
The job is waiting for resources to become available and will eventually run. |
InvalidAccount |
The job’s account is invalid. Cancel the job and rerun with correct account. |
InvaldQoS |
The job’s QoS is invalid. Cancel the job and rerun with correct account. |
QOSGrpMaxJobsLimit |
Maximum number of jobs for your job’s QoS have been met; job will run eventually. |
PartitionMaxJobsLimit |
Maximum number of jobs for your job’s partition have been met; job will run eventually. |
AssociationMaxJobsLimit |
Maximum number of jobs for your job’s association have been met; job will run eventually. |
JobLaunchFailure |
The job could not be launched. This may be due to a file system problem, invalid program name, etc. |
NonZeroExitCode |
The job terminated with a non-zero exit code. |
SystemFailure |
Failure of the Slurm system, a file system, the network, etc. |
TimeLimit |
The job exhausted its time limit. |
WaitingForScheduling |
No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason. |
BadConstraints |
The job’s constraints can not be satisfied. |
A full list of job reasons can be found at: https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES