Job Submission and Control

Introduction

As mentioned in the what is HPC section, HPC clusters like Bessemer and Stanage use a program called a scheduler to control and submit work to appropriate nodes.

All user work is dispatched to a cluster using a tool called a job scheduler. A job scheduler is a tool used to manage, submit and fairly queue users’ jobs in the shared environment of a HPC cluster. A cluster will normally use a single scheduler and allow a user to request either an immediate interactive job, or a queued batch job.

Here at the University of Sheffield, on both Bessemer and Stanage we use the SLURM scheduler, which follows three basic principles:

  • they allocate exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work,

  • they provide a framework for starting, executing, and monitoring work on the set of allocated nodes,

  • they arbitrate contention for resources by managing a queue of pending work.


Key Concepts

Tip

If you are not familiar with basic computer architecture we highly recommend reading our General Computer Architecture Quick Start page before continuing.

When engaging with our documentation several concepts must be well understood with reference to schedulers and jobs which will be explained below:

Types of Job

There are two types of job on any scheduler, interactive and batch:

Interactive jobs are ones where they are requested and immediately run providing the user with a bash shell (or a shell of their choosing) in which they can then run their software or scripts in.

Typically only very few nodes in a HPC cluster are dedicated solely to interactive jobs and interactive jobs require the resources to be available instantenously as the request is made or the request will fail. This means that interactive requests cannot always be fulfilled, particularly when requesting multiple cores.

Batch jobs are the other kind of job where a user prepares a batch submission script which both requests the resources for the job from the scheduler and contains the execution commands for a given program to run. On job submission, the scheduler will add it to the chosen queue and run your job when resources become available.

Any task that can be executed without any user intervention while it is running can be submitted as a batch job. This excludes jobs that require a Graphical User Interface (GUI), however, many common GUI applications such as ANSYS or MATLAB can also be used without their GUIs.

If you wish to use a cluster for interactive work and/or running applications like MATLAB or ANSYS using GUIs, you will need to request an interactive job from the scheduler.

If you wish to use a cluster to dispatch a very large ANSYS model you will need to request batch job from the scheduler and prepare an appropriate batch script.

Note

Long running jobs should use the batch submission system rather than requesting an interactive session for a very long time. Doing this will lead to better cluster performance for all users.

Queues and partitions

Queues or partitions (in SLURM) are queues of jobs submitted to a scheduler for it to run. They can have an assortment of constraints such as job size limit, job time limit, users permitted to use it and some nodes will be configured to accept jobs only from certain queues e.g. Department specific nodes.

All jobs are dispatchable

When a user requests that a job, (either a batch or an interactive session), is ran on the cluster, the scheduler will run jobs from the queue based on a set of rules, priorities and availabilities.

How and where a job can run are set when the job is requested based on the resource amounts requested as well as the chosen queue (assuming a user has permissions to use a queue.)

This means that not all interactive jobs are possible as the resources may not be available. It also means that the amount of time it takes for any batch job to run is dependent on how large the job resource request is, which queue it is in, what resources are available in that queue and how much previous resource usage the user has. The larger a resource request is, the longer it will take to wait for those resources to become available and the longer it will take for subsequent jobs to queue as a result of the fair scheduling algorithm.

Fair scheduling

Job schedulers are typically configured to use a fair-share / wait time system. In short, the scheduler assesses your previous CPU time and memory time (consumption) to give a requested job a priority. Subsequently it uses how long your job has had to wait in order to bump up that priority. Once your job is the highest priority, the job will then run when the requested resources become available on the system. Your running total for CPU time / memory time usage will decay over time but in general the more resources you request and for longer, the lower your initial job priority gets and the longer you have to wait behind other people’s jobs.

If you are seeing one job start and another immediately begin this is not an intentional chaining setting on the scheduler’s part. This is quite likely simply a reflection of your subsequent jobs waiting for resources to become available and it just so happens that your running job finishes freeing up the resources for the next.

As a natural consequence of backfilling into any trapped resources - you may see small time, memory and core request jobs with a lower priority running before your own with a higher priority. This is because they are small enough to utilize the trapped resource before the job trapping those resources is finished. This is not unfair and it would be inefficient and irresponsible for us to intentionally block a job from running simply because the priority is lower than a larger job that won’t fit in that trapped resource.


Job Submission / Control on Stanage & Bessemer

Tip

The Stanage & Bessemer clusters have been configured with resource request limits. Please see our Choosing appropriate compute resources page for further information.

Interactive Jobs

SLURM uses a single command to launch interactive jobs:

  • srun Standard SLURM command supporting graphical applications.

Usage of the command is as follows:

$ srun --pty bash -i

You can configure the resources available to the interactive session by adding command line options. For example to start an interactive session with access to 16 GB of RAM:

$ srun --mem=16G --pty bash -i

To start a session with access to 2 cores, use either:

$ srun --cpus-per-task=2 --pty bash -i #2 cores per task, 1 task and 1 node per job default. Preferred!
$ srun --ntasks-per-node=2 --pty bash -i #2 tasks per node, 1 core per task and 1 node per job default.

Please take care with your chosen options as usage in concert with other options can be multiplicative.

A further explanation of why you may use the tasks options or cpus options can be found here.

A table of common interactive job options is given below; any of these can be combined together to request more resources.

Slurm Command

Description

-t min or -t days-hh:mm:ss

Specify the total maximum wall clock execution time for the job. The upper limit is 08:00:00. Note: these limits may differ for reservations /projects.

--mem=xxG


--mem=xxG is used to specify the maximum amount (xx) of real memory to be requested per node.


If the real memory usage of your job exceeds this value multiplied by the number of cores / nodes you requested then your job will be killed.

-c nn or --cpus-per-task=nn


-c is cores per task, take care with your chosen number of tasks.

--ntasks-per-node=nn


--ntasks-per-node= is tasks per node, take care with your chosen number of cores per node. The default is one task per node, but note that other options can adjust the default of 1 core per task e.g. --cpus-per-task.

Rejoining an interactive job

If we lose connection to an interactive job, we can use the sattach command which attaches to a running Slurm job step. Just keep in mind that sattach doesn’t work for external or batch steps, as they aren’t set up for direct attachment.

Example:

[te1st@login1 [stanage] ~]$ squeue --me
        JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        833300 interacti     bash   te1st  R      31:22      1 node001
[te1st@login1 [stanage] ~]$ sattach 833300.0
[te1st@node001 [stanage] ~]$ echo $SLURM_JOB_ID
833300

Here we attached to SLURM job 833300 step 0. For more information type man sattach

Batch Jobs

Tip

Batch jobs have larger resource limits than interactive jobs! For guidance on what these limits are and how best to select resources please see our Choosing appropriate compute resources page.

SLURM uses a single command to submit batch jobs:

  • sbatch Standard SLURM command with no support for interactivity or graphical applications.

The Slurm docs have a complete list of available sbatch options.

The batch submission scripts are executed for submission as below:

sbatch submission.sh

Note the job submission number. For example:

Submitted batch job 1226

You can check your output log or error log file as below:

cat JOB_NAME-1226.out

There are numerous further options you can request in your batch submission files which are detailed below:

Name your job submission:

#SBATCH --job-name=JOB_NAME

Specify a number of nodes:

#SBATCH --nodes=1

Warning

Note that the Bessemer free queues do not permit the use of more than 1 node per job.

Specify a number of tasks per node:

#SBATCH --ntasks-per-node=4

Specify a number of tasks:

#SBATCH --ntasks=4

Specify a number of cores per task:

#SBATCH --cpus-per-task=4

Request a specific amount of memory per node:

#SBATCH --mem=16G

Request a specific amount of memory per CPU core:

#SBATCH --mem-per-cpu=16G

Specify the job output log file name:

#SBATCH --output=output.%j.test.out

Request a specific amount of time:

#SBATCH --time=00:30:00

Request job update email notifications:

#SBATCH --mail-user=username@sheffield.ac.uk

For the full list of the available options please visit the SLURM manual webpage for sbatch here: https://slurm.schedmd.com/sbatch.html

Here is an example SLURM batch submission script that runs a fictitious program called foo:

#!/bin/bash
# Request 5 gigabytes of real memory (mem)
#SBATCH --mem=5G

# load the module for the program we want to run
module load apps/gcc/foo

# Run the program foo with input foo.dat
# and output foo.res
foo foo.dat foo.res

Some things to note:

  • The first line always needs to be #!/bin/bash (to tell the scheduler that this is a bash batch script).

  • Comments start with a #.

  • It is always best to fully specify job’s resources with your submission script.

  • All Slurm Scheduler options start with #SBATCH

  • You should use the SLURM option --ntasks=nn Number of “tasks”, for programs using distributed parallelism (MPI).

  • You should use the SLURM option --ntasks-per-node=nn Number of “tasks per node”, for programs using distributed parallelism (MPI). Note that the Bessemer free queues do not permit the use of more than 1 node per job.

  • You should use the SLURM option --cpus-per-task=nn Number of “cores per task”, for programs using shared memory parallelism.

  • You will often require one or more module commands in your submission file to make programs and libraries available to your scripts. Many applications and libraries are available as modules on Bessemer and Stanage.

Here is a more complex example that requests more resources:

#!/bin/bash
# Request 16 gigabytes of real memory (RAM) 4 cores *4G = 16
#SBATCH --mem=16G
# Request 4 cores
#SBATCH --cpus-per-task=4
# Email notifications to me@somedomain.com
#SBATCH --mail-user=me@somedomain.com
# Email notifications if the job fails
#SBATCH --mail-type=FAIL
# Change the name of the output log file.
#SBATCH --output=output.%j.test.out
# Rename the job's name
#SBATCH --job-name=my_job


# Load the modules required by our program
module load compilers/gcc/5.2
module load apps/gcc/foo

# Set the OPENMP_NUM_THREADS environment variable to 4
# This is needed to ensure efficient core usage.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Run the program foo with input foo.dat
# and output foo.res
foo foo.dat foo.res

Tip

Bessemer currently supports running preemptable jobs. These are jobs which have been set to run in a reserved queue’s node when those nodes are idle. These reserved queues are typically private (research group-owned or dept-owned) nodes, but these resources will be reclaimed (and the associated jobs preempted) if members of those groups/departments submit jobs that can only start if those resources are repurposed.

For more details on running preemptable jobs on Bessemer please see: Preemptable jobs

Monitoring running Jobs

There are two commands to monitor running and queued jobs:

The squeue command is used to pull up information about jobs in the queue, by default this command will list the job ID, partition, username, job status, number of nodes, and name of nodes for all jobs queued or running within SLURM.

Display all jobs queued on the system:

$ squeue

To limit this command to only display a single user’s jobs the --user flag can be used:

$ squeue --user=$USER

To limit this command to only display your own jobs, the --me flag can be used:

$ squeue --me

Further information without abbreviation can be shown by using the --long flag:

$ squeue --me --long

The squeue command also provides a method to calculate the estimated start time for a job by using the --start flag:

$ squeue --me --start

The accuracy of squeue --start estimates varies due to factors like queue dynamics, resource availability (affected by maintenance, node failures, etc), making it a guideline rather than a guarantee.

When checking the status of a job you may wish to check for updates at a time interval. This can be achieved by using the --iterate flag and a number of seconds:

$ squeue --me --start --iterate=n_seconds

You can stop this command by pressing Ctrl + C.

Example output:

$ squeue
        JOBID   PARTITION   NAME      USER  ST       TIME  NODES NODELIST(REASON)
        1234567 interacti   bash   foo1bar   R   17:19:40      1 bessemer-node001
        1234568 sheffield job.sh   foo1bar   R   17:21:40      1 bessemer-node046
        1234569 sheffield job.sh   foo1bar  PD   17:22:40      1 (Resources)
        1234570 sheffield job.sh   foo1bar  PD   16:47:06      1 (Priority)
        1234571       gpu job.sh   foo1bar   R 1-19:46:53      1 bessemer-node026
        1234572       gpu job.sh   foo1bar   R 1-19:46:54      1 bessemer-node026
        1234573       gpu job.sh   foo1bar   R 1-19:46:55      1 bessemer-node026
        1234574       gpu job.sh   foo1bar   R 1-19:46:56      1 bessemer-node026
        1234575       gpu job.sh   foo1bar  PD       9:04      1 (ReqNodeNotAvail, UnavailableNodes:bessemer-node026)
        1234576 sheffield job.sh   foo1bar  PD    2:57:24      1 (QOSMaxJobsPerUserLimit)

States shown above indicate job states including running “R” and Pending “PD” with various reasons for pending states including a node (ReqNodeNotAvail) full of jobs and a user hitting the max limit for numbers of jobs they can run simultaneously in a QOS (QOSMaxJobsPerUserLimit).

A list of the most relevant job states and reasons can be seen below:

SLURM Job States:

Jobs typically pass through several states in the course of their execution. The typical states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED.

Status

Code

Explanation

COMPLETED

CD

The job has completed successfully.

COMPLETING

CG

The job is finishing but some processes are still active.

CANCELLED

CA

Job was explicitly cancelled by the user or system administrator.

FAILED

F

The job terminated with a non-zero exit code and failed to execute.

PENDING

PD

The job is waiting for resource allocation. It will eventually run.

PREEMPTED

PR

The job was terminated because of preemption by another job.

RUNNING

R

The job currently is allocated to a node and is running.

SUSPENDED

S

A running job has been stopped with its cores released to other jobs.

STOPPED

ST

A running job has been stopped with its cores retained.

OUT_OF_MEMORY

OOM

Job experienced out of memory error.

TIMEOUT

TO

Job exited because it reached its walltime limit.

NODE_FAIL

NF

Job terminated due to failure of one or more allocated nodes.

A full list of job states can be found at: https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES

SLURM Job Reasons:

These codes identify the reason that a job is waiting for execution. A job may be waiting for more than one reason, in which case only one of those reasons is displayed.

Reason Code

Explanation

Priority

One or more higher priority jobs is in queue for running. Your job will eventually run.

Dependency

This job is waiting for a dependent job to complete and will run afterwards.

Resources

The job is waiting for resources to become available and will eventually run.

InvalidAccount

The job’s account is invalid. Cancel the job and rerun with correct account.

InvaldQoS

The job’s QoS is invalid. Cancel the job and rerun with correct account.

QOSGrpMaxJobsLimit

Maximum number of jobs for your job’s QoS have been met; job will run eventually.

PartitionMaxJobsLimit

Maximum number of jobs for your job’s partition have been met; job will run eventually.

AssociationMaxJobsLimit

Maximum number of jobs for your job’s association have been met; job will run eventually.

JobLaunchFailure

The job could not be launched. This may be due to a file system problem, invalid program name, etc.

NonZeroExitCode

The job terminated with a non-zero exit code.

SystemFailure

Failure of the Slurm system, a file system, the network, etc.

TimeLimit

The job exhausted its time limit.

WaitingForScheduling

No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason.

BadConstraints

The job’s constraints can not be satisfied.

A full list of job reasons can be found at: https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES

The sstat command can be used to display status information about a user’s currently running jobs such as the CPU usage, task or node information and memory consumption.

The command can be invoked as follows with a specific job ID:

$ sstat --jobs=job-id

And to display specific information you can use the --format flag to choose your output:

$ sstat --jobs=job-id --format=var_1,var_2, ... , var_N

A chart of some these variables are listed in the table below:

sstat format variable names

Variable

Description

AveCPU

Average (system + user) CPU time of all tasks in job.

AveRSS

Average resident set size of all tasks in job.

AveVMSize

Average Virtual Memory size of all tasks in job.

JobID

The id of the Job.

MaxRSS

Maximum resident set size of all tasks in job.

MaxVMSize

Maximum Virtual Memory size of all tasks in job.

NTasks

Total number of tasks in a job or step.

A full list of variables for the --format flag can be found with the --helpformat flag or by visiting the slurm page on sstat.

Stopping or cancelling Jobs

Jobs can be stopped or cancelled using the scancel command:

You can stop jobs with the scancel command and the job’s ID (replacing job-id with the number):

$ scancel job-id

To cancel multiple jobs you can supply a comma separated list:

$ scancel job-id1, job-id2, job-id3

Investigating finished Jobs

Jobs which have already finished can be investigated using the seff script:

The seff script can be used as follows with the job’s ID to give summary of important job info :

$ seff job-id

For example, on the Stanage cluster:

$ seff 64626
Job ID: 64626
Cluster: stanage.alces.network
User/Group: a_user/clusterusers
State: COMPLETED (exit code 0)
Nodes: 2
Cores per node: 1
CPU Utilized: 00:02:37
CPU Efficiency: 35.68% of 00:07:20 core-walltime
Job Wall-clock time: 00:03:40
Memory Utilized: 137.64 MB (estimated maximum)
Memory Efficiency: 1.71% of 7.84 GB (3.92 GB/core)

Or in even more depth using the sacct command:

The sacct command can be used to display status information about a user’s historical jobs.

The command can be used as follows with the job’s ID:

$ sacct --jobs=job-id

Or to view information about all of a specific user’s jobs:

$ sacct --user=$USER

By default the sacct command will only bring up information about the user’s job from the current day. By using the --starttime flag the command will look further back to the given date e.g. :

$ sacct --user=$USER --starttime=YYYY-MM-DD

Like the sstat command, the --format flag can be used to choose the command output:

$ sacct --user=$USER --format=var_1,var_2, ... ,var_N
sacct format variable names

Variable

Description

Account

The account the job ran under.

AveCPU

Average (system + user) CPU time of all tasks in job.

AveRSS

Average resident set size of all tasks in job.

AveVMSize

Average Virtual Memory size of all tasks in job.

CPUTime

Formatted (Elapsed time * CPU) count used by a job or step.

Elapsed

Jobs elapsed time formated as DD-HH:MM:SS.

ExitCode

The exit code returned by the job script or salloc.

JobID

The id of the Job.

JobName

The name of the Job.

MaxRSS

Maximum resident set size of all tasks in job.

MaxVMSize

Maximum Virtual Memory size of all tasks in job.

MaxDiskRead

Maximum number of bytes read by all tasks in the job.

MaxDiskWrite

Maximum number of bytes written by all tasks in the job.

ReqCPUS

Requested number of CPUs.

ReqMem

Requested amount of memory.

ReqNodes

Requested number of nodes.

NCPUS

The number of CPUs used in a job.

NNodes

The number of nodes used in a job.

User

The username of the person who ran the job.

A full list of variables for the --format flag can be found with the --helpformat flag or by visiting the slurm page on sacct.

Debugging failed Jobs

If one of your jobs has failed and you need to debug why this has occured you should consult the job records held by the scheduler with the sacct referenced above as well as the generated job logs.

These output and error log files will be generated in the job working directory with the job name or output log file name as of the form slurm-$SLURM_JOB_ID.out where $SLURM_JOB_ID is the scheduler provided job id. Looking at these logs should indicate the source of any issues.

sacct will also give a job’s state and ExitCode field with each job.

The ExitCode is the return value of the exiting program/script. It can be a user defined value if the job is finished with a call to ‘exit(number)’. Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a Reason of “NonZeroExitCode”.

The job logs may also include a “derived exit code” field. This is set to the value of the highest exit code returned by all of the job’s steps (srun invocations).


Cluster job resource limits

While the Sheffield clusters have very large amounts of resources to use for your jobs there are limits applied in order for the schedulers to function. The limits below apply to the default free queues. Other queues may have different settings.

Warning

You must ensure that your jobs do not attempt to exceed these limits as the schedulers are not forgiving and will summarily kill any job which exceeds the requested limits without warning.

CPU Limits

Warning

Please note that for either cluster the larger the number of cores you request in an interactive job the more likely the request is to fail as the requested resource is not immediately available.

CPU Allocation Limits Table

Scheduler Type

No. CPU Cores Available

Interactive Job
(Default/ Min / Max )

No. CPU Cores Available

Batch Job
(Default/ Min / Max )

Submission Argument

SLURM (Stanage)

1 / 1 / ~11264 (MPI), 64 (SMP)

1 / 1 / ~11264 (MPI), 64 (SMP)

-c <nn>

SLURM (Bessemer)

1 / 1 / 40

1 / 1 / 40

-c <nn>

Time Limits

Time Allocation Limits Table

Scheduler Type

Interactive Job
(Default / Max)

Batch Job
(Default / Max)

Submission Argument

SLURM (Stanage)

8 / 8 hrs

8 / 96 hrs

--time=<days-hh:mm:ss>

SLURM (Bessemer)

8 / 8 hrs

8 / 168 hrs

--time=<days-hh:mm:ss>

Memory Limits

Memory Allocation Limits Table

SLURM (Stanage) Cross node MPI execution enabled

SLURM (Bessemer) Single node execution only

Default Job Memory Request

4016 MB

2 GB

Standard Nodes

251 MB

192 GB

Large RAM Nodes

1007 GB

N/A

Very Large RAM Nodes

2014 GB

N/A

Interactive Job

Maximum Possible Request

251 GB

192 GB

Batch Job (SMP)

Maximum Regular Node Request

251 GB

192 GB

Maximum Possible Request

2014 GB

192 GB

Batch Job (MPI)

Maximum Possible Request

~74404 GB

192 GB

Submission Argument on a per node (job) basis

–mem=<nn>

–mem=<nn>

Advanced / Automated job submission and management

Further information on advanced or automated job submission and management can be found on our dedicated pages: Advanced Job Submission and Control and Advanced Job Profiling and Analysis.

Reference information and further resources

Quick reference information for the SLURM scheduler used on both the Stanage and Bessemer clusters can be found in the Scheduler Reference Info section.

An SGE to SLURM conversion guide is provided in the Quick Reference section.