Job Submission and Control
Introduction
As mentioned in the what is HPC section, HPC clusters like Bessemer and Stanage use a program called a scheduler to control and submit work to appropriate nodes.
All user work is dispatched to a cluster using a tool called a job scheduler. A job scheduler is a tool used to manage, submit and fairly queue users’ jobs in the shared environment of a HPC cluster. A cluster will normally use a single scheduler and allow a user to request either an immediate interactive job, or a queued batch job.
Here at the University of Sheffield, on both Bessemer and Stanage we use the SLURM scheduler, which follows three basic principles:
they allocate exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work,
they provide a framework for starting, executing, and monitoring work on the set of allocated nodes,
they arbitrate contention for resources by managing a queue of pending work.
Key Concepts
Tip
If you are not familiar with basic computer architecture we highly recommend reading our General Computer Architecture Quick Start page before continuing.
When engaging with our documentation several concepts must be well understood with reference to schedulers and jobs which will be explained below:
Types of Job
There are two types of job on any scheduler, interactive and batch:
Interactive jobs are ones where they are requested and immediately run providing the user with a bash shell (or a shell of their choosing) in which they can then run their software or scripts in.
Typically only very few nodes in a HPC cluster are dedicated solely to interactive jobs and interactive jobs require the resources to be available instantenously as the request is made or the request will fail. This means that interactive requests cannot always be fulfilled, particularly when requesting multiple cores.
Batch jobs are the other kind of job where a user prepares a batch submission script which both requests the resources for the job from the scheduler and contains the execution commands for a given program to run. On job submission, the scheduler will add it to the chosen queue and run your job when resources become available.
Any task that can be executed without any user intervention while it is running can be submitted as a batch job. This excludes jobs that require a Graphical User Interface (GUI), however, many common GUI applications such as ANSYS or MATLAB can also be used without their GUIs.
If you wish to use a cluster for interactive work and/or running applications like MATLAB or ANSYS using GUIs, you will need to request an interactive job from the scheduler.
If you wish to use a cluster to dispatch a very large ANSYS model you will need to request batch job from the scheduler and prepare an appropriate batch script.
Note
Long running jobs should use the batch submission system rather than requesting an interactive session for a very long time. Doing this will lead to better cluster performance for all users.
Queues and partitions
Queues or partitions (in SLURM) are queues of jobs submitted to a scheduler for it to run. They can have an assortment of constraints such as job size limit, job time limit, users permitted to use it and some nodes will be configured to accept jobs only from certain queues e.g. Department specific nodes.
All jobs are dispatchable
When a user requests that a job, (either a batch or an interactive session), is ran on the cluster, the scheduler will run jobs from the queue based on a set of rules, priorities and availabilities.
How and where a job can run are set when the job is requested based on the resource amounts requested as well as the chosen queue (assuming a user has permissions to use a queue.)
This means that not all interactive jobs are possible as the resources may not be available. It also means that the amount of time it takes for any batch job to run is dependent on how large the job resource request is, which queue it is in, what resources are available in that queue and how much previous resource usage the user has. The larger a resource request is, the longer it will take to wait for those resources to become available and the longer it will take for subsequent jobs to queue as a result of the fair scheduling algorithm.
Fair scheduling
Job schedulers are typically configured to use a fair-share / wait time system. In short, the scheduler assesses your previous CPU time and memory time (consumption) to give a requested job a priority. Subsequently it uses how long your job has had to wait in order to bump up that priority. Once your job is the highest priority, the job will then run when the requested resources become available on the system. Your running total for CPU time / memory time usage will decay over time but in general the more resources you request and for longer, the lower your initial job priority gets and the longer you have to wait behind other people’s jobs.
If you are seeing one job start and another immediately begin this is not an intentional chaining setting on the scheduler’s part. This is quite likely simply a reflection of your subsequent jobs waiting for resources to become available and it just so happens that your running job finishes freeing up the resources for the next.
As a natural consequence of backfilling into any trapped resources - you may see small time, memory and core request jobs with a lower priority running before your own with a higher priority. This is because they are small enough to utilize the trapped resource before the job trapping those resources is finished. This is not unfair and it would be inefficient and irresponsible for us to intentionally block a job from running simply because the priority is lower than a larger job that won’t fit in that trapped resource.
Job Submission / Control on Stanage & Bessemer
Tip
The Stanage & Bessemer clusters have been configured with resource request limits. Please see our Choosing appropriate compute resources page for further information.
Interactive Jobs
SLURM uses a single command to launch interactive jobs:
srun Standard SLURM command supporting graphical applications.
Usage of the command is as follows:
$ srun --pty bash -i
You can configure the resources available to the interactive session by adding command line options. For example to start an interactive session with access to 16 GB of RAM:
$ srun --mem=16G --pty bash -i
To start a session with access to 2 cores, use either:
$ srun --cpus-per-task=2 --pty bash -i #2 cores per task, 1 task and 1 node per job default. Preferred!
$ srun --ntasks-per-node=2 --pty bash -i #2 tasks per node, 1 core per task and 1 node per job default.
Please take care with your chosen options as usage in concert with other options can be multiplicative.
A further explanation of why you may use the tasks options or cpus options can be found here.
A table of common interactive job options is given below; any of these can be combined together to request more resources.
Slurm Command |
Description |
---|---|
|
Specify the total maximum wall clock execution time for the job. The upper limit is 08:00:00. Note: these limits may differ for reservations /projects. |
|
|
|
|
|
|
Rejoining an interactive job
If we lose connection to an interactive job, we can use the sattach
command which attaches to a running Slurm job step.
Just keep in mind that sattach
doesn’t work for external or batch steps, as they aren’t
set up for direct attachment.
Example:
[te1st@login1 [stanage] ~]$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
833300 interacti bash te1st R 31:22 1 node001
[te1st@login1 [stanage] ~]$ sattach 833300.0
[te1st@node001 [stanage] ~]$ echo $SLURM_JOB_ID
833300
[te1st@bessemer-login1 ~]$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
833300 interacti bash te1st R 31:22 1 node001
[te1st@bessemer-login1 ~]$ sattach 833300.0
[te1st@bessemer-node001 ~]$ echo $SLURM_JOB_ID
833300
Here we attached to SLURM job 833300 step 0. For more information type man sattach
Batch Jobs
Tip
Batch jobs have larger resource limits than interactive jobs! For guidance on what these limits are and how best to select resources please see our Choosing appropriate compute resources page.
SLURM uses a single command to submit batch jobs:
sbatch Standard SLURM command with no support for interactivity or graphical applications.
The Slurm docs have a complete list of available sbatch
options.
The batch submission scripts are executed for submission as below:
sbatch submission.sh
Note the job submission number. For example:
Submitted batch job 1226
You can check your output log or error log file as below:
cat JOB_NAME-1226.out
There are numerous further options you can request in your batch submission files which are detailed below:
Name your job submission:
#SBATCH --job-name=JOB_NAME
Specify a number of nodes:
#SBATCH --nodes=1
Warning
Note that the Bessemer free queues do not permit the use of more than 1 node per job.
Specify a number of tasks per node:
#SBATCH --ntasks-per-node=4
Specify a number of tasks:
#SBATCH --ntasks=4
Specify a number of cores per task:
#SBATCH --cpus-per-task=4
Request a specific amount of memory per node:
#SBATCH --mem=16G
Request a specific amount of memory per CPU core:
#SBATCH --mem-per-cpu=16G
Request a specific amount of memory per job:
#SBATCH --mem=16G
Specify the job output log file name:
#SBATCH --output=output.%j.test.out
Request a specific amount of time:
#SBATCH --time=00:30:00
Request job update email notifications:
#SBATCH --mail-user=username@sheffield.ac.uk
For the full list of the available options please visit the SLURM manual webpage for sbatch here: https://slurm.schedmd.com/sbatch.html
Here is an example SLURM batch submission script that runs a fictitious program called foo
:
#!/bin/bash
# Request 5 gigabytes of real memory (mem)
#SBATCH --mem=5G
# load the module for the program we want to run
module load apps/gcc/foo
# Run the program foo with input foo.dat
# and output foo.res
foo foo.dat foo.res
Some things to note:
The first line always needs to be
#!/bin/bash
(to tell the scheduler that this is a bash batch script).Comments start with a
#
.It is always best to fully specify job’s resources with your submission script.
All Slurm Scheduler options start with
#SBATCH
You should use the SLURM option
--ntasks=nn
Number of “tasks”, for programs using distributed parallelism (MPI).You should use the SLURM option
--ntasks-per-node=nn
Number of “tasks per node”, for programs using distributed parallelism (MPI). Note that the Bessemer free queues do not permit the use of more than 1 node per job.You should use the SLURM option
--cpus-per-task=nn
Number of “cores per task”, for programs using shared memory parallelism.You will often require one or more
module
commands in your submission file to make programs and libraries available to your scripts. Many applications and libraries are available as modules on Bessemer and Stanage.
Here is a more complex example that requests more resources:
#!/bin/bash
# Request 16 gigabytes of real memory (RAM) 4 cores *4G = 16
#SBATCH --mem=16G
# Request 4 cores
#SBATCH --cpus-per-task=4
# Email notifications to me@somedomain.com
#SBATCH --mail-user=me@somedomain.com
# Email notifications if the job fails
#SBATCH --mail-type=FAIL
# Change the name of the output log file.
#SBATCH --output=output.%j.test.out
# Rename the job's name
#SBATCH --job-name=my_job
# Load the modules required by our program
module load compilers/gcc/5.2
module load apps/gcc/foo
# Set the OPENMP_NUM_THREADS environment variable to 4
# This is needed to ensure efficient core usage.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# Run the program foo with input foo.dat
# and output foo.res
foo foo.dat foo.res
Tip
Bessemer currently supports running preemptable jobs. These are jobs which have been set to run in a reserved queue’s node when those nodes are idle. These reserved queues are typically private (research group-owned or dept-owned) nodes, but these resources will be reclaimed (and the associated jobs preempted) if members of those groups/departments submit jobs that can only start if those resources are repurposed.
For more details on running preemptable jobs on Bessemer please see: Preemptable jobs
Monitoring running Jobs
There are two commands to monitor running and queued jobs:
The squeue
command is used to pull up information about jobs in the queue, by default this
command will list the job ID, partition, username, job status, number of nodes, and name of nodes
for all jobs queued or running within SLURM.
Display all jobs queued on the system:
$ squeue
To limit this command to only display a single user’s jobs the --user
flag can be used:
$ squeue --user=$USER
To limit this command to only display your own jobs, the --me
flag can be used:
$ squeue --me
Further information without abbreviation can be shown by using the --long
flag:
$ squeue --me --long
The squeue
command also provides a method to calculate the estimated start time for a job by
using the --start
flag:
$ squeue --me --start
The accuracy of squeue --start
estimates varies due to factors like queue dynamics,
resource availability (affected by maintenance, node failures, etc), making it a guideline rather than a guarantee.
When checking the status of a job you may wish to check for updates at a time interval. This can
be achieved by using the --iterate
flag and a number of seconds:
$ squeue --me --start --iterate=n_seconds
You can stop this command by pressing Ctrl + C
.
Example output:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1234567 interacti bash foo1bar R 17:19:40 1 bessemer-node001
1234568 sheffield job.sh foo1bar R 17:21:40 1 bessemer-node046
1234569 sheffield job.sh foo1bar PD 17:22:40 1 (Resources)
1234570 sheffield job.sh foo1bar PD 16:47:06 1 (Priority)
1234571 gpu job.sh foo1bar R 1-19:46:53 1 bessemer-node026
1234572 gpu job.sh foo1bar R 1-19:46:54 1 bessemer-node026
1234573 gpu job.sh foo1bar R 1-19:46:55 1 bessemer-node026
1234574 gpu job.sh foo1bar R 1-19:46:56 1 bessemer-node026
1234575 gpu job.sh foo1bar PD 9:04 1 (ReqNodeNotAvail, UnavailableNodes:bessemer-node026)
1234576 sheffield job.sh foo1bar PD 2:57:24 1 (QOSMaxJobsPerUserLimit)
States shown above indicate job states including running “R” and Pending “PD” with various reasons for pending states including a node (ReqNodeNotAvail) full of jobs and a user hitting the max limit for numbers of jobs they can run simultaneously in a QOS (QOSMaxJobsPerUserLimit).
A list of the most relevant job states and reasons can be seen below:
SLURM Job States:
Jobs typically pass through several states in the course of their execution. The typical states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED.
Status |
Code |
Explanation |
---|---|---|
COMPLETED |
CD |
The job has completed successfully. |
COMPLETING |
CG |
The job is finishing but some processes are still active. |
CANCELLED |
CA |
Job was explicitly cancelled by the user or system administrator. |
FAILED |
F |
The job terminated with a non-zero exit code and failed to execute. |
PENDING |
PD |
The job is waiting for resource allocation. It will eventually run. |
PREEMPTED |
PR |
The job was terminated because of preemption by another job. |
RUNNING |
R |
The job currently is allocated to a node and is running. |
SUSPENDED |
S |
A running job has been stopped with its cores released to other jobs. |
STOPPED |
ST |
A running job has been stopped with its cores retained. |
OUT_OF_MEMORY |
OOM |
Job experienced out of memory error. |
TIMEOUT |
TO |
Job exited because it reached its walltime limit. |
NODE_FAIL |
NF |
Job terminated due to failure of one or more allocated nodes. |
A full list of job states can be found at: https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES
SLURM Job Reasons:
These codes identify the reason that a job is waiting for execution. A job may be waiting for more than one reason, in which case only one of those reasons is displayed.
Reason Code |
Explanation |
---|---|
Priority |
One or more higher priority jobs is in queue for running. Your job will eventually run. |
Dependency |
This job is waiting for a dependent job to complete and will run afterwards. |
Resources |
The job is waiting for resources to become available and will eventually run. |
InvalidAccount |
The job’s account is invalid. Cancel the job and rerun with correct account. |
InvaldQoS |
The job’s QoS is invalid. Cancel the job and rerun with correct account. |
QOSGrpMaxJobsLimit |
Maximum number of jobs for your job’s QoS have been met; job will run eventually. |
PartitionMaxJobsLimit |
Maximum number of jobs for your job’s partition have been met; job will run eventually. |
AssociationMaxJobsLimit |
Maximum number of jobs for your job’s association have been met; job will run eventually. |
JobLaunchFailure |
The job could not be launched. This may be due to a file system problem, invalid program name, etc. |
NonZeroExitCode |
The job terminated with a non-zero exit code. |
SystemFailure |
Failure of the Slurm system, a file system, the network, etc. |
TimeLimit |
The job exhausted its time limit. |
WaitingForScheduling |
No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason. |
BadConstraints |
The job’s constraints can not be satisfied. |
A full list of job reasons can be found at: https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES
The sstat
command can be used to display status information about a user’s currently running
jobs such as the CPU usage, task or node information and memory consumption.
The command can be invoked as follows with a specific job ID:
$ sstat --jobs=job-id
And to display specific information you can use the --format
flag to choose your output:
$ sstat --jobs=job-id --format=var_1,var_2, ... , var_N
A chart of some these variables are listed in the table below:
Variable |
Description |
---|---|
AveCPU |
Average (system + user) CPU time of all tasks in job. |
AveRSS |
Average resident set size of all tasks in job. |
AveVMSize |
Average Virtual Memory size of all tasks in job. |
JobID |
The id of the Job. |
MaxRSS |
Maximum resident set size of all tasks in job. |
MaxVMSize |
Maximum Virtual Memory size of all tasks in job. |
NTasks |
Total number of tasks in a job or step. |
A full list of variables for the --format
flag can be
found with the --helpformat
flag or by visiting the slurm page on
sstat.
Stopping or cancelling Jobs
Jobs can be stopped or cancelled using the scancel
command:
You can stop jobs with the scancel
command and the job’s ID
(replacing job-id with the number):
$ scancel job-id
To cancel multiple jobs you can supply a comma separated list:
$ scancel job-id1, job-id2, job-id3
Investigating finished Jobs
Jobs which have already finished can be investigated using the seff
script:
The seff
script can be used as follows with the job’s ID to give summary of important job info :
$ seff job-id
For example, on the Stanage cluster:
$ seff 64626
Job ID: 64626
Cluster: stanage.alces.network
User/Group: a_user/clusterusers
State: COMPLETED (exit code 0)
Nodes: 2
Cores per node: 1
CPU Utilized: 00:02:37
CPU Efficiency: 35.68% of 00:07:20 core-walltime
Job Wall-clock time: 00:03:40
Memory Utilized: 137.64 MB (estimated maximum)
Memory Efficiency: 1.71% of 7.84 GB (3.92 GB/core)
Or in even more depth using the sacct
command:
The sacct
command can be used to display status information about a user’s historical
jobs.
The command can be used as follows with the job’s ID:
$ sacct --jobs=job-id
Or to view information about all of a specific user’s jobs:
$ sacct --user=$USER
By default the sacct
command will only bring up information about the user’s job from the
current day. By using the --starttime
flag the command will look further back to the given
date e.g. :
$ sacct --user=$USER --starttime=YYYY-MM-DD
Like the sstat
command, the --format
flag can be used to choose the command output:
$ sacct --user=$USER --format=var_1,var_2, ... ,var_N
Variable |
Description |
---|---|
Account |
The account the job ran under. |
AveCPU |
Average (system + user) CPU time of all tasks in job. |
AveRSS |
Average resident set size of all tasks in job. |
AveVMSize |
Average Virtual Memory size of all tasks in job. |
CPUTime |
Formatted (Elapsed time * CPU) count used by a job or step. |
Elapsed |
Jobs elapsed time formated as DD-HH:MM:SS. |
ExitCode |
The exit code returned by the job script or salloc. |
JobID |
The id of the Job. |
JobName |
The name of the Job. |
MaxRSS |
Maximum resident set size of all tasks in job. |
MaxVMSize |
Maximum Virtual Memory size of all tasks in job. |
MaxDiskRead |
Maximum number of bytes read by all tasks in the job. |
MaxDiskWrite |
Maximum number of bytes written by all tasks in the job. |
ReqCPUS |
Requested number of CPUs. |
ReqMem |
Requested amount of memory. |
ReqNodes |
Requested number of nodes. |
NCPUS |
The number of CPUs used in a job. |
NNodes |
The number of nodes used in a job. |
User |
The username of the person who ran the job. |
A full list of variables for the --format
flag can be
found with the --helpformat
flag or by visiting the slurm page on
sacct.
Debugging failed Jobs
If one of your jobs has failed and you need to debug why this has occured you should consult the job records held by the scheduler with the sacct referenced above as well as the generated job logs.
These output and error log files will be generated in the job working directory with the job name or
output log file name as of the form slurm-$SLURM_JOB_ID.out
where $SLURM_JOB_ID
is the scheduler provided job id.
Looking at these logs should indicate the source of any issues.
sacct will also give a job’s state and ExitCode field with each job.
The ExitCode is the return value of the exiting program/script. It can be a user defined value if the job is finished with a call to ‘exit(number)’. Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a Reason of “NonZeroExitCode”.
The job logs may also include a “derived exit code” field. This is set to the value of the highest exit code returned by all of the job’s steps (srun invocations).
Cluster job resource limits
While the Sheffield clusters have very large amounts of resources to use for your jobs there are limits applied in order for the schedulers to function. The limits below apply to the default free queues. Other queues may have different settings.
Warning
You must ensure that your jobs do not attempt to exceed these limits as the schedulers are not forgiving and will summarily kill any job which exceeds the requested limits without warning.
CPU Limits
Warning
Please note that for either cluster the larger the number of cores you request in an interactive job the more likely the request is to fail as the requested resource is not immediately available.
Scheduler Type |
No. CPU Cores Available |
No. CPU Cores Available |
Submission Argument |
---|---|---|---|
SLURM (Stanage) |
1 / 1 / ~11264 (MPI), 64 (SMP) |
1 / 1 / ~11264 (MPI), 64 (SMP) |
|
SLURM (Bessemer) |
1 / 1 / 40 |
1 / 1 / 40 |
|
Time Limits
Scheduler Type |
Interactive Job |
Batch Job |
Submission Argument |
---|---|---|---|
SLURM (Stanage) |
8 / 8 hrs |
8 / 96 hrs |
|
SLURM (Bessemer) |
8 / 8 hrs |
8 / 168 hrs |
|
Memory Limits
SLURM (Stanage) Cross node MPI execution enabled |
SLURM (Bessemer) Single node execution only |
||
---|---|---|---|
Default Job Memory Request |
4016 MB |
2 GB |
|
Standard Nodes |
251 MB |
192 GB |
|
Large RAM Nodes |
1007 GB |
N/A |
|
Very Large RAM Nodes |
2014 GB |
N/A |
|
Interactive Job |
Maximum Possible Request |
251 GB |
192 GB |
Batch Job (SMP) |
Maximum Regular Node Request |
251 GB |
192 GB |
Maximum Possible Request |
2014 GB |
192 GB |
|
Batch Job (MPI) |
Maximum Possible Request |
~74404 GB |
192 GB |
Submission Argument on a per node (job) basis |
–mem=<nn> |
–mem=<nn> |
Advanced / Automated job submission and management
Further information on advanced or automated job submission and management can be found on our dedicated pages: Advanced Job Submission and Control and Advanced Job Profiling and Analysis.
Reference information and further resources
Quick reference information for the SLURM scheduler used on both the Stanage and Bessemer clusters can be found in the Scheduler Reference Info section.
An SGE to SLURM conversion guide is provided in the Quick Reference section.