Attention

Advance Notice: Bessemer will be retired at the end of the day on Friday 31st October 2025.

Job Submission and Control 

Introduction 

As mentioned in the what is HPC section, HPC clusters like Bessemer and Stanage use a program called a scheduler to control and submit work to appropriate nodes.

All user work is dispatched to a cluster using a tool called a job scheduler. A job scheduler is a tool used to manage, submit and fairly queue users’ jobs in the shared environment of a HPC cluster. A cluster will normally use a single scheduler and allow a user to request either an immediate interactive job, or a queued batch job.

Here at the University of Sheffield, on both Bessemer and Stanage we use the SLURM scheduler, which follows three basic principles:

they allocate exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work,
they provide a framework for starting, executing, and monitoring work on the set of allocated nodes,
they arbitrate contention for resources by managing a queue of pending work.

Key Concepts 

Tip

If you are not familiar with basic computer architecture we highly recommend reading our General Computer Architecture Quick Start page before continuing.

When engaging with our documentation several concepts must be well understood with reference to schedulers and jobs which will be explained below:

Types of Job

There are two types of job on any scheduler, interactive and batch:

Interactive jobs are ones where they are requested and immediately run providing the user with a bash shell (or a shell of their choosing) in which they can then run their software or scripts in.

Typically only very few nodes in a HPC cluster are dedicated solely to interactive jobs and interactive jobs require the resources to be available instantenously as the request is made or the request will fail. This means that interactive requests cannot always be fulfilled, particularly when requesting multiple cores.

Batch jobs are the other kind of job where a user prepares a batch submission script which both requests the resources for the job from the scheduler and contains the execution commands for a given program to run. On job submission, the scheduler will add it to the chosen queue and run your job when resources become available.

Any task that can be executed without any user intervention while it is running can be submitted as a batch job. This excludes jobs that require a Graphical User Interface (GUI), however, many common GUI applications such as ANSYS or MATLAB can also be used without their GUIs.

If you wish to use a cluster for interactive work and/or running applications like MATLAB or ANSYS using GUIs, you will need to request an interactive job from the scheduler.

If you wish to use a cluster to dispatch a very large ANSYS model you will need to request batch job from the scheduler and prepare an appropriate batch script.

Note

Long running jobs should use the batch submission system rather than requesting an interactive session for a very long time. Doing this will lead to better cluster performance for all users.

Queues and partitions

Queues or partitions (in SLURM) are queues of jobs submitted to a scheduler for it to run. They can have an assortment of constraints such as job size limit, job time limit, users permitted to use it and some nodes will be configured to accept jobs only from certain queues e.g. Department specific nodes.

All jobs are dispatchable

When a user requests that a job, (either a batch or an interactive session), is ran on the cluster, the scheduler will run jobs from the queue based on a set of rules, priorities and availabilities.

How and where a job can run are set when the job is requested based on the resource amounts requested as well as the chosen queue (assuming a user has permissions to use a queue.)

This means that not all interactive jobs are possible as the resources may not be available. It also means that the amount of time it takes for any batch job to run is dependent on how large the job resource request is, which queue it is in, what resources are available in that queue and how much previous resource usage the user has. The larger a resource request is, the longer it will take to wait for those resources to become available and the longer it will take for subsequent jobs to queue as a result of the fair scheduling algorithm.

Fair scheduling

Job schedulers are typically configured to use a fair-share / wait time system. In short, the scheduler assesses your previous CPU time and memory time (consumption) to give a requested job a priority. Subsequently it uses how long your job has had to wait in order to bump up that priority. Once your job is the highest priority, the job will then run when the requested resources become available on the system. Your running total for CPU time / memory time usage will decay over time but in general the more resources you request and for longer, the lower your initial job priority gets and the longer you have to wait behind other people’s jobs.

If you are seeing one job start and another immediately begin this is not an intentional chaining setting on the scheduler’s part. This is quite likely simply a reflection of your subsequent jobs waiting for resources to become available and it just so happens that your running job finishes freeing up the resources for the next.

As a natural consequence of backfilling into any trapped resources - you may see small time, memory and core request jobs with a lower priority running before your own with a higher priority. This is because they are small enough to utilize the trapped resource before the job trapping those resources is finished. This is not unfair and it would be inefficient and irresponsible for us to intentionally block a job from running simply because the priority is lower than a larger job that won’t fit in that trapped resource.

Job Submission / Control on Stanage & Bessemer 

Tip

The Stanage & Bessemer clusters have been configured with resource request limits. Please see our Choosing appropriate compute resources page for further information.

Interactive Jobs

SLURM uses a single command to launch interactive jobs:

srun Standard SLURM command supporting graphical applications.

Usage of the command is as follows:

$ srun --pty bash -i

You can configure the resources available to the interactive session by adding command line options. For example to start an interactive session with access to 16 GB of RAM:

$ srun --mem=16G --pty bash -i

To start a session with access to 2 cores, use either:

$ srun --cpus-per-task=2 --pty bash -i #2 cores per task, 1 task and 1 node per job default. Preferred!
$ srun --ntasks-per-node=2 --pty bash -i #2 tasks per node, 1 core per task and 1 node per job default.

Please take care with your chosen options as usage in concert with other options can be multiplicative.

A further explanation of why you may use the tasks options or cpus options can be found here.

A table of common interactive job options is given below; any of these can be combined together to request more resources.

Slurm Command	Description
`-t min` or `-t days-hh:mm:ss`	Specify the total maximum wall clock execution time for the job. The upper limit is 08:00:00. Note: these limits may differ for reservations /projects.
`--mem=xxG`	`--mem=xxG` is used to specify the maximum amount (`xx`) of real memory to be requested per node. If the real memory usage of your job exceeds this value multiplied by the number of cores / nodes you requested then your job will be killed.
`-c nn` or `--cpus-per-task=nn`	`-c` is cores per task, take care with your chosen number of tasks.
`--ntasks-per-node=nn`	`--ntasks-per-node=` is tasks per node, take care with your chosen number of cores per node. The default is one task per node, but note that other options can adjust the default of 1 core per task e.g. `--cpus-per-task`.

Rejoining an interactive job

If we lose connection to an interactive job, we can use the sattach command which attaches to a running Slurm job step. Just keep in mind that sattach doesn’t work for external or batch steps, as they aren’t set up for direct attachment.

Example:

[te1st@login1 [stanage] ~]$ squeue --me
        JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        833300 interacti     bash   te1st  R      31:22      1 node001
[te1st@login1 [stanage] ~]$ sattach 833300.0
[te1st@node001 [stanage] ~]$ echo $SLURM_JOB_ID
833300

[te1st@bessemer-login1 ~]$ squeue --me
        JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        833300 interacti     bash   te1st  R      31:22      1 node001
[te1st@bessemer-login1 ~]$ sattach 833300.0
[te1st@bessemer-node001 ~]$ echo $SLURM_JOB_ID
833300

Here we attached to SLURM job 833300 step 0. For more information type man sattach

Batch Jobs

Tip

Batch jobs have larger resource limits than interactive jobs! For guidance on what these limits are and how best to select resources please see our Choosing appropriate compute resources page.

SLURM uses a single command to submit batch jobs:

sbatch Standard SLURM command with no support for interactivity or graphical applications.

The Slurm docs have a complete list of available sbatch options.

The batch submission scripts are executed for submission as below:

sbatch submission.sh

Note the job submission number. For example:

Submitted batch job 1226

You can check your output log or error log file as below:

cat JOB_NAME-1226.out

There are numerous further options you can request in your batch submission files which are detailed below:

Name your job submission:

#SBATCH --job-name=JOB_NAME

Specify a number of nodes:

#SBATCH --nodes=1

Warning

Note that the Bessemer free queues do not permit the use of more than 1 node per job.

Specify a number of tasks per node:

#SBATCH --ntasks-per-node=4

Specify a number of tasks:

#SBATCH --ntasks=4

Specify a number of cores per task:

#SBATCH --cpus-per-task=4

Request a specific amount of memory per node:

#SBATCH --mem=16G

Request a specific amount of memory per CPU core:

#SBATCH --mem-per-cpu=16G

Request a specific amount of memory per job:

#SBATCH --mem=16G

Specify the job output log file name:

#SBATCH --output=output.%j.test.out

Request a specific amount of time:

#SBATCH --time=00:30:00

Request job update email notifications:

#SBATCH --mail-user=username@sheffield.ac.uk

For the full list of the available options please visit the SLURM manual webpage for sbatch here: https://slurm.schedmd.com/sbatch.html

Here is an example SLURM batch submission script that runs a fictitious program called foo:

#!/bin/bash
# Request 5 gigabytes of real memory (mem)
#SBATCH --mem=5G

# load the module for the program we want to run
module load apps/gcc/foo

# Run the program foo with input foo.dat
# and output foo.res
foo foo.dat foo.res

Some things to note:

The first line always needs to be #!/bin/bash (to tell the scheduler that this is a bash batch script).
Comments start with a #.
It is always best to fully specify job’s resources with your submission script.
All Slurm Scheduler options start with #SBATCH
You should use the SLURM option --ntasks=nn Number of “tasks”, for programs using distributed parallelism (MPI).
You should use the SLURM option --ntasks-per-node=nn Number of “tasks per node”, for programs using distributed parallelism (MPI). Note that the Bessemer free queues do not permit the use of more than 1 node per job.
You should use the SLURM option --cpus-per-task=nn Number of “cores per task”, for programs using shared memory parallelism.
You will often require one or more module commands in your submission file to make programs and libraries available to your scripts. Many applications and libraries are available as modules on Bessemer and Stanage.

Here is a more complex example that requests more resources:

#!/bin/bash
# Request 16 gigabytes of real memory (RAM) 4 cores *4G = 16
#SBATCH --mem=16G
# Request 4 cores
#SBATCH --cpus-per-task=4
# Email notifications to me@somedomain.com
#SBATCH --mail-user=me@somedomain.com
# Email notifications if the job fails
#SBATCH --mail-type=FAIL
# Change the name of the output log file.
#SBATCH --output=output.%j.test.out
# Rename the job's name
#SBATCH --job-name=my_job


# Load the modules required by our program
module load compilers/gcc/5.2
module load apps/gcc/foo

# Set the OPENMP_NUM_THREADS environment variable to 4
# This is needed to ensure efficient core usage.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Run the program foo with input foo.dat
# and output foo.res
foo foo.dat foo.res

Tip

Bessemer currently supports running preemptable jobs. These are jobs which have been set to run in a reserved queue’s node when those nodes are idle. These reserved queues are typically private (research group-owned or dept-owned) nodes, but these resources will be reclaimed (and the associated jobs preempted) if members of those groups/departments submit jobs that can only start if those resources are repurposed.

For more details on running preemptable jobs on Bessemer please see: Preemptable jobs

Monitoring running Jobs

There are two commands to monitor running and queued jobs:

sstat
squeue

The squeue command is used to pull up information about jobs in the queue, by default this command will list the job ID, partition, username, job status, number of nodes, and name of nodes for all jobs queued or running within SLURM.

Display all jobs queued on the system:

$ squeue

To limit this command to only display a single user’s jobs the --user flag can be used:

$ squeue --user=$USER

To limit this command to only display your own jobs, the --me flag can be used:

$ squeue --me

Further information without abbreviation can be shown by using the --long flag:

$ squeue --me --long

The squeue command also provides a method to calculate the estimated start time for a job by using the --start flag:

$ squeue --me --start

The accuracy of squeue --start estimates varies due to factors like queue dynamics, resource availability (affected by maintenance, node failures, etc), making it a guideline rather than a guarantee.

When checking the status of a job you may wish to check for updates at a time interval. This can be achieved by using the --iterate flag and a number of seconds:

$ squeue --me --start --iterate=n_seconds

You can stop this command by pressing Ctrl + C.

Example output:

$ squeue
        JOBID   PARTITION   NAME      USER  ST       TIME  NODES NODELIST(REASON)
        1234567 interacti   bash   foo1bar   R   17:19:40      1 bessemer-node001
        1234568 sheffield job.sh   foo1bar   R   17:21:40      1 bessemer-node046
        1234569 sheffield job.sh   foo1bar  PD   17:22:40      1 (Resources)
        1234570 sheffield job.sh   foo1bar  PD   16:47:06      1 (Priority)
        1234571       gpu job.sh   foo1bar   R 1-19:46:53      1 bessemer-node026
        1234572       gpu job.sh   foo1bar   R 1-19:46:54      1 bessemer-node026
        1234573       gpu job.sh   foo1bar   R 1-19:46:55      1 bessemer-node026
        1234574       gpu job.sh   foo1bar   R 1-19:46:56      1 bessemer-node026
        1234575       gpu job.sh   foo1bar  PD       9:04      1 (ReqNodeNotAvail, UnavailableNodes:bessemer-node026)
        1234576 sheffield job.sh   foo1bar  PD    2:57:24      1 (QOSMaxJobsPerUserLimit)

States shown above indicate job states including running “R” and Pending “PD” with various reasons for pending states including a node (ReqNodeNotAvail) full of jobs and a user hitting the max limit for numbers of jobs they can run simultaneously in a QOS (QOSMaxJobsPerUserLimit).

A list of the most relevant job states and reasons can be seen below:

SLURM Job States:

Jobs typically pass through several states in the course of their execution. The typical states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED.

Status	Code	Explanation
COMPLETED	CD	The job has completed successfully.
COMPLETING	CG	The job is finishing but some processes are still active.
CANCELLED	CA	Job was explicitly cancelled by the user or system administrator.
FAILED	F	The job terminated with a non-zero exit code and failed to execute.
PENDING	PD	The job is waiting for resource allocation. It will eventually run.
PREEMPTED	PR	The job was terminated because of preemption by another job.
RUNNING	R	The job currently is allocated to a node and is running.
SUSPENDED	S	A running job has been stopped with its cores released to other jobs.
STOPPED	ST	A running job has been stopped with its cores retained.
OUT_OF_MEMORY	OOM	Job experienced out of memory error.
TIMEOUT	TO	Job exited because it reached its walltime limit.
NODE_FAIL	NF	Job terminated due to failure of one or more allocated nodes.

A full list of job states can be found at: https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES

SLURM Job Reasons:

These codes identify the reason that a job is waiting for execution. A job may be waiting for more than one reason, in which case only one of those reasons is displayed.

Reason Code	Explanation
Priority	One or more higher priority jobs is in queue for running. Your job will eventually run.
Dependency	This job is waiting for a dependent job to complete and will run afterwards.
Resources	The job is waiting for resources to become available and will eventually run.
InvalidAccount	The job’s account is invalid. Cancel the job and rerun with correct account.
InvaldQoS	The job’s QoS is invalid. Cancel the job and rerun with correct account.
QOSGrpMaxJobsLimit	Maximum number of jobs for your job’s QoS have been met; job will run eventually.
PartitionMaxJobsLimit	Maximum number of jobs for your job’s partition have been met; job will run eventually.
AssociationMaxJobsLimit	Maximum number of jobs for your job’s association have been met; job will run eventually.
JobLaunchFailure	The job could not be launched. This may be due to a file system problem, invalid program name, etc.
NonZeroExitCode	The job terminated with a non-zero exit code.
SystemFailure	Failure of the Slurm system, a file system, the network, etc.
TimeLimit	The job exhausted its time limit.
WaitingForScheduling	No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason.
BadConstraints	The job’s constraints can not be satisfied.

A full list of job reasons can be found at: https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES

The sstat command can be used to display status information about a user’s currently running jobs such as the CPU usage, task or node information and memory consumption.

The command can be invoked as follows with a specific job ID:

$ sstat --jobs=job-id

And to display specific information you can use the --format flag to choose your output:

$ sstat --jobs=job-id --format=var_1,var_2, ... , var_N

A chart of some these variables are listed in the table below:

sstat format variable names
Variable	Description
AveCPU	Average (system + user) CPU time of all tasks in job.
AveRSS	Average resident set size of all tasks in job.
AveVMSize	Average Virtual Memory size of all tasks in job.
JobID	The id of the Job.
MaxRSS	Maximum resident set size of all tasks in job.
MaxVMSize	Maximum Virtual Memory size of all tasks in job.
NTasks	Total number of tasks in a job or step.

A full list of variables for the --format flag can be found with the --helpformat flag or by visiting the slurm page on sstat.

Stopping or cancelling Jobs

Jobs can be stopped or cancelled using the scancel command:

scancel

You can stop jobs with the scancel command and the job’s ID (replacing job-id with the number):

$ scancel job-id

To cancel multiple jobs you can supply a comma separated list:

$ scancel job-id1, job-id2, job-id3

Investigating finished Jobs

Jobs which have already finished can be investigated using the seff script:

seff

The seff script can be used as follows with the job’s ID to give summary of important job info :

$ seff job-id

For example, on the Stanage cluster:

$ seff 64626
Job ID: 64626
Cluster: stanage.alces.network
User/Group: a_user/clusterusers
State: COMPLETED (exit code 0)
Nodes: 2
Cores per node: 1
CPU Utilized: 00:02:37
CPU Efficiency: 35.68% of 00:07:20 core-walltime
Job Wall-clock time: 00:03:40
Memory Utilized: 137.64 MB (estimated maximum)
Memory Efficiency: 1.71% of 7.84 GB (3.92 GB/core)

You can also monitor individual job steps by calling seff with the syntax seff job-id.job-step.

If your CPU usage is consistently low, your code may not be making effective use of the available resources — this could be due to inefficient code or a lack of parallelisation.

If memory usage is far below or above the requested amount, consider adjusting your allocation. It’s generally best to request slightly more RAM than your code typically uses, to avoid job failures while minimising waste.

Or in even more depth using the sacct command:

sacct

The sacct command can be used to display status information about a user’s historical jobs.

The command can be used as follows with the job’s ID:

$ sacct --jobs=job-id

Or to view information about all of a specific user’s jobs:

$ sacct --user=$USER

By default the sacct command will only bring up information about the user’s job from the current day. By using the --starttime flag the command will look further back to the given date e.g. :

$ sacct --user=$USER --starttime=YYYY-MM-DD

Like the sstat command, the --format flag can be used to choose the command output:

$ sacct --user=$USER --format=var_1,var_2, ... ,var_N

sacct format variable names
Variable	Description
Account	The account the job ran under.
AveCPU	Average (system + user) CPU time of all tasks in job.
AveRSS	Average resident set size of all tasks in job.
AveVMSize	Average Virtual Memory size of all tasks in job.
CPUTime	Formatted (Elapsed time * CPU) count used by a job or step.
Elapsed	Jobs elapsed time formated as DD-HH:MM:SS.
ExitCode	The exit code returned by the job script or salloc.
JobID	The id of the Job.
JobName	The name of the Job.
MaxRSS	Maximum resident set size of all tasks in job.
MaxVMSize	Maximum Virtual Memory size of all tasks in job.
MaxDiskRead	Maximum number of bytes read by all tasks in the job.
MaxDiskWrite	Maximum number of bytes written by all tasks in the job.
ReqCPUS	Requested number of CPUs.
ReqMem	Requested amount of memory.
ReqNodes	Requested number of nodes.
NCPUS	The number of CPUs used in a job.
NNodes	The number of nodes used in a job.
User	The username of the person who ran the job.

A full list of variables for the --format flag can be found with the --helpformat flag or by visiting the slurm page on sacct.

Debugging failed Jobs

If one of your jobs has failed and you need to debug why this has occured you should consult the job records held by the scheduler with the sacct referenced above as well as the generated job logs.

These output and error log files will be generated in the job working directory with the job name or output log file name as of the form slurm-$SLURM_JOB_ID.out where $SLURM_JOB_ID is the scheduler provided job id. Looking at these logs should indicate the source of any issues.

sacct will also give a job’s state and ExitCode field with each job.

The ExitCode is the return value of the exiting program/script. It can be a user defined value if the job is finished with a call to ‘exit(number)’. Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a Reason of “NonZeroExitCode”.

The job logs may also include a “derived exit code” field. This is set to the value of the highest exit code returned by all of the job’s steps (srun invocations).

ExitCode	Description	Explanation
0	Success	Indicates that the command or program executed successfully without any errors.
1	General Error	A catch-all exit code for a variety of general errors. Often used when the command or program encounters an error, but no specific exit code is available for the situation.
2	Misuse of shell built-ins	Indicates incorrect usage of shell built-in commands or misuse of shell syntax.
126	Command cannot execute	The command was found, but it could not be executed, possibly due to insufficient permissions or other issues.
127	Command not found	The command was not found in the system’s PATH, indicating that either the command does not exist or the PATH variable is incorrectly set.
128	Invalid exit argument	Returned when a script exits with an invalid argument. This usually indicates an error in the script itself.
128 + N	Fatal error signal N	Indicates that the command or program was terminated by a fatal error signal. For example, an exit code of 137 (128 + 9) means that the command was terminated by a SIGKILL signal.
130	Script terminated by Control-C	Indicates that the command or script was terminated by the user using Control-C (SIGINT signal).
255	Exit status out of range	Returned when the exit status is outside the valid range (0 to 254).

Cluster job resource limits 

While the Sheffield clusters have very large amounts of resources to use for your jobs there are limits applied in order for the schedulers to function. The limits below apply to the default free queues. Other queues may have different settings.

Warning

You must ensure that your jobs do not attempt to exceed these limits as the schedulers are not forgiving and will summarily kill any job which exceeds the requested limits without warning.

CPU Limits

Warning

Please note that for either cluster the larger the number of cores you request in an interactive job the more likely the request is to fail as the requested resource is not immediately available.

CPU Allocation Limits Table
Scheduler Type	No. CPU Cores Available Interactive Job (Default/ Min / Max )	No. CPU Cores Available Batch Job (Default/ Min / Max )	Submission Argument
SLURM (Stanage)	1 / 1 / ~11264 (MPI), 64 (SMP)	1 / 1 / ~11264 (MPI), 64 (SMP)	`-c <nn>`
SLURM (Bessemer)	1 / 1 / 40	1 / 1 / 40	`-c <nn>`

Time Limits

Time Allocation Limits Table
Scheduler Type	Interactive Job (Default / Max)	Batch Job (Default / Max)	Submission Argument
SLURM (Stanage)	8 / 8 hrs	8 / 96 hrs	`--time=<days-hh:mm:ss>`
SLURM (Bessemer)	8 / 8 hrs	8 / 168 hrs	`--time=<days-hh:mm:ss>`

Memory Limits

Memory Allocation Limits Table
		SLURM (Stanage) Cross node MPI execution enabled	SLURM (Bessemer) Single node execution only
Default Job Memory Request		4016 MB	2 GB
Standard Nodes		251 MB	192 GB
Large RAM Nodes		1007 GB	N/A
Very Large RAM Nodes		2014 GB	N/A
Interactive Job	Maximum Possible Request	251 GB	192 GB
Batch Job (SMP)	Maximum Regular Node Request	251 GB	192 GB
Batch Job (SMP)	Maximum Possible Request	2014 GB	192 GB
Batch Job (MPI)	Maximum Possible Request	~74404 GB	192 GB
Submission Argument on a per node (job) basis		–mem=<nn>	–mem=<nn>

Advanced / Automated job submission and management 

Further information on advanced or automated job submission and management can be found on our dedicated pages: Advanced Job Submission and Control and Advanced Job Profiling and Analysis.

Reference information and further resources 

Quick reference information for the SLURM scheduler used on both the Stanage and Bessemer clusters can be found in the Scheduler Reference Info section.

An SGE to SLURM conversion guide is provided in the Quick Reference section.

Job Submission and Control

Introduction

Key Concepts

Types of Job

Queues and partitions

All jobs are dispatchable

Fair scheduling

Job Submission / Control on Stanage & Bessemer

Interactive Jobs

Rejoining an interactive job

Batch Jobs

Monitoring running Jobs

Stopping or cancelling Jobs

Investigating finished Jobs

Debugging failed Jobs

Cluster job resource limits

CPU Limits

Time Limits

Memory Limits

Advanced / Automated job submission and management

Reference information and further resources

Job Submission and Control 

Introduction 

Key Concepts 

Job Submission / Control on Stanage & Bessemer 

Cluster job resource limits 

Advanced / Automated job submission and management 

Reference information and further resources 