Job Submission and Control

Introduction

As mentioned in the what is HPC section, HPC clusters like ShARC, Bessemer and Stanage use a program called a scheduler to control and submit work to appropriate nodes.

All user work is dispatched to a cluster using a tool called a job scheduler. A job scheduler is a tool used to manage, submit and fairly queue users’ jobs in the shared environment of a HPC cluster. A cluster will normally use a single scheduler and allow a user to request either an immediate interactive job, or a queued batch job.

Here at the University of Sheffield, we use 2 different schedulers, the SGE scheduler on ShARC and the more modern SLURM scheduler on Bessemer and Stanage. Both have the same purpose, use similar commands and work on the same three basic principles:

  • they allocate exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work,

  • they provide a framework for starting, executing, and monitoring work on the set of allocated nodes,

  • they arbitrate contention for resources by managing a queue of pending work.


Key Concepts

Tip

If you are not familiar with basic computer architecture we highly recommend reading our General Computer Architecture Quick Start page before continuing.

When engaging with our documentation several concepts must be well understood with reference to schedulers and jobs which will be explained below:

Types of Job

There are two types of job on any scheduler, interactive and batch:

Interactive jobs are ones where they are requested and immediately run providing the user with a bash shell (or a shell of their choosing) in which they can then run their software or scripts in.

Typically only very few nodes in a HPC cluster are dedicated solely to interactive jobs and interactive jobs require the resources to be available instantenously as the request is made or the request will fail. This means that interactive requests cannot always be fulfilled, particularly when requesting multiple cores.

Batch jobs are the other kind of job where a user prepares a batch submission script which both requests the resources for the job from the scheduler and contains the execution commands for a given program to run. On job submission, the scheduler will add it to the chosen queue and run your job when resources become available.

Any task that can be executed without any user intervention while it is running can be submitted as a batch job. This excludes jobs that require a Graphical User Interface (GUI), however, many common GUI applications such as ANSYS or MATLAB can also be used without their GUIs.

If you wish to use a cluster for interactive work and/or running applications like MATLAB or ANSYS using GUIs, you will need to request an interactive job from the scheduler.

If you wish to use a cluster to dispatch a very large ANSYS model you will need to request batch job from the scheduler and prepare an appropriate batch script.

Note

Long running jobs should use the batch submission system rather than requesting an interactive session for a very long time. Doing this will lead to better cluster performance for all users.

Queues and partitions

Queues or partitions (in SLURM) are queues of jobs submitted to a scheduler for it to run. They can have an assortment of constraints such as job size limit, job time limit, users permitted to use it and some nodes will be configured to accept jobs only from certain queues e.g. Department specific nodes.

All jobs are dispatchable

When a user requests that a job, (either a batch or an interactive session), is ran on the cluster, the scheduler will run jobs from the queue based on a set of rules, priorities and availabilities.

How and where a job can run are set when the job is requested based on the resource amounts requested as well as the chosen queue (assuming a user has permissions to use a queue.)

This means that not all interactive jobs are possible as the resources may not be available. It also means that the amount of time it takes for any batch job to run is dependent on how large the job resource request is, which queue it is in, what resources are available in that queue and how much previous resource usage the user has. The larger a resource request is, the longer it will take to wait for those resources to become available and the longer it will take for subsequent jobs to queue as a result of the fair scheduling algorithm.

Fair scheduling

Job schedulers are typically configured to use a fair-share / wait time system. In short, the scheduler assesses your previous CPU time and memory time (consumption) to give a requested job a priority. Subsequently it uses how long your job has had to wait in order to bump up that priority. Once your job is the highest priority, the job will then run when the requested resources become available on the system. Your running total for CPU time / memory time usage will decay over time but in general the more resources you request and for longer, the lower your initial job priority gets and the longer you have to wait behind other people’s jobs.

If you are seeing one job start and another immediately begin this is not an intentional chaining setting on the scheduler’s part. This is quite likely simply a reflection of your subsequent jobs waiting for resources to become available and it just so happens that your running job finishes freeing up the resources for the next.

As a natural consequence of backfilling into any trapped resources - you may see small time, memory and core request jobs with a lower priority running before your own with a higher priority. This is because they are small enough to utilize the trapped resource before the job trapping those resources is finished. This is not unfair and it would be inefficient and irresponsible for us to intentionally block a job from running simply because the priority is lower than a larger job that won’t fit in that trapped resource.


Job Submission / Control on ShARC

Interactive Jobs

There are three commands for requesting an interactive shell using SGE:

  • qrsh - No support for graphical applications. Standard SGE command.

  • qsh - Supports graphical applications. Standard SGE command.

  • qrshx - Supports graphical applications. Superior to qsh and is unique to Sheffield’s clusters.

Usage of these commands is as follows:

$ qrshx

You can configure the resources available to the interactive session by adding command line options. For example to start an interactive session with access to 16 GB of RAM:

$ qrshx -l rmem=16G

To start a session with access to 2 cores in the SMP parallel environment:

$ qrshx -pe smp 2

A table of common interactive job options is given below; any of these can be combined together to request more resources.

SGE Command

Description

-l h_rt=hh:mm:ss

Specify the total maximum wall clock execution time for the job. The upper limit is 08:00:00. Note: these limits may differ for reservations /projects.

-l rmem=xxG

-l rmem=xxG is used to specify the maximum amount (xx) of real memory to be requested per CPU core.


If the real memory usage of your job exceeds this value multiplied by the number of cores / nodes you requested then your job will be killed.

-pe env nn

Specify a parallel, env, environment and a number of processor cores nn. e.g. SMP jobs -pe smp 4 or MPI jobs -pe mpi 4.

Note that ShARC has multiple parallel environments, the current parallel environments can be found on the ShARC Parallel Environments page.

Batch Jobs

Tip

Batch jobs have larger resource limits than interactive jobs! For guidance on what these limits are and how best to select resources please see our Choosing appropriate compute resources page.

There is a single command to submit jobs via SGE:

  • qsub - Standard SGE command with no support for interactivity or graphical applications.

The batch submission scripts are executed for submission as below:

qsub submission.sh

Note the job submission number. For example:

Your job 12345678 ("submission.sh") has been submitted

You can check your output log or error log file when the job is finished.

cat job.sh.o12345678
cat job.sh.e12345678

There are numerous further options you can request in your batch submission files which are detailed below:

Pass through current shell environment (sometimes important):

#$ -V

Name your job submission:

#$ -N test_job

Specify a parallel environment for SMP jobs where N is a number of cores:

#$ -pe smp N

Specify a parallel environment for MPI jobs where N is a number of cores:

#$ -pe mpi N

Request a specific amount of memory where N is a number of gigabytes per core:

#$ -l rmem=NG

Request a specific amount of time in hours, minutes and seconds:

#$ -l h_rt=hh:mm:ss

Request email notifications on start, end and abort:

#$ -M me@somedomain.com
#$ -m abe

For the full list of the available options please visit the SGE manual webpage for qsub here: http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html

Here is an example SGE batch submission script that runs a fictitious program called foo:

#!/bin/bash
# Request 5 gigabytes of real memory (mem)
#$ -l rmem=5G

# load the module for the program we want to run
module load apps/gcc/foo

# Run the program foo with input foo.dat
# and output foo.res
foo foo.dat foo.res

Some things to note:

  • The first line always needs to be #!/bin/bash (to tell the scheduler that this is a bash batch script).

  • Comments start with a #.

  • It is always best to fully specify job’s resources with your submission script.

  • All SGE Scheduler options, such as the amount of memory requested, start with #$

  • You will often require one or more module commands in your submission file. These make programs and libraries available to your scripts. Many applications and libraries are available as modules on ShARC.

Here is a more complex example that requests more resources:

#!/bin/bash
# Request 16 gigabytes of real memory (RAM) 4 cores *4G = 16
#$ -l rmem=4G
# Request 4 cores in an OpenMP environment
#$ -pe openmp 4
# Email notifications to me@somedomain.com
#$ -M me@somedomain.com
# Email notifications if the job aborts
#$ -m a
# Name the job
#$ -N my_job
# Request 24 hours of time
#$ -l h_rt=24:00:00

# Load the modules required by our program
module load compilers/gcc/5.2
module load apps/gcc/foo

# Set the OPENMP_NUM_THREADS environment variable to 4
# This is needed to ensure efficient core usage.
export OMP_NUM_THREADS=$NSLOTS

# Run the program foo with input foo.dat
# and output foo.res
foo foo.dat foo.res

Monitoring running Jobs

There is a single command to monitor running and queued jobs via the qstat command:

Display your own jobs queued on the system

$ qstat

Show a specific running or queueing job’s details:

qstat -j jobid

Display all jobs queued on the system

$ qstat -u "*"

Display all jobs queued by the username foo1bar

$ qstat -u foo1bar

Display all jobs in the openmp parallel environment

$ qstat -pe openmp

Display all jobs in the queue named foobar

$ qstat -q foobar.q

Example output:

$ qstat -u "*"
job-ID  prior   name       user          state submit/start at     queue                              slots   ja-task-ID
------------------------------------------------------------------------------------------------------------------------
1234567 0.00000 INTERACTIV foo1bar       dr    12/24/2021 07:13:20 interactive.q@sharc-node004.sh     1
1234568 0.00000 job.sh     foo1bar       r     01/22/2022 05:37:31 all.q@sharc-node019.shef.ac.uk     16
1234569 0.00000 job.sh     foo1bar       r     01/23/2022 07:41:18 all.q@sharc-node084.shef.ac.uk     16
1234570 0.00000 job.sh     foo1bar       Rr    01/23/2022 08:03:22 all.q@sharc-node068.shef.ac.uk     16
1234571 0.00076 job.sh     foo1bar       qw    01/23/2022 07:06:18                                    1
1234572 0.00067 job.sh     foo1bar       hqw   01/23/2022 07:06:18                                    1
1234573 0.00000 job.sh     foo1bar       Eqw   01/21/2022 13:50:55                                    1
1234574 0.00000 job.sh     foo1bar       t     01/24/2022 13:04:25 all.q@sharc-node159.shef.ac.uk     1        22964

SGE Job States:

State

Explanation

SGE State Letter Code/s

Pending

pending, queued

qw

Pending

pending, user and/or system hold

hqw

Running

running

r

Error

all pending states with error

Eqw, Ehqw, EhRqw

Key: q: queueing, r: running, w: waiting, h: on hold, E: error, R: re-run, s: job suspended, S: queue suspended, t: transferring, d: deletion.

Note

A full list of SGE and DRMAA states can be found here

Stopping or cancelling Jobs

Jobs can be stopped or cancelled using the qdel command:

A job can be cancelled by using the qdel command as shown swapping out 123456 for your own job id number

$ qdel 123456

Investigating finished Jobs

Jobs which have already finished can be investigated using the qacct command:

The qacct command can be used to display status information about a user’s historical jobs.

Running the qacct command alone will provide a summary of used resources from the current month for the user running the command.

The command can be used as follows with the job’s ID to get job specific info:

$ qacct -j job-id

Or to view information about all of a specific user’s jobs:

$ qacct -j -u $USER

By default the qacct command will only bring up summary info about the user’s jobs from the current accounting file (which rotates monthly). Further detail about the output metrics and how to query jobs older than a month can be found on the dedicated qacct page.

Debugging failed Jobs

Note

One common form of job failure on ShARC is caused by Windows style line endings. If you see an error reported by qacct of the form:

failed searching requested shell because:

Or by qstat of the form:

failed: No such file or directory

You must replace these line endings as detailed in the FAQ.

If one of your jobs has failed and you need to debug why this has occured you should consult the job records held by the scheduler with the qacct referenced above as well as the generated job logs.

These output and error log files will be generated in the job working directory with the structure $JOBNAME.o$JOBID and $JOBNAME.e$JOBID where $JOBNAME is the user chosen name of the job and $JOBID is the scheduler provided job id. Looking at these logs should indicate the source of any issues.

The qacct info will contain two important metrics, the exit code and failure code.

The exit code is the return value of the exiting program/script. It can be a user defined value if the job is finished with a call to ‘exit(number)’. For abnormally terminated jobs it is the signal number + 128.

As an example: 137-128 = 9, therefore: signal 9 ( SIGKILL), it was sent the KILL signal and was killed, likely by the scheduler.

The failure code indicates why a job was abnormally terminated (by the scheduler). An incomplete table of common failure codes is shown below:

code

meaning

100

failure after job

37

qmaster enforced h_rt limit (Job ran out of time.)

30

rescheduling on application error

28

no current working dir

27

no shell

26

failure opening output

21

failure in recognizing job

19

no exit status

8

failure in prolog

1

failure before job (execd)


Job Submission / Control on Bessemer

Interactive Jobs

SLURM uses a single command to launch interactive jobs:

  • srun Standard SLURM command supporting graphical applications.

Usage of the command is as follows:

$ srun --pty bash -i

You can configure the resources available to the interactive session by adding command line options. For example to start an interactive session with access to 16 GB of RAM:

$ srun --mem=16G --pty bash -i

To start a session with access to 2 cores, use either:

$ srun --cpus-per-task=2 --pty bash -i #2 cores per task, 1 task and 1 node per job default. Preferred!
$ srun --ntasks-per-node=2 --pty bash -i #2 tasks per node, 1 core per task and 1 node per job default.

Please take care with your chosen options as usage in concert with other options can be multiplicative.

A further explanation of why you may use the tasks options or cpus options can be found here.

A table of common interactive job options is given below; any of these can be combined together to request more resources.

Slurm Command

Description

-t min or -t days-hh:mm:ss

Specify the total maximum wall clock execution time for the job. The upper limit is 08:00:00. Note: these limits may differ for reservations /projects.

--mem=xxG


--mem=xxG is used to specify the maximum amount (xx) of real memory to be requested per node.


If the real memory usage of your job exceeds this value multiplied by the number of cores / nodes you requested then your job will be killed.

-c nn or --cpus-per-task=nn


-c is cores per task, take care with your chosen number of tasks.

--ntasks-per-node=nn


--ntasks-per-node= is tasks per node, take care with your chosen number of cores per node. The default is one task per node, but note that other options can adjust the default of 1 core per task e.g. --cpus-per-task.

Batch Jobs

Tip

Batch jobs have larger resource limits than interactive jobs! For guidance on what these limits are and how best to select resources please see our Choosing appropriate compute resources page.

SLURM uses a single command to submit batch jobs:

  • sbatch Standard SLURM command with no support for interactivity or graphical applications.

The Slurm docs have a complete list of available sbatch options.

The batch submission scripts are executed for submission as below:

sbatch submission.sh

Note the job submission number. For example:

Submitted batch job 1226

You can check your output log or error log file as below:

cat JOB_NAME-1226.out

There are numerous further options you can request in your batch submission files which are detailed below:

Name your job submission:

#SBATCH --comment=JOB_NAME

Specify a number of nodes:

#SBATCH --nodes=1

Warning

Note that the Bessemer free queues do not permit the use of more than 1 node per job.

Specify a number of tasks per node:

#SBATCH --ntasks-per-node=4

Specify a number of tasks:

#SBATCH --ntasks=4

Specify a number of cores per task:

#SBATCH --cpus-per-task=4

Request a specific amount of memory per job:

#SBATCH --mem=16G

Specify the job output log file name:

#SBATCH --output=output.%j.test.out

Request a specific amount of time:

#SBATCH --time=00:30:00

Request job update email notifications:

#SBATCH --mail-user=username@sheffield.ac.uk

For the full list of the available options please visit the SLURM manual webpage for sbatch here: https://slurm.schedmd.com/sbatch.html

Here is an example SLURM batch submission script that runs a fictitious program called foo:

#!/bin/bash
# Request 5 gigabytes of real memory (mem)
#SBATCH --mem=5G

# load the module for the program we want to run
module load apps/gcc/foo

# Run the program foo with input foo.dat
# and output foo.res
foo foo.dat foo.res

Some things to note:

  • The first line always needs to be #!/bin/bash (to tell the scheduler that this is a bash batch script).

  • Comments start with a #.

  • It is always best to fully specify job’s resources with your submission script.

  • All Slurm Scheduler options start with #SBATCH

  • You should use the SLURM option --ntasks=nn Number of “tasks”, for programs using distributed parallelism (MPI).

  • You should use the SLURM option --ntasks-per-node=nn Number of “tasks per node”, for programs using distributed parallelism (MPI). Note that the Bessemer free queues do not permit the use of more than 1 node per job.

  • You should use the SLURM option --cpus-per-task=nn Number of “cores per task”, for programs using shared memory parallelism (SMP or openmp).

  • You will often require one or more module commands in your submission file to make programs and libraries available to your scripts. Many applications and libraries are available as modules on Bessemer.

Here is a more complex SMP example that requests more resources:

#!/bin/bash
# Request 16 gigabytes of real memory (RAM) 4 cores *4G = 16
#SBATCH --mem=16G
# Request 4 cores
#SBATCH --cpus-per-task=4
# Email notifications to me@somedomain.com
#SBATCH --mail-user=me@somedomain.com
# Email notifications if the job fails
#SBATCH --mail-type=FAIL
# Change the name of the output log file.
#SBATCH --output=output.%j.test.out
# Rename the job's name
#SBATCH --comment=my_smp_job


# Load the modules required by our program
module load compilers/gcc/5.2
module load apps/gcc/foo

# Set the OPENMP_NUM_THREADS environment variable to 4
# This is needed to ensure efficient core usage.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Run the program foo with input foo.dat
# and output foo.res
foo foo.dat foo.res

Preemptable Jobs

Under certain conditions, Slurm on Bessemer allows jobs running in higher-priority Partitions (sets of nodes) to preempt jobs in lower-priority Partitions. When a higher priority job preempts a lower priority job, the lower priority job is stopped (and by default cancelled) and the higher priority job takes its place.

Specifically, Slurm allows users to run interactive sessions and batch jobs using idle resources in private (research group-owned or dept-owned) nodes, but these resources will be reclaimed (and the associated jobs preempted) if members of those groups/departments submit jobs that can only start if those resources are repurposed.

Note

Support for preemptable jobs has been enabled on a trial basis and will be disabled if it impacts on priority access by groups / departments to private nodes they have purchased.

An example of the use of preemptable jobs:

  1. Researcher A wants to run a job using 2 GPUs. All ‘public’ GPUs are being used by jobs, but some GPUs in a private node belonging to research group X are idle.

  2. Researcher A decides that they want to use those idle GPUs but they aren’t a member of research group X; however, they are happy to take the risk of their job being preempted by a member of research group X.

  3. Researcher A submits a job and makes it preemptable (by adding submitting it to the preempt Partition using --partition=preempt).

  4. The job starts running on a node which is a member of the preempt and research-group-X Partitions.

  5. Researcher B is a member of research group X and submits a job to the research-group-X Partition.

  6. This job can only start if the resources being used by the first job are reclaimed.

  7. As a result, Slurm preempts the first job with this second job, as a result of which the first job is cancelled.

  8. The second job runs to completion.

Tips for using preemptable jobs:

  • Ensure that you’re able to reliably re-submit your preemptable job if it is preempted before completion. A common way of doing this is to write out state/progress information periodically whilst the job is running.

  • Select a sensible frequency for writing out state/progress information or you may cause poor performance due to storage write speed limits.

Monitoring running Jobs

There are two commands to monitor running and queued jobs:

The squeue command is used to pull up information about jobs in the queue, by default this command will list the job ID, partition, username, job status, number of nodes, and name of nodes for all jobs queued or running within SLURM.

Display all jobs queued on the system:

$ squeue

To limit this command to only display a single user’s jobs the --user flag can be used:

$ squeue --user=$USER

Further information without abbreviation can be shown by using the --long flag:

$ squeue --user=$USER --long

The squeue command also provides a method to calculate the estimated start time for a job by using the --start flag:

$ squeue --user=$USER --start

When checking the status of a job you may wish to check for updates at a time interval. This can be achieved by using the --iterate flag and a number of seconds:

$ squeue --user=$USER --start --iterate=n_seconds

You can stop this command by pressing Ctrl + C.

Example output:

$ squeue
        JOBID   PARTITION   NAME      USER  ST       TIME  NODES NODELIST(REASON)
        1234567 interacti   bash   foo1bar   R   17:19:40      1 bessemer-node001
        1234568 sheffield job.sh   foo1bar   R   17:21:40      1 bessemer-node046
        1234569 sheffield job.sh   foo1bar  PD   17:22:40      1 (Resources)
        1234570 sheffield job.sh   foo1bar  PD   16:47:06      1 (Priority)
        1234571       gpu job.sh   foo1bar   R 1-19:46:53      1 bessemer-node026
        1234572       gpu job.sh   foo1bar   R 1-19:46:54      1 bessemer-node026
        1234573       gpu job.sh   foo1bar   R 1-19:46:55      1 bessemer-node026
        1234574       gpu job.sh   foo1bar   R 1-19:46:56      1 bessemer-node026
        1234575       gpu job.sh   foo1bar  PD       9:04      1 (ReqNodeNotAvail, UnavailableNodes:bessemer-node026)
        1234576 sheffield job.sh   foo1bar  PD    2:57:24      1 (QOSMaxJobsPerUserLimit)

States shown above indicate job states including running “R” and Pending “PD” with various reasons for pending states including a node (ReqNodeNotAvail) full of jobs and a user hitting the max limit for numbers of jobs they can run simultaneously in a QOS (QOSMaxJobsPerUserLimit).

A list of the most relevant job states and reasons can be seen below:

SLURM Job States:

Jobs typically pass through several states in the course of their execution. The typical states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED.

Status

Code

Explanation

COMPLETED

CD

The job has completed successfully.

COMPLETING

CG

The job is finishing but some processes are still active.

CANCELLED

CA

Job was explicitly cancelled by the user or system administrator.

FAILED

F

The job terminated with a non-zero exit code and failed to execute.

PENDING

PD

The job is waiting for resource allocation. It will eventually run.

PREEMPTED

PR

The job was terminated because of preemption by another job.

RUNNING

R

The job currently is allocated to a node and is running.

SUSPENDED

S

A running job has been stopped with its cores released to other jobs.

STOPPED

ST

A running job has been stopped with its cores retained.

OUT_OF_MEMORY

OOM

Job experienced out of memory error.

TIMEOUT

TO

Job exited because it reached its walltime limit.

NODE_FAIL

NF

Job terminated due to failure of one or more allocated nodes.

A full list of job states can be found at: https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES

SLURM Job Reasons:

These codes identify the reason that a job is waiting for execution. A job may be waiting for more than one reason, in which case only one of those reasons is displayed.

Reason Code

Explanation

Priority

One or more higher priority jobs is in queue for running. Your job will eventually run.

Dependency

This job is waiting for a dependent job to complete and will run afterwards.

Resources

The job is waiting for resources to become available and will eventually run.

InvalidAccount

The job’s account is invalid. Cancel the job and rerun with correct account.

InvaldQoS

The job’s QoS is invalid. Cancel the job and rerun with correct account.

QOSGrpMaxJobsLimit

Maximum number of jobs for your job’s QoS have been met; job will run eventually.

PartitionMaxJobsLimit

Maximum number of jobs for your job’s partition have been met; job will run eventually.

AssociationMaxJobsLimit

Maximum number of jobs for your job’s association have been met; job will run eventually.

JobLaunchFailure

The job could not be launched. This may be due to a file system problem, invalid program name, etc.

NonZeroExitCode

The job terminated with a non-zero exit code.

SystemFailure

Failure of the Slurm system, a file system, the network, etc.

TimeLimit

The job exhausted its time limit.

WaitingForScheduling

No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason.

BadConstraints

The job’s constraints can not be satisfied.

A full list of job reasons can be found at: https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES

The sstat command can be used to display status information about a user’s currently running jobs such as the CPU usage, task or node information and memory consumption.

The command can be invoked as follows with a specific job ID:

$ sstat --jobs=job-id

And to display specific information you can use the --format flag to choose your output:

$ sstat --jobs=job-id --format=var_1,var_2, ... , var_N

A chart of some these variables are listed in the table below:

sstat format variable names

Variable

Description

AveCPU

Average (system + user) CPU time of all tasks in job.

AveRSS

Average resident set size of all tasks in job.

AveVMSize

Average Virtual Memory size of all tasks in job.

JobID

The id of the Job.

MaxRSS

Maximum resident set size of all tasks in job.

MaxVMSize

Maximum Virtual Memory size of all tasks in job.

NTasks

Total number of tasks in a job or step.

A full list of variables for the --format flag can be found with the --helpformat flag or by visiting the slurm page on sstat.

Stopping or cancelling Jobs

Jobs can be stopped or cancelled using the scancel command:

You can stop jobs with the scancel command and the job’s ID (replacing job-id with the number):

$ scancel job-id

To cancel multiple jobs you can supply a comma separated list:

$ scancel job-id1, job-id2, job-id3

Investigating finished Jobs

Jobs which have already finished can be investigated using the sacct command:

The sacct command can be used to display status information about a user’s historical jobs.

The command can be used as follows with the job’s ID:

$ sacct --jobs=job-id

Or to view information about all of a specific user’s jobs:

$ sacct --user=$USER

By default the sacct command will only bring up information about the user’s job from the current day. By using the --starttime flag the command will look further back to the given date e.g. :

$ sacct --user=$USER --starttime=YYYY-MM-DD

Like the sstat command, the --format flag can be used to choose the command output:

$ sacct --user=$USER --format=var_1,var_2, ... ,var_N
sacct format variable names

Variable

Description

Account

The account the job ran under.

AveCPU

Average (system + user) CPU time of all tasks in job.

AveRSS

Average resident set size of all tasks in job.

AveVMSize

Average Virtual Memory size of all tasks in job.

CPUTime

Formatted (Elapsed time * CPU) count used by a job or step.

Elapsed

Jobs elapsed time formated as DD-HH:MM:SS.

ExitCode

The exit code returned by the job script or salloc.

JobID

The id of the Job.

JobName

The name of the Job.

MaxRSS

Maximum resident set size of all tasks in job.

MaxVMSize

Maximum Virtual Memory size of all tasks in job.

MaxDiskRead

Maximum number of bytes read by all tasks in the job.

MaxDiskWrite

Maximum number of bytes written by all tasks in the job.

ReqCPUS

Requested number of CPUs.

ReqMem

Requested amount of memory.

ReqNodes

Requested number of nodes.

NCPUS

The number of CPUs used in a job.

NNodes

The number of nodes used in a job.

User

The username of the person who ran the job.

A full list of variables for the --format flag can be found with the --helpformat flag or by visiting the slurm page on sacct.

Debugging failed Jobs

If one of your jobs has failed and you need to debug why this has occured you should consult the job records held by the scheduler with the sacct referenced above as well as the generated job logs.

These output and error log files will be generated in the job working directory with the job name or output log file name as of the form slurm-$JOBID.out where $JOBID is the scheduler provided job id. Looking at these logs should indicate the source of any issues.

sacct will also give a job’s state and ExitCode field with each job.

The ExitCode is the return value of the exiting program/script. It can be a user defined value if the job is finished with a call to ‘exit(number)’. Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a Reason of “NonZeroExitCode”.

The job logs may also include a “derived exit code” field. This is set to the value of the highest exit code returned by all of the job’s steps (srun invocations).


Job Submission / Control on Stanage

Tip

The Stanage cluster has been configured to have the same default resource request limits as the ShARC cluster. Please see our Choosing appropriate compute resources page for further information.

Interactive Jobs

SLURM uses a single command to launch interactive jobs:

  • srun Standard SLURM command supporting graphical applications.

Usage of the command is as follows:

$ srun --pty bash -i

You can configure the resources available to the interactive session by adding command line options. For example to start an interactive session with access to 16 GB of RAM:

$ srun --mem=16G --pty bash -i

To start a session with access to 2 cores, use either:

$ srun --cpus-per-task=2 --pty bash -i #2 cores per task, 1 task and 1 node per job default. Preferred!
$ srun --ntasks-per-node=2 --pty bash -i #2 tasks per node, 1 core per task and 1 node per job default.

Please take care with your chosen options as usage in concert with other options can be multiplicative.

A further explanation of why you may use the tasks options or cpus options can be found here.

A table of common interactive job options is given below; any of these can be combined together to request more resources.

Slurm Command

Description

-t min or -t days-hh:mm:ss

Specify the total maximum wall clock execution time for the job. The upper limit is 08:00:00. Note: these limits may differ for reservations /projects.

--mem=xxG


--mem=xxG is used to specify the maximum amount (xx) of real memory to be requested per node.


If the real memory usage of your job exceeds this value multiplied by the number of cores / nodes you requested then your job will be killed.

-c nn or --cpus-per-task=nn


-c is cores per task, take care with your chosen number of tasks.

--ntasks-per-node=nn


--ntasks-per-node= is tasks per node, take care with your chosen number of cores per node. The default is one task per node, but note that other options can adjust the default of 1 core per task e.g. --cpus-per-task.

Batch Jobs

Tip

Batch jobs have larger resource limits than interactive jobs! For guidance on what these limits are and how best to select resources please see our Choosing appropriate compute resources page.

SLURM uses a single command to submit batch jobs:

  • sbatch Standard SLURM command with no support for interactivity or graphical applications.

The Slurm docs have a complete list of available sbatch options.

The batch submission scripts are executed for submission as below:

sbatch submission.sh

Note the job submission number. For example:

Submitted batch job 1226

You can check your output log or error log file as below:

cat JOB_NAME-1226.out

There are numerous further options you can request in your batch submission files which are detailed below:

Name your job submission:

#SBATCH --comment=JOB_NAME

Specify a number of nodes:

#SBATCH --nodes=1

Specify a number of tasks per node:

#SBATCH --ntasks-per-node=4

Specify a number of tasks:

#SBATCH --ntasks=4

Specify a number of cores per task:

#SBATCH --cpus-per-task=4

Request a specific amount of memory per job:

#SBATCH --mem=16G

Specify the job output log file name:

#SBATCH --output=output.%j.test.out

Request a specific amount of time:

#SBATCH --time=00:30:00

Request job update email notifications:

#SBATCH --mail-user=username@sheffield.ac.uk

For the full list of the available options please visit the SLURM manual webpage for sbatch here: https://slurm.schedmd.com/sbatch.html

Here is an example SLURM batch submission script that runs a fictitious program called foo:

#!/bin/bash
# Request 5 gigabytes of real memory (mem)
#SBATCH --mem=5G

# load the module for the program we want to run
module load apps/gcc/foo

# Run the program foo with input foo.dat
# and output foo.res
foo foo.dat foo.res

Some things to note:

  • The first line always needs to be #!/bin/bash (to tell the scheduler that this is a bash batch script).

  • Comments start with a #.

  • It is always best to fully specify job’s resources with your submission script.

  • All Slurm Scheduler options start with #SBATCH

  • You should use the SLURM option --ntasks=nn Number of “tasks”, for programs using distributed parallelism (MPI).

  • You should use the SLURM option --ntasks-per-node=nn Number of “tasks per node”, for programs using distributed parallelism (MPI).

  • You should use the SLURM option --cpus-per-task=nn Number of “cores per task”, for programs using shared memory parallelism (SMP or openmp).

  • You will often require one or more module commands in your submission file to make programs and libraries available to your scripts.


Cluster job resource limits

While the Sheffield cluster have very large amounts of resources to use for your jobs there are limits applied in order for the schedulers to function. The limits below apply to the default free queues. Other queues may have different settings.

Warning

You must ensure that your jobs do not attempt to exceed these limits as the schedulers are not forgiving and will summarily kill any job which exceeds the requested limits without warning.

CPU Limits

Please note that the CPU limits do depend on the chosen parallel environment for ShARC jobs, with SMP type jobs limited to a maximum of 16 cores in either job type. Please also note that interactive jobs with more than 16 cores are only available in the MPI parallel environment

Warning

Please note that for either cluster the larger the number of cores you request in an interactive job the more likely the request is to fail as the requested resource is not immediately available.

CPU Allocation Limits Table

Scheduler Type

No. CPU Cores Available

Interactive Job
(Default/ Min / Max )

No. CPU Cores Available

Batch Job
(Default/ Min / Max )

Submission Argument

SLURM (Stanage)

1 / 1 / ~11264 (MPI), 64 (SMP)

1 / 1 / ~11264 (MPI), 64 (SMP)

-c <nn>

SLURM (Bessemer)

1 / 1 / 40

1 / 1 / 40

-c <nn>

SGE (ShARC)

1 / 1 / ~1536 (MPI), 16 (SMP)

1 / 1 / ~1536 (MPI), 16 (SMP)

-pe <env> <nn>

Time Limits

Time Allocation Limits Table

Scheduler Type

Interactive Job
(Default / Max)

Batch Job
(Default / Max)

Submission Argument

SLURM (Stanage)

8 / 8 hrs

8 / 96 hrs

--time=<days-hh:mm:ss>

SLURM (Bessemer)

8 / 8 hrs

8 / 168 hrs

--time=<days-hh:mm:ss>

SGE (ShARC)

8 / 8 hrs

8 / 96 hrs

-l h_rt=<hh:mm:ss>

Memory Limits

Memory Allocation Limits Table

Scheduler Type

Standard Nodes

Large RAM Nodes

Very Large RAM Nodes

Interactive Job
(Default / Max)

Batch Job
(Default / Max)

Submission Argument

SLURM (Stanage)

251 GB

1007 GB

2014 GB

2 GB / 251 GB

2 GB / 251 GB (SMP) ~74404 GB (MPI)

Per job basis --mem=<nn>

SLURM (Bessemer)

192 GB

N/A

N/A

2 GB / 192 GB

2 GB / 192 GB

Per job basis --mem=<nn>

SGE (ShARC)

64 GB

256 GB

N/A

2 GB / 64 GB

2 GB / 64 GB (SMP) ~6144 GB (MPI)

Per core basis -l rmem=<nn>

Advanced / Automated job submission and management

The Distributed Resource Management Application API (DRMAA) is available on both clusters which can be used with advanced scripts or a Computational Pipeline manager (such as Ruffus)

For further detail see our guide to the DRMAA API.

Reference information and further resources

Quick reference information for the SGE scheduler (ShARC), Bessemer scheduler (SLURM) and Stanage scheduler can be found in the Scheduler Reference Info section.

Stanford Research Computing Center provide a SGE to SLURM conversion guide.