Job Submission and Control
Introduction
As mentioned in the what is HPC section, HPC clusters like ShARC, Bessemer and Stanage use a program called a scheduler to control and submit work to appropriate nodes.
All user work is dispatched to a cluster using a tool called a job scheduler. A job scheduler is a tool used to manage, submit and fairly queue users’ jobs in the shared environment of a HPC cluster. A cluster will normally use a single scheduler and allow a user to request either an immediate interactive job, or a queued batch job.
Here at the University of Sheffield, we use 2 different schedulers, the SGE scheduler on ShARC and the more modern SLURM scheduler on Bessemer and Stanage. Both have the same purpose, use similar commands and work on the same three basic principles:
they allocate exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work,
they provide a framework for starting, executing, and monitoring work on the set of allocated nodes,
they arbitrate contention for resources by managing a queue of pending work.
Key Concepts
Tip
If you are not familiar with basic computer architecture we highly recommend reading our General Computer Architecture Quick Start page before continuing.
When engaging with our documentation several concepts must be well understood with reference to schedulers and jobs which will be explained below:
Types of Job
There are two types of job on any scheduler, interactive and batch:
Interactive jobs are ones where they are requested and immediately run providing the user with a bash shell (or a shell of their choosing) in which they can then run their software or scripts in.
Typically only very few nodes in a HPC cluster are dedicated solely to interactive jobs and interactive jobs require the resources to be available instantenously as the request is made or the request will fail. This means that interactive requests cannot always be fulfilled, particularly when requesting multiple cores.
Batch jobs are the other kind of job where a user prepares a batch submission script which both requests the resources for the job from the scheduler and contains the execution commands for a given program to run. On job submission, the scheduler will add it to the chosen queue and run your job when resources become available.
Any task that can be executed without any user intervention while it is running can be submitted as a batch job. This excludes jobs that require a Graphical User Interface (GUI), however, many common GUI applications such as ANSYS or MATLAB can also be used without their GUIs.
If you wish to use a cluster for interactive work and/or running applications like MATLAB or ANSYS using GUIs, you will need to request an interactive job from the scheduler.
If you wish to use a cluster to dispatch a very large ANSYS model you will need to request batch job from the scheduler and prepare an appropriate batch script.
Note
Long running jobs should use the batch submission system rather than requesting an interactive session for a very long time. Doing this will lead to better cluster performance for all users.
Queues and partitions
Queues or partitions (in SLURM) are queues of jobs submitted to a scheduler for it to run. They can have an assortment of constraints such as job size limit, job time limit, users permitted to use it and some nodes will be configured to accept jobs only from certain queues e.g. Department specific nodes.
All jobs are dispatchable
When a user requests that a job, (either a batch or an interactive session), is ran on the cluster, the scheduler will run jobs from the queue based on a set of rules, priorities and availabilities.
How and where a job can run are set when the job is requested based on the resource amounts requested as well as the chosen queue (assuming a user has permissions to use a queue.)
This means that not all interactive jobs are possible as the resources may not be available. It also means that the amount of time it takes for any batch job to run is dependent on how large the job resource request is, which queue it is in, what resources are available in that queue and how much previous resource usage the user has. The larger a resource request is, the longer it will take to wait for those resources to become available and the longer it will take for subsequent jobs to queue as a result of the fair scheduling algorithm.
Fair scheduling
Job schedulers are typically configured to use a fair-share / wait time system. In short, the scheduler assesses your previous CPU time and memory time (consumption) to give a requested job a priority. Subsequently it uses how long your job has had to wait in order to bump up that priority. Once your job is the highest priority, the job will then run when the requested resources become available on the system. Your running total for CPU time / memory time usage will decay over time but in general the more resources you request and for longer, the lower your initial job priority gets and the longer you have to wait behind other people’s jobs.
If you are seeing one job start and another immediately begin this is not an intentional chaining setting on the scheduler’s part. This is quite likely simply a reflection of your subsequent jobs waiting for resources to become available and it just so happens that your running job finishes freeing up the resources for the next.
As a natural consequence of backfilling into any trapped resources - you may see small time, memory and core request jobs with a lower priority running before your own with a higher priority. This is because they are small enough to utilize the trapped resource before the job trapping those resources is finished. This is not unfair and it would be inefficient and irresponsible for us to intentionally block a job from running simply because the priority is lower than a larger job that won’t fit in that trapped resource.
Job Submission / Control on ShARC
Interactive Jobs
There are three commands for requesting an interactive shell using SGE:
qrsh - No support for graphical applications. Standard SGE command.
qsh - Supports graphical applications. Standard SGE command.
qrshx - Supports graphical applications. Superior to qsh and is unique to Sheffield’s clusters.
Usage of these commands is as follows:
$ qrshx
You can configure the resources available to the interactive session by adding command line options. For example to start an interactive session with access to 16 GB of RAM:
$ qrshx -l rmem=16G
To start a session with access to 2 cores in the SMP parallel environment:
$ qrshx -pe smp 2
A table of common interactive job options is given below; any of these can be combined together to request more resources.
SGE Command |
Description |
---|---|
|
Specify the total maximum wall clock execution time for the job. The upper limit is 08:00:00. Note: these limits may differ for reservations /projects. |
|
|
|
Specify a parallel, |
Note that ShARC has multiple parallel environments, the current parallel environments can be found on the ShARC Parallel Environments page.
Batch Jobs
Tip
Batch jobs have larger resource limits than interactive jobs! For guidance on what these limits are and how best to select resources please see our Choosing appropriate compute resources page.
There is a single command to submit jobs via SGE:
qsub - Standard SGE command with no support for interactivity or graphical applications.
The batch submission scripts are executed for submission as below:
qsub submission.sh
Note the job submission number. For example:
Your job 12345678 ("submission.sh") has been submitted
You can check your output log or error log file when the job is finished.
cat job.sh.o12345678
cat job.sh.e12345678
There are numerous further options you can request in your batch submission files which are detailed below:
Pass through current shell environment (sometimes important):
#$ -V
Name your job submission:
#$ -N test_job
Specify a parallel environment for SMP jobs where N
is a number of cores:
#$ -pe smp N
Specify a parallel environment for MPI jobs where N
is a number of cores:
#$ -pe mpi N
Request a specific amount of memory where N
is a number of gigabytes per core:
#$ -l rmem=NG
Request a specific amount of time in hours, minutes and seconds:
#$ -l h_rt=hh:mm:ss
Request email notifications on start, end and abort:
#$ -M me@somedomain.com
#$ -m abe
For the full list of the available options please visit the SGE manual webpage for qsub here: http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html
Here is an example SGE batch submission script that runs a fictitious program called foo
:
#!/bin/bash
# Request 5 gigabytes of real memory (mem)
#$ -l rmem=5G
# load the module for the program we want to run
module load apps/gcc/foo
# Run the program foo with input foo.dat
# and output foo.res
foo foo.dat foo.res
Some things to note:
The first line always needs to be
#!/bin/bash
(to tell the scheduler that this is a bash batch script).Comments start with a
#
.It is always best to fully specify job’s resources with your submission script.
All SGE Scheduler options, such as the amount of memory requested, start with
#$
You will often require one or more
module
commands in your submission file. These make programs and libraries available to your scripts. Many applications and libraries are available as modules on ShARC.
Here is a more complex example that requests more resources:
#!/bin/bash
# Request 16 gigabytes of real memory (RAM) 4 cores *4G = 16
#$ -l rmem=4G
# Request 4 cores in an OpenMP environment
#$ -pe openmp 4
# Email notifications to me@somedomain.com
#$ -M me@somedomain.com
# Email notifications if the job aborts
#$ -m a
# Name the job
#$ -N my_job
# Request 24 hours of time
#$ -l h_rt=24:00:00
# Load the modules required by our program
module load compilers/gcc/5.2
module load apps/gcc/foo
# Set the OPENMP_NUM_THREADS environment variable to 4
# This is needed to ensure efficient core usage.
export OMP_NUM_THREADS=$NSLOTS
# Run the program foo with input foo.dat
# and output foo.res
foo foo.dat foo.res
Monitoring running Jobs
There is a single command to monitor running and queued jobs via the qstat
command:
Display your own jobs queued on the system
$ qstat
Show a specific running or queueing job’s details:
qstat -j jobid
Display all jobs queued on the system
$ qstat -u "*"
Display all jobs queued by the username foo1bar
$ qstat -u foo1bar
Display all jobs in the openmp parallel environment
$ qstat -pe openmp
Display all jobs in the queue named foobar
$ qstat -q foobar.q
Example output:
$ qstat -u "*"
job-ID prior name user state submit/start at queue slots ja-task-ID
------------------------------------------------------------------------------------------------------------------------
1234567 0.00000 INTERACTIV foo1bar dr 12/24/2021 07:13:20 interactive.q@sharc-node004.sh 1
1234568 0.00000 job.sh foo1bar r 01/22/2022 05:37:31 all.q@sharc-node019.shef.ac.uk 16
1234569 0.00000 job.sh foo1bar r 01/23/2022 07:41:18 all.q@sharc-node084.shef.ac.uk 16
1234570 0.00000 job.sh foo1bar Rr 01/23/2022 08:03:22 all.q@sharc-node068.shef.ac.uk 16
1234571 0.00076 job.sh foo1bar qw 01/23/2022 07:06:18 1
1234572 0.00067 job.sh foo1bar hqw 01/23/2022 07:06:18 1
1234573 0.00000 job.sh foo1bar Eqw 01/21/2022 13:50:55 1
1234574 0.00000 job.sh foo1bar t 01/24/2022 13:04:25 all.q@sharc-node159.shef.ac.uk 1 22964
SGE Job States:
State |
Explanation |
SGE State Letter Code/s |
---|---|---|
Pending |
pending, queued |
qw |
Pending |
pending, user and/or system hold |
hqw |
Running |
running |
r |
Error |
all pending states with error |
Eqw, Ehqw, EhRqw |
Key: q: queueing, r: running, w: waiting, h: on hold, E: error, R: re-run, s: job suspended, S: queue suspended, t: transferring, d: deletion.
Note
A full list of SGE and DRMAA states can be found here
Stopping or cancelling Jobs
Jobs can be stopped or cancelled using the qdel
command:
A job can be cancelled by using the qdel
command as shown swapping out 123456 for your own job id number
$ qdel 123456
Investigating finished Jobs
Jobs which have already finished can be investigated using the qacct
command:
The qacct
command can be used to display status information about a user’s historical
jobs.
Running the qacct
command alone will provide a summary of used resources from the current month
for the user running the command.
The command can be used as follows with the job’s ID to get job specific info:
$ qacct -j job-id
Or to view information about all of a specific user’s jobs:
$ qacct -j -u $USER
By default the qacct
command will only bring up summary info about the user’s jobs from the
current accounting file (which rotates monthly). Further detail about the output metrics and how
to query jobs older than a month can be found on the dedicated qacct page.
Debugging failed Jobs
Note
One common form of job failure on ShARC is caused by Windows style line endings. If you see
an error reported by qacct
of the form:
failed searching requested shell because:
Or by qstat
of the form:
failed: No such file or directory
You must replace these line endings as detailed in the FAQ.
If one of your jobs has failed and you need to debug why this has occured you should consult the job records held by the scheduler with the qacct referenced above as well as the generated job logs.
These output and error log files will be generated in the job working directory
with the structure $JOBNAME.o$JOBID
and $JOBNAME.e$JOBID
where $JOBNAME
is
the user chosen name of the job and $JOBID
is the scheduler provided job id.
Looking at these logs should indicate the source of any issues.
The qacct
info will contain two important metrics, the exit code and failure code.
The exit code is the return value of the exiting program/script. It can be a user defined value if the job is finished with a call to ‘exit(number)’. For abnormally terminated jobs it is the signal number + 128.
As an example: 137-128 = 9, therefore: signal 9 ( SIGKILL), it was sent the KILL signal and was killed, likely by the scheduler.
The failure code indicates why a job was abnormally terminated (by the scheduler). An incomplete table of common failure codes is shown below:
code |
meaning |
---|---|
100 |
failure after job |
37 |
qmaster enforced h_rt limit (Job ran out of time.) |
30 |
rescheduling on application error |
28 |
no current working dir |
27 |
no shell |
26 |
failure opening output |
21 |
failure in recognizing job |
19 |
no exit status |
8 |
failure in prolog |
1 |
failure before job (execd) |
Job Submission / Control on Bessemer
Interactive Jobs
SLURM uses a single command to launch interactive jobs:
srun Standard SLURM command supporting graphical applications.
Usage of the command is as follows:
$ srun --pty bash -i
You can configure the resources available to the interactive session by adding command line options. For example to start an interactive session with access to 16 GB of RAM:
$ srun --mem=16G --pty bash -i
To start a session with access to 2 cores, use either:
$ srun --cpus-per-task=2 --pty bash -i #2 cores per task, 1 task and 1 node per job default. Preferred!
$ srun --ntasks-per-node=2 --pty bash -i #2 tasks per node, 1 core per task and 1 node per job default.
Please take care with your chosen options as usage in concert with other options can be multiplicative.
A further explanation of why you may use the tasks options or cpus options can be found here.
A table of common interactive job options is given below; any of these can be combined together to request more resources.
Slurm Command |
Description |
---|---|
|
Specify the total maximum wall clock execution time for the job. The upper limit is 08:00:00. Note: these limits may differ for reservations /projects. |
|
|
|
|
|
|
Batch Jobs
Tip
Batch jobs have larger resource limits than interactive jobs! For guidance on what these limits are and how best to select resources please see our Choosing appropriate compute resources page.
SLURM uses a single command to submit batch jobs:
sbatch Standard SLURM command with no support for interactivity or graphical applications.
The Slurm docs have a complete list of available sbatch
options.
The batch submission scripts are executed for submission as below:
sbatch submission.sh
Note the job submission number. For example:
Submitted batch job 1226
You can check your output log or error log file as below:
cat JOB_NAME-1226.out
There are numerous further options you can request in your batch submission files which are detailed below:
Name your job submission:
#SBATCH --comment=JOB_NAME
Specify a number of nodes:
#SBATCH --nodes=1
Warning
Note that the Bessemer free queues do not permit the use of more than 1 node per job.
Specify a number of tasks per node:
#SBATCH --ntasks-per-node=4
Specify a number of tasks:
#SBATCH --ntasks=4
Specify a number of cores per task:
#SBATCH --cpus-per-task=4
Request a specific amount of memory per job:
#SBATCH --mem=16G
Specify the job output log file name:
#SBATCH --output=output.%j.test.out
Request a specific amount of time:
#SBATCH --time=00:30:00
Request job update email notifications:
#SBATCH --mail-user=username@sheffield.ac.uk
For the full list of the available options please visit the SLURM manual webpage for sbatch here: https://slurm.schedmd.com/sbatch.html
Here is an example SLURM batch submission script that runs a fictitious program called foo
:
#!/bin/bash
# Request 5 gigabytes of real memory (mem)
#SBATCH --mem=5G
# load the module for the program we want to run
module load apps/gcc/foo
# Run the program foo with input foo.dat
# and output foo.res
foo foo.dat foo.res
Some things to note:
The first line always needs to be
#!/bin/bash
(to tell the scheduler that this is a bash batch script).Comments start with a
#
.It is always best to fully specify job’s resources with your submission script.
All Slurm Scheduler options start with
#SBATCH
You should use the SLURM option
--ntasks=nn
Number of “tasks”, for programs using distributed parallelism (MPI).You should use the SLURM option
--ntasks-per-node=nn
Number of “tasks per node”, for programs using distributed parallelism (MPI). Note that the Bessemer free queues do not permit the use of more than 1 node per job.You should use the SLURM option
--cpus-per-task=nn
Number of “cores per task”, for programs using shared memory parallelism (SMP or openmp).You will often require one or more
module
commands in your submission file to make programs and libraries available to your scripts. Many applications and libraries are available as modules on Bessemer.
Here is a more complex SMP example that requests more resources:
#!/bin/bash
# Request 16 gigabytes of real memory (RAM) 4 cores *4G = 16
#SBATCH --mem=16G
# Request 4 cores
#SBATCH --cpus-per-task=4
# Email notifications to me@somedomain.com
#SBATCH --mail-user=me@somedomain.com
# Email notifications if the job fails
#SBATCH --mail-type=FAIL
# Change the name of the output log file.
#SBATCH --output=output.%j.test.out
# Rename the job's name
#SBATCH --comment=my_smp_job
# Load the modules required by our program
module load compilers/gcc/5.2
module load apps/gcc/foo
# Set the OPENMP_NUM_THREADS environment variable to 4
# This is needed to ensure efficient core usage.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# Run the program foo with input foo.dat
# and output foo.res
foo foo.dat foo.res
Preemptable Jobs
Under certain conditions, Slurm on Bessemer allows jobs running in higher-priority Partitions (sets of nodes) to preempt jobs in lower-priority Partitions. When a higher priority job preempts a lower priority job, the lower priority job is stopped (and by default cancelled) and the higher priority job takes its place.
Specifically, Slurm allows users to run interactive sessions and batch jobs using idle resources in private (research group-owned or dept-owned) nodes, but these resources will be reclaimed (and the associated jobs preempted) if members of those groups/departments submit jobs that can only start if those resources are repurposed.
Note
Support for preemptable jobs has been enabled on a trial basis and will be disabled if it impacts on priority access by groups / departments to private nodes they have purchased.
An example of the use of preemptable jobs:
Researcher A wants to run a job using 2 GPUs. All ‘public’ GPUs are being used by jobs, but some GPUs in a private node belonging to research group X are idle.
Researcher A decides that they want to use those idle GPUs but they aren’t a member of research group X; however, they are happy to take the risk of their job being preempted by a member of research group X.
Researcher A submits a job and makes it preemptable (by adding submitting it to the
preempt
Partition using--partition=preempt
).The job starts running on a node which is a member of the
preempt
andresearch-group-X
Partitions.Researcher B is a member of research group X and submits a job to the
research-group-X
Partition.This job can only start if the resources being used by the first job are reclaimed.
As a result, Slurm preempts the first job with this second job, as a result of which the first job is cancelled.
The second job runs to completion.
Tips for using preemptable jobs:
Ensure that you’re able to reliably re-submit your preemptable job if it is preempted before completion. A common way of doing this is to write out state/progress information periodically whilst the job is running.
Select a sensible frequency for writing out state/progress information or you may cause poor performance due to storage write speed limits.
Monitoring running Jobs
There are two commands to monitor running and queued jobs:
The squeue
command is used to pull up information about jobs in the queue, by default this
command will list the job ID, partition, username, job status, number of nodes, and name of nodes
for all jobs queued or running within SLURM.
Display all jobs queued on the system:
$ squeue
To limit this command to only display a single user’s jobs the --user
flag can be used:
$ squeue --user=$USER
Further information without abbreviation can be shown by using the --long
flag:
$ squeue --user=$USER --long
The squeue
command also provides a method to calculate the estimated start time for a job by
using the --start
flag:
$ squeue --user=$USER --start
When checking the status of a job you may wish to check for updates at a time interval. This can
be achieved by using the --iterate
flag and a number of seconds:
$ squeue --user=$USER --start --iterate=n_seconds
You can stop this command by pressing Ctrl + C
.
Example output:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1234567 interacti bash foo1bar R 17:19:40 1 bessemer-node001
1234568 sheffield job.sh foo1bar R 17:21:40 1 bessemer-node046
1234569 sheffield job.sh foo1bar PD 17:22:40 1 (Resources)
1234570 sheffield job.sh foo1bar PD 16:47:06 1 (Priority)
1234571 gpu job.sh foo1bar R 1-19:46:53 1 bessemer-node026
1234572 gpu job.sh foo1bar R 1-19:46:54 1 bessemer-node026
1234573 gpu job.sh foo1bar R 1-19:46:55 1 bessemer-node026
1234574 gpu job.sh foo1bar R 1-19:46:56 1 bessemer-node026
1234575 gpu job.sh foo1bar PD 9:04 1 (ReqNodeNotAvail, UnavailableNodes:bessemer-node026)
1234576 sheffield job.sh foo1bar PD 2:57:24 1 (QOSMaxJobsPerUserLimit)
States shown above indicate job states including running “R” and Pending “PD” with various reasons for pending states including a node (ReqNodeNotAvail) full of jobs and a user hitting the max limit for numbers of jobs they can run simultaneously in a QOS (QOSMaxJobsPerUserLimit).
A list of the most relevant job states and reasons can be seen below:
SLURM Job States:
Jobs typically pass through several states in the course of their execution. The typical states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED.
Status |
Code |
Explanation |
---|---|---|
COMPLETED |
CD |
The job has completed successfully. |
COMPLETING |
CG |
The job is finishing but some processes are still active. |
CANCELLED |
CA |
Job was explicitly cancelled by the user or system administrator. |
FAILED |
F |
The job terminated with a non-zero exit code and failed to execute. |
PENDING |
PD |
The job is waiting for resource allocation. It will eventually run. |
PREEMPTED |
PR |
The job was terminated because of preemption by another job. |
RUNNING |
R |
The job currently is allocated to a node and is running. |
SUSPENDED |
S |
A running job has been stopped with its cores released to other jobs. |
STOPPED |
ST |
A running job has been stopped with its cores retained. |
OUT_OF_MEMORY |
OOM |
Job experienced out of memory error. |
TIMEOUT |
TO |
Job exited because it reached its walltime limit. |
NODE_FAIL |
NF |
Job terminated due to failure of one or more allocated nodes. |
A full list of job states can be found at: https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES
SLURM Job Reasons:
These codes identify the reason that a job is waiting for execution. A job may be waiting for more than one reason, in which case only one of those reasons is displayed.
Reason Code |
Explanation |
---|---|
Priority |
One or more higher priority jobs is in queue for running. Your job will eventually run. |
Dependency |
This job is waiting for a dependent job to complete and will run afterwards. |
Resources |
The job is waiting for resources to become available and will eventually run. |
InvalidAccount |
The job’s account is invalid. Cancel the job and rerun with correct account. |
InvaldQoS |
The job’s QoS is invalid. Cancel the job and rerun with correct account. |
QOSGrpMaxJobsLimit |
Maximum number of jobs for your job’s QoS have been met; job will run eventually. |
PartitionMaxJobsLimit |
Maximum number of jobs for your job’s partition have been met; job will run eventually. |
AssociationMaxJobsLimit |
Maximum number of jobs for your job’s association have been met; job will run eventually. |
JobLaunchFailure |
The job could not be launched. This may be due to a file system problem, invalid program name, etc. |
NonZeroExitCode |
The job terminated with a non-zero exit code. |
SystemFailure |
Failure of the Slurm system, a file system, the network, etc. |
TimeLimit |
The job exhausted its time limit. |
WaitingForScheduling |
No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason. |
BadConstraints |
The job’s constraints can not be satisfied. |
A full list of job reasons can be found at: https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES
The sstat
command can be used to display status information about a user’s currently running
jobs such as the CPU usage, task or node information and memory consumption.
The command can be invoked as follows with a specific job ID:
$ sstat --jobs=job-id
And to display specific information you can use the --format
flag to choose your output:
$ sstat --jobs=job-id --format=var_1,var_2, ... , var_N
A chart of some these variables are listed in the table below:
Variable |
Description |
---|---|
AveCPU |
Average (system + user) CPU time of all tasks in job. |
AveRSS |
Average resident set size of all tasks in job. |
AveVMSize |
Average Virtual Memory size of all tasks in job. |
JobID |
The id of the Job. |
MaxRSS |
Maximum resident set size of all tasks in job. |
MaxVMSize |
Maximum Virtual Memory size of all tasks in job. |
NTasks |
Total number of tasks in a job or step. |
A full list of variables for the --format
flag can be
found with the --helpformat
flag or by visiting the slurm page on
sstat.
Stopping or cancelling Jobs
Jobs can be stopped or cancelled using the scancel
command:
You can stop jobs with the scancel
command and the job’s ID
(replacing job-id with the number):
$ scancel job-id
To cancel multiple jobs you can supply a comma separated list:
$ scancel job-id1, job-id2, job-id3
Investigating finished Jobs
Jobs which have already finished can be investigated using the sacct
command:
The sacct
command can be used to display status information about a user’s historical
jobs.
The command can be used as follows with the job’s ID:
$ sacct --jobs=job-id
Or to view information about all of a specific user’s jobs:
$ sacct --user=$USER
By default the sacct
command will only bring up information about the user’s job from the
current day. By using the --starttime
flag the command will look further back to the given
date e.g. :
$ sacct --user=$USER --starttime=YYYY-MM-DD
Like the sstat
command, the --format
flag can be used to choose the command output:
$ sacct --user=$USER --format=var_1,var_2, ... ,var_N
Variable |
Description |
---|---|
Account |
The account the job ran under. |
AveCPU |
Average (system + user) CPU time of all tasks in job. |
AveRSS |
Average resident set size of all tasks in job. |
AveVMSize |
Average Virtual Memory size of all tasks in job. |
CPUTime |
Formatted (Elapsed time * CPU) count used by a job or step. |
Elapsed |
Jobs elapsed time formated as DD-HH:MM:SS. |
ExitCode |
The exit code returned by the job script or salloc. |
JobID |
The id of the Job. |
JobName |
The name of the Job. |
MaxRSS |
Maximum resident set size of all tasks in job. |
MaxVMSize |
Maximum Virtual Memory size of all tasks in job. |
MaxDiskRead |
Maximum number of bytes read by all tasks in the job. |
MaxDiskWrite |
Maximum number of bytes written by all tasks in the job. |
ReqCPUS |
Requested number of CPUs. |
ReqMem |
Requested amount of memory. |
ReqNodes |
Requested number of nodes. |
NCPUS |
The number of CPUs used in a job. |
NNodes |
The number of nodes used in a job. |
User |
The username of the person who ran the job. |
A full list of variables for the --format
flag can be
found with the --helpformat
flag or by visiting the slurm page on
sacct.
Debugging failed Jobs
If one of your jobs has failed and you need to debug why this has occured you should consult the job records held by the scheduler with the sacct referenced above as well as the generated job logs.
These output and error log files will be generated in the job working directory with the job name or
output log file name as of the form slurm-$JOBID.out
where $JOBID
is the scheduler provided job id.
Looking at these logs should indicate the source of any issues.
sacct will also give a job’s state and ExitCode field with each job.
The ExitCode is the return value of the exiting program/script. It can be a user defined value if the job is finished with a call to ‘exit(number)’. Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a Reason of “NonZeroExitCode”.
The job logs may also include a “derived exit code” field. This is set to the value of the highest exit code returned by all of the job’s steps (srun invocations).
Job Submission / Control on Stanage
Tip
The Stanage cluster has been configured to have the same default resource request limits as the ShARC cluster. Please see our Choosing appropriate compute resources page for further information.
Interactive Jobs
SLURM uses a single command to launch interactive jobs:
srun Standard SLURM command supporting graphical applications.
Usage of the command is as follows:
$ srun --pty bash -i
You can configure the resources available to the interactive session by adding command line options. For example to start an interactive session with access to 16 GB of RAM:
$ srun --mem=16G --pty bash -i
To start a session with access to 2 cores, use either:
$ srun --cpus-per-task=2 --pty bash -i #2 cores per task, 1 task and 1 node per job default. Preferred!
$ srun --ntasks-per-node=2 --pty bash -i #2 tasks per node, 1 core per task and 1 node per job default.
Please take care with your chosen options as usage in concert with other options can be multiplicative.
A further explanation of why you may use the tasks options or cpus options can be found here.
A table of common interactive job options is given below; any of these can be combined together to request more resources.
Slurm Command |
Description |
---|---|
|
Specify the total maximum wall clock execution time for the job. The upper limit is 08:00:00. Note: these limits may differ for reservations /projects. |
|
|
|
|
|
|
Batch Jobs
Tip
Batch jobs have larger resource limits than interactive jobs! For guidance on what these limits are and how best to select resources please see our Choosing appropriate compute resources page.
SLURM uses a single command to submit batch jobs:
sbatch Standard SLURM command with no support for interactivity or graphical applications.
The Slurm docs have a complete list of available sbatch
options.
The batch submission scripts are executed for submission as below:
sbatch submission.sh
Note the job submission number. For example:
Submitted batch job 1226
You can check your output log or error log file as below:
cat JOB_NAME-1226.out
There are numerous further options you can request in your batch submission files which are detailed below:
Name your job submission:
#SBATCH --comment=JOB_NAME
Specify a number of nodes:
#SBATCH --nodes=1
Specify a number of tasks per node:
#SBATCH --ntasks-per-node=4
Specify a number of tasks:
#SBATCH --ntasks=4
Specify a number of cores per task:
#SBATCH --cpus-per-task=4
Request a specific amount of memory per job:
#SBATCH --mem=16G
Specify the job output log file name:
#SBATCH --output=output.%j.test.out
Request a specific amount of time:
#SBATCH --time=00:30:00
Request job update email notifications:
#SBATCH --mail-user=username@sheffield.ac.uk
For the full list of the available options please visit the SLURM manual webpage for sbatch here: https://slurm.schedmd.com/sbatch.html
Here is an example SLURM batch submission script that runs a fictitious program called foo
:
#!/bin/bash
# Request 5 gigabytes of real memory (mem)
#SBATCH --mem=5G
# load the module for the program we want to run
module load apps/gcc/foo
# Run the program foo with input foo.dat
# and output foo.res
foo foo.dat foo.res
Some things to note:
The first line always needs to be
#!/bin/bash
(to tell the scheduler that this is a bash batch script).Comments start with a
#
.It is always best to fully specify job’s resources with your submission script.
All Slurm Scheduler options start with
#SBATCH
You should use the SLURM option
--ntasks=nn
Number of “tasks”, for programs using distributed parallelism (MPI).You should use the SLURM option
--ntasks-per-node=nn
Number of “tasks per node”, for programs using distributed parallelism (MPI).You should use the SLURM option
--cpus-per-task=nn
Number of “cores per task”, for programs using shared memory parallelism (SMP or openmp).You will often require one or more
module
commands in your submission file to make programs and libraries available to your scripts.
Cluster job resource limits
While the Sheffield cluster have very large amounts of resources to use for your jobs there are limits applied in order for the schedulers to function. The limits below apply to the default free queues. Other queues may have different settings.
Warning
You must ensure that your jobs do not attempt to exceed these limits as the schedulers are not forgiving and will summarily kill any job which exceeds the requested limits without warning.
CPU Limits
Please note that the CPU limits do depend on the chosen parallel environment for ShARC jobs, with SMP type jobs limited to a maximum of 16 cores in either job type. Please also note that interactive jobs with more than 16 cores are only available in the MPI parallel environment
Warning
Please note that for either cluster the larger the number of cores you request in an interactive job the more likely the request is to fail as the requested resource is not immediately available.
Scheduler Type |
No. CPU Cores Available |
No. CPU Cores Available |
Submission Argument |
---|---|---|---|
SLURM (Stanage) |
1 / 1 / ~11264 (MPI), 64 (SMP) |
1 / 1 / ~11264 (MPI), 64 (SMP) |
|
SLURM (Bessemer) |
1 / 1 / 40 |
1 / 1 / 40 |
|
SGE (ShARC) |
1 / 1 / ~1536 (MPI), 16 (SMP) |
1 / 1 / ~1536 (MPI), 16 (SMP) |
|
Time Limits
Scheduler Type |
Interactive Job |
Batch Job |
Submission Argument |
---|---|---|---|
SLURM (Stanage) |
8 / 8 hrs |
8 / 96 hrs |
|
SLURM (Bessemer) |
8 / 8 hrs |
8 / 168 hrs |
|
SGE (ShARC) |
8 / 8 hrs |
8 / 96 hrs |
|
Memory Limits
Scheduler Type |
Standard Nodes |
Large RAM Nodes |
Very Large RAM Nodes |
Interactive Job |
Batch Job |
Submission Argument |
---|---|---|---|---|---|---|
SLURM (Stanage) |
251 GB |
1007 GB |
2014 GB |
2 GB / 251 GB |
2 GB / 251 GB (SMP) ~74404 GB (MPI) |
Per job basis |
SLURM (Bessemer) |
192 GB |
N/A |
N/A |
2 GB / 192 GB |
2 GB / 192 GB |
Per job basis |
SGE (ShARC) |
64 GB |
256 GB |
N/A |
2 GB / 64 GB |
2 GB / 64 GB (SMP) ~6144 GB (MPI) |
Per core basis |
Advanced / Automated job submission and management
The Distributed Resource Management Application API (DRMAA) is available on both clusters which can be used with advanced scripts or a Computational Pipeline manager (such as Ruffus)
For further detail see our guide to the DRMAA API.
Reference information and further resources
Quick reference information for the SGE scheduler (ShARC), Bessemer scheduler (SLURM) and Stanage scheduler can be found in the Scheduler Reference Info section.
Stanford Research Computing Center provide a SGE to SLURM conversion guide.