Attention
The ShARC HPC cluster was decommissioned on the 30th of November 2023 at 17:00. It is no longer possible for users to access that cluster.
Choosing appropriate compute resources
Introduction
Choosing appropriate resources for your jobs is essential to ensuring your jobs will be scheduled as quickly as possible while wasting as little resources as possible.
The key resources you need to optimise for are:
It is important to be aware that the resource requests that you make are not flexible: if your job exceeds what you have requested for it the scheduler will terminate your job abruptly and without any warning. This means that it is safest to over estimate your job’s requirements if they cannot be accurately and precisely known in advance.
This does not mean that you can set extremely large values for these resource requests for several reasons, the most important being:
Large allocations will take longer to queue and start.
Allocations larger than the scheduler can ever satisfy with the available resources will never start.
Cluster choice
We have three cluster choices listed below for you to choose from:
Stanage (Our newest and most powerful yet, launched in March 2023).
Bessemer (Launched in 2018).
ShARC (Launched in 2016).
It is also important to note that the Sheffield HPC clusters have been designed to fulfil different purposes. Stanage and ShARC are for the most part `capability` clusters designed to run larger compute jobs that will use multiple nodes. Bessemer is a `capacity` cluster designed to run smaller compute jobs which will fit on a single node.
You should prioritize putting smaller core count jobs onto Bessemer and massively parallel jobs onto Stanage or ShARC (while utilizing a form of MPI).
In addition, Stanage and Bessemer have newer CPUs with more modern features. All of the clusters share similar file storage areas, each which are tuned for certain workloads although the Bessemer and Stanage clusters do not have a /data filestore.
The specifications for each cluster are detailed for Stanage here Stanage Specifications , Bessemer here Bessemer specifications and ShARC here ShARC specifications .
Time Allocation Limits
Scheduler Type |
Interactive Job |
Batch Job |
Submission Argument |
---|---|---|---|
SLURM (Stanage) |
8 / 8 hrs |
8 / 96 hrs |
|
SLURM (Bessemer) |
8 / 8 hrs |
8 / 168 hrs |
|
SGE (ShARC) |
8 / 8 hrs |
8 / 96 hrs |
|
The time allocation limits will differ between job types and by cluster - a summary of these differences can be seen above. Time requirements are highly dependent on how many CPU cores your job is using - using more cores may significantly decrease the amount of time the job spends running, depending on how optimally the software you are using supports parallelisation. Further details on CPU cores selection can be found in the CPU cores allocation section.
Determining time requirements using timing commands in your script
A way of deducing the “wall clock” time used by a job is to use the date
command within the batch script file. The date
command is part of the Linux operating
system. Here is an example:
#SBATCH --time=00:10:00
date
my_program < my_input
date
When the above script is submitted (via sbatch
), the job output file will contain the date and time at each invocation of the date command. You can then calculate the
difference between these date/times to determine the actual time taken.
#SBATCH --time=00:10:00
date
my_program < my_input
date
When the above script is submitted (via sbatch
), the job output file will contain the date and time at each invocation of the date command. You can then calculate the
difference between these date/times to determine the actual time taken.
#$ -l h_rt=10:00:00
date
my_program < my_input
date
When the above script is submitted (via qsub), the job output file will contain the date and time at each invocation of the date command. You can then calculate the difference between these date/times to determine the actual time taken.
Determining time used by your jobs
The time used by a job is typical quantified into 2 values by the scheduler:
the “wallclock” (the time your job took if measured by a clock on your wall)
the consumed “CPU time” (a number of seconds of compute, derived directly from the amount of CPU time used on all cores of a job).
How to determine these values can be seen below using the seff or qacct commands as below:
The seff
script can be used as follows with the job’s ID to give summary of important job info including the wallclock time:
$ seff 64626
Job ID: 64626
Cluster: stanage.alces.network
User/Group: a_user/clusterusers
State: COMPLETED (exit code 0)
Nodes: 2
Cores per node: 1
CPU Utilized: 00:02:37
CPU Efficiency: 35.68% of 00:07:20 core-walltime
Job Wall-clock time: 00:03:40
Memory Utilized: 137.64 MB (estimated maximum)
Memory Efficiency: 1.71% of 7.84 GB (3.92 GB/core)
Here we can see the wallclock was 03:40
(220s) and the consumed CPU time was 02:37
(157s). As this job requested 2 cores (2 nodes * 1 core), we can also see there was a maximum core-walltime of 07:20
(440s) available.
The CPU Efficiency follows as (157/440)*100=35.68%
.
The seff
script can be used as follows with the job’s ID to give summary of important job info including the wallclock time:
$ seff 64626
Job ID: 64626
Cluster: bessemer
User/Group: a_user/a_group
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:02:37
CPU Efficiency: 35.68% of 00:07:20 core-walltime
Job Wall-clock time: 00:03:40
Memory Utilized: 137.64 MB (estimated maximum)
Memory Efficiency: 1.71% of 7.84 GB (3.92 GB/core)
Here we can see the wallclock was 03:40
(220s) and the consumed CPU time was 02:37
(157s). As this job requested 2 cores (1 node * 2 cores), we can also see there was a maximum core-walltime of 07:20
(440s) available.
The CPU Efficiency follows as (157/440)*100=35.68%
.
The qstat
can be used as follows to display the CPU time as well as the wallclock:
$ qacct -j 628 | grep -E "ru_wallclock|cpu"
ru_wallclock 13s
cpu 4.187s
Here we can see the wallclock time is 13s and the consumed CPU time was 4.187s.
CPU Allocation Limits
Scheduler Type |
No. CPU Cores Available |
No. CPU Cores Available |
Submission Argument |
---|---|---|---|
SLURM (Stanage) |
1 / 1 / ~11264 (MPI), 64 (SMP) |
1 / 1 / ~11264 (MPI), 64 (SMP) |
|
SLURM (Bessemer) |
1 / 1 / 40 |
1 / 1 / 40 |
|
SGE (ShARC) |
1 / 1 / ~1536 (MPI), 16 (SMP) |
1 / 1 / ~1536 (MPI), 16 (SMP) |
|
The CPU allocation limits will differ between job types and by cluster - a summary of these differences can be seen above. It is important to note that SLURM and SGE will request CPU on a different basis as detailed above.
Determining CPU requirements:
In order to determine your CPU requirements, you should investigate if your program / job supports parallel execution. If the program only supports serial processing, then you can only use 1 CPU core and should be using Bessemer (faster CPUs) to do so.
If your job / program supports multiple cores, you need to assess whether it supports SMP (symmetric multiprocessing) where you can only use CPUs on 1 node or MPI (message passing interface) where you can access as many nodes, CPUs and cores as are available.
For SMP only type parallel processing jobs: you can use a maximum of 64 cores on Stanage, 40 cores on Bessemer and 16 cores on ShARC. Ideally you should use Stanage or Bessemer as you can not only access more cores, you are using more modern cores.
For multiple node MPI type parallel processing jobs: these can run on both Stanage and ShARC and although you can access as many cores as are available you must consider how long a job will take to queue waiting for resources compared the the decrease in time for the job to complete computation.
Single node MPI type parallel jobs can run on ShARC (when running in the SMP parallel environment), Stanage and Bessemer.
For both parallel processing methods you should run several test jobs using the tips from the Time allocation section with various numbers of cores to assess what factor of speedup/slowdown is attained for queuing and computation / the total time for job completion.
Remember, the larger your request, the longer it will take for the resources to become available and the time taken to queue is highly dependent on other cluster jobs.
Some additional important considerations to make are:
Amdahl’s law - an increase in cores or computational power will not scale in a perfectly linear manner. Using 2 cores will not be twice as fast as a single core - and the proportional time reduction from using more cores will decrease with larger core counts.
Job workload optimisation is highly dependent on the workload type - workloads can be CPU, memory bandwidth or IO (reading and writing to disk) limited - detailed exploration and profiling of workloads is beyond the scope of this guide.
Trying a smaller job (or preferably a set of smaller jobs of different sizes) will allow you to extrapolate likely resource requirements but you must remain aware of the limitations as stated above.
Determining job CPU efficiencies
When quantifying the CPU usage efficiency two values are important:
the “wallclock” (the time your job took if measured by a clock on your wall)
the consumed “CPU time” (a number of seconds of compute, derived directly from the amount of CPU time used on all cores of a job).
To optimise your CPU requests you can investigate how efficiently your job is making use of your requested cores with the seff or qacct command:
The seff
script can be used as follows with the job’s ID to give summary of important job info including the wallclock time:
$ seff 64626
Job ID: 64626
Cluster: stanage.alces.network
User/Group: a_user/clusterusers
State: COMPLETED (exit code 0)
Nodes: 2
Cores per node: 1
CPU Utilized: 00:02:37
CPU Efficiency: 35.68% of 00:07:20 core-walltime
Job Wall-clock time: 00:03:40
Memory Utilized: 137.64 MB (estimated maximum)
Memory Efficiency: 1.71% of 7.84 GB (3.92 GB/core)
Here we can see the wallclock was 03:40
(220s) and the consumed CPU time was 02:37
(157s). As this job requested 2 cores (2 nodes * 1 core), we can also see there was a maximum core-walltime of 07:20
(440s) available.
The CPU Efficiency follows as (157/440)*100=35.68%
.
The ideal value for CPU efficiency is 100%.
If a value of 100/n requested cores is observed, you are likely to be using a single threaded program (which cannot benefit from multiple cores) or a multithreaded program incorrectly configured to use the multiple cores requested. In general, you should request a single core for single threaded programs and ensure multicore programs are correctly configured with as few cores as possible requested to shorten your queue time.
The seff
script can be used as follows with the job’s ID to give summary of important job info including the wallclock time:
$ seff 64626
Job ID: 64626
Cluster: bessemer
User/Group: a_user/a_group
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:02:37
CPU Efficiency: 35.68% of 00:07:20 core-walltime
Job Wall-clock time: 00:03:40
Memory Utilized: 137.64 MB (estimated maximum)
Memory Efficiency: 1.71% of 7.84 GB (3.92 GB/core)
Here we can see the wallclock was 03:40
(220s) and the consumed CPU time was 02:37
(157s). As this job requested 2 cores (1 node * 2 cores), we can also see there was a maximum core-walltime of 07:20
(440s) available.
The CPU Efficiency follows as (157/440)*100=35.68%
.
The ideal value for CPU efficiency is 100%.
If a value of 100/n requested cores is observed, you are likely to be using a single threaded program (which cannot benefit from multiple cores) or a multithreaded program incorrectly configured to use the multiple cores requested. In general, you should request a single core for single threaded programs and ensure multicore programs are correctly configured with as few cores as possible requested to shorten your queue time.
CPU efficiency on ShARC can be computed using the qacct command to show the job info and then be computer as cpuefficiency = cpu / (ru_wallclock*slots)
.
For example:
$ qacct -j 628 | grep -E 'slots|ru_wallclock|cpu'
slots 1
ru_wallclock 13s
cpu 4.187s
Where efficiency is 4.187/(13*1)=0.3221
or 32.21%.
This can be calculated on the command line with:
$ qacct -j 628 | grep -E "slots|ru_wallclock|cpu" | sed "s/[^0-9.]//g" | awk "{num[NR] = \$1} END {result = (num[1] * num[3]) / num[2] * 100; printf \"%.2f%%\\n\", result}"
32.21%
You may wish to add this as an alias in your .bashrc
file:
alias qcpueff='grep -E "slots|ru_wallclock|cpu" | sed "s/[^0-9.]//g" | awk "{num[NR] = \$1} END {result = (num[1] * num[3]) / num[2] * 100; printf \"%.2f%%\\n\", result}"'
You could then call this as:
$ qacct -j 628 | qcpueff
32.21%
Memory Allocation Limits
Scheduler Type |
Standard Nodes |
Large RAM Nodes |
Very Large RAM Nodes |
Interactive Job |
Batch Job |
Submission Argument |
---|---|---|---|---|---|---|
SLURM (Stanage) |
251 GB |
1007 GB |
2014 GB |
4016 MB / 251 GB |
4016 MB / 251 GB (SMP) ~74404 GB (MPI) |
Per node basis |
SLURM (Bessemer) |
192 GB |
N/A |
N/A |
2 GB / 192 GB |
2 GB / 192 GB |
Per node (job) basis |
SGE (ShARC) |
64 GB |
256 GB |
N/A |
2 GB / 64 GB |
2 GB / 64 GB (SMP) ~6144 GB (MPI) |
Per core basis |
The memory allocation limits will differ between job types and by cluster - a summary of these differences can be seen above. It is important to note that SLURM and SGE will request memory on a different basis as detailed above.
Determining memory requirements:
By using the emailing parameters of the qsub or sbatch command:
Submit your job qsub
or sbatch
by specifying very generous memory and time requirements to ensure that it runs to completion” and also using the -M
and -m abe
or --mail-user=
and --mail-type=ALL
parameters to receive an email-report. The mail message will list the maximum memory usage ( maxvmem / MaxVMSize ) as well as the wallclock time used by the job.
#SBATCH --mem=8G
#SBATCH --time=01:00:00
#SBATCH --mail-user=joe.blogs@sheffield.ac.uk
#SBATCH --mail-type=ALL
myprog < mydata.txt > myresults.txt
#SBATCH --mem=8G
#SBATCH --time=01:00:00
#SBATCH --mail-user=joe.blogs@sheffield.ac.uk
#SBATCH --mail-type=ALL
myprog < mydata.txt > myresults.txt
#$ -l h_rt=01:00:00
#$ -l rmem=8G
#$ -m abe
#$ -M joe.blogs@sheffield.ac.uk
myprog < mydata.txt > myresults.txt
When the job completes, you will receive an email reporting the memory and time usage figures.
By using the qstat/qacct or seff/sstat/sacct command:
For a quick summary, the seff command can be used as follows with the job’s ID to give summary of important job info including the memory usage / efficiency:
$ seff 64626
Job ID: 64626
Cluster: stanage.alces.network
User/Group: a_user/clusterusers
State: COMPLETED (exit code 0)
Nodes: 2
Cores per node: 1
CPU Utilized: 00:02:37
CPU Efficiency: 35.68% of 00:07:20 core-walltime
Job Wall-clock time: 00:03:40
Memory Utilized: 137.64 MB (estimated maximum)
Memory Efficiency: 1.71% of 7.84 GB (3.92 GB/core)
For more specific info, you can use the sacct / sstat commands:
While a job is still running find out its job id by:
sacct
And check its current usage of memory by:
sstat job_id --format='JobID,MaxVMSize,MaxRSS'
If your job has already finished you can list the memory usage with sacct:
sacct --format='JobID,Elapsed,MaxVMSize,MaxRSS'
It is the MaxVMSize / MaxRSS figures that you will need to use to determine the --mem=
parameter for your next job.
For a quick summary, the seff command can be used as follows with the job’s ID to give summary of important job info including the memory usage / efficiency:
$ seff 64626
Job ID: 64626
Cluster: bessemer
User/Group: a_user/a_group
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:02:37
CPU Efficiency: 35.68% of 00:07:20 core-walltime
Job Wall-clock time: 00:03:40
Memory Utilized: 137.64 MB (estimated maximum)
Memory Efficiency: 1.71% of 7.84 GB (3.92 GB/core)
For more specific info, you can use the sacct / sstat commands:
While a job is still running find out its job id by:
sacct
And check its current usage of memory by:
sstat job_id --format='JobID,MaxVMSize,MaxRSS'
If your job has already finished you can list the memory usage with sacct:
sacct --format='JobID,Elapsed,MaxVMSize,MaxRSS'
It is the MaxVMSize / MaxRSS figures that you will need to use to determine the --mem=
parameter for your next job.
You can detect the memory used by your job while it is running by using the qstat command for SGE as follows:
While a job is still running find out its job id by:
qstat
And check its current usage of memory by:
qstat -F -j job_id | grep mem
If your job has already finished you can list the memory usage with qacct:
qacct -j job_id | grep vmem
The reported figures will indicate:
the currently used memory ( vmem ).
maximum memory needed since startup ( maxvmem ).
It is the maxvmem figure that you will need to use to determine the -l rmem=
parameter for your next job.
For example:
$ qacct -j 628 | grep vmem
maxvmem 4.807G
category -u username -l h_vmem=8G -pe smp 1 -P SHEFFIELD
$ qstat -F -j 628 | grep vmem
usage 1: cpu=77:57:45, mem=61962.33471 GB s, io=38.68363 GB, vmem=4.334G, maxvmem=4.807G
Filestore Limits / file store performance characteristics
Every HPC user is allocated a file-storage area of their own. Please read the section on Filestores for the clusters for further information. Any attempt to exceed a file store quota during the execution of a job can have disastrous consequences. This is because any program or package writing into files will produce a fatal error and stop if the filestore limit happens to be exceeded during that operation.
Filestore limits are not associated with jobs and can not be specified while submitting a job. Users must make sure that there is sufficient spare space in their filestore areas before submitting any job that is going to produce large amounts of output. It may be necessary for users to use multiple filestores during longer projects or even within one job.
The quota
command can be used to check your current filestore allocation and usage.
Each filestore has the relevant detail and performance characteristics listed within the section on Filestores, this will indicate where your program is best suited to run from.
Determining how much storage space your jobs will need:
One method to determine how much space your jobs are likely to consume is to run an example job within a specific directory saving the output within.
Once the run has completed you can determine the amount of storage taken by the job by running:
du -sh my_directory_name
Special limits and alternative queues
If you have paid for a reservation, your research group or department has purchased additional resources there may be other accounts and partitions you can specify which will override normal limits imposed on the cluster jobs.
If you have access to additional queues / partitions and want to know their limitations you can using the following commands to explore this.
Listing Queues
On Stanage you can list the queues with:
sinfo -a
As Stanage has non-homogenous nodes you can list more information by formatting the output, e.g.:
sinfo -o "%P %l %c %D " # PARTITION TIMELIMIT CPUS NODES
On Bessemer you can list the queues with:
sinfo -a
As Bessemer has non-homogenous nodes you can list more information by formatting the output, e.g.:
sinfo -o "%P %l %c %D " # PARTITION TIMELIMIT CPUS NODES
On ShARC you can list the queues with:
qconf -sql
You can then list your specific queue properties with:
qconf -sq queue_name