Choosing appropriate compute resources
Introduction
Choosing appropriate resources for your jobs is essential to ensuring your jobs will be scheduled as quickly as possible while wasting as little resources as possible.
The key resources you need to optimise for are:
It is important to be aware that the resource requests that you make are not flexible: if your job exceeds what you have requested for it the scheduler will terminate your job abruptly and without any warning. This means that it is safest to over estimate your job’s requirements if they cannot be accurately and precisely known in advance.
This does not mean that you can set extremely large values for these resource requests for several reasons, the most important being:
Large allocations will take longer to queue and start.
Allocations larger than the scheduler can ever satisfy with the available resources will never start.
Cluster choice
We have three cluster choices listed below for you to choose from:
Stanage (Our newest and most powerful yet, launched in March 2023).
Bessemer (Launched in 2018).
ShARC (Launched in 2016).
It is also important to note that the Sheffield HPC clusters have been designed to fulfil different purposes. Stanage and ShARC are for the most part `capability` clusters designed to run larger compute jobs that will use multiple nodes. Bessemer is a `capacity` cluster designed to run smaller compute jobs which will fit on a single node.
You should prioritize putting smaller core count jobs onto Bessemer and massively parallel jobs onto Stanage or ShARC (while utilizing a form of MPI).
In addition, Stanage and Bessemer have newer CPUs with more modern features. All of the clusters share similar file storage areas, each which are tuned for certain workloads although the Bessemer and Stanage clusters do not have a /data filestore.
The specifications for each cluster are detailed for Stanage here Stanage Specifications , Bessemer here Bessemer specifications and ShARC here ShARC specifications .
Time Allocation Limits
Scheduler Type |
Interactive Job |
Batch Job |
Submission Argument |
---|---|---|---|
SLURM (Stanage) |
8 / 8 hrs |
8 / 96 hrs |
|
SLURM (Bessemer) |
8 / 8 hrs |
8 / 168 hrs |
|
SGE (ShARC) |
8 / 8 hrs |
8 / 96 hrs |
|
The time allocation limits will differ between job types and by cluster - a summary of these differences can be seen above. Time requirements are highly dependent on how many CPU cores your job is using - using more cores may significantly decrease the amount of time the job spends running, depending on how optimally the software you are using supports parallelisation. Further details on CPU cores selection can be found in the CPU cores allocation section.
Determining time requirements using timing commands in your script
A way of deducing the “wall clock” time used by a job is to use the date
command within the batch script file. The date
command is part of the Linux operating
system. Here is an example:
#SBATCH --time=00:10:00
date
my_program < my_input
date
When the above script is submitted (via sbatch
), the job output file will contain the date and time at each invocation of the date command. You can then calculate the
difference between these date/times to determine the actual time taken.
#SBATCH --time=00:10:00
date
my_program < my_input
date
When the above script is submitted (via sbatch
), the job output file will contain the date and time at each invocation of the date command. You can then calculate the
difference between these date/times to determine the actual time taken.
#$ -l h_rt=10:00:00
date
my_program < my_input
date
When the above script is submitted (via qsub), the job output file will contain the date and time at each invocation of the date command. You can then calculate the difference between these date/times to determine the actual time taken.
CPU Allocation Limits
Scheduler Type |
No. CPU Cores Available |
No. CPU Cores Available |
Submission Argument |
---|---|---|---|
SLURM (Stanage) |
1 / 1 / ~11264 (MPI), 64 (SMP) |
1 / 1 / ~11264 (MPI), 64 (SMP) |
|
SLURM (Bessemer) |
1 / 1 / 40 |
1 / 1 / 40 |
|
SGE (ShARC) |
1 / 1 / ~1536 (MPI), 16 (SMP) |
1 / 1 / ~1536 (MPI), 16 (SMP) |
|
The CPU allocation limits will differ between job types and by cluster - a summary of these differences can be seen above. It is important to note that SLURM and SGE will request CPU on a different basis as detailed above.
Determining CPU requirements:
In order to determine your CPU requirements, you should investigate if your program / job supports parallel execution. If the program only supports serial processing, then you can only use 1 CPU core and should be using Bessemer (faster CPUs) to do so.
If your job / program supports multiple cores, you need to assess whether it supports SMP (symmetric multiprocessing) where you can only use CPUs on 1 node or MPI (message passing interface) where you can access as many nodes, CPUs and cores as are available.
For SMP only type parallel processing jobs: you can use a maximum of 64 cores on Stanage, 40 cores on Bessemer and 16 cores on ShARC. Ideally you should use Stanage or Bessemer as you can not only access more cores, you are using more modern cores.
For multiple node MPI type parallel processing jobs: these can run on both Stanage and ShARC and although you can access as many cores as are available you must consider how long a job will take to queue waiting for resources compared the the decrease in time for the job to complete computation.
Single node MPI type parallel jobs can run on ShARC (when running in the SMP parallel environment), Stanage and Bessemer.
For both parallel processing methods you should run several test jobs using the tips from the Time allocation section with various numbers of cores to assess what factor of speedup/slowdown is attained for queuing and computation / the total time for job completion.
Remember, the larger your request, the longer it will take for the resources to become available and the time taken to queue is highly dependent on other cluster jobs.
Some additional important considerations to make are:
Amdahl’s law - an increase in cores or computational power will not scale in a perfectly linear manner. Using 2 cores will not be twice as fast as a single core - and the proportional time reduction from using more cores will decrease with larger core counts.
Job workload optimisation is highly dependent on the workload type - workloads can be CPU, memory bandwidth or IO (reading and writing to disk) limited - detailed exploration and profiling of workloads is beyond the scope of this guide.
Trying a smaller job (or preferably a set of smaller jobs of different sizes) will allow you to extrapolate likely resource requirements but you must remain aware of the limitations as stated above.
Memory Allocation Limits
Scheduler Type |
Standard Nodes |
Large RAM Nodes |
Very Large RAM Nodes |
Interactive Job |
Batch Job |
Submission Argument |
---|---|---|---|---|---|---|
SLURM (Stanage) |
251 GB |
1007 GB |
2014 GB |
2 GB / 251 GB |
2 GB / 251 GB (SMP) ~74404 GB (MPI) |
Per job basis |
SLURM (Bessemer) |
192 GB |
N/A |
N/A |
2 GB / 192 GB |
2 GB / 192 GB |
Per job basis |
SGE (ShARC) |
64 GB |
256 GB |
N/A |
2 GB / 64 GB |
2 GB / 64 GB (SMP) ~6144 GB (MPI) |
Per core basis |
The memory allocation limits will differ between job types and by cluster - a summary of these differences can be seen above. It is important to note that SLURM and SGE will request memory on a different basis as detailed above.
Determining memory requirements:
By using the emailing parameters of the qsub or sbatch command:
Submit your job qsub
or sbatch
by specifying very generous memory and time requirements to ensure that it runs to completion” and also using the -M
and -m abe
or --mail-user=
and --mail-type=ALL
parameters to receive an email-report. The mail message will list the maximum memory usage ( maxvmem / MaxVMSize ) as well as the wallclock time used by the job.
#SBATCH --mem=8G
#SBATCH --time=01:00:00
#SBATCH --mail-user=joe.blogs@sheffield.ac.uk
#SBATCH --mail-type=ALL
myprog < mydata.txt > myresults.txt
#SBATCH --mem=8G
#SBATCH --time=01:00:00
#SBATCH --mail-user=joe.blogs@sheffield.ac.uk
#SBATCH --mail-type=ALL
myprog < mydata.txt > myresults.txt
#$ -l h_rt=01:00:00
#$ -l rmem=8G
#$ -m abe
#$ -M joe.blogs@sheffield.ac.uk
myprog < mydata.txt > myresults.txt
When the job completes, you will receive an email reporting the memory and time usage figures.
By using the qstat or sstat / sacct command:
You can detect the memory used by your job while it is running by using the sacct command for SLURM as follows:
While a job is still running find out its job id by:
sacct
And check its current usage of memory by:
sstat job_id --format='JobID,MaxVMSize,MaxRSS'
If your job has already finished you can list the memory usage with sacct:
sacct --format='JobID,Elapsed,MaxVMSize,MaxRSS'
It is the MaxVMSize / MaxRSS figures that you will need to use to determine the --mem=
parameter for your next job.
You can detect the memory used by your job while it is running by using the sacct command for SLURM as follows:
While a job is still running find out its job id by:
sacct
And check its current usage of memory by:
sstat job_id --format='JobID,MaxVMSize,MaxRSS'
If your job has already finished you can list the memory usage with sacct:
sacct --format='JobID,Elapsed,MaxVMSize,MaxRSS'
It is the MaxVMSize / MaxRSS figures that you will need to use to determine the --mem=
parameter for your next job.
You can detect the memory used by your job while it is running by using the qstat command for SGE as follows:
While a job is still running find out its job id by:
qstat
And check its current usage of memory by:
qstat -F -j job_id | grep mem
If your job has already finished you can list the memory usage with qacct:
qacct -j job_id
The reported figures will indicate:
the currently used memory ( vmem ).
maximum memory needed since startup ( maxvmem ).
It is the maxvmem figure that you will need to use to determine the -l rmem=
parameter for your next job.
Filestore Limits / file store performance characteristics
Every HPC user is allocated a file-storage area of their own. Please read the section on Filestores for the clusters for further information. Any attempt to exceed a file store quota during the execution of a job can have disastrous consequences. This is because any program or package writing into files will produce a fatal error and stop if the filestore limit happens to be exceeded during that operation.
Filestore limits are not associated with jobs and can not be specified while submitting a job. Users must make sure that there is sufficient spare space in their filestore areas before submitting any job that is going to produce large amounts of output. It may be necessary for users to use multiple filestores during longer projects or even within one job.
The quota
command can be used to check your current filestore allocation and usage.
Each filestore has the relevant detail and performance characteristics listed within the section on Filestores, this will indicate where your program is best suited to run from.
Determining how much storage space your jobs will need:
One method to determine how much space your jobs are likely to consume is to run an example job within a specific directory saving the output within.
Once the run has completed you can determine the amount of storage taken by the job by running:
du -sh my_directory_name
Special limits and alternative queues
If you have paid for a reservation, your research group or department has purchased additional resources there may be other accounts and partitions you can specify which will override normal limits imposed on the cluster jobs.
If you have access to additional queues / partitions and want to know their limitations you can using the following commands to explore this.
Listing Queues
On Stanage you can list the queues with:
sinfo -a
As Stanage has non-homogenous nodes you can list more information by formatting the output, e.g.:
sinfo -o "%P %l %c %D " # PARTITION TIMELIMIT CPUS NODES
On Bessemer you can list the queues with:
sinfo -a
As Bessemer has non-homogenous nodes you can list more information by formatting the output, e.g.:
sinfo -o "%P %l %c %D " # PARTITION TIMELIMIT CPUS NODES
On ShARC you can list the queues with:
qconf -sql
You can then list your specific queue properties with:
qconf -sq queue_name