Choosing appropriate compute resources

Introduction

Choosing appropriate resources for your jobs is essential to ensuring your jobs will be scheduled as quickly as possible while wasting as little resources as possible.

The key resources you need to optimise for are:

It is important to be aware that the resource requests that you make are not flexible: if your job exceeds what you have requested for it the scheduler will terminate your job abruptly and without any warning. This means that it is safest to over estimate your job’s requirements if they cannot be accurately and precisely known in advance.

This does not mean that you can set extremely large values for these resource requests for several reasons, the most important being:

  • Large allocations will take longer to queue and start.

  • Allocations larger than the scheduler can ever satisfy with the available resources will never start.

Cluster choice

It is also important to note that the Sheffield HPC clusters have been designed to fulfil different purposes. ShARC is for the most part a capability cluster designed to run larger compute jobs that will use multiple nodes. Bessemer is a capacity cluster designed to run smaller compute jobs which will fit on a single node. In addition, Bessemer has more modern faster CPUs but does not have a /data filestore.

With this in mind, you should prioritize putting smaller core count jobs onto Bessemer and massively parallel jobs onto ShARC (while utilizing a form of MPI).

The specifications for each cluster are detailed for ShARC here ShARC specifications and Bessemer here. Bessemer specifications


Time Allocation Limits

Time Allocation Limits Table

Scheduler Type

Interactive Job
(Default / Max)

Batch Job
(Default / Max)

Submission Argument

SGE (ShARC)

8 / 8 hrs

8 / 96 hrs

-l h_rt=<hh:mm:ss>

SLURM (Bessemer)

8 / 8 hrs

8 / 168 hrs

--time=<days-hh:mm:ss>

The time allocation limits will differ between job types and by cluster - a summary of these differences can be seen above. Time requirements are highly dependent on how many CPU cores your job is using - using more cores may significantly decrease the amount of time the job spends running, depending on how optimally the software you are using supports parallelisation. Further details on CPU cores selection can be found in the CPU cores allocation section.

Determining time requirements using timing commands in your script

A way of deducing the “wall clock” time used by a job is to use the date or the timeused command within the script file. The date command is part of the Linux operating system whereas the timeused command is specific to our clusters and provides the usage figures directly rather than having to manually calculate it from two subsequent date commands. Here are some examples -

Using the date command:

#$ -l h_rt=10:00:00
date
my_program < my_input
date

When the above script is submitted (via qsub), the job output file will contain the date and time at each invocation of the date command. You can then calculate the difference between these date/times to determine the actual time taken.

Using the timeused command:

#$ -l h_rt=10:00:00
export TIMECOUNTER=0
source timeused
my_program < my_input
source timeused

When the above script is submitted the first invocation of the timeused command will initialise the timer counter due to the fact that TIMECOUNTER variable is set to 0. The subsequent invocations will report the time in hours,minutes and seconds since the first invocation.


CPU Allocation Limits

CPU Allocation Limits Table

Scheduler Type

No. CPU Cores Available

Interactive Job
(Default/ Min / Max )

No. CPU Cores Available

Batch Job
(Default/ Min / Max )

Submission Argument

SGE (ShARC)

1 / 1 / 16

1 / 1 / ~1536

-pe <env> <nn>

SLURM (Bessemer)

1 / 1 / 40

1 / 1 / 40

-c <nn>

The CPU allocation limits will differ between job types and by cluster - a summary of these differences can be seen above. It is important to note that SLURM and SGE will request CPU on a different basis as detailed above.

ShARC Parallel Environments

The available parallel environments for ShARC only (Bessemer only has the SMP environment which is set by default) can be found below:

ShARC Parallel Environments Table

Parallel Environment Name <env>

Parallel Environment description

smp

Symmetric multiprocessing or ‘Shared Memory Parallel’ environment. Limited to a single node and therefore 16 cores on a normal ShARC node.

openmp

A ‘Shared Memory Parallel’ environment supporting OpenMP execution. Limited to a single node and therefore 16 cores on a normal ShARC node.

mpi

Message Passing interface. Can use as many nodes or cores as desired.

mpi-rsh

The same as the mpi parallel environment but configured to use RSH instead of SSH for certain software like ANSYS.

Other parallel environments not mentioned do exist for specific purposes. Those who require these will be informed directly or via signposting in other documentation.

A current list of environments on ShARC can be generated using the qconf -spl command.

Determining CPU requirements:

In order to determine your CPU requirements, you should investigate if your program / job supports parallel execution. If the program only supports serial processing, then you can only use 1 CPU core and should be using Bessemer (faster CPUs) to do so.

If your job / program supports multiple cores, you need to assess whether it supports SMP (symmetric multiprocessing) where you can only use CPUs on 1 node or MPI (message passing interface) where you can access as many nodes, CPUs and cores as are available.

For SMP only type parallel processing jobs: you can use a maximum of 16 cores on ShARC and 40 cores on Bessemer. Ideally you should use Bessemer as you can not only access more cores, you are using more modern cores.

For multiple node MPI type parallel processing jobs: these can only run on ShARC and although you can access as many cores as are available you must consider how long a job will take to queue waiting for resources compared the the decrease in time for the job to complete computation.

Single node MPI type parallel jobs can run on both ShARC (when running in the SMP parallel environment) and Bessemer.

For both parallel processing methods you should run several test jobs using the tips from the Time allocation section with various numbers of cores to assess what factor of speedup/slowdown is attained for queuing and computation / the total time for job completion.

Remember, the larger your request, the longer it will take for the resources to become available and the time taken to queue is highly dependent on other cluster jobs.

Some additional important considerations to make are:

  • Amdahl’s law - an increase in cores or computational power will not scale in a perfectly linear manner. Using 2 cores will not be twice as fast as a single core - and the proportional time reduction from using more cores will decrease with larger core counts.

  • Job workload optimisation is highly dependent on the workload type - workloads can be CPU, memory bandwidth or IO (reading and writing to disk) limited - detailed exploration and profiling of workloads is beyond the scope of this guide.

  • Trying a smaller job (or preferably a set of smaller jobs of different sizes) will allow you to extrapolate likely resource requirements but you must remain aware of the limitations as stated above.


Memory Allocation Limits

Memory Allocation Limits Table

Scheduler Type

Standard Nodes

Large RAM Nodes

Interactive Job
(Default / Max)

Batch Job
(Default / Max)

Submission Argument

SGE (ShARC)

64 GB

256 GB

2 GB / 64 GB

2 GB / 64 GB (SMP) ~6144 GB (MPI)

Per core basis -l rmem=<nn>

SLURM (Bessemer)

192 GB

N/A

2 GB / 192 GB

2 GB / 192 GB

Per job basis --mem=<nn>

The memory allocation limits will differ between job types and by cluster - a summary of these differences can be seen above. It is important to note that SLURM and SGE will request memory on a different basis as detailed above.

Determining memory requirements:

By using the emailing parameters of the qsub or sbatch command:

Submit your job qsub or sbatch by specifying very generous memory and time requirements to ensure that it runs to completion” and also using the -M and -m abe or --mail-user= and --mail-type=ALL parameters to receive an email-report. The mail message will list the maximum memory usage ( maxvmem / MaxVMSize ) as well as the wallclock time used by the job.

Here is an example job script for SGE:

#$ -l h_rt=01:00:00
#$ -l rmem=8G
#$ -m abe
#$ -M joe.blogs@sheffield.ac.uk
myprog < mydata.txt > myresults.txt

Here is an example job script for SLURM:

#SBATCH --mem=8G
#SBATCH --time=01:00:00
#SBATCH --mail-user=joe.blogs@sheffield.ac.uk
#SBATCH --mail-type=ALL
myprog < mydata.txt > myresults.txt

When the job completes, you will receive an email reporting the memory and time usage figures.


By using the qstat or sstat / sacct command:

You can detect the memory used by your job while it is running by using the qstat command for SGE as follows:

While a job is still running find out its job id by:

qstat

And check its current usage of memory by:

qstat -F -j job_id | grep mem

If your job has already finished you can list the memory usage with qacct:

qacct -j job_id

The reported figures will indicate:

  • the currently used memory ( vmem ).

  • maximum memory needed since startup ( maxvmem ).

It is the maxvmem figure that you will need to use to determine the -l rmem= parameter for your next job.


You can detect the memory used by your job while it is running by using the sacct command for SLURM as follows:

While a job is still running find out its job id by:

sacct

And check its current usage of memory by:

sstat job_id --format='JobID,MaxVMSize'

If your job has already finished you can list the memory usage with sacct:

sacct --format='JobID,Elapsed,MaxVMSize'

It is the MaxVMSize figure that you will need to use to determine the --mem= parameter for your next job.

Filestore Limits / file store performance characteristics

Every HPC user is allocated a file-storage area of their own. Please read the section on Filestores for the clusters for further information. Any attempt to exceed a file store quota during the execution of a job can have disastrous consequences. This is because any program or package writing into files will produce a fatal error and stop if the filestore limit happens to be exceeded during that operation.

Filestore limits are not associated with jobs and can not be specified while submitting a job. Users must make sure that there is sufficient spare space in their filestore areas before submitting any job that is going to produce large amounts of output. It may be necessary for users to use multiple filestores during longer projects or even within one job.

The quota command can be used to check your current filestore allocation and usage.

Each filestore has the relevant detail and performance characteristics listed within the section on Filestores, this will indicate where your program is best suited to run from.


Determining how much storage space your jobs will need:

One method to determine how much space your jobs are likely to consume is to run an example job within a specific directory saving the output within.

Once the run has completed you can determine the amount of storage taken by the job by running:

du -sh my_directory_name

Special limits and alternative queues

If you have paid for a reservation, your research group or department has purchased additional resources there may be other accounts and partitions you can specify which will override normal limits imposed on the cluster jobs.

If you have access to additional queues / partitions and want to know their limitations you can using the following commands to explore this.


Listing Queues

ShARC

On ShARC you can list the queues with:

qconf -sql

You can then list your specific queue properties with:

qconf -sq queue_name

Bessemer

On Bessemer you can list the queues with:

sinfo -h

As Bessemer has non-homogenous nodes you can list more information by formatting the output, e.g.:

sinfo -o "%P %l %c %D "  # PARTITION TIMELIMIT CPUS NODES