Happy Holidays
Wishing everyone a peaceful break and fewer broken jobs.
Attention
High-priority access is not currently available, but will be soon.
Priority access to HPC resources
Summary
The HPC High-Priority Service is a paid add-on for front of the queue access, complementing the existing free at point of use access.
High-priority jobs are expected to start within 5–15 min in most cases, depending on the state of the cluster.
You purchase a capacity allowance (e.g., 16 CPU cores / 1 GPU). Jobs totalling those resources start now; additional jobs queue to start immediately after.
A purchase is not a reservation or a dedicated node in order to share idle capacity is shared. Your funds expand the cluster permanently.
Staff can purchase resources in monthly increments, pausing and resuming as needed, and share the purchase with others.
To use a high-priority purchase, add this to your job submission script:
#SBATCH --account=<your_account> #SBATCH --partition=hp-<cpu|a100|h100|h100-nvl>
HPC High-Priority Service
The University of Sheffield’s High-Performance Computing (HPC) clusters are available to registered staff and students on a free at point of use, shared basis. The HPC High-Priority Service offers improved access, with:
Reduced queue times
Increased availability of CPU, RAM, or GPU resources
In cases where you require either more HPC resources or faster access to HPC resources, please look at the guidance on Purchasing computational resources and/or contact the IT Services’ Research and Innovation team at an early stage.
About the HPC High-Priority Service
The University of Sheffield High Performance Computing (HPC) clusters offer High-Priority Service. The service allows you to purchase a HPC High-Priority Service account, which will grant you high-priority access to resources equivalent to a fraction or multiple of a node. Types of purchases include CPU or GPU nodes, for example ½ of a Stanage General CPU node or one Stanage A100 GPU node. Jobs submitted using a HPC High-Priority Service account will move to the front of the queue for the duration of your purchase.
With a University of Sheffield HPC High-Priority Service account you can run jobs totalling the purchased resources with increased priority, resulting in the job running on the next available resources in the cluster (i.e., as soon as possible). For example, if you have purchased 16 cores and are currently running no other HPC High-Priority Service jobs, you could submit either two 8 core jobs or one 16 core job to run as soon as possible. An additional 8 core job would start immediately after one of the running HPC High-Priority Service jobs completed. You will not be able to run a HPC High-Priority Service job larger than the amount of resources you have purchased, but you can still run larger jobs as part of the regular free HPC offering.
HPC High-Priority Service jobs once submitted will be near or at the top of the queue and are expected to start between 5–15 minutes after submission in most cases (this may be longer for GPU jobs). In some instances a HPC High-Priority Service job may not be the next to start depending on the size and type of resources requested (e.g., a regular GPU job might start ahead of a HPC High-Priority Service CPU job). HPC High-Priority Service jobs may run on a range of different nodes in the cluster, ensuring availability of compute resources.
You may submit multiple HPC High-Priority Service jobs, provided any individual job is smaller than or equal to the size of your HPC High-Priority Service resources. These jobs will be queued so that after your first job finishes your next job starts immediately afterwards. You may use your HPC High-Priority Service resources constantly for the duration of your purchase.
Additional benefits of the HPC High-Priority Service
The University of Sheffield HPC High-Priority Service replaces dedicated nodes. Purchasing a HPC High-Priority Service account instead goes towards funding the expansion of the HPC system. Your purchased resources are available to you when needed whilst also expanding the cluster for all users when those resources would otherwise be idle. Further, unlike purchasing new hardware, there is only a short lead time for purchasing a HPC High-Priority Service account. This means if you decide that you need additional high-priority resources, these can be made available in weeks not months.
Note
HPC High-Priority Service aims to provide a dedicated node like experience without the use of dedicated nodes.
As HPC High-Priority Service jobs can run on any node, or combination of nodes, there will never be extended down time as the result of a single faulty node. You also have significant flexibility in compute topology, with the ability to run your job on multiple nodes, for example, to increase memory bandwidth.
The HPC High-Priority Service gives increased flexibility, with the ability to pause and resume your purchase. For example, when purchasing one year of high-priority access, you can use three months of compute for initial investigations, pause whilst those results are analysed, then use your final nine months for the full study. You pay for the compute resources you need only when you need them.
The University of Sheffield Sustainability Strategy commits the university to net-zero by 2030. HPC clusters use large amounts of energy and are included in this strategy. One of the largest single users of cluster energy are idle nodes, and in particular dedicated compute resources purchased for specific tasks sitting unused. The University of Sheffield HPC High-Priority Service provides many similar benefits to dedicated nodes, but instead when you are not using your HPC High-Priority Service resources, those resources can be used by others. The removal of dedicated nodes contributes towards improving carbon efficiency. Learn more about sustainable computational science.
What the HPC High-Priority Service is not
A University of Sheffield HPC High-Priority Service account is neither a dedicated node nor reservation. Your HPC High-Priority Service account entitles you to substantially reduced queue times for running jobs totalling the resources purchased, but unlike a dedicated node or reservation, these resources are available to others whilst you are not using them. If you absolutely need a reservation for teaching or a dedicated node, then please contact the IT Services’ Research and Innovation team for advice at an early stage.
Queue times are expected to be 5–15 minutes in most cases, but this cannot be guaranteed due to the complexity of managing jobs for hundreds of cluster users and queue times may be longer for GPU jobs. High-priority access through the University of Sheffield HPC High-Priority Service is provided on a first-come, first-served basis. Once running, your jobs will have continuous use of your resources.
How the HPC High-Priority Service impacts on free use
University of Sheffield HPC High-Priority Service high-priority jobs do not affect jobs from non-paying staff and students other than short delays in queue time depending on the number of HPC High-Priority Service jobs waiting to run. Job age, fair share, etc., function the same for both High-Priority Service jobs and free jobs, excepting that the two job types do not interact with each other. To ensure our HPC clusters remain available for all users, at least 20% of the cluster is reserved exclusively for free use. The remainder of the cluster is shared, with a small fraction reserved for High-Priority Service jobs and the rest available to both High-Priority Service and free jobs.
If you use a HPC High-Priority Service account, your high-priority jobs will not affect any other jobs you run using the regular free offering. If you have only purchased CPU cores, you will still be able to run free GPU jobs the same way any regular cluster user can and vice versa.
The Slurm queue and job priority
To manage the work, or jobs, submitted to the University of Sheffield’s High-Performance Computing (HPC) clusters by multiple users fairly, our HPC systems run the Slurm scheduler Control. When a job is submitted, it is queued by Slurm and a priority for the job is assigned. Job priority is then used as part of determining the order of the queue and hence the next job to run. Generally jobs with higher priority will start sooner.
Job priority is calculated as a function of job age, fair share (determined by your recent cluster use), and other factors. For example, if you have used the cluster significantly in the last week, your job priority will be lower. High-priority jobs get a significant boost to job priority, which increases their position in the queue and causes high-priority jobs to start sooner. Usually a high-priority job will start next if you have no other high-priority jobs running.
Getting a High-Priority Service account
Attention
We will shortly be offering HPC High-Priority Service accounts on Stanage.
The University of Sheffield HPC High-Priority Service is available for staff upon request by following the guidance on Purchasing computational resources and/or contacting the IT Services’ Research and Innovation team. You can pause and resume a single purchase, in effect splitting it up. You can share your HPC High-Priority Service accounts with others. Please inform us of both as part of your request.
HPC High-Priority Service accounts are available in month increments and start at one month duration and may last until cluster retirement. The minimum purchase is ¼ of a CPU node (16 cores), ¼ of an A100 GPU node (one A100 GPU), ½ of a H100 GPU node (one H100 GPU), or ¼ of an H100 NVL GPU node (one H100 NVL GPU). The maximum purchase sizes are: two large memory nodes (2 TB of RAM), two very large memory CPU nodes (4 TB of RAM), three A100 nodes (twelve A100 GPUs), one H100 GPU node (two H100 GPUs), and ½ of an H100 NVL GPU node (two H100 NVL GPUs). There is no maximum general CPU node purchase. Purchases should be even fractions of a single node or multiple whole nodes. Learn more about the different nodes on Stanage.
When purchasing a HPC High-Priority Service account, your funds contribute towards the costs of growing the cluster.
Using the High-Priority Service
Once your University of Sheffield HPC High-Priority Service account purchase is confirmed, you will be given an account name in the format of c162408_ab1xyz or c162408_project.
This contains a reminder about the purchase type, the number of cores, and when the purchase started.
(In this example (C)PU node, (16) cores, started on 24/08 - August 2024.)
While batch jobs are preferred, both interactive and batch jobs can be run with increased priority using the HPC High-Priority Service. HPC High-Priority Service jobs will need to include the account as part of the submission script as well as the high-priority partition for your resource type. High-priority partitions are given in the following table:
Resource type |
CPU resources |
A100 GPU resources |
H100 GPU resources |
H100 NVL GPU resources |
High-priority partition |
hp-cpu |
hp-a100 |
hp-h100 |
hp-h100-nvl |
Batch HPC High-Priority Service usage
Any existing job submission script can be submitted using your HPC High-Priority Service resources by running:
sbatch --account=c162408_ab1xyz --partition=hp-cpu script.sh
where c162408_ab1xyz is your HPC High-Priority Service account, hp-cpu is the CPU node HPC High-Priority Service master partition, and script.sh is the name of your existing job submission script.
Job submission scripts can also be modified to run using your HPC High-Priority Service resources by adding:
#SBATCH --account=c162408_ab1xyz
#SBATCH --partition=hp-cpu
to the header of your submission script, where c162408_ab1xyz is your HPC High-Priority Service account and hp-cpu is the CPU node HPC High-Priority Service master partition.
Interactive HPC High-Priority Service usage
You may use your HPC High-Priority Service resources interactively by running:
srun --account=c162408_ab1xyz --partition=hp-cpu --pty bash -i
where c162408_ab1xyz is your HPC High-Priority Service account and hp-cpu is the CPU node high-priority partition.
Running HPC High-Priority Service GPU jobs
To run a HPC High-Priority Service GPU job, you will need to specify GPUs using the gpu QOS and other GPU specific scheduler flags (see Using GPUs on Stanage for more details).
Submit existing job submissions scripts to a HPC High-Priority Service account for GPUs by running:
sbatch --account=a12509_ab1xyz --partition=hp-a100 --qos=gpu --gres=gpu:1 script.sh
where a12509_ab1xyz is your HPC High-Priority Service account, hp-a100 is the A100 node high-priority partition, and script.sh is the name of your existing job submission script.
sbatch --account=h12509_ab1xyz --partition=hp-h100 --qos=gpu --gres=gpu:1 script.sh
where h12509_ab1xyz is your HPC High-Priority Service account, hp-h100 is the H100 node high-priority partition, and script.sh is the name of your existing job submission script.
sbatch --account=n12510_ab1xyz --partition=hp-h100-nvl --qos=gpu --gres=gpu:1 script.sh
where n12510_ab1xyz is your HPC High-Priority Service account, hp-h100-nvl is the H100 NVL node high-priority partition, and script.sh is the name of your existing job submission script.
Or modify Job submission scripts to run using your HPC High-Priority Service resources by adding the following to the header of your submission script:
#SBATCH --account=a12509_ab1xyz
#SBATCH --partition=hp-a100
where a12509_ab1xyz is your HPC High-Priority Service account and hp-a100 is the A100 node high-priority partition.
#SBATCH --account=h12509_ab1xyz
#SBATCH --partition=hp-h100
where h12509_ab1xyz is your HPC High-Priority Service account and hp-h100 is the H100 node high-priority partition.
#SBATCH --account=n12510_ab1xyz
#SBATCH --partition=hp-h100-nvl
where nh12510_ab1xyz is your HPC High-Priority Service account and hp-h100-nvl is the H100 NVL node high-priority partition.
Managing your HPC High-Priority Service account
You may share your HPC High-Priority Service resources with other users by adding them to your account. To request this, please contact the IT Services’ Research and Innovation team.
You may pause your HPC High-Priority Service purchase in monthly increments to be resumed at a later date. To pause your purchase, please contact the IT Services’ Research and Innovation team. Keep in mind that your HPC High-Priority Service purchase on Stanage may not continue beyond the end date of Stanage.
Jobs running under your HPC High-Priority Service account may be listed by running:
squeue -A c162408_ab1xyz
where c162408_ab1xyz is your HPC High-Priority Service account.
You can list your HPC High-Priority Service accounts by running:
sacctmgr show association where user=$USER format=Account -P --noheader
In addition to any HPC High-Priority Service accounts you have access to, this list will include the free use account default.
Your HPC High-Priority Service account will expire once it has reached the end of its agreed duration. Once expired, you will no longer be able to submit or run jobs under this account. While any running jobs will finish, any submitted jobs will queue indefinitely.
Note
If your HPC High-Priority Service account has expired, you will continue to be able to submit jobs to the cluster as a regular user (you should no longer specify --account or a high-priority partition).
No files or folders will be deleted as part of your HPC High-Priority Service account expiring.
If you would like to extend your HPC High-Priority Service, please follow the steps of Getting a High-Priority Service account, mentioning that you would like to extend an existing account.
Troubleshooting High-Priority access
When submitting a job I am receiving a AssocGrpSubmitJobsLimit error
Your access to high-priority has been disabled, either because you requested this or your purchase has expired. If your purchase has expired and you wish to purchase additional high-priority access, please contact the IT Services’ Research and Innovation team.
Glossary
High-priority access
The general term referring to being able to run Slurm jobs with increased job priority and availability of compute resources.
High-priority partition
The partitions used by the HPC High-Priority Service to distinguish high-priority jobs from free jobs. These partitions control which nodes high-priority jobs run on. You must specify the high-priority partition matching your resource purchase type as part of your job submission script to make use of your HPC High-Priority Service account. See Using the High-Priority Service for a list of high-priority partitions.
HPC High-Priority Service account
The Slurm account purchased to enable high-priority access to a HPC cluster. You must specify this account as part of your job submission script to make use of your HPC High-Priority Service account. See Getting a High-Priority Service account for more information.
HPC High-Priority Service job
A Slurm job submitted using a HPC High-Priority Service account. Also referred to as a high-priority job. See also Job Submission and Control.
HPC High-Priority Service resources
The number of CPUs, amount of RAM, or number of GPUs assigned to a HPC High-Priority Service account. Jobs submitted using a HPC High-Priority Service account may not use more than these resources. See Getting a High-Priority Service account for the types of resources you can purchase.
HPC High-Priority Service
The University of Sheffield’s offering to purchase enhanced access to HPC resources, consisting of reduced queue times and increase availability of CPU, RAM, or GPU resources. This service is not free, but funds paid will be used to expand the HPC cluster. Learn more about the HPC High-Priority Service.