Using GPUs on Stanage
There are three types of GPU node in Stanage which differ in terms of GPU architecture (NVIDIA A100, H100, and H100 NVL), the number of GPUs per node and GPU interconnect technologies (inc bandwidth) (see Stanage hardware specifications). At present you need to decide which node type to target when submitting a batch job or starting an interactive session on a worker node.
Before proceeding, ensure you’ve worked through our introductory GPU tutorial.
Interactive use of the GPUs
Note
See requesting an interactive session on slurm if you’re not already familiar with the concept.
Attention
Interactive use of GPUs is strongly discouraged, as they are a valuable and limited resource. Please use interactive GPU sessions only for short debugging, essential visualisation, or compiling GPU-enabled software. All other GPU workloads must be submitted as batch jobs.
To start an interactive session with access to one GPU on a GPU node (Stanage hardware specifications):
srun --partition=gpu --qos=gpu --gres=gpu:1 --mem=82G --pty bash
srun --partition=gpu-h100 --qos=gpu --gres=gpu:1 --mem=82G --pty bash
srun --partition=gpu-h100-nvl --qos=gpu --gres=gpu:1 --mem=96G --pty bash
Note: you can now request GPUs using --gpus=N on Stanage (as an alternative to --gres=gpu:N), following a recent Slurm upgrade.
Interactive sessions may default to only a few GB of host (CPU) memory, which is far less than the GPU device memory available (80 GB on A100/H100 GPUs, or 94 GB on H100 NVL GPUs). This mismatch can cause problems — for example, transfers between CPU and GPU may fail if there isn’t enough host memory available.
The examples above deliberately request slightly more CPU memory than the total memory associated with the requested GPU(s).
Please also carefully consider your --cpus-per-task and --time requests - shorter sessions tend to start sooner.
Warning
Usage of the H100 GPUs requires the
--partition=gpu-h100and--gres=gpu:1arguments to be set in your submission scripts.Usage of the H100 NVL GPUs requires the
--partition=gpu-h100-nvland--gres=gpu:1arguments to be set in your submission scripts.
This is to ensure usage is “opt in” by users as the slightly different architecture of these GPUs to the existing A100 GPUs may necessitate changes to batch submission scripts and selected software versions.
Submitting GPU batch jobs
Note
To run batch jobs on GPU nodes, ensure your job submission script includes a request for GPUs,
e.g. for two GPUs use --gres=gpu:2:
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --qos=gpu
#SBATCH --gres=gpu:2
#SBATCH --mem=82G
# Your code below...
#!/bin/bash
#SBATCH --partition=gpu-h100
#SBATCH --qos=gpu
#SBATCH --gres=gpu:2
#SBATCH --mem=82G
# Your code below...
#!/bin/bash
#SBATCH --partition=gpu-h100-nvl
#SBATCH --qos=gpu
#SBATCH --gres=gpu:2
#SBATCH --mem=96G
# Your code below...
Requesting GPUs and multiple CPU cores from the scheduler
To request four separate Slurm tasks within a job, each of which has eight CPU cores and with four (A100) GPUs available to the entire job (shared between tasks):
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --qos=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:4 # 4 GPUs for job
Note that:
The GPUs are (unintuitively) shared between the Slurm tasks.
It’s not possible to request
--gpus-per-node,--gpus-per-taskor--gpus-per-socketon Stanage at this time.Not all nodes have four GPUs (Stanage hardware specifications).
Architecture differences
CPU Architecture
Stanage GPU nodes contain a mixture of CPU architectures. The A100 and H100 GPU nodes use AMD-based CPUs while the H100 NVL GPU nodes use Intel-based CPUs. In some cases your software may need to be re-compiled depending on which GPU node it is run on.
GPU Architecture
The A100 GPUs are based on the Ampere architecture (sm80) while the H100 and H100 NVL GPUs are based on the slightly newer Hopper architecture (sm90). In some cases your software may need to be re-compiled depending on which GPU it is run on.
Note
While both the H100-NVL and H100 GPU nodes contain “H100” GPUs, the H100 NVL variant in the H100-NVL nodes is substantially different from the H100 GPUs in the H100 nodes. The H100 NVL is a newer GPU featuring 15% more CUDA and tensor cores alongside 95% more memory bandwidth than the older H100 GPU. In some work loads the H100 NVL may be up to 50% faster in practice.
Stanage GPU Resources
GPU-enabled Software
Applications
None yet
Libraries
Development Tools
Choosing appropriate GPU compute resources
On Stanage, GPU nodes are shared resources. While GPU requests are the most obvious factor, your CPU and RAM requests also determine whether other jobs can run on the same node and how efficiently the scheduler can pack work onto the hardware.
Caution
Request what you need, not just whatever happens to work. Over-requesting GPU, CPU, or RAM makes scheduling slower, wastes capacity, and blocks other users.
Below are sensible per-GPU guidelines to help you choose the right GPU type and request reasonable CPU and RAM. Following these suggestions will help reduce the risk of blocking others from using resources.
Note
When we say memory/RAM below, we mean host (CPU) memory unless we explicitly say GPU device memory.
Host (CPU) memory =
--mem(Slurm can kill your job if you exceed this).GPU device memory = memory on the GPU itself (errors usually mention CUDA and “out of memory”).
GPU node types at a glance
Type |
GPU Device Mem (GB) |
GPUs/Node |
Memory/Node (GB) |
CPUs/Node |
|---|---|---|---|---|
A100 |
80 |
4 |
503 |
48 |
H100 |
80 |
2 |
503 |
48 |
H100 NVL |
94 |
4 |
503 |
96 |
We recommend
Min host RAM / CPU per GPU: a sensible baseline to avoid common bottlenecks and keep GPU jobs running smoothly (especially for data loading and preprocessing when the host code is parallelised).
Max host RAM / CPU per GPU: the upper bound for fair sharing and efficient scheduling.
Type |
Min RAM/GPU (GB) |
Max RAM/GPU (GB) |
Min CPUs/GPU |
Max CPUs/GPU |
|---|---|---|---|---|
A100 |
82 |
120 |
8 |
12 |
H100 |
82 |
240 |
12 |
24 |
H100 NVL |
96 |
120 |
12 |
24 |
The minimum recommended host memory (RAM) per GPU is intentionally slightly higher than the GPU device memory.
If you don’t know what you need yet
Start at the recommended minimum
Increase host memory only if your job fails with host OOM (Slurm “Out Of Memory”)
If you hit GPU device OOM (CUDA “out of memory”), reduce the problem size or use a GPU with more device memory
If your GPU utilisation is low due to data loading/preprocessing, you may need more CPUs if your workflow supports parallelism (e.g. multiple dataloader workers)
Tip
We recommend monitoring your job’s GPU utilisation. See Monitoring GPU usage, which includes a tip for live monitoring.
Why memory requests matter (especially on A100 and H100 NVL)
A100 and H100 NVL nodes have 4 GPUs sharing the same 503 GB of host (CPU) RAM. If one job asks for too much host memory per GPU, it can prevent other users’ jobs from fitting on the node even when GPUs are still free, resulting in GPU resource stranding.
A100 or H100 NVL GPU resource stranding example (4 GPUs, 503 GB RAM)
If three 1-GPU jobs request 150 GB each, then:
150 + 150 + 150 = 450 GB used
503 - 450 = 53 GB remaining
A fourth job cannot start if it requests more than 53 GB, even though one GPU is still available. This remaining memory (53 GB) is below our minimum recommended memory per GPU for both A100 and H100 NVL nodes.
Resource stranding refers to GPUs or other resources that are unable to be used because a complementary resource, such as RAM in the example above, is unavailable. Particularly on A100 and H100 NVL nodes, being greedy with RAM risks GPUs becoming stranded and unusable for the duration of other jobs.
This is why we encourage users to request only what they need, and to use the recommended per-GPU memory guidance where appropriate.
Note
Memory matters on H100 nodes too — but because there are only 2 GPUs per node, GPU resource stranding is less likely unless the job requests very high RAM.
Hard maximums for host memory per GPU
These are the upper limits of what you can technically request and still potentially not block other users from resources. You should only use these maximums if absolutely required by your workflow.
These maximum limits are approximate, and assume there is still enough RAM left on the node for at least one other minimum-sized GPU job to run (with a small amount of headroom).
A100 extreme max: 2 GPUs at 82G and 2 GPUs at 165G can coexist
H100 extreme max: 1 GPU at 82G and 1 GPU at 400G can coexist
H100 NVL extreme max: 2 GPUs at 96G and 2 GPUs at 150G can coexist
These are guidelines rather than hard limits — the maximums are intentionally below the theoretical per-GPU maximum to leave a small amount of headroom for the node OS and other processes.
If you need to request more than these values, you will likely prevent other jobs from sharing the node and may leave GPUs unused. At that point, consider whether your workflow should be adjusted (e.g. fewer concurrent jobs per node, or different resource choices) to make better use of the allocated hardware.
Important
For advice, consult our IT Services’ Research and Innovation team early in your workflow.
GPU selection decision tree
This section provides a rule-of-thumb decision tree to help you choose between GPU node types based on performance, GPU device memory, and how well your job can share a node with other users.
GPU selection isn’t always clear-cut, so treat this as guidance rather than a definitive answer — if you’re unsure, start with the recommended minimum resources and adjust based on your job’s behaviour.
Note
Firstly you should consider the availability of each GPU type:
A100 (52) ████████████████████████████████████████████████████
H100 (12) ████████████
H100 NVL (16) ████████████████
Key: █ = 1 GPU
See also: Available GPU resources
Start
|
+-- Can your code actually use more than 1 GPU?
| |
| +-- No --> Choose any suitable single-GPU option
| | |
| | |-- Need >80 GB GPU device memory?
| | | |
| | | +-- Yes --> H100 NVL
| | | +-- No --> A100 or H100
| | |
| | |-- Need very high host (CPU) RAM per GPU (>120 GB/GPU)?
| | |
| | +-- Yes --> Prefer H100 (lower stranding risk)
| | +-- No --> Any
| |
| +-- Yes --> (multi-GPU)
| |
| |-- Need 3–4 GPUs on one node?
| | |
| | +-- Yes --> A100 or H100 NVL
| | +-- No --> Any
| |
| +-- Need higher GPU-to-GPU bandwidth?
| |
| +-- Yes --> Prefer A100 (NVLink) or H100 NVL
| +-- No --> Any
|
+-- Need the fastest GPU compute performance / likely to hit 96h wall?
|
+-- Yes --> Prefer H100 NVL
+-- No --> Any
Training materials
The Research Software Engineering team have developed an undergraduate teaching module on CUDA;
lecture notes and lecture recordings for that module are accessible here for anyone with a University account.