Using GPUs on Stanage

There are three types of GPU node in Stanage which differ in terms of GPU architecture (NVIDIA A100, H100, and H100 NVL), the number of GPUs per node and GPU interconnect technologies (inc bandwidth) (see Stanage hardware specifications). At present you need to decide which node type to target when submitting a batch job or starting an interactive session on a worker node.

Before proceeding, ensure you’ve worked through our introductory GPU tutorial.

Interactive use of the GPUs

Note

See requesting an interactive session on slurm if you’re not already familiar with the concept.

Attention

Interactive use of GPUs is strongly discouraged, as they are a valuable and limited resource. Please use interactive GPU sessions only for short debugging, essential visualisation, or compiling GPU-enabled software. All other GPU workloads must be submitted as batch jobs.

To start an interactive session with access to one GPU on a GPU node (Stanage hardware specifications):

srun --partition=gpu --qos=gpu --gres=gpu:1 --mem=82G --pty bash

Note: you can now request GPUs using --gpus=N on Stanage (as an alternative to --gres=gpu:N), following a recent Slurm upgrade.

Interactive sessions may default to only a few GB of host (CPU) memory, which is far less than the GPU device memory available (80 GB on A100/H100 GPUs, or 94 GB on H100 NVL GPUs). This mismatch can cause problems — for example, transfers between CPU and GPU may fail if there isn’t enough host memory available.

The examples above deliberately request slightly more CPU memory than the total memory associated with the requested GPU(s).

Please also carefully consider your --cpus-per-task and --time requests - shorter sessions tend to start sooner.

Warning

  • Usage of the H100 GPUs requires the --partition=gpu-h100 and --gres=gpu:1 arguments to be set in your submission scripts.

  • Usage of the H100 NVL GPUs requires the --partition=gpu-h100-nvl and --gres=gpu:1 arguments to be set in your submission scripts.

This is to ensure usage is “opt in” by users as the slightly different architecture of these GPUs to the existing A100 GPUs may necessitate changes to batch submission scripts and selected software versions.

Submitting GPU batch jobs

Note

See submitting jobs on slurm if you’re not already familiar with the concept.
Each user can use at most 12 GPUs concurrently (A100 + H100 combined); further jobs will wait until prior ones release GPUs. (Reduced from 16 in Aug 2025.)

To run batch jobs on GPU nodes, ensure your job submission script includes a request for GPUs, e.g. for two GPUs use --gres=gpu:2:

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --qos=gpu
#SBATCH --gres=gpu:2
#SBATCH --mem=82G

# Your code below...

Requesting GPUs and multiple CPU cores from the scheduler

To request four separate Slurm tasks within a job, each of which has eight CPU cores and with four (A100) GPUs available to the entire job (shared between tasks):

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --qos=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:4       # 4 GPUs for job

Note that:

  • The GPUs are (unintuitively) shared between the Slurm tasks.

  • It’s not possible to request --gpus-per-node, --gpus-per-task or --gpus-per-socket on Stanage at this time.

  • Not all nodes have four GPUs (Stanage hardware specifications).

Architecture differences

CPU Architecture

Stanage GPU nodes contain a mixture of CPU architectures. The A100 and H100 GPU nodes use AMD-based CPUs while the H100 NVL GPU nodes use Intel-based CPUs. In some cases your software may need to be re-compiled depending on which GPU node it is run on.

GPU Architecture

The A100 GPUs are based on the Ampere architecture (sm80) while the H100 and H100 NVL GPUs are based on the slightly newer Hopper architecture (sm90). In some cases your software may need to be re-compiled depending on which GPU it is run on.

Note

While both the H100-NVL and H100 GPU nodes contain “H100” GPUs, the H100 NVL variant in the H100-NVL nodes is substantially different from the H100 GPUs in the H100 nodes. The H100 NVL is a newer GPU featuring 15% more CUDA and tensor cores alongside 95% more memory bandwidth than the older H100 GPU. In some work loads the H100 NVL may be up to 50% faster in practice.

Stanage GPU Resources

GPU-enabled Software

Choosing appropriate GPU compute resources

On Stanage, GPU nodes are shared resources. While GPU requests are the most obvious factor, your CPU and RAM requests also determine whether other jobs can run on the same node and how efficiently the scheduler can pack work onto the hardware.

Caution

Request what you need, not just whatever happens to work. Over-requesting GPU, CPU, or RAM makes scheduling slower, wastes capacity, and blocks other users.

Below are sensible per-GPU guidelines to help you choose the right GPU type and request reasonable CPU and RAM. Following these suggestions will help reduce the risk of blocking others from using resources.

Note

When we say memory/RAM below, we mean host (CPU) memory unless we explicitly say GPU device memory.

  • Host (CPU) memory = --mem (Slurm can kill your job if you exceed this).

  • GPU device memory = memory on the GPU itself (errors usually mention CUDA and “out of memory”).

GPU node types at a glance

Type

GPU Device Mem (GB)

GPUs/Node

Memory/Node (GB)

CPUs/Node

A100

80

4

503

48

H100

80

2

503

48

H100 NVL

94

4

503

96

We recommend

  • Min host RAM / CPU per GPU: a sensible baseline to avoid common bottlenecks and keep GPU jobs running smoothly (especially for data loading and preprocessing when the host code is parallelised).

  • Max host RAM / CPU per GPU: the upper bound for fair sharing and efficient scheduling.

Type

Min RAM/GPU (GB)

Max RAM/GPU (GB)

Min CPUs/GPU

Max CPUs/GPU

A100

82

120

8

12

H100

82

240

12

24

H100 NVL

96

120

12

24

The minimum recommended host memory (RAM) per GPU is intentionally slightly higher than the GPU device memory.

If you don’t know what you need yet

  • Start at the recommended minimum

  • Increase host memory only if your job fails with host OOM (Slurm “Out Of Memory”)

  • If you hit GPU device OOM (CUDA “out of memory”), reduce the problem size or use a GPU with more device memory

  • If your GPU utilisation is low due to data loading/preprocessing, you may need more CPUs if your workflow supports parallelism (e.g. multiple dataloader workers)

Tip

  • We recommend monitoring your job’s GPU utilisation. See Monitoring GPU usage, which includes a tip for live monitoring.

Why memory requests matter (especially on A100 and H100 NVL)

A100 and H100 NVL nodes have 4 GPUs sharing the same 503 GB of host (CPU) RAM. If one job asks for too much host memory per GPU, it can prevent other users’ jobs from fitting on the node even when GPUs are still free, resulting in GPU resource stranding.

A100 or H100 NVL GPU resource stranding example (4 GPUs, 503 GB RAM)

If three 1-GPU jobs request 150 GB each, then:

  • 150 + 150 + 150 = 450 GB used

  • 503 - 450 = 53 GB remaining

A fourth job cannot start if it requests more than 53 GB, even though one GPU is still available. This remaining memory (53 GB) is below our minimum recommended memory per GPU for both A100 and H100 NVL nodes.

Resource stranding refers to GPUs or other resources that are unable to be used because a complementary resource, such as RAM in the example above, is unavailable. Particularly on A100 and H100 NVL nodes, being greedy with RAM risks GPUs becoming stranded and unusable for the duration of other jobs.

This is why we encourage users to request only what they need, and to use the recommended per-GPU memory guidance where appropriate.

Note

Memory matters on H100 nodes too — but because there are only 2 GPUs per node, GPU resource stranding is less likely unless the job requests very high RAM.

Hard maximums for host memory per GPU

These are the upper limits of what you can technically request and still potentially not block other users from resources. You should only use these maximums if absolutely required by your workflow.

These maximum limits are approximate, and assume there is still enough RAM left on the node for at least one other minimum-sized GPU job to run (with a small amount of headroom).

  • A100 extreme max: 2 GPUs at 82G and 2 GPUs at 165G can coexist

  • H100 extreme max: 1 GPU at 82G and 1 GPU at 400G can coexist

  • H100 NVL extreme max: 2 GPUs at 96G and 2 GPUs at 150G can coexist

These are guidelines rather than hard limits — the maximums are intentionally below the theoretical per-GPU maximum to leave a small amount of headroom for the node OS and other processes.

If you need to request more than these values, you will likely prevent other jobs from sharing the node and may leave GPUs unused. At that point, consider whether your workflow should be adjusted (e.g. fewer concurrent jobs per node, or different resource choices) to make better use of the allocated hardware.

Important

For advice, consult our IT Services’ Research and Innovation team early in your workflow.

GPU selection decision tree

This section provides a rule-of-thumb decision tree to help you choose between GPU node types based on performance, GPU device memory, and how well your job can share a node with other users.

GPU selection isn’t always clear-cut, so treat this as guidance rather than a definitive answer — if you’re unsure, start with the recommended minimum resources and adjust based on your job’s behaviour.

Note

Firstly you should consider the availability of each GPU type:

A100      (52)  ████████████████████████████████████████████████████
H100      (12)  ████████████
H100 NVL  (16)  ████████████████
Key: █ = 1 GPU

See also: Available GPU resources

Start
 |
 +-- Can your code actually use more than 1 GPU?
 |        |
 |        +-- No --> Choose any suitable single-GPU option
 |        |           |
 |        |           |-- Need >80 GB GPU device memory?
 |        |           |        |
 |        |           |        +-- Yes --> H100 NVL
 |        |           |        +-- No --> A100 or H100
 |        |           |
 |        |           |-- Need very high host (CPU) RAM per GPU (>120 GB/GPU)?
 |        |                    |
 |        |                    +-- Yes --> Prefer H100 (lower stranding risk)
 |        |                    +-- No --> Any
 |        |
 |        +-- Yes --> (multi-GPU)
 |                    |
 |                    |-- Need 3–4 GPUs on one node?
 |                    |        |
 |                    |        +-- Yes --> A100 or H100 NVL
 |                    |        +-- No --> Any
 |                    |
 |                    +-- Need higher GPU-to-GPU bandwidth?
 |                             |
 |                             +-- Yes --> Prefer A100 (NVLink) or H100 NVL
 |                             +-- No --> Any
 |
 +-- Need the fastest GPU compute performance / likely to hit 96h wall?
          |
          +-- Yes --> Prefer H100 NVL
          +-- No --> Any

Training materials