Using GPUs on Stanage

There are two types of GPU node in Stanage which differ in terms of GPU architecture (NVIDIA A100 and H100), the number of GPUs per node and GPU interconnect technologies (inc bandwidth) (see Stanage hardware specifications). At present you need to decide which node type to target when submitting a batch job or starting an interactive session on a worker node.

Interactive use of the GPUs

Note

See requesting an interactive session on slurm if you’re not already familiar with the concept.

To start an interactive session with access to one GPU on a GPU node (Stanage hardware specifications):

srun --partition=gpu --qos=gpu --gres=gpu:a100:1 --pty bash

Note it’s not possible to request GPUs using --gpus=N on Stanage at this time (unlike on Bessemer).

Warning

During the 2 week introduction phase of the H100 GPUs to the Stanage cluster, usage of the H100 GPUs requires the --partition=gpu-h100 and --gres=gpu:1 arguments to be set in your submission scripts. This is to ensure usage is “opt in” by users as the slightly different architecture of these GPUs to the existing A100 GPUs may necessitate changes to batch submission scripts and selected software versions.

Eventually the H100 GPUs will be brought into the general GPU partition, at which point the --partition=gpu will be required to access H100s (or any GPUs). At that stage any submissions using the general --gres=gpu:1 will be scheduled with the first available GPU of any type. Requesting a specific type of GPU will then require selection via the --gres=gpu:h100:1 or --gres=gpu:a100:1 arguments.

When these latter changes are made, we will give advanced notice via email and by amendments made within this documentation.

Interactive sessions provide you with 2 GB of CPU RAM by default, which is significantly less than the amount of GPU RAM available on a single GPU. This can lead to issues where your session has insufficient CPU RAM to transfer data to and from the GPU. As such, it is recommended that you request enough CPU memory to communicate properly with the GPU e.g.

# NB Each NVIDIA A100 (and H100) GPU in Stanage has 80GB of GPU RAM
srun --partition=gpu --qos=gpu --gres=gpu:1 --mem=82G --pty bash

The above will give you 2GB more CPU RAM than the 80GB of GPU RAM available on an NVIDIA A100 (and H100).

Submitting GPU batch jobs

Note

See submitting jobs on slurm if you’re not already familiar with the concept.

To run batch jobs on GPU nodes, ensure your job submission script includes a request for GPUs, e.g. for two GPUs use --gres=gpu:2:

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --qos=gpu
#SBATCH --gres=gpu:2
#SBATCH --mem=82G

# Your code below...

Requesting GPUs and multiple CPU cores from the scheduler

To request four separate Slurm tasks within a job, each of which has four CPU cores and with four (A100) GPUs available to the entire job (shared between tasks):

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --qos=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:4       # 4 GPUs for job

Note that:

  • The GPUs are (unintuitively) shared between the Slurm tasks.

  • It’s not possible to request --gpus-per-node, --gpus-per-task or --gpus-per-socket on Stanage at this time (unlike on Bessemer).

  • Not all nodes have four GPUs (Stanage hardware specifications).

Stanage GPU Resources

GPU-enabled Software

Training materials