Using GPUs on Bessemer

Interactive use of the GPUs

Note

See requesting an interactive session on slurm if you’re not already familiar with the concept.

To start using the GPU enabled nodes interactively, type:

srun --partition=gpu --qos=gpu --nodes=1 --gpus-per-node=1 --pty bash

The --gpus-per-node=1 parameter determines how many GPUs you are requesting (just one in this case). Don’t forget to specify --nodes=1 too. Currently, the maximum number of GPUs allowed per job is set to 4, as Bessemer is configured to only permit single-node jobs and GPU nodes contain up to 4 GPUs. If you think you would benefit from using >4 GPUs in a single job then consider requesting access to JADE (Tier 2 GPU cluster).

Interactive sessions provide you with 2 GB of CPU RAM by default, which is significantly less than the amount of GPU RAM available on a single GPU. This can lead to issues where your session has insufficient CPU RAM to transfer data to and from the GPU. As such, it is recommended that you request enough CPU memory to communicate properly with the GPU:

# NB Each NVIDIA V100 GPU has 32GB of RAM
srun --partition=gpu --qos=gpu --nodes=1 --gpus-per-node=1 --mem=34G --pty bash

The above will give you 2GB more CPU RAM than the 32GB of GPU RAM available on the NVIDIA V100.

Note

Some private GPU nodes have only 16GB of GPU RAM per GPU; the users of private GPU nodes should check and be aware of how much GPU memory is available.

Submitting batch GPU jobs

Note

See submitting jobs on slurm if you’re not already familiar with the concept.

To run batch jobs on GPU nodes, ensure your job submission script includes a request for GPUs, e.g. for a single GPU --nodes=1 --gpus-per-node=1:

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --qos=gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1

#Your code below...

Requesting GPUs and multiple CPU cores from the scheduler

There are two ways of requesting multiple CPUs in conjunction with GPU requests.

  • To request multiple CPUs independent of number of GPUs requested, the -c option is used:

    #!/bin/bash
    #SBATCH --partition=gpu
    #SBATCH --qos=gpu
    #SBATCH --nodes=1
    #SBATCH --gpus-per-node=2  # Requests 2 GPUs
    #SBATCH -c=2               # Requests 2 CPUs
    

    The script above requests 2 CPUs and 2 GPUs.

  • To request multiple CPUs based on the number of GPUs requested, the --cpus-per-gpu option is used:

    #!/bin/bash
    #SBATCH --partition=gpu
    #SBATCH --qos=gpu
    #SBATCH --nodes=1
    #SBATCH --gpus-per-node=2  # Requests 2 GPUs
    #SBATCH --cpus-per-gpu=2   # Requests 2 CPUs per GPU requested
    

    The script above requests 2 GPUs and 2 CPUs per GPU for a total of 4 CPUs.

Bessemer GPU Resources

GPU-enabled Software

Temporary NVIDIA A100 GPU nodes

Prior to May 2022 the Bessemer cluster only featured one ‘public’ GPU node containing NVIDIA V100 GPUs (although private GPU nodes could be accessed using preemptable jobs, for those whose workflows could tolerate preemption).

Between May and December 2022, 16 addtional GPUs nodes were (temporarily) available to all users of Bessemer. These featured NVIDIA A100 GPUs, which were quite a bit faster than the (older generation) V100 GPUs in Bessemer.

Note

These nodes were removed from Bessemer in December 2022 and will be available in the University’s new HPC cluster, Stanage, early in 2023.

Specifications per A100 node:

  • Chassis: Dell XE8545

  • CPUs: 48 CPU cores from 2x AMD EPYC 7413 CPUs (AMD Milan aka AMD Zen 3 microarchitecture; 2.65 GHz; 128MB L3 cache per CPU)

  • RAM: 512 GB (3200 MT/s)

  • Local storage: 460 GB boot device (SSD) plus 2.88 TB ‘/scratch’ temporary storage (RAID 0 on SSDs)

  • GPUs: 4x NVIDIA Tesla A100, each with

    • High-bandwidth, low-latency NVLink GPU interconnects

    • 80GB memory (HBM2e)

Training materials