CUDA
CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model created by NVIDIA. It allows software developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing, an approach known as General Purpose GPU (GPGPU) computing.
Usage
You need to first request one or more GPUs within an interactive session or batch job on a worker node.
To request three of any available public GPUs for a batch job you would include all the following in the header of your submission script:
#SBATCH --partition=gpu
#SBATCH --qos=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:3
Note
See Using GPUs on Stanage for more information on how to request a GPU-enabled node for an interactive session or job submission.
You then need to ensure a version of the CUDA library (and compiler) is loaded. As with much software installed on the cluster, versions of CUDA are activated via the ‘module load’ command.
To load one of the currently available CUDA versions you can run one of the following commands:
module load CUDA/12.4.0
module load CUDA/12.1.1
module load CUDA/12.0.0
module load CUDA/11.8.0
module load CUDA/11.7.0
module load CUDA/11.1.1-GCC-10.2.0
module load CUDA/10.2.89-GCC-8.3.0
module load CUDA/10.1.243-GCC-8.3.0
module load CUDA/10.1.243
module load CUDA/10.1.105-GCC-8.2.0-2.31.1
module load CUDA/10.0.130
Note that the older versions of CUDA may implicitly load the GCC compiler. For newer versions you will also need to explicitly load a compiler e.g. GCC.
Confirm which version of CUDA you are using via nvcc --version
e.g.:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Mon_Oct_24_19:12:58_PDT_2022
Cuda compilation tools, release 12.0, V12.0.76
Build cuda_12.0.r12.0/compiler.31968024_0
Compiling a simple CUDA program
An example of the use of nvcc
(the CUDA compiler):
nvcc filename.cu
This will compile the CUDA program contained in the file filename.cu
.
Compiling the sample programs
You do not need to be using a GPU-enabled node to compile the sample programs but you do need at least one GPU to run them.
In this demonstration, we create a batch job that
Request one GPU, a single CPU core and 8GB RAM
Loads a module to provide CUDA 11.8
Downloads compatible NVIDIA CUDA sample programs
Compiles and runs an example that performs a matrix multiplication
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --qos=gpu
#SBATCH --gres=gpu:1 # Number of GPUs
#SBATCH --mem=8G
#SBATCH --time=0-00:05 # time (DD-HH:MM)
#SBATCH --job-name=gputest
module load CUDA/12.0.0
mkdir -p $HOME/examples
cd $HOME/examples
if ! [[ -f cuda-samples/.git ]]; then
git clone https://github.com/NVIDIA/cuda-samples.git cuda-samples
fi
cd cuda-samples
git checkout tags/v12.0 # use sample programs compatible with CUDA 12.0
cd Samples/0_Introduction/matrixMul/
make SMS="80"
./matrixMul
GPU Code Generation Options
To achieve the best possible performance whilst being portable, GPU code should be generated for the architecture(s) it will be executed upon.
This is controlled by specifying -gencode
arguments to NVCC which,
unlike the -arch
and -code
arguments,
allows for ‘fatbinary’ executables that are optimised for multiple device architectures.
Each -gencode
argument requires two values,
the virtual architecture and real architecture,
for use in NVCC’s two-stage compilation.
For example, -gencode=arch=compute_70,code=sm_70
specifies a virtual architecture of compute_70
and real architecture sm_70
.
To support future hardware of higher compute capability,
an additional -gencode
argument can be used to enable Just in Time (JIT) compilation of embedded intermediate PTX code.
This argument should use the highest virtual architecture specified in other gencode arguments
for both the arch
and code
i.e. -gencode=arch=compute_80,code=compute_80
.
The minimum specified virtual architecture must be less than or equal to the Compute Capability of the GPU used to execute the code.
At present, Stanage contains NVIDIA A100 GPUs, and NVIDIA H100 GPUs which are Compute Capability 80 and 90 respectively.
Note
CUDA 10.x is not aware of compute capability 80, and CUDA < 11.8 is not aware of compute capability 90. PTX for an older architecture should be embedded instead to avoid compilation errors.
To build a CUDA application which targets both H100 and A100 GPUs, use the following -gencode
arguments:
nvcc filename.cu \
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_90,code=sm_90 \
-gencode=arch=compute_90,code=compute_90
nvcc filename.cu \
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_80,code=compute_80
nvcc filename.cu \
-gencode=arch=compute_70,code=compute_70
Further details of these compiler flags can be found in the NVCC Documentation, along with details of the supported virtual architectures and real architectures.
Documentation
Nsight Systems
Nsight Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms and identify the largest opportunities to optimize. It supports Pascal (SM 60) and newer GPUs.
A common use-case for Nsight Systems is to generate application timelines via the command line,
which can later be visualised on a local computer using the GUI component.
The Nsight Systems executable, nsys
, is provided by loading a version of CUDA using a module file.
Note
You should use a version of nsys that is at least as new as the CUDA toolkit used to compile your application (if appropriate).
However, be aware that the nsys provided with CUDA >= 11.8 (and <= 12.0) is buggy and should not be used.
To generate an application timeline with Nsight Systems CLI (nsys):
nsys profile -o timeline ./myapplication <arguments>
Nsight systems can trace mulitple APIs, such as CUDA and OpenACC. The --trace
argument to specify which APIs should be traced.
See the nsys profiling command switch options for further information.
nsys profile -o timeline --trace cuda,nvtx,osrt,openacc ./myapplication <arguments>
Once this file has been downloaded to your local machine,
it can be opened in nsys-ui
/nsight-sys
via File > Open > timeline.qdrep
Nsight Compute
Nsight Compute is a kernel profiler for CUDA applications, which can also be used for API debugging. It supports Volta (SM 70) and newer GPUs.
A common use-case for using Nsight Compute is to capture all available profiling metrics via the command line, which can later be analysed on a local computer using the GUI component.
Note
If you want to perform CUDA kernel profiling on this cluster
you need to explicitly request (via research-it@sheffield.ac.uk) for
that to be enabled for you for a certain number of GPUs over a certain time period,
otherwise attempts to use tools like Nsight Compute will result in permissions errors (ERR_NVGPUCTRPERM
).
Nsight Compute, ncu
, is provided by loading a version of CUDA using a module file.
You should use a versions of ncu
that is at least as new as the CUDA toolkit used to compile your application.
To generate the default set of profile metrics with Nsight Compute CLI (ncu
):
ncu -o metrics ./myapplication <arguments>
Nsight compute can capture many different metrics which are used to generate the different sections of the profiling report. The --set
argument can be used to control which set of metrics and sections are captured. See the Nsight Compute CLI Command Line Options for further information.
ncu -o metrics --set full ./myapplication <arguments>
Once this file has been downloaded to your local machine, it can be opened in ncu-ui
/nv-nsight-cu
via File > Open File > metrics.ncu-rep
Profiling using nvprof
nvprof is not supported on the GPUs in Stanage (it does not support NVIDIA architectures >= SM80); please use Nsight Systems and Nsight Compute instead.
CUDA Training
The Research Software Engineering team have developed an undergraduate teaching module on CUDA; lecture notes and lecture recordings for that module are accessible here for anyone with a University account.
Determining the NVIDIA Driver version
Run the command:
cat /proc/driver/nvidia/version
Example output is:
NVRM version: NVIDIA UNIX x86_64 Kernel Module 525.105.17 Tue Mar 28 18:02:59 UTC 2023
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)
Installation notes
These are primarily for system administrators.
CUDA 12.0.0
Installed as a dependency of the cuDNN-8.8.0.121-CUDA-12.0.0.eb
easyconfig.
Single GPU and compiler testing was conducted as above in the matrixMul
batch job.
CUDA 11.8.0
Installed as a dependency of the cuDNN-8.6.0.163-CUDA-11.8.0.eb
easyconfig.
Single GPU and compiler testing was conducted as above in the matrixMul
batch job.
CUDA 11.7.0
Installed as a dependency of the cuDNN-8.4.1.50-CUDA-11.7.0.eb
easyconfig.
Single GPU and compiler testing was conducted as above in the matrixMul
batch job.
Inter-GPU performance was tested on all 4x A100 devices in gpu01
using the NCCL all_reduce_perf
benchmark test
(provided by the NCCL-tests/2.13.6-GCC-11.3.0-CUDA-11.7.0
module), which was run using:
all_reduce_perf -b 8 -e 128M -f 2 -g 4
Results:
# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 29697 on gpu01 device 0 [0x01] NVIDIA A100-SXM4-80GB
# Rank 1 Group 0 Pid 29697 on gpu01 device 1 [0x41] NVIDIA A100-SXM4-80GB
# Rank 2 Group 0 Pid 29697 on gpu01 device 2 [0x81] NVIDIA A100-SXM4-80GB
# Rank 3 Group 0 Pid 29697 on gpu01 device 3 [0xc1] NVIDIA A100-SXM4-80GB
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 15.29 0.00 0.00 0 14.64 0.00 0.00 0
16 4 float sum -1 14.72 0.00 0.00 0 14.96 0.00 0.00 0
32 8 float sum -1 14.48 0.00 0.00 0 14.67 0.00 0.00 0
64 16 float sum -1 15.52 0.00 0.01 0 14.51 0.00 0.01 0
128 32 float sum -1 14.73 0.01 0.01 0 14.81 0.01 0.01 0
256 64 float sum -1 14.85 0.02 0.03 0 14.20 0.02 0.03 0
512 128 float sum -1 14.89 0.03 0.05 0 14.91 0.03 0.05 0
1024 256 float sum -1 14.50 0.07 0.11 0 14.58 0.07 0.11 0
2048 512 float sum -1 15.01 0.14 0.20 0 14.43 0.14 0.21 0
4096 1024 float sum -1 14.75 0.28 0.42 0 15.19 0.27 0.40 0
8192 2048 float sum -1 14.93 0.55 0.82 0 14.81 0.55 0.83 0
16384 4096 float sum -1 16.29 1.01 1.51 0 15.35 1.07 1.60 0
32768 8192 float sum -1 19.80 1.66 2.48 0 19.43 1.69 2.53 0
65536 16384 float sum -1 21.48 3.05 4.58 0 20.99 3.12 4.68 0
131072 32768 float sum -1 25.64 5.11 7.67 0 25.36 5.17 7.75 0
262144 65536 float sum -1 35.04 7.48 11.22 0 34.06 7.70 11.55 0
524288 131072 float sum -1 44.89 11.68 17.52 0 44.45 11.80 17.69 0
1048576 262144 float sum -1 63.16 16.60 24.90 0 63.08 16.62 24.94 0
2097152 524288 float sum -1 69.09 30.35 45.53 0 69.25 30.28 45.42 0
4194304 1048576 float sum -1 86.22 48.65 72.97 0 86.73 48.36 72.54 0
8388608 2097152 float sum -1 132.7 63.21 94.81 0 130.3 64.40 96.60 0
16777216 4194304 float sum -1 188.7 88.91 133.36 0 187.8 89.35 134.02 0
33554432 8388608 float sum -1 284.9 117.76 176.64 0 282.3 118.85 178.28 0
67108864 16777216 float sum -1 537.4 124.88 187.32 0 538.8 124.56 186.83 0
134217728 33554432 float sum -1 974.9 137.67 206.51 0 962.4 139.46 209.20 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 39.6794