CUDA

CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model created by NVIDIA. It allows software developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing, an approach known as General Purpose GPU (GPGPU) computing.


Usage

You need to first request one or more GPUs within an interactive session or batch job on a worker node.

To request say three unspecified public GPUs for a batch job you would include all the following in the header of your submission script:

#SBATCH --partition=gpu
#SBATCH --qos=gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=3

Note

See Using GPUs on Bessemer for more information on how to request a GPU-enabled node for an interactive session or job submission.

You then need to ensure a version of the CUDA library (and compiler) is loaded. As with much software installed on the cluster, versions of CUDA are activated via the ‘module load’ command.

To load one of the currently available CUDA versions you can run one of the following commands:

module load CUDA/10.0.130
module load CUDA/10.1.105-GCC-8.2.0-2.31.1
module load CUDA/10.1.243
module load CUDA/10.1.243-GCC-8.3.0
module load CUDA/10.2.89-GCC-8.3.0
module load CUDAcore/11.0.2
module load CUDAcore/11.1.1
module load CUDA/11.8.0
module load CUDA/12.4.0

Warning

Please take care when loading these modules as some modules will load further software, libraries or toolchains. Further fosscuda toolchain modules also exist which are detailed below.

To load just CUDA 10.2 plus the GCC 8.3 compiler:

module load CUDA/10.2.89-GCC-8.3.0

To load CUDA 10.1 plus the GCC 8.x compiler, OpenMPI, OpenBLAS, SCALAPACK and FFTW:

module load fosscuda/2019b  # includes GCC 8.3
module load fosscuda/2019a   # includes GCC 8.2

To load just CUDA 10.1 and GCC 8.x:

module load CUDA/10.1.243-GCC-8.3.0  # subset of the fosscuda-2019b toolchain
module load CUDA/10.1.105-GCC-8.2.0-2.31.1  # subset of the fosscuda-2019a toolchain

To load just CUDA 10.0:

module load CUDA/10.0.130

Confirm which version of CUDA you are using via nvcc --version e.g.:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105

Compiling a simple CUDA program

An example of the use of nvcc (the CUDA compiler):

nvcc filename.cu

will compile the CUDA program contained in the file filename.cu.


Compiling the sample programs

You do not need to be using a GPU-enabled node to compile the sample programs but you do need at least one GPU to run them.

In this demonstration, we create a batch job that

  1. Requests two GPUs, a single CPU core and 8GB RAM

  2. Loads a module to provide CUDA 10.1

  3. Downloads compatible NVIDIA CUDA sample programs

  4. Compiles and runs an example that performs a matrix multiplication

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --qos=gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=2     # Number of GPUs (per node)
#SBATCH --mem=8G
#SBATCH --time=0-00:05        # time (DD-HH:MM)
#SBATCH --job-name=gputest

module load fosscuda/2019a  # provides CUDA 10.1

mkdir -p $HOME/examples
cd $HOME/examples
if ! [[ -f cuda-samples/.git ]]; then
    git clone https://github.com/NVIDIA/cuda-samples.git cuda-samples
fi
cd cuda-samples
git checkout tags/10.1.1  # use sample programs compatible with CUDA 10.1
cd Samples/matrixMul
make
./matrixMul

GPU Code Generation Options

To achieve the best possible performance whilst being portable, GPU code should be generated for the architecture(s) it will be executed upon.

This is controlled by specifying -gencode arguments to NVCC which, unlike the -arch and -code arguments, allows for ‘fatbinary’ executables that are optimised for multiple device architectures.

Each -gencode argument requires two values, the virtual architecture and real architecture, for use in NVCC’s two-stage compilation. For example, -gencode=arch=compute_70,code=sm_70 specifies a virtual architecture of compute_70 and real architecture sm_70.

To support future hardware of higher compute capability, an additional -gencode argument can be used to enable Just in Time (JIT) compilation of embedded intermediate PTX code. This argument should use the highest virtual architecture specified in other gencode arguments for both the arch and code i.e. -gencode=arch=compute_70,code=compute_70.

The minimum specified virtual architecture must be less than or equal to the Compute Capability of the GPU used to execute the code.

Most public and private GPU nodes in Bessemer contain Tesla V100 GPUs, which are Compute Capability 70. To build a CUDA application which targets just the public GPUS nodes, use the following -gencode arguments:

nvcc filename.cu \
   -gencode=arch=compute_70,code=sm_70 \
   -gencode=arch=compute_70,code=compute_70

To build a CUDA application which targets just those nodes you need CUDA >= 11 and need to supply the following -gencode arguments:

nvcc filename.cu \
   -gencode=arch=compute_80,code=sm_80 \
   -gencode=arch=compute_80,code=compute_80

Further details of these compiler flags can be found in the NVCC Documentation, along with details of the supported virtual architectures and real architectures.


Documentation


Nsight Systems

Nsight Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms and identify the largest opportunities to optimize. It supports Pascal (SM 60) and newer GPUs.

A common use-case for Nsight Systems is to generate application timelines via the command line, which can later be visualised on a local computer using the GUI component. Nsight Systems, nsys, is provided by the following modules.

module load CUDAcore/11.0.2
module load CUDAcore/11.1.1
module load cuDNN/8.0.4.30-CUDA-11.0.2
module load cuDNN/8.0.4.30-CUDA-11.1.1

You should use a version of nsys that is at least as new as the CUDA toolkit used to compile your application (if appropriate).

To generate an application timeline with Nsight Systems CLI (nsys):

nsys profile -o timeline ./myapplication <arguments>

Nsight systems can trace mulitple APIs, such as CUDA and OpenACC. The --trace argument to specify which APIs should be traced. See the nsys profiling command switch options for further information.

nsys profile -o timeline --trace cuda,nvtx,osrt,openacc ./myapplication <arguments>

Once this file has been downloaded to your local machine, it can be opened in nsys-ui/nsight-sys via File > Open > timeline.qdrep

Nsight Compute

Nsight Compute is a kernel profiler for CUDA applications, which can also be used for API debugging. It supports Volta (SM 70) and newer GPUs.

A common use-case for using Nsight Compute is to capture all available profiling metrics via the command line, which can later be analysed on a local computer using the GUI component. Nsight Compute, ncu, is provided by the following modules.

module load CUDAcore/11.0.2
module load CUDAcore/11.1.1
module load cuDNN/8.0.4.30-CUDA-11.0.2
module load cuDNN/8.0.4.30-CUDA-11.1.1

You should use a versions of ncu that is at least as new as the CUDA toolkit used to compile your application.

To generate the default set of profile metrics with Nsight Compute CLI (ncu):

ncu -o metrics ./myapplication <arguments>

Nsight compute can capture many different metrics which are used to generate the different sections of the profiling report. The --set argument can be used to control which set of metrics and sections are captured. See the Nsight Compute CLI Command Line Options for further information.

ncu -o metrics --set full ./myapplication <arguments>

Once this file has been downloaded to your local machine, it can be opened in ncu-ui/nv-nsight-cu via File > Open File > metrics.ncu-rep

Profiling using nvprof

Prior to September 2020 nvprof, NVIDIA’s CUDA profiler, could not write its SQLite database outputs to the /fastdata filesystem. This was because SQLite requires a filesystem that supports file locking but file locking was not previously enabled on the (Lustre) filesystem mounted on /fastdata.

nvprof can now write output data to any user-accessible filesystem including /fastdata.


CUDA Training

The Research Software Engineering team have developed an undergraduate teaching module on CUDA; lecture notes and lecture recordings for that module are accessible here for anyone with a University account.


Determining the NVIDIA Driver version

Run the command:

cat /proc/driver/nvidia/version

Example output is:

NVRM version: NVIDIA UNIX x86_64 Kernel Module  418.67  Sat Apr  6 03:07:24 CDT 2019
GCC version:  gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC)

Installation notes

These are primarily for system administrators.

Device driver

The NVIDIA device driver is installed and configured using the gpu-nvidia-driver systemd service (managed by puppet). This service runs /usr/local/scripts/gpu-nvidia-driver.sh at boot time to:

  • Check the device driver version and uninstall it then reinstall the target version if required;

  • Load the nvidia kernel module;

  • Create several device nodes in /dev/.


CUDA 11.1.1

Installed as a dependency of the cuDNN-8.0.4.30-CUDA-11.1.1 easyconfig.

Single GPU and compiler testing was conducted as above in the matrixMul batch job.

Inter-GPU performance was tested on all 4x V100 devices in bessemer-node026 (no NVLINK) using nccl-tests and /NCCL/2.8.3-CUDA-11.1.1. nccl-tests was run using ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4

Results:

# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid 201685 on bessemer-node026 device  0 [0x3d] Tesla V100-PCIE-32GB
#   Rank  1 Pid 201685 on bessemer-node026 device  1 [0x3e] Tesla V100-PCIE-32GB
#   Rank  2 Pid 201685 on bessemer-node026 device  2 [0x3f] Tesla V100-PCIE-32GB
#   Rank  3 Pid 201685 on bessemer-node026 device  3 [0x40] Tesla V100-PCIE-32GB
#
#                                                       out-of-place                       in-place
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
          8             2     float     sum    13.37    0.00    0.00  1e-07    14.59    0.00    0.00  0e+00
         16             4     float     sum    13.58    0.00    0.00  3e-08    13.35    0.00    0.00  3e-08
         32             8     float     sum    13.82    0.00    0.00  3e-08    13.46    0.00    0.00  3e-08
         64            16     float     sum    13.42    0.00    0.01  3e-08    13.45    0.00    0.01  3e-08
        128            32     float     sum    13.81    0.01    0.01  3e-08    13.21    0.01    0.01  3e-08
        256            64     float     sum    13.96    0.02    0.03  3e-08    13.63    0.02    0.03  3e-08
        512           128     float     sum    13.86    0.04    0.06  3e-08    13.56    0.04    0.06  1e-08
       1024           256     float     sum    13.77    0.07    0.11  1e-07    13.67    0.07    0.11  1e-07
       2048           512     float     sum    13.85    0.15    0.22  1e-07    13.92    0.15    0.22  1e-07
       4096          1024     float     sum    14.24    0.29    0.43  2e-07    13.75    0.30    0.45  2e-07
       8192          2048     float     sum    15.92    0.51    0.77  2e-07    15.23    0.54    0.81  2e-07
      16384          4096     float     sum    19.15    0.86    1.28  2e-07    18.81    0.87    1.31  2e-07
      32768          8192     float     sum    22.07    1.48    2.23  2e-07    21.74    1.51    2.26  2e-07
      65536         16384     float     sum    30.05    2.18    3.27  2e-07    29.71    2.21    3.31  2e-07
     131072         32768     float     sum    47.07    2.78    4.18  2e-07    46.60    2.81    4.22  2e-07
     262144         65536     float     sum    64.61    4.06    6.09  2e-07    63.70    4.12    6.17  2e-07
     524288        131072     float     sum    84.66    6.19    9.29  2e-07    85.23    6.15    9.23  2e-07
    1048576        262144     float     sum    156.5    6.70   10.05  2e-07    155.0    6.77   10.15  2e-07
    2097152        524288     float     sum    299.0    7.01   10.52  2e-07    299.0    7.01   10.52  2e-07
    4194304       1048576     float     sum    657.1    6.38    9.57  2e-07    651.5    6.44    9.66  2e-07
    8388608       2097152     float     sum   1313.2    6.39    9.58  2e-07   1308.3    6.41    9.62  2e-07
   16777216       4194304     float     sum   2671.5    6.28    9.42  2e-07   2671.4    6.28    9.42  2e-07
   33554432       8388608     float     sum   5349.2    6.27    9.41  2e-07   5351.0    6.27    9.41  2e-07
   67108864      16777216     float     sum    10712    6.26    9.40  2e-07    10711    6.27    9.40  2e-07
  134217728      33554432     float     sum    21410    6.27    9.40  2e-07    21407    6.27    9.40  2e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 4.22207
#

CUDA 11.0.2

Installed as a dependency of the cuDNN-8.0.4.30-CUDA-11.0.2 easyconfig.

Single GPU and compiler testing was conducted as above in the matrixMul batch job.

Inter-GPU performance was tested on all 4x V100 devices in bessemer-node026 (no NVLINK) using nccl-tests and /NCCL/2.8.3-CUDA-11.0.2. nccl-tests was run using ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4

Results:

# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid 200999 on bessemer-node026 device  0 [0x3d] Tesla V100-PCIE-32GB
#   Rank  1 Pid 200999 on bessemer-node026 device  1 [0x3e] Tesla V100-PCIE-32GB
#   Rank  2 Pid 200999 on bessemer-node026 device  2 [0x3f] Tesla V100-PCIE-32GB
#   Rank  3 Pid 200999 on bessemer-node026 device  3 [0x40] Tesla V100-PCIE-32GB
#
#                                                       out-of-place                       in-place
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
          8             2     float     sum    13.23    0.00    0.00  1e-07    13.39    0.00    0.00  0e+00
         16             4     float     sum    13.31    0.00    0.00  3e-08    13.33    0.00    0.00  3e-08
         32             8     float     sum    13.55    0.00    0.00  3e-08    13.45    0.00    0.00  3e-08
         64            16     float     sum    13.40    0.00    0.01  3e-08    13.27    0.00    0.01  3e-08
        128            32     float     sum    13.51    0.01    0.01  3e-08    13.26    0.01    0.01  3e-08
        256            64     float     sum    13.68    0.02    0.03  3e-08    13.20    0.02    0.03  3e-08
        512           128     float     sum    13.69    0.04    0.06  3e-08    13.32    0.04    0.06  1e-08
       1024           256     float     sum    13.40    0.08    0.11  1e-07    13.15    0.08    0.12  1e-07
       2048           512     float     sum    14.14    0.14    0.22  1e-07    13.56    0.15    0.23  1e-07
       4096          1024     float     sum    14.45    0.28    0.43  2e-07    13.95    0.29    0.44  2e-07
       8192          2048     float     sum    16.36    0.50    0.75  2e-07    15.91    0.51    0.77  2e-07
      16384          4096     float     sum    19.80    0.83    1.24  2e-07    19.44    0.84    1.26  2e-07
      32768          8192     float     sum    23.24    1.41    2.11  2e-07    22.48    1.46    2.19  2e-07
      65536         16384     float     sum    31.39    2.09    3.13  2e-07    30.96    2.12    3.18  2e-07
     131072         32768     float     sum    50.30    2.61    3.91  2e-07    49.39    2.65    3.98  2e-07
     262144         65536     float     sum    69.78    3.76    5.64  2e-07    68.22    3.84    5.76  2e-07
     524288        131072     float     sum    86.08    6.09    9.14  2e-07    86.15    6.09    9.13  2e-07
    1048576        262144     float     sum    155.5    6.74   10.11  2e-07    156.3    6.71   10.06  2e-07
    2097152        524288     float     sum    298.7    7.02   10.53  2e-07    295.2    7.10   10.65  2e-07
    4194304       1048576     float     sum    646.5    6.49    9.73  2e-07    647.9    6.47    9.71  2e-07
    8388608       2097152     float     sum   1310.7    6.40    9.60  2e-07   1307.6    6.42    9.62  2e-07
   16777216       4194304     float     sum   2665.6    6.29    9.44  2e-07   2660.4    6.31    9.46  2e-07
   33554432       8388608     float     sum   5324.7    6.30    9.45  2e-07   5324.3    6.30    9.45  2e-07
   67108864      16777216     float     sum    10678    6.28    9.43  2e-07    10667    6.29    9.44  2e-07
  134217728      33554432     float     sum    21423    6.26    9.40  2e-07    21352    6.29    9.43  2e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 4.18969
#

CUDA 10.1

Installed as a dependency of the fosscuda-2019a easyconfig.

Inter-GPU performance was tested on all 4x V100 devices in bessemer-node026 (no NVLINK) using nccl-tests and NCCL/2.4.2-gcccuda-2019a. nccl-tests was run using ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4

Results:

# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  31823 on bessemer-node026 device  0 [0x3d] Tesla V100-PCIE-32GB
#   Rank  1 Pid  31823 on bessemer-node026 device  1 [0x3e] Tesla V100-PCIE-32GB
#   Rank  2 Pid  31823 on bessemer-node026 device  2 [0x3f] Tesla V100-PCIE-32GB
#   Rank  3 Pid  31823 on bessemer-node026 device  3 [0x40] Tesla V100-PCIE-32GB
#
#                                                     out-of-place                       in-place
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2   float     sum    16.36    0.00    0.00  1e-07    15.99    0.00    0.00  0e+00
          16             4   float     sum    183.5    0.00    0.00  3e-08    16.04    0.00    0.00  3e-08
          32             8   float     sum    15.99    0.00    0.00  3e-08    15.93    0.00    0.00  3e-08
          64            16   float     sum    16.13    0.00    0.01  3e-08    16.12    0.00    0.01  3e-08
         128            32   float     sum    255.5    0.00    0.00  3e-08    16.10    0.01    0.01  3e-08
         256            64   float     sum    16.23    0.02    0.02  3e-08    16.15    0.02    0.02  3e-08
         512           128   float     sum    16.13    0.03    0.05  3e-08    16.08    0.03    0.05  1e-08
        1024           256   float     sum    16.08    0.06    0.10  1e-07    16.28    0.06    0.09  1e-07
        2048           512   float     sum    16.44    0.12    0.19  1e-07    16.15    0.13    0.19  1e-07
        4096          1024   float     sum    16.41    0.25    0.37  2e-07    16.38    0.25    0.37  2e-07
        8192          2048   float     sum    16.56    0.49    0.74  2e-07    16.22    0.51    0.76  2e-07
       16384          4096   float     sum    19.62    0.84    1.25  2e-07    18.78    0.87    1.31  2e-07
       32768          8192   float     sum    29.21    1.12    1.68  2e-07    27.23    1.20    1.80  2e-07
       65536         16384   float     sum    46.77    1.40    2.10  2e-07    43.66    1.50    2.25  2e-07
      131072         32768   float     sum    51.53    2.54    3.82  2e-07    50.77    2.58    3.87  2e-07
      262144         65536   float     sum    67.61    3.88    5.82  2e-07    67.61    3.88    5.82  2e-07
      524288        131072   float     sum    100.3    5.23    7.84  2e-07    100.3    5.23    7.84  2e-07
     1048576        262144   float     sum    165.5    6.33    9.50  2e-07    165.1    6.35    9.52  2e-07
     2097152        524288   float     sum    301.1    6.96   10.45  2e-07    299.6    7.00   10.50  2e-07
     4194304       1048576   float     sum    588.3    7.13   10.69  2e-07    583.7    7.19   10.78  2e-07
     8388608       2097152   float     sum   1141.4    7.35   11.02  2e-07   1133.3    7.40   11.10  2e-07
    16777216       4194304   float     sum   2269.2    7.39   11.09  2e-07   2256.6    7.43   11.15  2e-07
    33554432       8388608   float     sum   4510.3    7.44   11.16  2e-07   4497.0    7.46   11.19  2e-07
    67108864      16777216   float     sum   9013.1    7.45   11.17  2e-07   8998.9    7.46   11.19  2e-07
   134217728      33554432   float     sum    18003    7.46   11.18  2e-07    17974    7.47   11.20  2e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 4.42606
#

CUDA 10.0

Explicitly installed via the EasyBuild-provided CUDA/10.0.130 easyconfig.