Fork me on GitHub

cuDNN

The NVIDIA CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers. cuDNN is part of the NVIDIA Deep Learning SDK.

Usage

Only GPU-enabled nodes are able to run the library. See Using GPUs on ShARC for more information on how to request a GPU-enabled node for an interactive session or job submission.

Load the appropriate cuDNN version (and implicitly load a specific CUDA version) with one of the following commands:

module load libs/cudnn/7.6.5.32/binary-cuda-10.2.89
module load libs/cudnn/7.6.5.32/binary-cuda-10.1.243
module load libs/cudnn/7.6.5.32/binary-cuda-10.0.130
module load libs/cudnn/7.6.5.32/binary-cuda-9.0.176
module load libs/cudnn/7.5.0.56/binary-cuda-10.0.130
module load libs/cudnn/7.5.0.56/binary-cuda-9.0.176
module load libs/cudnn/7.3.1.20/binary-cuda-9.0.176
module load libs/cudnn/7.0/binary-cuda-9.1.85
module load libs/cudnn/7.0/binary-cuda-8.0.44
module load libs/cudnn/6.0/binary-cuda-8.0.44
module load libs/cudnn/5.1/binary-cuda-8.0.44
module load libs/cudnn/5.1/binary-cuda-7.5.18
module load libs/cudnn/4.0/binary-cuda-7.5.18

Note that for libs/cudnn/4.0/binary-cuda-7.5.18 the mnistCUDNN test program succeeds on K80 GPUs but fails on P100 and V100 GPUs.

Examples

Examples are provided for the following versions

  • libs/cudnn/7.6.5.32/binary-cuda-10.2.89

  • libs/cudnn/7.6.5.32/binary-cuda-10.1.243

  • libs/cudnn/7.6.5.32/binary-cuda-10.0.130

  • libs/cudnn/7.6.5.32/binary-cuda-9.0.176

  • libs/cudnn/7.5.0.56/binary-cuda-10.0.130

  • libs/cudnn/7.5.0.56/binary-cuda-9.0.176

  • libs/cudnn/7.3.1.20/binary-cuda-9.0.176

  • libs/cudnn/4.0/binary-cuda-7.5.18

Usage with libs/cudnn/7.3.1.20/binary-cuda-9.0.176:

# start an interactive session on a GPU node
qrshx -l gpu=1
# Copy the cuDNN 7 examples to a temporary directory
cp -r "$CUDNN_HOME/src/cudnn_samples_v7" "${TMPDIR-/tmp}"
# Compile an example
cd "${TMPDIR-/tmp}/cudnn_samples_v7/mnistCUDNN"
make
# Run the example
./mnistCUDNN

Installation notes

This section is primarily for administrators of the system.

The cuDNN library is only available to download through the developer portal. Installation .tgz files are located in /usr/local/media/protected/cudnn.

Version 7.6.5.32

  • Install script: install_cudnn7.6.5.32.sh

  • Module file for CUDA 10.2

  • Module file for CUDA 10.1

  • Module file for CUDA 10.0

  • Module file for CUDA 9.0

  • Testing: ran the mnistCUDNN example (see Examples above) with CUDA 10.0 on a V100 GPU; results:

    [te1st mnistCUDNN]$ ./mnistCUDNN
    cudnnGetVersion() : 7605 , CUDNN_VERSION from cudnn.h : 7605 (7.6.5)
    Host compiler version : GCC 4.8.5
    There are 1 CUDA capable devices on your machine :
    device 0 : sms 80  Capabilities 7.0, SmClock 1380.0 Mhz, MemSize (Mb) 16130, MemClock 877.0 Mhz, Ecc=1, boardGroupID=0
    Using device 0
    
    Testing single precision
    Loading image data/one_28x28.pgm
    Performing forward propagation ...
    Testing cudnnGetConvolutionForwardAlgorithm ...
    Fastest algorithm is Algo 0
    Testing cudnnFindConvolutionForwardAlgorithm ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.030688 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.128000 time requiring 2057744 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.148448 time requiring 57600 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.196640 time requiring 3464 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.231456 time requiring 203008 memory
    Resulting weights from Softmax:
    0.0000000 0.9999399 0.0000000 0.0000000 0.0000561 0.0000000 0.0000012 0.0000017 0.0000010 0.0000000
    Loading image data/three_28x28.pgm
    Performing forward propagation ...
    Resulting weights from Softmax:
    0.0000000 0.0000000 0.0000000 0.9999288 0.0000000 0.0000711 0.0000000 0.0000000 0.0000000 0.0000000
    Loading image data/five_28x28.pgm
    Performing forward propagation ...
    Resulting weights from Softmax:
    0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 0.9999820 0.0000154 0.0000000 0.0000012 0.0000006
    
    Result of classification: 1 3 5
    
    Test passed!
    
    Testing half precision (math in single precision)
    Loading image data/one_28x28.pgm
    Performing forward propagation ...
    Testing cudnnGetConvolutionForwardAlgorithm ...
    Fastest algorithm is Algo 0
    Testing cudnnFindConvolutionForwardAlgorithm ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.016384 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.051200 time requiring 28800 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.055328 time requiring 2057744 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.065536 time requiring 3464 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.079904 time requiring 203008 memory
    Resulting weights from Softmax:
    0.0000001 1.0000000 0.0000001 0.0000000 0.0000563 0.0000001 0.0000012 0.0000017 0.0000010 0.0000001
    Loading image data/three_28x28.pgm
    Performing forward propagation ...
    Resulting weights from Softmax:
    0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000720 0.0000000 0.0000000 0.0000000 0.0000000
    Loading image data/five_28x28.pgm
    Performing forward propagation ...
    Resulting weights from Softmax:
    0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 1.0000000 0.0000154 0.0000000 0.0000012 0.0000006
    
    Result of classification: 1 3 5
    
    Test passed!
    

Version 7.5.0.56

  • Install script: install_cudnn7.5.0.56_for_cuda_10.0_and_9.0.sh

  • Module file for CUDA 10.0

  • Module file for CUDA 9.0

  • Testing: ran the mnistCUDNN example (see Examples above) with CUDA 10.0 on a V100 GPU; results:

    [te1st@sharc-node168 mnistCUDNN]$ ./mnistCUDNN
    cudnnGetVersion() : 7500 , CUDNN_VERSION from cudnn.h : 7500 (7.5.0)
    Host compiler version : GCC 4.8.5
    There are 1 CUDA capable devices on your machine :
    device 0 : sms 80  Capabilities 7.0, SmClock 1380.0 Mhz, MemSize (Mb) 16130, MemClock 877.0 Mhz, Ecc=1, boardGroupID=0
    Using device 0
    
    Testing single precision
    Loading image data/one_28x28.pgm
    Performing forward propagation ...
    Testing cudnnGetConvolutionForwardAlgorithm ...
    Fastest algorithm is Algo 0
    Testing cudnnFindConvolutionForwardAlgorithm ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.019424 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.053248 time requiring 57600 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.078848 time requiring 3464 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.086016 time requiring 2057744 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.094208 time requiring 203008 memory
    Resulting weights from Softmax:
    0.0000000 0.9999399 0.0000000 0.0000000 0.0000561 0.0000000 0.0000012 0.0000017 0.0000010 0.0000000
    Loading image data/three_28x28.pgm
    Performing forward propagation ...
    Resulting weights from Softmax:
    0.0000000 0.0000000 0.0000000 0.9999288 0.0000000 0.0000711 0.0000000 0.0000000 0.0000000 0.0000000
    Loading image data/five_28x28.pgm
    Performing forward propagation ...
    Resulting weights from Softmax:
    0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 0.9999820 0.0000154 0.0000000 0.0000012 0.0000006
    
    Result of classification: 1 3 5
    
    Test passed!
    
    Testing half precision (math in single precision)
    Loading image data/one_28x28.pgm
    Performing forward propagation ...
    Testing cudnnGetConvolutionForwardAlgorithm ...
    Fastest algorithm is Algo 0
    Testing cudnnFindConvolutionForwardAlgorithm ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.016384 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.051200 time requiring 28800 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.052224 time requiring 3464 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.065568 time requiring 2057744 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.068608 time requiring 207360 memory
    Resulting weights from Softmax:
    0.0000001 1.0000000 0.0000001 0.0000000 0.0000563 0.0000001 0.0000012 0.0000017 0.0000010 0.0000001
    Loading image data/three_28x28.pgm
    Performing forward propagation ...
    Resulting weights from Softmax:
    0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000720 0.0000000 0.0000000 0.0000000 0.0000000
    Loading image data/five_28x28.pgm
    Performing forward propagation ...
    Resulting weights from Softmax:
    0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 1.0000000 0.0000154 0.0000000 0.0000012 0.0000006
    
    Result of classification: 1 3 5
    
    Test passed!
    

Version 7.3.1.20

  • Install script: install_cudnn7.3.1.20_for_cuda_9.0.sh

  • Module file for CUDA 9.0

  • Testing: ran the mnistCUDNN example (see Examples above) with CUDA 9.0 on a V100 GPU; results:

    [te1st@sharc-node168 mnistCUDNN]$ ./mnistCUDNN
    cudnnGetVersion() : 7301 , CUDNN_VERSION from cudnn.h : 7301 (7.3.1)
    Host compiler version : GCC 4.8.5
    There are 2 CUDA capable devices on your machine :
    device 0 : sms 80  Capabilities 7.0, SmClock 1380.0 Mhz, MemSize (Mb) 16160, MemClock 877.0 Mhz, Ecc=1, boardGroupID=0
    device 1 : sms 80  Capabilities 7.0, SmClock 1380.0 Mhz, MemSize (Mb) 16160, MemClock 877.0 Mhz, Ecc=1, boardGroupID=1
    Using device 0
    
    Testing single precision
    Loading image data/one_28x28.pgm
    Performing forward propagation ...
    Testing cudnnGetConvolutionForwardAlgorithm ...
    Fastest algorithm is Algo 0
    Testing cudnnFindConvolutionForwardAlgorithm ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.109600 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.161792 time requiring 57600 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.183296 time requiring 3464 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.198656 time requiring 203008 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.200704 time requiring 2057744 memory
    Resulting weights from Softmax:
    0.0000000 0.9999399 0.0000000 0.0000000 0.0000561 0.0000000 0.0000012 0.0000017 0.0000010 0.0000000
    Loading image data/three_28x28.pgm
    Performing forward propagation ...
    Resulting weights from Softmax:
    0.0000000 0.0000000 0.0000000 0.9999288 0.0000000 0.0000711 0.0000000 0.0000000 0.0000000 0.0000000
    Loading image data/five_28x28.pgm
    Performing forward propagation ...
    Resulting weights from Softmax:
    0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 0.9999820 0.0000154 0.0000000 0.0000012 0.0000006
    
    Result of classification: 1 3 5
    
    Test passed!
    
    Testing half precision (math in single precision)
    Loading image data/one_28x28.pgm
    Performing forward propagation ...
    Testing cudnnGetConvolutionForwardAlgorithm ...
    Fastest algorithm is Algo 0
    Testing cudnnFindConvolutionForwardAlgorithm ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.048128 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.089088 time requiring 3464 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.097280 time requiring 2057744 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.098272 time requiring 28800 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.132096 time requiring 207360 memory
    Resulting weights from Softmax:
    0.0000001 1.0000000 0.0000001 0.0000000 0.0000563 0.0000001 0.0000012 0.0000017 0.0000010 0.0000001
    Loading image data/three_28x28.pgm
    Performing forward propagation ...
    Resulting weights from Softmax:
    0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000720 0.0000000 0.0000000 0.0000000 0.0000000
    Loading image data/five_28x28.pgm
    Performing forward propagation ...
    Resulting weights from Softmax:
    0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 1.0000000 0.0000154 0.0000000 0.0000012 0.0000006
    
    Result of classification: 1 3 5
    
    Test passed!
    

Version 6.0

Version 4.0

  • Install script: install_4.0_for_cuda_7.0.sh

  • Module file for CUDA 7.5 (this cuDNN was built for CUDA 7.0 but should be compatible with CUDA 7.5)

  • Testing: ran the mnistCUDNN example (see Examples above) with CUDA 7.5 on a K80 GPU (NB tests failed on P100 and V100 GPUs):

    $ make
    /usr/local/packages/libs/CUDA/7.5.18/binary/cuda/bin/nvcc -ccbin g++ -I/usr/local/packages/libs/CUDA/7.5.18/binary/cuda/include -IFreeImage/include -IUtilNPP  -m64    -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_52,code=compute_52 -o fp16_dev.o -c fp16_dev.cu
    g++ -I/usr/local/packages/libs/CUDA/7.5.18/binary/cuda/include -IFreeImage/include -IUtilNPP   -o fp16_emu.o -c fp16_emu.cpp
    g++ -I/usr/local/packages/libs/CUDA/7.5.18/binary/cuda/include -IFreeImage/include -IUtilNPP   -o mnistCUDNN.o -c mnistCUDNN.cpp
    /usr/local/packages/libs/CUDA/7.5.18/binary/cuda/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_52,code=compute_52 -o mnistCUDNN fp16_dev.o fp16_emu.o mnistCUDNN.o  -LFreeImage/lib/linux/x86_64 -LFreeImage/lib/linux -lcudart -lnppi -lnppc -lcublas -lcudnn -lfreeimage -lstdc++ -lm
    $ ./mnistCUDNN
    cudnnGetVersion() : 4007 , CUDNN_VERSION from cudnn.h : 4007 (4.0.7)
    Host compiler version : GCC 4.8.5
    There are 8 CUDA capable devices on your machine :
    device 0 : sms 13  Capabilities 3.7, SmClock 823.5 Mhz, MemSize (Mb) 11441, MemClock 2505.0 Mhz, Ecc=1, boardGroupID=0
    device 1 : sms 13  Capabilities 3.7, SmClock 823.5 Mhz, MemSize (Mb) 11441, MemClock 2505.0 Mhz, Ecc=1, boardGroupID=0
    device 2 : sms 13  Capabilities 3.7, SmClock 823.5 Mhz, MemSize (Mb) 11441, MemClock 2505.0 Mhz, Ecc=1, boardGroupID=2
    device 3 : sms 13  Capabilities 3.7, SmClock 823.5 Mhz, MemSize (Mb) 11441, MemClock 2505.0 Mhz, Ecc=1, boardGroupID=2
    device 4 : sms 13  Capabilities 3.7, SmClock 823.5 Mhz, MemSize (Mb) 11441, MemClock 2505.0 Mhz, Ecc=1, boardGroupID=4
    device 5 : sms 13  Capabilities 3.7, SmClock 823.5 Mhz, MemSize (Mb) 11441, MemClock 2505.0 Mhz, Ecc=1, boardGroupID=4
    device 6 : sms 13  Capabilities 3.7, SmClock 823.5 Mhz, MemSize (Mb) 11441, MemClock 2505.0 Mhz, Ecc=1, boardGroupID=6
    device 7 : sms 13  Capabilities 3.7, SmClock 823.5 Mhz, MemSize (Mb) 11441, MemClock 2505.0 Mhz, Ecc=1, boardGroupID=6
    Using device 0
    
    Testing single precision
    Loading image data/one_28x28.pgm
    Performing forward propagation ...
    Testing cudnnGetConvolutionForwardAlgorithm ...
    Fastest algorithm is Algo 1
    Testing cudnnFindConvolutionForwardAlgorithm ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.024928 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.033504 time requiring 100 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.046816 time requiring 57600 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.128416 time requiring 207360 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.143424 time requiring 209360 memory
    Resulting weights from Softmax:
    0.0000000 0.9999399 0.0000000 0.0000000 0.0000561 0.0000000 0.0000012 0.0000017 0.0000010 0.0000000
    Loading image data/three_28x28.pgm
    Performing forward propagation ...
    Resulting weights from Softmax:
    0.0000000 0.0000000 0.0000000 0.9999288 0.0000000 0.0000711 0.0000000 0.0000000 0.0000000 0.0000000
    Loading image data/five_28x28.pgm
    Performing forward propagation ...
    Resulting weights from Softmax:
    0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 0.9999820 0.0000154 0.0000000 0.0000012 0.0000006
    
    Result of classification: 1 3 5
    
    Test passed!
    
    Testing half precision (math in single precision)
    Loading image data/one_28x28.pgm
    Performing forward propagation ...
    Testing cudnnGetConvolutionForwardAlgorithm ...
    Fastest algorithm is Algo 1
    Testing cudnnFindConvolutionForwardAlgorithm ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.026144 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.033696 time requiring 100 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.047136 time requiring 28800 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.133760 time requiring 207360 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.144096 time requiring 209360 memory
    Resulting weights from Softmax:
    0.0000001 1.0000000 0.0000001 0.0000000 0.0000563 0.0000001 0.0000012 0.0000017 0.0000010 0.0000001
    Loading image data/three_28x28.pgm
    Performing forward propagation ...
    Resulting weights from Softmax:
    0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000714 0.0000000 0.0000000 0.0000000 0.0000000
    Loading image data/five_28x28.pgm
    Performing forward propagation ...
    Resulting weights from Softmax:
    0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 1.0000000 0.0000154 0.0000000 0.0000012 0.0000006
    
    Result of classification: 1 3 5
    
    Test passed!