TensorFlow

TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.

About TensorFlow on Stanage

Note

A GPU-enabled worker node must be requested in order to use the GPU version of this software. See Using GPUs on Stanage for more information.

As TensorFlow and all its dependencies are written in Python, it can be installed locally in your home directory. The use of Anaconda (Python) is recommended as it is able to create a virtual environment in your home directory, allowing the installation of new Python packages without needing admin permissions.

Note

If you are wanting to use TensorFlow with GPUs on Stanage:

Be aware that official TensorFlow releases don’t (yet) come with CUDA code compiled to target the sm_90 NVIDIA Compute Capability (required by H100 GPUs - see Stanage specs). Attempts to run TensorFlow on H100 nodes will result in GPU-architecture-independent ‘PTX’ code that is bundled with TensorFlow being compiled on the fly to target the sm_90 NVIDIA Compute Capability. Be warned that this process can take up to ~30 minutes.

Warning

During the 2 week introduction phase of the H100 GPUs to the Stanage cluster, usage of the H100 GPUs requires the --partition=gpu-h100 and --gres=gpu:1 arguments to be set in your submission scripts. This is to ensure usage is “opt in” by users as the slightly different architecture of these GPUs to the existing A100 GPUs may necessitate changes to batch submission scripts and selected software versions.

Eventually the H100 GPUs will be brought into the general GPU partition, at which point the --partition=gpu will be required to access H100s (or any GPUs). At that stage any submissions using the general --gres=gpu:1 will be scheduled with the first available GPU of any type. Requesting a specific type of GPU will then require selection via the --gres=gpu:h100:1 or --gres=gpu:a100:1 arguments.

When these latter changes are made, we will give advanced notice via email and by amendments made within this documentation.

Installation in Home Directory - CPU Version

In order to to install to your home directory, Conda is used to create a virtual python environment for installing your local version of TensorFlow.

First request an interactive session, e.g. with Interactive Jobs.

Then TensorFlow can be installed by the following:

# Load the conda module
module load Anaconda3/2022.10

# Create an conda virtual environment called 'tensorflow'
conda create -n tensorflow python=3.10

# Activate the 'tensorflow' environment
source activate tensorflow

pip install tensorflow

Every Session Afterwards and in Your Job Scripts

Every time you use a new session or within your job scripts, the modules must be loaded and Conda must be activated again. Use the following command to activate the Conda environment with TensorFlow installed:

module load Anaconda3/2022.10
source activate tensorflow

Installation in Home Directory - GPU Version

The GPU version of TensorFlow is a distinct Pip package and is also dependent on CUDA and cuDNN libraries, making the installation procedure slightly different.

Warning

You will need to ensure you load CUDA and cuDNN modules which are compatible with the version of TensorFlow used (see table).

First request an interactive session, e.g. see Interactive use of the GPUs.

Then GPU version of TensorFlow can be installed by the following

# Load the conda module
module load Anaconda3/2022.10

# Load the CUDA and cuDNN module
module load cuDNN/8.7.0.84-CUDA-11.8.0

# Create an conda virtual environment called 'tensorflow-gpu'
conda create -n tensorflow-gpu python=3.6

# Activate the 'tensorflow-gpu' environment
source activate tensorflow-gpu

# Install GPU version of TensorFlow
pip install tensorflow==2.12.0

To install a different version of tensorflow other than the latest version you should specify a version number when running pip install i.e.

pip install tensorflow==<version_number>

Every Session Afterwards and in Your Job Scripts

Every time you use a new session or within your job scripts, the modules must be loaded and Conda must be activated again. Use the following command to activate the Conda environment with TensorFlow installed:

module load Anaconda3/2022.10
module load cuDNN/8.7.0.84-CUDA-11.8.0
source activate tensorflow-gpu

Testing your TensorFlow installation

You can test that TensorFlow is running on the GPU with the following Python code (requires TensorFlow >= 2):

import tensorflow as tf

tf.debugging.set_log_device_placement(True)

# Creates a graph
# (ensure tensors placed on the GPU)
with tf.device('/device:GPU:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)

# Runs the op.
print(c)

Which when run should give the following results:

[[ 22.  28.]
 [ 49.  64.]]

CUDA and cuDNN Import Errors

TensorFlow releases depend on specific versions of both CUDA and cuDNN. If the wrong cuDNN module is loaded, you may receive ImportError runtime errors such as:

ImportError: libcublas.so.10.0: cannot open shared object file: No such file or directory

This indicates that TensorFlow was expecting to find CUDA 10.0 (and an appropriate version of cuDNN) but was unable to do so.

The following table shows the which module to load for the various versions of TensorFlow, based on the tested build configurations.

TensorFlow

CUDA

cuDNN

cuDNN module to load

2.12.0

11.8

>= 8.6

cuDNN/8.7.0.84-CUDA-11.8.0 (inc. CUDA 11.8.0) recommended

2.4.0

11.0

>= 8.0

cuDNN/8.0.4.30-CUDA-11.0.2 (inc. CUDA 11.0.2) recommended

2.3.0

10.1

>= 7.6

cuDNN/7.6.4.38-gcccuda-2019b (inc. CUDA 10.1.243)

2.1.0

10.1

>= 7.6

cuDNN/7.6.4.38-gcccuda-2019b (inc. CUDA 10.1.243)

2.0.0

10.0

>= 7.4

cuDNN/7.4.2.24-CUDA-10.0.130

1.14.0

10.0

>= 7.4

cuDNN/7.4.2.24-CUDA-10.0.130

1.13.1

10.0

>= 7.4

cuDNN/7.4.2.24-CUDA-10.0.130

>= 1.5.0

9.0

7

N/A

>= 1.3.0

8.0

6

N/A

>= 1.0.0

8.0

5.1

N/A

Training

The Research Software Engineering team has an introductory workshop on deep learning with the TensorFlow Keras framework <https://rses-dl-course.github.io/>__.