Fork me on GitHub

Bede (Tier 2 GPU cluster)

Bede is a new EPSRC-funded ‘Tier 2’ (regional) HPC cluster. It is currently being configured and tested and should be available for use by researchers in the summer of 2020. The system will be available for use by researchers from N8 Research Partnership institutions (Durham, Lancaster, Leeds, Liverpool, Manchester, Newcastle, Sheffield and York).

This system will be particularly well suited to supporting:

  • Jobs that require multiple GPUs and possibly multiple nodes

  • Jobs that require much movement of data between CPU and GPU memory

NB the system was previously known as NICE-19.

Hardware, OS and scheduler

  • Main GPU nodes (32x) - each (IBM AC922) node has

    • 2x IBM POWER9 CPUs (and two NUMA nodes), with

    • 2x NVIDIA V100 GPUs per CPU

    • Each CPU is connected to its two GPUs via high-bandwidth, low-latency NVLink interconnects (helps if you need to move lots of data to/from GPU memory)

  • Inference GPU nodes (6x IBM IC922)

    • 4x nodes have NVIDIA T4 inference GPUs

    • 2x nodes have FPGAs instead (Bitware 250-SoC, which is a Xilinx Zinq Ultrascale+ FPGA that has ARM on the package too)

  • Networking

    • 100 Gb EDR Infiniband (high bandwith and low latency to support multi-node jobs)

    • 10 Gb Ethernet as a backup

  • Storage: Lustre parallel file system (available over Infiniband and Ethernet network interfaces)

  • Scheduler: Slurm

Research software

The set of research software available on the cluster is yet to be finalised by is likely to include the following (subject to change):

  • IBM Watson Machine Learning Community Edition

    • IBM Distributed Deep Learning (DDL)

      • Efficiently scale popular machine learning frameworks over multiple CPUs/GPUs/nodes

      • Works with TensorFlow, IBMCaffe and Pytorch

    • IBM Large Model Support

      • Work with models too large to fit into the memory of a single GPU by transparently moving data from CPU to GPU memory as required.

      • Works with BVLC Caffe, IBMCaffe, TensorFlow, TensorFlow-Keras, PyTorch

    • IBM Snap ML

      • A library for training generalized linear models

      • Supports GPU acceleration, distributed training and sparse data structures

      • Can integrate with Scikit-Learn and Apache Spark

    • IBM provide this software inc their custom versions of PyTorch, TensorFlow etc via conda channels

  • Profiling and debugging

    • Standard GNU toolkit via the IBM Advanced Toolchain for Linux

      • Provides IBM-optimised GNU compilers, BLAS/LAPACK, glibc, gdb, valgrind, itrace, Boost, Python, Go and more

    • NVIDIA tools

      • nvprof

      • nsight-systems and nsight-compute

      • cuda-gdb