GPU nodes for specific Computer Science academics

Four academics in the Department of Computer Science (DCS) share two GPU nodes in Bessemer.

Academic

Slurm Account name

Loic Barrault

dcs-acad1

Chenghua Lin

dcs-acad2

Aditya Gilra

dcs-acad3

Po Yang

dcs-acad3

Hardware specifications

bessemer-node041 and bessemer-node042 each have:

Processors

2x Intel Xeon Gold 6138 (2.00GHz; 40 cores per CPU)

RAM

192GB (DDR4 @ 2666 MHz)

NUMA nodes

2x

GPUS

4x NVIDIA Tesla V100 SXM2 (16GB RAM each; NVLINK interconnects between GPUs)

Networking

25 Gbps Ethernet

Local storage

140 GB of temporary storage under /scratch (2x SSD RAID1)

Note

Most other GPU nodes in Bessemer have 32GB of GPU memory per GPU.

Requesting access

Users other than the four listed academics should contact one of those academics should they want access to these nodes.

That academic can then grant users access to the relevant SLURM Account (e.g. dcs-acad1) via this web interface.

Using the nodes

There are several ways to access these nodes. The type of access granted for a job depends on which SLURM Account and Partition are requested at job submission time. Only certain users have access to a given Account.

1. Non-pre-emptable access to half a node

Each of the four academics (plus their collaborators) have ring-fenced, on-demand access to the resources of half a node.

To submit a job via this route, you need to specify a *Partition* and *Account* when submitting a batch job or starting an interactive session:

  • Partition: dcs-acad

  • Account: dcs-acadX where X is 1, 2, 3 or 4 and varies between the academics).

  • QoS: do not specify one i.e. do not use the --qos parameter.

Resource limits per job:

  • Default run-time: 8 hours

  • Maximum run-time: 7 days

  • CPU cores: 20

  • GPUs: 2

  • Memory: 96 GB

2. Pre-emptable access to both nodes

If any of the academics (or their collaborators) want to run a larger job that requires up to all the resources available in one of these two nodes then they can specify a different Partition when submitting a batch job or starting an interactive session:

  • Partition: dcs-acad-pre

  • Account: dcs-acadX where X is 1, 2, 3 or 4 and varies between the academics).

  • QoS: do not specify one i.e. do not use the --qos parameter.

However, to facilitate fair sharing of these GPU nodes jobs submitted via this route are pre-emptable: they will be stopped mid-execution if a job is submitted to the dcs-acad partition (see above) that requires those resources.

When a job submitted by this route is pre-empted by another job the pre-empted job is terminated and re-queued.

Resource limits per job: