GPU nodes for specific Computer Science academics

Four academics in the Department of Computer Science (DCS) share two NVIDIA V100 GPU nodes in Bessemer:

Academic

Node

Slurm Account name

Slurm Partition name

Carolina Scarton

bessemer-node041

dcs-acad1

(see notes below)

Chenghua Lin

bessemer-node041

dcs-acad2

(see notes below)

Matt Ellis

bessemer-node042

dcs-acad3

(see notes below)

Po Yang

bessemer-node042

dcs-acad4

(see notes below)

Other academics in the department have temporary access to some NVIDIA A100 GPU nodes in Bessemer:

Academic(s)

Node

Slurm Account name

Slurm Partition name

Heidi Christensen / Jon Barker

gpu-node017

dcs-acad5

dcs-acad5

Nafise Sadat Moosavi

gpu-node018

dcs-acad6

dcs-acad6

Hardware specifications

bessemer-node041 and bessemer-node042

Processors

2x Intel Xeon Gold 6138 (2.00GHz; 20 cores per CPU)

RAM

192GB (DDR4 @ 2666 MHz)

NUMA nodes

2x

GPUS

4x NVIDIA Tesla V100 SXM2 (16GB RAM each; NVLINK interconnects between GPUs)

Networking

25 Gbps Ethernet

Local storage

140 GB of temporary storage under /scratch (2x SSD RAID1)

gpu-node017 and gpu-node018

See the specifications of the NVIDIA A100 nodes listed here.

Requesting access

Users other than the listed academics should contact one of those academics should they want access to these nodes.

That academic can then grant users access to the relevant SLURM Account (e.g. dcs-acad1) via this web interface.

Using the nodes

There are several ways to access these nodes. The type of access granted for a job depends on which SLURM Account and Partition are requested at job submission time. Only certain users have access to a given Account.

sharc-node041 and sharc-node042: non-preemptable access to half a node

Each of four academics (plus their collaborators) have ring-fenced, on-demand access to the resources of half a node.

To submit a job via this route, you need to specify a Partition and an Account when submitting a batch job or starting an interactive session:

  • Partition: dcs-acad

  • Account: dcs-acadX where X is 1, 2, 3 or 4 and varies between the academics).

  • QoS: do not specify one i.e. do not use the --qos parameter.

Resource limits per job:

  • Default run-time: 8 hours

  • Maximum run-time: 7 days

  • CPU cores: 20

  • GPUs: 2

  • Memory: 96 GB

sharc-node041 and sharc-node042: preemptable access to both nodes

If any of the four academics (or their collaborators) want to run a larger job that requires up to all the resources available in one of these two nodes then they can specify a different Partition when submitting a batch job or starting an interactive session:

  • Partition: dcs-acad-pre

  • Account: dcs-acadX where X is 1, 2, 3 or 4 and varies between the academics).

  • QoS: do not specify one i.e. do not use the --qos parameter.

However, to facilitate fair sharing of these GPU nodes jobs submitted via this route are preemptable: they will be stopped mid-execution if a job is submitted to the dcs-acad partition (see above) that requires those resources.

When a job submitted by this route is preempted by another job the preempted job is terminated and re-queued.

Resource limits per job:

gpu-node017 and gpu-node018: access to a node

Two sets two academics (plus their collaborators) each have access to one node.

To submit a job via this route, you need to specify a Partition and an Account when submitting a batch job or starting an interactive session:

  • Partition: dcs-acadX where X is 5 or 6 and varies between the academics).

  • Account: dcs-acadX where again X is 5 or 6.

  • QoS: do not specify one i.e. do not use the --qos parameter.

Resource limits per job:

  • Default run-time: 8 hours

  • Maximum run-time: 7 days

  • CPU cores: 48

  • GPUs: 4

  • Memory: 512 GB