GPU nodes (Computer Science)

GPU nodes purchased for Bessemer by the Department of Computer Science (DCS) for use by DCS research staff, their collaborators and their research students.

Hardware specifications

Eight nodes (bessemer-node030 to bessemer-node037) each have:

Processors

2x Intel Xeon Gold 6138 (2.00GHz; 20 cores per CPU)

RAM

192GB (DDR4 @ 2666 MHz)

NUMA nodes

2x

GPUS

4x NVIDIA Tesla V100 SXM2 (32GB RAM each; NVLINK interconnects between GPUs)

Networking

25 Gbps Ethernet

Local storage

140 GB of temporary storage under /scratch (2x SSD RAID1)

Requesting Access

Access to the node is managed by the RSE team. Access policy:

  • PhD students, researchers and staff in Computer Science can all request access to the nodes.

  • Access to others who are collaborating on projects with some Computer Science / RSE involvement can be made on a case-by-case basis.

  • Access to Computer Science MSc and BSc students can be made on a case-by-case basis.

A number of other users were granted access before this policy was developed.

To request access complete this Google Form and someone within the RSE team will then respond with further information.

Using the nodes

There are several ways to access these nodes. The type of access granted for a job depends on which Slurm Account and Partition are requested at job submission time.

1. DCS test/debugging access

E.g. for short test batch jobs or for interactive debugging.

To submit a job via this route, you need to specify a *Partition* and *Account* when submitting a batch job or starting an interactive session:

  • Partition: dcs-gpu-test

  • Account: dcs-res (members of DCS) or dcs-collab (collaborators of DCS)

  • QoS: do not specify one i.e. do not use the --qos parameter.

Resource limits per job:

Each user can run a maximum of two of these jobs concurrently.

2. DCS access for larger jobs

If you want to run a longer batch job which uses up to all the resources available in one of these nodes, or a longer interactive job which uses up to 2 GPUs, then you can specify a different Partition when submitting a batch job or starting an interactive session:

  • Partition: dcs-gpu

  • Account: dcs-res (members of DCS) or dcs-collab (collaborators of DCS)

  • QoS: do not specify one i.e. do not use the --qos parameter.

Please only run batch jobs this way: long-running interactive sessions that are associated with large resource requests are often an inefficient way of using cluster resources.

Resource limits per job:

Checking Queue and Node Status

Using the squeue and sinfo SLURM Commands it is possible to query the status of these nodes. Knowing how many jobs are queued for these nodes, and the status of the nodes can be helpful when estimating when your jobs will run.

squeue can be used to view running and queued jobs for specific partitions, using -p <partition_list>. Requesting non default format options such as the time limit for jobs can help estimate when your jobs may begin to run, using -o of -O.

squeue -p dcs-gpu,dcs-gpu-test -o "%.18i %.12j %.12u %.12b %.2t %.10M %.10l %R"

Which will produce output similar to:

  JOBID         NAME         USER TRES_PER_NOD ST       TIME TIME_LIMIT NODELIST(REASON)
XXXXXXX     job_name     USERNAME   gres:gpu:1 PD       0:00    1:00:00 (Resources)
YYYYYYY     job_name     USERNAME   gres:gpu:1  R   12:34:56 7-00:00:00 bessemer-nodeNNN
...

sinfo can be used to query the status of nodes within a partition. For GPU nodes it is useful to also request Gres and GresUsed:

sinfo -p dcs-gpu -N -O "NodeList,Available,Gres,GresUsed,CPUsState"

When all GPUs in the partition are being used, the output will be similar to:

NODELIST            AVAIL               GRES                GRES_USED           CPUS(A/I/O/T)
bessemer-nodeNNN    up                  gpu:v100:4(S:0)     gpu:v100:4(IDX:0-3) 16/24/0/40
...