Four academics in the Department of Computer Science (DCS) share two GPU nodes in Bessemer.
Academic |
Slurm Account name |
---|---|
|
|
|
|
|
|
|
bessemer-node041
and bessemer-node042
each have:
Processors |
2x Intel Xeon Gold 6138 (2.00GHz; 40 cores per CPU) |
RAM |
192GB (DDR4 @ 2666 MHz) |
NUMA nodes |
2x |
GPUS |
4x NVIDIA Tesla V100 SXM2 (16GB RAM each; NVLINK interconnects between GPUs) |
Networking |
25 Gbps Ethernet |
Local storage |
140 GB of temporary storage under |
Note
Most other GPU nodes in Bessemer have 32GB of GPU memory per GPU.
Users other than the four listed academics should contact one of those academics should they want access to these nodes.
That academic can then grant users access to the relevant SLURM Account (e.g. dcs-acad1
)
via this web interface.
There are several ways to access these nodes. The type of access granted for a job depends on which SLURM Account and Partition are requested at job submission time. Only certain users have access to a given Account.
Each of the four academics (plus their collaborators) have ring-fenced, on-demand access to the resources of half a node.
To submit a job via this route, you need to specify a *Partition* and *Account* when submitting a batch job or starting an interactive session:
Partition: dcs-acad
Account: dcs-acadX
where X
is 1, 2, 3 or 4 and varies between the academics).
QoS: do not specify one i.e. do not use the --qos
parameter.
Resource limits per job:
Default run-time: 8 hours
Maximum run-time: 7 days
CPU cores: 20
GPUs: 2
Memory: 96 GB
If any of the academics (or their collaborators) want to run a larger job that requires up to all the resources available in one of these two nodes then they can specify a different Partition when submitting a batch job or starting an interactive session:
Partition: dcs-acad-pre
Account: dcs-acadX
where X
is 1, 2, 3 or 4 and varies between the academics).
QoS: do not specify one i.e. do not use the --qos
parameter.
However, to facilitate fair sharing of these GPU nodes jobs submitted via this route are pre-emptable:
they will be stopped mid-execution if a job is submitted to the dcs-acad
partition (see above)
that requires those resources.
When a job submitted by this route is pre-empted by another job the pre-empted job is terminated and re-queued.
Resource limits per job:
Number of CPU cores, amount of RAM and number of GPUs in a single node i.e. multi-node jobs are not permitted.
Same default and maximum run-time (as above).