NVIDIA DGX-1 (Computer Science)¶
NVIDIA’s blurb when the DGX-1 was launched:
The NVIDIA DGX-1 is the world’s first Deep Learning supercomputer. It is equipped with 8 Tesla P100 GPUs connected together with NVLink technology to provide super fast inter-GPU communication. Capable of performing 170 TeraFLOPs of computation, it can provide up to 75 times speed up on training deep neural networks compared to the latest Intel Xeon CPU.
2x 20-core Intel Xeon E5-2698 v4 2.2 GHz
512 GB system RAM
8x Tesla P100 GPUs (16GB RAM each)
1 Gbps Ethernet
SSD storage array: provides 6.5TB under
One GPU is faulty; only 7 GPUs are currently usable.
All other nodes in ShARC are connected to the cluster’s 100 Gbps Omni-Path networking fabric but this not is not as it cannot accommodate an Omni-Path adaptor card.
The storage array is configured for performance (RAID 0) and not resilience so should be considered a cache and should not be used store the only copy of important data.
Same as per the DCS big memory nodes.
Starting an interactive session¶
Once you have been granted access to the DCS nodes in ShARC, to request an interactive session (interactive job) on the DGX-1 node with a single GPU, type:
qrshx -l gpu=1 -P rse -q rse-interactive.q
-l gpu=1denotes the number of GPUs that will be used in the job (maximum of 7), This is required otherwise the job will be placed on a one of the DCS team’s CPU-only nodes.
-P rse -q rse.qdenotes that the job will be submitted under the
rseProject and you want it to only run in the
rse-interactive.qjob queue, which is for interactive jobs that can run for up to 8 hours.
You are limited to at most 1 GPU over all your jobs on the DGX-1 that do not run in the
rse.q (batch-job-only) job queue.
This is to encourage users to prefer batch jobs over interactive sessions on the DGX-1,
as batch jobs make more efficient use of resources.
Submitting batch jobs¶
Batch jobs can be submitted to the DGX-1 by
adding lines containing
-P rse and
to your job submission script (note the different argument to
For example, create a job script named
my_job_script.sh with the contents:
#!/bin/bash #$ -l gpu=1 #$ -P rse #$ -q rse.q echo "Hello world"
You can add additional lines beneath
#$ -q rse.q to request additional resources
-l rmem=10G to request 10GB RAM per CPU core rather than the default.
Run your script with the
You can use
qstat command to check the status of your current job.
An output file is created in your home directory that captures your script’s outputs.
Note that the maximum run-time for jobs submitted to the (batch job only)
rse.q is four days,
as is standard for batch jobs on ShARC.
See Starting interactive jobs and submitting batch jobs for more information on job submission and the Sun Grid Engine scheduler.
Deep Learning on the DGX-1¶
Many popular Deep Learning packages are available to use on the DGX-1 and the ShARC cluster. Please see Deep Learning on ShARC for more information.