Advanced Job Profiling and Analysis
Tip
For majority of the users the Job Submission and Control page and the output log files you produce should suffice in analysing the perfomance of a batch job. This section is for users who want an even more granular view of the perfomance of their batch jobs by observing what is happening on the node their job is running.
Warning
Since your job has a fixed set of memory and CPU resources, carrying out resource hungry operations might lead to the scheduler killing your batch job due to errors like out of memory
.
Abuse of this feature to carry out tasks that are not profiling and perfomance analysis of your running batch job might lead to your account being suspended.
Accessing a Running Single-Node Slurm Batch job
In some cases, you might want to interact with a batch job in the RUNNING
state (e.g. for fault-finding, debugging or profiling purposes). You can start an interactive session within the resource allocation (memory and CPU cores on particular nodes) associated with the job with:
srun --jobid=<JOBID> --pty /usr/bin/bash
The command creates a new Job Step in the batch job with ID <JOBID>
and starts an interactive bash shell session within that Job Step, allowing you to interact with the resources allocated to that job.
If all the allocated CPU resources are already used, srun
will prohibit the new Job Step access to those resources. However, the argument --overlap
can be passed to srun
to allow Job Steps to share access to those resources.
srun --jobid=<JOBID> --overlap --pty /usr/bin/bash
Once you are in the interactive session you can see the process IDs associated with your job by typing:
scontrol listpids |grep <JOBID>
or
ps -u $USER
Start profiling and analysing the perfomance of the node and the job by using commands such as:
ps
nvidia-smi
top
lsof
Accessing a Running Multi-Node Slurm Batch job
In the scenario you are running a multi-node Slurm job you can use squeue
to see the nodes your job is using:
squeue --me
Example output:
squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
860638 sheffield job.sh user123 R 1:28:01 1 node301
830209 sheffield job.sh user123 R 2-18:45:36 1 node087
831510 sheffield job.sh user123 R 2-02:08:04 4 node[075-078]
Once you have the list of nodes you can specify the nodes you want the interactive session to launch on by using --nodelist=<NODELIST>
.
srun --jobid=<JOBID> --nodelist=<Node Name> --overlap --pty /usr/bin/bash