Attention

The ShARC HPC cluster was decommissioned on the 30th of November 2023 at 17:00. It is no longer possible for users to access that cluster.

SGE Scheduler Job Submission info

Job Submission / Control on ShARC

Interactive Jobs

There are three commands for requesting an interactive shell using SGE:

  • qrsh - No support for graphical applications. Standard SGE command.

  • qsh - Supports graphical applications. Standard SGE command.

  • qrshx - Supports graphical applications. Superior to qsh and is unique to Sheffield’s clusters.

Usage of these commands is as follows:

$ qrshx

You can configure the resources available to the interactive session by adding command line options. For example to start an interactive session with access to 16 GB of RAM:

$ qrshx -l rmem=16G

To start a session with access to 2 cores in the SMP parallel environment:

$ qrshx -pe smp 2

A table of common interactive job options is given below; any of these can be combined together to request more resources.

SGE Command

Description

-l h_rt=hh:mm:ss

Specify the total maximum wall clock execution time for the job. The upper limit is 08:00:00. Note: these limits may differ for reservations /projects.

-l rmem=xxG

-l rmem=xxG is used to specify the maximum amount (xx) of real memory to be requested per CPU core.


If the real memory usage of your job exceeds this value multiplied by the number of cores / nodes you requested then your job will be killed.

-pe env nn

Specify a parallel, env, environment and a number of processor cores nn. e.g. SMP jobs -pe smp 4 or MPI jobs -pe mpi 4.

Note that ShARC has multiple parallel environments, the current parallel environments can be found on the ShARC Parallel Environments page.

Batch Jobs

Tip

Batch jobs have larger resource limits than interactive jobs! For guidance on what these limits are and how best to select resources please see our Choosing appropriate compute resources page.

There is a single command to submit jobs via SGE:

  • qsub - Standard SGE command with no support for interactivity or graphical applications.

The batch submission scripts are executed for submission as below:

qsub submission.sh

Note the job submission number. For example:

Your job 12345678 ("submission.sh") has been submitted

You can check your output log or error log file when the job is finished.

cat job.sh.o12345678
cat job.sh.e12345678

There are numerous further options you can request in your batch submission files which are detailed below:

Pass through current shell environment (sometimes important):

#$ -V

Name your job submission:

#$ -N test_job

Specify a parallel environment for SMP jobs where N is a number of cores:

#$ -pe smp N

Specify a parallel environment for MPI jobs where N is a number of cores:

#$ -pe mpi N

Request a specific amount of memory where N is a number of gigabytes per core:

#$ -l rmem=NG

Request a specific amount of time in hours, minutes and seconds:

#$ -l h_rt=hh:mm:ss

Request email notifications on start, end and abort:

#$ -M me@somedomain.com
#$ -m abe

For the full list of the available options please visit the SGE manual webpage for qsub here: http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html

Here is an example SGE batch submission script that runs a fictitious program called foo:

#!/bin/bash
# Request 5 gigabytes of real memory (mem)
#$ -l rmem=5G

# load the module for the program we want to run
module load apps/gcc/foo

# Run the program foo with input foo.dat
# and output foo.res
foo foo.dat foo.res

Some things to note:

  • The first line always needs to be #!/bin/bash (to tell the scheduler that this is a bash batch script).

  • Comments start with a #.

  • It is always best to fully specify job’s resources with your submission script.

  • All SGE Scheduler options, such as the amount of memory requested, start with #$

  • You will often require one or more module commands in your submission file. These make programs and libraries available to your scripts. Many applications and libraries are available as modules on ShARC.

Here is a more complex example that requests more resources:

#!/bin/bash
# Request 16 gigabytes of real memory (RAM) 4 cores *4G = 16
#$ -l rmem=4G
# Request 4 cores in an OpenMP environment
#$ -pe openmp 4
# Email notifications to me@somedomain.com
#$ -M me@somedomain.com
# Email notifications if the job aborts
#$ -m a
# Name the job
#$ -N my_job
# Request 24 hours of time
#$ -l h_rt=24:00:00

# Load the modules required by our program
module load compilers/gcc/5.2
module load apps/gcc/foo

# Set the OPENMP_NUM_THREADS environment variable to 4
# This is needed to ensure efficient core usage.
export OMP_NUM_THREADS=$NSLOTS

# Run the program foo with input foo.dat
# and output foo.res
foo foo.dat foo.res

Monitoring running Jobs

There is a single command to monitor running and queued jobs via the qstat command:

Display your own jobs queued on the system

$ qstat

Show a specific running or queueing job’s details:

qstat -j jobid

Display all jobs queued on the system

$ qstat -u "*"

Display all jobs queued by the username foo1bar

$ qstat -u foo1bar

Display all jobs in the openmp parallel environment

$ qstat -pe openmp

Display all jobs in the queue named foobar

$ qstat -q foobar.q

Example output:

$ qstat -u "*"
job-ID  prior   name       user          state submit/start at     queue                              slots   ja-task-ID
------------------------------------------------------------------------------------------------------------------------
1234567 0.00000 INTERACTIV foo1bar       dr    12/24/2021 07:13:20 interactive.q@sharc-node004.sh     1
1234568 0.00000 job.sh     foo1bar       r     01/22/2022 05:37:31 all.q@sharc-node019.shef.ac.uk     16
1234569 0.00000 job.sh     foo1bar       r     01/23/2022 07:41:18 all.q@sharc-node084.shef.ac.uk     16
1234570 0.00000 job.sh     foo1bar       Rr    01/23/2022 08:03:22 all.q@sharc-node068.shef.ac.uk     16
1234571 0.00076 job.sh     foo1bar       qw    01/23/2022 07:06:18                                    1
1234572 0.00067 job.sh     foo1bar       hqw   01/23/2022 07:06:18                                    1
1234573 0.00000 job.sh     foo1bar       Eqw   01/21/2022 13:50:55                                    1
1234574 0.00000 job.sh     foo1bar       t     01/24/2022 13:04:25 all.q@sharc-node159.shef.ac.uk     1        22964

SGE Job States:

State

Explanation

SGE State Letter Code/s

Pending

pending, queued

qw

Pending

pending, user and/or system hold

hqw

Running

running

r

Error

all pending states with error

Eqw, Ehqw, EhRqw

Key: q: queueing, r: running, w: waiting, h: on hold, E: error, R: re-run, s: job suspended, S: queue suspended, t: transferring, d: deletion.

Note

A full list of SGE and DRMAA states can be found here

Stopping or cancelling Jobs

Jobs can be stopped or cancelled using the qdel command:

A job can be cancelled by using the qdel command as shown swapping out 123456 for your own job id number

$ qdel 123456

Investigating finished Jobs

Jobs which have already finished can be investigated using the qacct command:

The qacct command can be used to display status information about a user’s historical jobs.

Running the qacct command alone will provide a summary of used resources from the current month for the user running the command.

The command can be used as follows with the job’s ID to get job specific info:

$ qacct -j job-id

Or to view information about all of a specific user’s jobs:

$ qacct -j -u $USER

By default the qacct command will only bring up summary info about the user’s jobs from the current accounting file (which rotates monthly). Further detail about the output metrics and how to query jobs older than a month can be found on the dedicated qacct page.

Debugging failed Jobs

Note

One common form of job failure on ShARC is caused by Windows style line endings. If you see an error reported by qacct of the form:

failed searching requested shell because:

Or by qstat of the form:

failed: No such file or directory

You must replace these line endings as detailed in the FAQ.

If one of your jobs has failed and you need to debug why this has occured you should consult the job records held by the scheduler with the qacct referenced above as well as the generated job logs.

These output and error log files will be generated in the job working directory with the structure $JOBNAME.o$JOBID and $JOBNAME.e$JOBID where $JOBNAME is the user chosen name of the job and $JOBID is the scheduler provided job id. Looking at these logs should indicate the source of any issues.

The qacct info will contain two important metrics, the exit code and failure code.

The exit code is the return value of the exiting program/script. It can be a user defined value if the job is finished with a call to ‘exit(number)’. For abnormally terminated jobs it is the signal number + 128.

As an example: 137-128 = 9, therefore: signal 9 ( SIGKILL), it was sent the KILL signal and was killed, likely by the scheduler.

The failure code indicates why a job was abnormally terminated (by the scheduler). An incomplete table of common failure codes is shown below:

code

meaning

100

failure after job

37

qmaster enforced h_rt limit (Job ran out of time.)

30

rescheduling on application error

28

no current working dir

27

no shell

26

failure opening output

21

failure in recognizing job

19

no exit status

8

failure in prolog

1

failure before job (execd)