Attention

Advance Notice: Bessemer will be retired at the end of the day on Friday 31st October 2025.

MPI for Multi-node and Parallel Jobs

Overview

Confirm that your application is built to support MPI.
Compile your code using MPI-aware compilers. Be sure to load the same modules in your job script.
To assign multiple tasks on a single machine, use --nodes=1 and --ntasks=n.
Launch the program using srun if you’re relying on system-installed MPI modules, or use mpirun for custom-built MPI.
To distribute tasks evenly across multiple machines, combine --nodes=N with --ntasks-per-node=n to yield a total of N x n tasks.
Always check resource usage with tools like seff JOBID to ensure efficient use of allocations.
If you are uncertain about scaling up, contact the IT Services’ Research and Innovation team at an early stage.

MPI allows a program to run concurrently across many cluster nodes, although it typically requires more specialised programming.

What is MPI?

The Message Passing Interface is a standard for passing data and other messages between running processes which may or may not be on a single computer. It is commonly used on computer clusters as a means by which a set of related processes can work together in parallel on one or more tasks. These strands (processes) must therefore communicate data and other information by passing messages between each other.

MPI is used on systems ranging from a few interconnected Raspberry Pi’s through to the UK’s national supercomputer, Archer.

MPI Implementations

The Message Passing Interface (MPI) itself is just a specification for a message passing library.

There are multiple implementations of this specification, each produced by a different organisation, including OpenMPI and Intel MPI. This documentation includes information on the MPI implementations available on Stanage and Bessemer. On the Stanage cluster these implementations have been compiled in a way that allows them to make optimal use of the high-speed network infrastructure (OmniPath). If you are not sure which implementation to use then try the latest available version of OpenMPI.

Batch MPI

To use MPI you need use module load to activate a particular MPI implementation (or module load an application that itself loads an MPI implementation behind the scenes).

Here is an example that requests 4 slots (CPU cores) with 8GB of RAM per slot then runs a program called executable in the current directory using the OpenMPI library (version 4.1.4, built using version 12.2.0 of the gcc compiler). It is assumed that executable was previously compiled using that exact same MPI library.

#!/bin/bash
# Request one node
#SBATCH --nodes=1
# Request 4 cores per node
#SBATCH ntasks=4
# Request 8GB of RAM per node
#SBATCH mem=8G

# Load a MPI library
module load OpenMPI/4.1.4-GCC-12.2.0

# Run a program previously compiled using that specific MPI library
srun --export=ALL ./executable

Unlike shared memory models, MPI requires programs to explicitly send and receive data between tasks. Most applications must be written with MPI support from the outset, so standard serial code won’t benefit from MPI unless rewritten accordingly.

MPI programs usually follow this pattern:

The same executable is launched in several separate processes.
All processes connect to one another through MPI.
Each process is assigned a unique identifier called a “rank”.
Each rank performs a specific portion of the work. Rank 0 often handles I/O and status messages.
The MPI environment is closed after execution ends.

If you’re using MPI modules provided by the system, Slurm communicates rank and task details using a library called PMIx. This may not work seamlessly with external MPI builds.

Building and Executing MPI Programs

Compiling with MPI

Choose an MPI implementation for building your application. Several are available, all conforming to the MPI standard. We suggest using the latest version of OpenMPI for compatibility with the cluster environment. For information on other installed versions, see Parallel Systems on Stanage.

Requesting MPI Resources in Slurm

To allocate resources for an MPI job, the typical format is: --nodes=1 --ntasks=N. This ensures all MPI tasks run on a single machine—ideal for communication-intensive programs.

When scaling to multiple nodes, use: --nodes=N --ntasks-per-node=n. This launches N x n tasks, balancing them across machines. Each task gets 1 CPU by default. To increase this (if your programme supports it), see the section on hybrid parallel models.

Example: MPI Program to Estimate Pi

We’ll use the pi-mpi.c example, which estimates π using a Monte Carlo method, and supports multiple MPI tasks.

Start by compiling with the MPI module:

module load hpc-examples
module load OpenMPI
mpicc -o pi-mpi ${HPC_EXAMPLES}/slurm/pi-mpi.c

Run interactively:

srun --nodes=1 --ntasks=2 --time=00:10:00 --mem=500M ./pi-mpi 1000000

Or via a Slurm script (pi-mpi.sh):

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --mem=500M
#SBATCH --output=pi-mpi.out
#SBATCH --nodes=1
#SBATCH --ntasks=2

module load OpenMPI
srun --export=ALL ./pi-mpi 1000000

Submit with:

sbatch pi-mpi.sh

You can inspect the output file using cat:

$ cat pi-mpi.out
node032.pri.stanage.alces.network: This is rank 1 doing 500000 trials
Calculating pi using 1000000 stochastic trials
node032.pri.stanage.alces.network: This is rank 0 doing 500000 trials
Throws: 785491 / 1000000 Pi: 3.141964

Important

Here we didn’t specify an OpenMPI version, so the system default was used. However, for reproducibility and to avoid runtime errors, always load the same version of OpenMPI as you used when compiling the program. Mismatched major versions (e.g. 3.x vs 4.x) can cause MPI initialisation errors or crashes, and even minor differences can affect runtime behaviour.

Special Cases and Tips

Ranks Not Detected

When using MPI libraries outside the system default, ranks may not be recognised automatically. Add the following to your job script if needed:

export SLURM_MPI_TYPE=pmix_v2

Note: We suggest pmix_v2 here for broad compatibility with different MPI builds, but newer MPI libraries may also support pmix_v4.

Performance Monitoring

The seff script can be used as follows with the job’s ID to give summary of important job info :

$ seff job-id

For example, on the Stanage cluster:

$ seff 64626
Job ID: 64626
Cluster: stanage.alces.network
User/Group: a_user/clusterusers
State: COMPLETED (exit code 0)
Nodes: 2
Cores per node: 1
CPU Utilized: 00:02:37
CPU Efficiency: 35.68% of 00:07:20 core-walltime
Job Wall-clock time: 00:03:40
Memory Utilized: 137.64 MB (estimated maximum)
Memory Efficiency: 1.71% of 7.84 GB (3.92 GB/core)

You can also monitor individual job steps by calling seff with the syntax seff job-id.job-step.

If your CPU usage is consistently low, your code may not be making effective use of the available resources — this could be due to inefficient code or a lack of parallelisation.

If memory usage is far below or above the requested amount, consider adjusting your allocation. It’s generally best to request slightly more RAM than your code typically uses, to avoid job failures while minimising waste.

Hybrid Parallelism: MPI + Shared Memory

Some applications benefit from both task-based and thread-based parallelism (e.g., MPI + OpenMP). This is often called hybrid parallelism.

Warning

Hybrid parallelism is generally only beneficial on very large-scale resources (thousands to tens of thousands of cores). For smaller jobs, the overhead often outweighs any performance gains.

To request multiple threads per MPI task, use: --ntasks-per-node=n --cpus-per-task=C

Ensure the total cores requested per node (n × C) does not exceed the node capacity.

Exercises

Exercise 1: Understanding Basic MPI Options

Try these commands and observe their behaviour:

srun --cpus-per-task=4 hostname
srun --ntasks=4 hostname
srun --nodes=4 hostname

Exercise 2: Run MPI with Various Slurm Options

Try the pi-mpi.c example using:

--ntasks=4
--ntasks-per-node=4
--nodes=2 --ntasks-per-node=2

Solution

You can test with:

module load OpenMPI
srun --export=ALL --ntasks=4 --time=00:10:00 --mem=500M ./pi-mpi 2000000000
srun --export=ALL --ntasks-per-node=4 --time=00:10:00 --mem=500M ./pi-mpi 2000000000

Again, we will need to submit a script to Slurm for a multi-node job:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --time=00:10:00
#SBATCH --mem=500M
#SBATCH --output=pi-mpi-multi-node-test.out

module load OpenMPI
srun --export=ALL ./pi-mpi 2000000000

The difference is how tasks are distributed across nodes.

In the first example, all 4 tasks run on a single node (default behaviour).
In the second example, Slurm packs 4 tasks onto a single node explicitly.
In the third example, Slurm distributes 2 tasks to each of two nodes.

These examples help you see how task placement influences performance.

You can inspect the efficiency of these jobs with seff JOBID:

Distribution	Wall Time	CPU Time	CPU Efficiency (%)
`ntasks=4`	00:08	00:32	96.88
`ntasks-per-node=4`	00:09	00:36	86.11
`ntasks=2 ntasks-per-node=2`	00:17	01:08	60.29

The slight difference between the first two cases is likely just due to small startup and timing artefacts, typical for short jobs.
The significant drop in efficiency in the third case is expected — it reflects the overhead of communicating across nodes, especially for small jobs.

Tip

Multi-node jobs have more communication overhead due to network latency.
For larger jobs, consider scaling tests to find the “sweet spot” between cores, memory, and nodes.

Exercise 3: Can Your Code Use MPI?

Look at your code’s documentation or output. Keywords that hint at MPI support include:

MPI
mpirun
mpiexec
distributed
rank

Exercise 4:

Explore our documentation pages on parallel implementations on Stanage and Bessemer. Pay attention to examples, best practices, and cluster-specific tweaks - they’ll give you a head start in deploying MPI effectively on our systems.

MPI Training

Training courses from the national supercomputing centre are available here

What’s Next?

The next guide introduces GPU parallelism and how to use GPUs on the cluster.

This material contains material adapted from Aalto Scientific Computing Documentation, licensed under CC BY 4.0 Changes were made to the original content.