Attention

The ShARC HPC cluster was decommissioned on the 30th of November 2023 at 17:00. It is no longer possible for users to access that cluster.

spark

Apache Spark is a fast and general engine for large-scale data processing.

Interactive Usage

After connecting to ShARC (see Establishing a SSH connection), start an interactive session with the qrsh or qrshx command.

Before using Spark, you will need to load a version of Java. For example:

module load apps/java/jdk1.8.0_102/binary

To make a version of Spark available, use one the following commands:

module load apps/spark/2.3.0/jdk-1.8.0_102
module load apps/spark/2.1.0/gcc-4.8.5

You can now start a Spark shell session with

spark-shell

SparkR

To use SparkR, you will additionally need to load a version of R e.g.:

module load apps/java/jdk1.8.0_102/binary
module load apps/spark/2.3.0/jdk-1.8.0_102
module load apps/R/3.3.2/gcc-4.8.5

Now you can start a SparkR session by running:

sparkR

Setting the number of cores

The installation of Spark on ShARC is limited to jobs that make use of one node. As such, the maximum number of CPU cores you can request for a Spark job is (typically) 16.

First, you must request cores from the scheduler. That is, you add the following line to your submission script to request 4 cores

#$ -pe smp 4

You must also tell Spark to only use 4 cores by setting the MASTER environment variable

export MASTER=local[4]

A full example using Python is given here.

Using pyspark in JupyterHub sessions

Alternative setup instructions are required when using Pyspark with conda and Jupyter on ShARC:

First, ensure you have access to a conda environment containing the ipykernel and pyspark conda packages (see Jupyter on SHARC: preparing your environment).

Next, add the following to a cell at the top of the Notebook you want to use pyspark with:

import os

# Java required by Spark - ensure a version is available:
if 'JAVA_HOME' not in os.environ:
os.environ['JAVA_HOME'] = "/usr/local/packages/apps/java/jdk1.8.0_102/binary"

# Tell Spark to save temporary files to a sensible place:
if 'TMPDIR' in os.environ:
os.environ['SPARK_LOCAL_DIRS'] = os.environ['TMPDIR']

from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
conf.setAppName('conda-pyspark')

# Create as many Spark processes as allocated CPU cores
# (assuming all cores allocated on one node):
if 'NSLOTS' in os.environ:
conf.setMaster("local[{}]".format(os.environ['NSLOTS']))

# Finally, create our Spark context
sc = SparkContext(conf=conf)

# Verify how many processes Spark will create/use
print(sc.defaultParallelism)

It may be possible to install/use Java using conda but this has not been tested.

Installation notes

These notes are primarily for administrators of the system.

Spark 2.3.0

  • Install script: install.sh

  • Module file apps/spark/2.3.0/jdk-1.8.0_102, which

    • sets SPARK_HOME

    • prepends the Spark bin directory to the PATH

    • sets MASTER to local\[1\] (i.e. Spark will default to using 1 core)

Spark 2.1

qrsh -l rmem=10G

module load apps/java/jdk1.8.0_102/binary
tar -xvzf ./spark-2.1.0.tgz
cd spark-2.1.0
./build/mvn -DskipTests clean package

mkdir -p /usr/local/packages/apps/spark/2.1
cd ..
mv spark-2.1.0 /usr/local/packages/apps/spark/2.1

The default install of Spark is incredibly verbose. Even a ‘Hello World’ program results in many lines of [INFO]. To make it a little quieter, the default log4j level has been reduced from INFO to WARN:

cd /usr/local/packages/apps/spark/2.1/spark-2.1.0/conf/
cp log4j.properties.template log4j.properties

The file log4j.properties was then edited so that the line beginning log4j.rootCategory reads:

log4j.rootCategory=WARN, console

Module file apps/spark/2.1/gcc-4.8.5, which

  • sets SPARK_HOME

  • prepends the Spark bin directory to the PATH

  • sets MASTER to local\[1\] (i.e. Spark will default to using 1 core)