Attention
The ShARC HPC cluster was decommissioned on the 30th of November 2023 at 17:00. It is no longer possible for users to access that cluster.
spark
Apache Spark is a fast and general engine for large-scale data processing.
Interactive Usage
After connecting to ShARC (see Establishing a SSH connection), start an interactive session with the qrsh or qrshx command.
Before using Spark, you will need to load a version of Java. For example:
module load apps/java/jdk1.8.0_102/binary
To make a version of Spark available, use one the following commands:
module load apps/spark/2.3.0/jdk-1.8.0_102
module load apps/spark/2.1.0/gcc-4.8.5
You can now start a Spark shell session with
spark-shell
SparkR
To use SparkR, you will additionally need to load a version of R e.g.:
module load apps/java/jdk1.8.0_102/binary
module load apps/spark/2.3.0/jdk-1.8.0_102
module load apps/R/3.3.2/gcc-4.8.5
Now you can start a SparkR session by running:
sparkR
Setting the number of cores
The installation of Spark on ShARC is limited to jobs that make use of one node. As such, the maximum number of CPU cores you can request for a Spark job is (typically) 16.
First, you must request cores from the scheduler. That is, you add the following line to your submission script to request 4 cores
#$ -pe smp 4
You must also tell Spark to only use 4 cores by setting the MASTER
environment variable
export MASTER=local[4]
A full example using Python is given here.
Using pyspark in JupyterHub sessions
Alternative setup instructions are required when using Pyspark with conda and Jupyter on ShARC:
A version of Java needs to be loaded
Pyspark needs to be told to write temporary files to a sensible location
Pyspark needs to be told to create an appropriate number of worker processes given the number of CPU cores allocated to the job by the scheduler.
First, ensure you have access to a conda environment containing the ipykernel
and pyspark
conda packages (see Jupyter on SHARC: preparing your environment).
Next, add the following to a cell at the top of the Notebook you want to use pyspark with:
import os
# Java required by Spark - ensure a version is available:
if 'JAVA_HOME' not in os.environ:
os.environ['JAVA_HOME'] = "/usr/local/packages/apps/java/jdk1.8.0_102/binary"
# Tell Spark to save temporary files to a sensible place:
if 'TMPDIR' in os.environ:
os.environ['SPARK_LOCAL_DIRS'] = os.environ['TMPDIR']
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
conf.setAppName('conda-pyspark')
# Create as many Spark processes as allocated CPU cores
# (assuming all cores allocated on one node):
if 'NSLOTS' in os.environ:
conf.setMaster("local[{}]".format(os.environ['NSLOTS']))
# Finally, create our Spark context
sc = SparkContext(conf=conf)
# Verify how many processes Spark will create/use
print(sc.defaultParallelism)
It may be possible to install/use Java using conda but this has not been tested.
Installation notes
These notes are primarily for administrators of the system.
Spark 2.3.0
Install script:
install.sh
Module file
apps/spark/2.3.0/jdk-1.8.0_102
, whichsets
SPARK_HOME
prepends the Spark
bin
directory to thePATH
sets
MASTER
tolocal\[1\]
(i.e. Spark will default to using 1 core)
Spark 2.1
qrsh -l rmem=10G
module load apps/java/jdk1.8.0_102/binary
tar -xvzf ./spark-2.1.0.tgz
cd spark-2.1.0
./build/mvn -DskipTests clean package
mkdir -p /usr/local/packages/apps/spark/2.1
cd ..
mv spark-2.1.0 /usr/local/packages/apps/spark/2.1
The default install of Spark is incredibly verbose. Even a ‘Hello World’ program results in many lines of [INFO]
.
To make it a little quieter, the default log4j level has been reduced from INFO
to WARN
:
cd /usr/local/packages/apps/spark/2.1/spark-2.1.0/conf/
cp log4j.properties.template log4j.properties
The file log4j.properties
was then edited so that the line beginning log4j.rootCategory
reads:
log4j.rootCategory=WARN, console
Module file apps/spark/2.1/gcc-4.8.5
,
which
sets
SPARK_HOME
prepends the Spark
bin
directory to thePATH
sets
MASTER
tolocal\[1\]
(i.e. Spark will default to using 1 core)