Install Spark on the cluster

Install Spark

Go to Apache Spark Downloads. Download and unpack a recent Spark binary. For example, on each instance:

cd ~/volume
wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
tar xzvf spark-3.3.1-bin-hadoop3.tgz
rm spark-3.3.1-bin-hadoop3.tgz
ln -fns spark-3.3.1-bin-hadoop3 spark

Set $SPARK_HOME and add the Spark binaries and scripts to your path:

export SPARK_HOME=/home/ubuntu/volume/spark
export PATH=${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin
cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc)
echo "export SPARK_HOME=/home/ubuntu/volume/spark" >>  ~/.bashrc
echo "export PATH=\${PATH}:\${SPARK_HOME}/bin:\${SPARK_HOME}/sbin" >>  ~/.bashrc

Configure Spark

On your local machine, create the file spark-defaults.conf:

spark.master yarn
spark.driver.memory 512m
spark.yarn.am.memory 512m
spark.executor.memory 512m

From the local machine, upload spark-defaults.conf to each instance:

scp spark-defaults.conf spark-driver:volume/spark/conf
scp spark-defaults.conf spark-worker-1:volume/spark/conf
...

Test run Spark

On spark-driver, make sure HDFS/YARN are running (to be certain, you can use$HADOOP_HOME/sbin/start-all.sh, because Spark also defines a start-all.sh script...), and start Spark with:

SPARK_PUBLIC_DNS=$MASTER_NODE pyspark

SPARK_PUBLIC_DNS is the public IP address that Spark's web UI listens to at port 4040. To set it permanently:

export SPARK_PUBLIC_DNS=$MASTER_NODE
echo "export SPARK_PUBLIC_DNS=$MASTER_NODE" >> ~/.bashrc

Create a test program, e.g., spark-test.py:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
            .master("local") \
            .appName("Word Count") \
            .getOrCreate()
print('*** The result row is', 
        spark.range(5000).where("id > 500").selectExpr("sum(id)").collect(),
        '***')

Test run with:

spark-submit spark-test.py

In the program, change the line

            .master("local") \

to

            .master("yarn") \

to test run on the Hadoop cluster.

Task 1: Run exercise1.py from Exercise 1, for example with spark-submit exercise1.py. A few tips:

Use the large dataset in tweets_id_text_100000.jl (i.e., small_dataset = False).
Because SPARK_HOME is set, you do not need findspark.
When you run on top of YARN, Spark expects to input files from HDFS, not from the regular file system.
Use the HDFS and YARN web UIs to check what is going on.
The default replication factors and other settings in YARN and Spark are not clever. Don't worry about that now: they are enough to get you started.

Web UIs

While spark is running, you can attempt to access Spark's web UI at http://158.39.201.197:4040 (assuming that 158.39.201.197 is the IPv4 address of spark-driver). But when you run on top of YARN it just attempts to redirect to YARN's web UI at http://158.39.201.197:8088 , which you have accessed already.

Anonymous

Search

Install Spark on the cluster

Namespaces

More

Page actions

Contents

Install Spark on the cluster

Install Spark

Configure Spark

Test run Spark

Web UIs

Navigation

Pages

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Install Spark on the cluster

Install Spark on the cluster

Install Spark

Configure Spark

Test run Spark

Web UIs

Navigation

Wiki tools

Page tools