Install Spark on the cluster: Difference between revisions
Created page with "== Install Spark == Go to [https://spark.apache.org/downloads.html Apache Spark Downloads]. Download and unpack a recent Spark binary. For example, on each instance: cd volum..." |
No edit summary |
||
| Line 1: | Line 1: | ||
= Install Spark on the cluster = | |||
== Install Spark == | == Install Spark == | ||
Go to [https://spark.apache.org/downloads.html Apache Spark Downloads]. Download and unpack a recent Spark binary. For example, on each instance: | Go to [https://spark.apache.org/downloads.html Apache Spark Downloads]. Download and unpack a recent Spark binary. For example, on each instance: | ||
cd volume | cd ~/volume | ||
wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz | wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz | ||
tar xzvf spark-3.3.0-bin-hadoop3.tgz | tar xzvf spark-3.3.0-bin-hadoop3.tgz | ||
| Line 8: | Line 9: | ||
Set $SPARK_HOME and add Spark binaries to path: | Set $SPARK_HOME and add Spark binaries to path: | ||
export SPARK_HOME=/home/ubuntu/volume/spark | export SPARK_HOME=/home/ubuntu/volume/spark | ||
export PATH=${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin | export PATH=${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin | ||
cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc) | cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc) | ||
echo "export SPARK_HOME=/home/ubuntu/volume/spark | echo "export SPARK_HOME=/home/ubuntu/volume/spark" >> ~/.bashrc | ||
echo "export PATH=\${PATH}:\${SPARK_HOME}/bin:\${SPARK_HOME}/sbin" >> ~/.bashrc | echo "export PATH=\${PATH}:\${SPARK_HOME}/bin:\${SPARK_HOME}/sbin" >> ~/.bashrc | ||
On | On your local machine, create the file ''spark-defaults.conf'': | ||
spark.master yarn | spark.master yarn | ||
spark.driver.memory 512m | spark.driver.memory 512m | ||
| Line 21: | Line 22: | ||
From the local machine, upload ''spark-defaults.conf'' to each instance: | From the local machine, upload ''spark-defaults.conf'' to each instance: | ||
scp spark-defaults.conf spark-driver:volume/spark | scp spark-defaults.conf spark-driver:volume/spark/conf | ||
scp spark-defaults.conf spark-worker-1:volume/spark | scp spark-defaults.conf spark-worker-1:volume/spark/conf | ||
... | ... | ||
| Line 45: | Line 46: | ||
Test run with: | Test run with: | ||
spark-submit spark-test.py | spark-submit spark-test.py | ||
== Web UIs == | |||
While spark is running, you can attempt to access Spark's web UI at http://158.39.201.197:4040 (assuming that 158.39.201.197 is the IPv4 address of ''spark-driver''). But when you run on top of YARN it just attempts to redirect to YARN's web UI at http://158.39.201.197:8088 , which you have accessed already. | |||
Revision as of 11:38, 17 October 2022
Install Spark on the cluster
Install Spark
Go to Apache Spark Downloads. Download and unpack a recent Spark binary. For example, on each instance:
cd ~/volume wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz tar xzvf spark-3.3.0-bin-hadoop3.tgz rm spark-3.3.0-bin-hadoop3.tgz ln -fns spark-3.3.0-bin-hadoop3 spark
Set $SPARK_HOME and add Spark binaries to path:
export SPARK_HOME=/home/ubuntu/volume/spark
export PATH=${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin
cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc)
echo "export SPARK_HOME=/home/ubuntu/volume/spark" >> ~/.bashrc
echo "export PATH=\${PATH}:\${SPARK_HOME}/bin:\${SPARK_HOME}/sbin" >> ~/.bashrc
On your local machine, create the file spark-defaults.conf:
spark.master yarn spark.driver.memory 512m spark.yarn.am.memory 512m spark.executor.memory 512m
From the local machine, upload spark-defaults.conf to each instance:
scp spark-defaults.conf spark-driver:volume/spark/conf scp spark-defaults.conf spark-worker-1:volume/spark/conf ...
Test run Spark
On spark-driver, start Spark with:
SPARK_PUBLIC_DNS=$MASTER_NODE pyspark
SPARK_PUBLIC_DNS is the public IP address that Spark's web UI listens to at port 4040. To set it permanently:
export SPARK_PUBLIC_DNS=$MASTER_NODE echo "export SPARK_PUBLIC_DNS=$MASTER_NODE" >> ~/.bashrc
Create a test program, e.g., spark-test.py:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("Word Count") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
print('*** The result row is',
spark.range(5000).where("id > 500").selectExpr("sum(id)").collect(),
'***')
Test run with:
spark-submit spark-test.py
Web UIs
While spark is running, you can attempt to access Spark's web UI at http://158.39.201.197:4040 (assuming that 158.39.201.197 is the IPv4 address of spark-driver). But when you run on top of YARN it just attempts to redirect to YARN's web UI at http://158.39.201.197:8088 , which you have accessed already.
