Revision as of 11:38, 17 October 2022

Install Spark on the cluster

Install Spark

Go to Apache Spark Downloads. Download and unpack a recent Spark binary. For example, on each instance:

cd ~/volume
wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
tar xzvf spark-3.3.0-bin-hadoop3.tgz
rm spark-3.3.0-bin-hadoop3.tgz
ln -fns spark-3.3.0-bin-hadoop3 spark

Set $SPARK_HOME and add Spark binaries to path:

export SPARK_HOME=/home/ubuntu/volume/spark
export PATH=${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin
cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc)
echo "export SPARK_HOME=/home/ubuntu/volume/spark" >>  ~/.bashrc
echo "export PATH=\${PATH}:\${SPARK_HOME}/bin:\${SPARK_HOME}/sbin" >>  ~/.bashrc

On your local machine, create the file spark-defaults.conf:

spark.master yarn
spark.driver.memory 512m
spark.yarn.am.memory 512m
spark.executor.memory 512m

From the local machine, upload spark-defaults.conf to each instance:

scp spark-defaults.conf spark-driver:volume/spark/conf
scp spark-defaults.conf spark-worker-1:volume/spark/conf
...

Test run Spark

On spark-driver, start Spark with:

SPARK_PUBLIC_DNS=$MASTER_NODE pyspark

SPARK_PUBLIC_DNS is the public IP address that Spark's web UI listens to at port 4040. To set it permanently:

export SPARK_PUBLIC_DNS=$MASTER_NODE
echo "export SPARK_PUBLIC_DNS=$MASTER_NODE" >> ~/.bashrc

Create a test program, e.g., spark-test.py:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
            .master("local") \
            .appName("Word Count") \
            .config("spark.some.config.option", "some-value") \
            .getOrCreate()
print('*** The result row is', 
        spark.range(5000).where("id > 500").selectExpr("sum(id)").collect(),
        '***')

Test run with:

spark-submit spark-test.py

Web UIs

While spark is running, you can attempt to access Spark's web UI at http://158.39.201.197:4040 (assuming that 158.39.201.197 is the IPv4 address of spark-driver). But when you run on top of YARN it just attempts to redirect to YARN's web UI at http://158.39.201.197:8088 , which you have accessed already.

Anonymous

Search

Install Spark on the cluster: Difference between revisions

Namespaces

More

Page actions

Revision as of 11:38, 17 October 2022

Contents

Install Spark on the cluster

Install Spark

Test run Spark

Web UIs

Navigation

Pages

Navigation

Wiki tools

Wiki tools

@@ Line 1: / Line 1: @@
+= Install Spark on the cluster =
 == Install Spark ==
 Go to [https://spark.apache.org/downloads.html Apache Spark Downloads]. Download and unpack a recent Spark binary. For example, on each instance:
-  cd volume
+  cd ~/volume
   wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
   tar xzvf spark-3.3.0-bin-hadoop3.tgz
@@ Line 8: / Line 9: @@
 Set $SPARK_HOME and add Spark binaries to path:
-  export SPARK_HOME=/home/ubuntu/volume/spark-3.3.0-bin-hadoop3
+  export SPARK_HOME=/home/ubuntu/volume/spark
   export PATH=${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin
   cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc)
-  echo "export SPARK_HOME=/home/ubuntu/volume/spark-3.3.0-bin-hadoop3" >>  ~/.bashrc
+  echo "export SPARK_HOME=/home/ubuntu/volume/spark" >>  ~/.bashrc
   echo "export PATH=\${PATH}:\${SPARK_HOME}/bin:\${SPARK_HOME}/sbin" >>  ~/.bashrc
-On the local machine, create the file ''spark-defaults.conf'':
+On your local machine, create the file ''spark-defaults.conf'':
   spark.master yarn
   spark.driver.memory 512m
@@ Line 21: / Line 22: @@
 From the local machine, upload ''spark-defaults.conf'' to each instance:
-  scp spark-defaults.conf spark-driver:volume/spark-3.3.0-bin-hadoop3/conf
+  scp spark-defaults.conf spark-driver:volume/spark/conf
-  scp spark-defaults.conf spark-worker-1:volume/spark-3.3.0-bin-hadoop3/conf
+  scp spark-defaults.conf spark-worker-1:volume/spark/conf
   ...
@@ Line 45: / Line 46: @@
 Test run with:
   spark-submit spark-test.py
+== Web UIs ==
+While spark is running, you can attempt to access Spark's web UI at http://158.39.201.197:4040 (assuming that 158.39.201.197 is the IPv4 address of ''spark-driver''). But when you run on top of YARN it just attempts to redirect to YARN's web UI at http://158.39.201.197:8088 , which you have accessed already.

Anonymous

Search

Install Spark on the cluster: Difference between revisions

Revision as of 11:38, 17 October 2022

Install Spark on the cluster

Install Spark

Test run Spark

Web UIs

Navigation

Wiki tools

Page tools