Install Spark on the cluster: Difference between revisions

From info319
 
(3 intermediate revisions by the same user not shown)
Line 3: Line 3:
Go to [https://spark.apache.org/downloads.html Apache Spark Downloads]. Download and unpack a recent Spark binary. For example, on each instance:
Go to [https://spark.apache.org/downloads.html Apache Spark Downloads]. Download and unpack a recent Spark binary. For example, on each instance:
  cd ~/volume
  cd ~/volume
  wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
  wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
  tar xzvf spark-3.3.0-bin-hadoop3.tgz
  tar xzvf spark-3.3.1-bin-hadoop3.tgz
  rm spark-3.3.0-bin-hadoop3.tgz
  rm spark-3.3.1-bin-hadoop3.tgz
  ln -fns spark-3.3.0-bin-hadoop3 spark
  ln -fns spark-3.3.1-bin-hadoop3 spark


Set $SPARK_HOME and add Spark binaries to path:
Set $SPARK_HOME and add the Spark binaries and scripts to your path:
  export SPARK_HOME=/home/ubuntu/volume/spark
  export SPARK_HOME=/home/ubuntu/volume/spark
  export PATH=${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin
  export PATH=${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin
Line 15: Line 15:
  echo "export PATH=\${PATH}:\${SPARK_HOME}/bin:\${SPARK_HOME}/sbin" >>  ~/.bashrc
  echo "export PATH=\${PATH}:\${SPARK_HOME}/bin:\${SPARK_HOME}/sbin" >>  ~/.bashrc


== Configure Spark ==
On your local machine, create the file ''spark-defaults.conf'':
On your local machine, create the file ''spark-defaults.conf'':
  spark.master yarn
  spark.master yarn
Line 27: Line 28:


== Test run Spark ==
== Test run Spark ==
On ''spark-driver'', make sure HDFS/YARN are running ('''$HADOOP_HOME/sbin/start-all.sh''', because Spark also defines a ''start-all.sh'' script), and start Spark with:
On ''spark-driver'', make sure HDFS/YARN are running (to be certain, you can use''$HADOOP_HOME/sbin/start-all.sh'', because Spark also defines a ''start-all.sh'' script...), and start Spark with:
  SPARK_PUBLIC_DNS=$MASTER_NODE pyspark
  SPARK_PUBLIC_DNS=$MASTER_NODE pyspark
SPARK_PUBLIC_DNS is the public IP address that Spark's web UI listens to at port 4040. To set it permanently:
SPARK_PUBLIC_DNS is the public IP address that Spark's web UI listens to at port 4040. To set it permanently:
Line 46: Line 47:
  spark-submit spark-test.py
  spark-submit spark-test.py


Change the line
In the program, change the line
             .master("local") \
             .master("local") \
to
to
Line 74: Line 75:
* When you run on top of YARN, Spark expects to input files from HDFS, not from the regular file system.
* When you run on top of YARN, Spark expects to input files from HDFS, not from the regular file system.
* Use the HDFS and YARN web UIs to check what is going on.
* Use the HDFS and YARN web UIs to check what is going on.
* The default replication factors and other settings are not clever. Don't worry about that now: they are enough to get you started.
* The default replication factors and other settings in YARN and Spark are not clever. Don't worry about that now: they are enough to get you started.


== Web UIs ==
== Web UIs ==
While spark is running, you can attempt to access Spark's web UI at http://158.39.201.197:4040 (assuming that 158.39.201.197 is the IPv4 address of ''spark-driver''). But when you run on top of YARN it just attempts to redirect to YARN's web UI at http://158.39.201.197:8088 , which you have accessed already.
While spark is running, you can attempt to access Spark's web UI at http://158.39.201.197:4040 (assuming that 158.39.201.197 is the IPv4 address of ''spark-driver''). But when you run on top of YARN it just attempts to redirect to YARN's web UI at http://158.39.201.197:8088 , which you have accessed already.

Latest revision as of 12:17, 31 October 2022

Install Spark on the cluster

Install Spark

Go to Apache Spark Downloads. Download and unpack a recent Spark binary. For example, on each instance:

cd ~/volume
wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
tar xzvf spark-3.3.1-bin-hadoop3.tgz
rm spark-3.3.1-bin-hadoop3.tgz
ln -fns spark-3.3.1-bin-hadoop3 spark

Set $SPARK_HOME and add the Spark binaries and scripts to your path:

export SPARK_HOME=/home/ubuntu/volume/spark
export PATH=${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin
cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc)
echo "export SPARK_HOME=/home/ubuntu/volume/spark" >>  ~/.bashrc
echo "export PATH=\${PATH}:\${SPARK_HOME}/bin:\${SPARK_HOME}/sbin" >>  ~/.bashrc

Configure Spark

On your local machine, create the file spark-defaults.conf:

spark.master yarn
spark.driver.memory 512m
spark.yarn.am.memory 512m
spark.executor.memory 512m

From the local machine, upload spark-defaults.conf to each instance:

scp spark-defaults.conf spark-driver:volume/spark/conf
scp spark-defaults.conf spark-worker-1:volume/spark/conf
...

Test run Spark

On spark-driver, make sure HDFS/YARN are running (to be certain, you can use$HADOOP_HOME/sbin/start-all.sh, because Spark also defines a start-all.sh script...), and start Spark with:

SPARK_PUBLIC_DNS=$MASTER_NODE pyspark

SPARK_PUBLIC_DNS is the public IP address that Spark's web UI listens to at port 4040. To set it permanently:

export SPARK_PUBLIC_DNS=$MASTER_NODE
echo "export SPARK_PUBLIC_DNS=$MASTER_NODE" >> ~/.bashrc

Create a test program, e.g., spark-test.py:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
            .master("local") \
            .appName("Word Count") \
            .getOrCreate()
print('*** The result row is', 
        spark.range(5000).where("id > 500").selectExpr("sum(id)").collect(),
        '***')

Test run with:

spark-submit spark-test.py

In the program, change the line

            .master("local") \

to

            .master("yarn") \

to test run on the Hadoop cluster.


Task 1: Run exercise1.py from Exercise 1, for example with spark-submit exercise1.py. A few tips:

  • Use the large dataset in tweets_id_text_100000.jl (i.e., small_dataset = False).
  • Because SPARK_HOME is set, you do not need findspark.
  • When you run on top of YARN, Spark expects to input files from HDFS, not from the regular file system.
  • Use the HDFS and YARN web UIs to check what is going on.
  • The default replication factors and other settings in YARN and Spark are not clever. Don't worry about that now: they are enough to get you started.

Web UIs

While spark is running, you can attempt to access Spark's web UI at http://158.39.201.197:4040 (assuming that 158.39.201.197 is the IPv4 address of spark-driver). But when you run on top of YARN it just attempts to redirect to YARN's web UI at http://158.39.201.197:8088 , which you have accessed already.