Latest revision as of 12:17, 31 October 2022

Install Spark on the cluster

Install Spark

Go to Apache Spark Downloads. Download and unpack a recent Spark binary. For example, on each instance:

cd ~/volume
wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
tar xzvf spark-3.3.1-bin-hadoop3.tgz
rm spark-3.3.1-bin-hadoop3.tgz
ln -fns spark-3.3.1-bin-hadoop3 spark

Set $SPARK_HOME and add the Spark binaries and scripts to your path:

export SPARK_HOME=/home/ubuntu/volume/spark
export PATH=${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin
cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc)
echo "export SPARK_HOME=/home/ubuntu/volume/spark" >>  ~/.bashrc
echo "export PATH=\${PATH}:\${SPARK_HOME}/bin:\${SPARK_HOME}/sbin" >>  ~/.bashrc

Configure Spark

On your local machine, create the file spark-defaults.conf:

spark.master yarn
spark.driver.memory 512m
spark.yarn.am.memory 512m
spark.executor.memory 512m

From the local machine, upload spark-defaults.conf to each instance:

scp spark-defaults.conf spark-driver:volume/spark/conf
scp spark-defaults.conf spark-worker-1:volume/spark/conf
...

Test run Spark

On spark-driver, make sure HDFS/YARN are running (to be certain, you can use$HADOOP_HOME/sbin/start-all.sh, because Spark also defines a start-all.sh script...), and start Spark with:

SPARK_PUBLIC_DNS=$MASTER_NODE pyspark

SPARK_PUBLIC_DNS is the public IP address that Spark's web UI listens to at port 4040. To set it permanently:

export SPARK_PUBLIC_DNS=$MASTER_NODE
echo "export SPARK_PUBLIC_DNS=$MASTER_NODE" >> ~/.bashrc

Create a test program, e.g., spark-test.py:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
            .master("local") \
            .appName("Word Count") \
            .getOrCreate()
print('*** The result row is', 
        spark.range(5000).where("id > 500").selectExpr("sum(id)").collect(),
        '***')

Test run with:

spark-submit spark-test.py

In the program, change the line

            .master("local") \

to

            .master("yarn") \

to test run on the Hadoop cluster.

Task 1: Run exercise1.py from Exercise 1, for example with spark-submit exercise1.py. A few tips:

Use the large dataset in tweets_id_text_100000.jl (i.e., small_dataset = False).
Because SPARK_HOME is set, you do not need findspark.
When you run on top of YARN, Spark expects to input files from HDFS, not from the regular file system.
Use the HDFS and YARN web UIs to check what is going on.
The default replication factors and other settings in YARN and Spark are not clever. Don't worry about that now: they are enough to get you started.

Web UIs

While spark is running, you can attempt to access Spark's web UI at http://158.39.201.197:4040 (assuming that 158.39.201.197 is the IPv4 address of spark-driver). But when you run on top of YARN it just attempts to redirect to YARN's web UI at http://158.39.201.197:8088 , which you have accessed already.

Anonymous

Search

Install Spark on the cluster: Difference between revisions

Namespaces

More

Page actions

Latest revision as of 12:17, 31 October 2022

Contents

Install Spark on the cluster

Install Spark

Configure Spark

Test run Spark

Web UIs

Navigation

Pages

Navigation

Wiki tools

Wiki tools

@@ Line 3: / Line 3: @@
 Go to [https://spark.apache.org/downloads.html Apache Spark Downloads]. Download and unpack a recent Spark binary. For example, on each instance:
   cd ~/volume
-  wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
+  wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
-  tar xzvf spark-3.3.0-bin-hadoop3.tgz
+  tar xzvf spark-3.3.1-bin-hadoop3.tgz
-  rm spark-3.3.0-bin-hadoop3.tgz
+  rm spark-3.3.1-bin-hadoop3.tgz
-  ln -fns spark-3.3.0-bin-hadoop3 spark
+  ln -fns spark-3.3.1-bin-hadoop3 spark
-Set $SPARK_HOME and add Spark binaries to path:
+Set $SPARK_HOME and add the Spark binaries and scripts to your path:
   export SPARK_HOME=/home/ubuntu/volume/spark
   export PATH=${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin
@@ Line 15: / Line 15: @@
   echo "export PATH=\${PATH}:\${SPARK_HOME}/bin:\${SPARK_HOME}/sbin" >>  ~/.bashrc
+== Configure Spark ==
 On your local machine, create the file ''spark-defaults.conf'':
   spark.master yarn
@@ Line 27: / Line 28: @@
 == Test run Spark ==
-On ''spark-driver'', start Spark with:
+On ''spark-driver'', make sure HDFS/YARN are running (to be certain, you can use''$HADOOP_HOME/sbin/start-all.sh'', because Spark also defines a ''start-all.sh'' script...), and start Spark with:
   SPARK_PUBLIC_DNS=$MASTER_NODE pyspark
 SPARK_PUBLIC_DNS is the public IP address that Spark's web UI listens to at port 4040. To set it permanently:
@@ Line 38: / Line 39: @@
               .master("local") \
               .appName("Word Count") \
-             .config("spark.some.config.option", "some-value") \
               .getOrCreate()
   print('*** The result row is',
@@ Line 46: / Line 46: @@
 Test run with:
   spark-submit spark-test.py
+In the program, change the line
+             .master("local") \
+to
+             .master("yarn") \
+to test run on the Hadoop cluster.
 <!--
@@ Line 53: / Line 59: @@
   mkdir -p ~/volume/python
   cd ~/volume/python
- sudo apt install emacs
- # edit exercise1.py, first to read form big file, then from hdfs
   sudo apt install python3-pip python3-dev python3-venv
   python3 -m venv venv
   . venv/bin/activate
   python3 -m pip install --upgrade pip
-  pip install findspark
+  pip install tweepy  # kafka-python afinn
+ # pip install findspark
   # because SPARK_HOME is set, findspark is not needed
-->>
+ sudo apt install emacs
+ # edit exercise1.py, first to read form big file, then from hdfs
+-->
-'''Task:''' Run ''exercise1.py'' from Exercise 1, for example with '''spark-submit exercise1.py''. A few tips:
+'''Task 1:''' Run ''exercise1.py'' from Exercise 1, for example with ''spark-submit exercise1.py''. A few tips:
 * Use the large dataset in ''tweets_id_text_100000.jl'' (i.e., '''small_dataset = False''').
 * Because SPARK_HOME is set, you do not need ''findspark''.
-* WHen you run on top of YARN, Spark expects to find files in HDFS, not in the regular file system.
+* When you run on top of YARN, Spark expects to input files from HDFS, not from the regular file system.
-* Use the HDFS and YARN web UIs to look a little at what is going on.
+* Use the HDFS and YARN web UIs to check what is going on.
-* The default replication factors and other settings are not clever. Don't worry about that now: they are enough to get you started.
+* The default replication factors and other settings in YARN and Spark are not clever. Don't worry about that now: they are enough to get you started.
-'''Task:''' Run the full Twitter pipeline from Exercise 3.
 == Web UIs ==
 While spark is running, you can attempt to access Spark's web UI at http://158.39.201.197:4040 (assuming that 158.39.201.197 is the IPv4 address of ''spark-driver''). But when you run on top of YARN it just attempts to redirect to YARN's web UI at http://158.39.201.197:8088 , which you have accessed already.

Anonymous

Search

Install Spark on the cluster: Difference between revisions

Latest revision as of 12:17, 31 October 2022

Install Spark on the cluster

Install Spark

Configure Spark

Test run Spark

Web UIs

Navigation

Wiki tools

Page tools