Install Spark on the cluster: Difference between revisions
| (10 intermediate revisions by the same user not shown) | |||
| Line 3: | Line 3: | ||
Go to [https://spark.apache.org/downloads.html Apache Spark Downloads]. Download and unpack a recent Spark binary. For example, on each instance: | Go to [https://spark.apache.org/downloads.html Apache Spark Downloads]. Download and unpack a recent Spark binary. For example, on each instance: | ||
cd ~/volume | cd ~/volume | ||
wget https://dlcdn.apache.org/spark/spark-3.3. | wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz | ||
tar xzvf spark-3.3. | tar xzvf spark-3.3.1-bin-hadoop3.tgz | ||
rm spark-3.3. | rm spark-3.3.1-bin-hadoop3.tgz | ||
ln -fns spark-3.3. | ln -fns spark-3.3.1-bin-hadoop3 spark | ||
Set $SPARK_HOME and add Spark binaries to path: | Set $SPARK_HOME and add the Spark binaries and scripts to your path: | ||
export SPARK_HOME=/home/ubuntu/volume/spark | export SPARK_HOME=/home/ubuntu/volume/spark | ||
export PATH=${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin | export PATH=${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin | ||
| Line 15: | Line 15: | ||
echo "export PATH=\${PATH}:\${SPARK_HOME}/bin:\${SPARK_HOME}/sbin" >> ~/.bashrc | echo "export PATH=\${PATH}:\${SPARK_HOME}/bin:\${SPARK_HOME}/sbin" >> ~/.bashrc | ||
== Configure Spark == | |||
On your local machine, create the file ''spark-defaults.conf'': | On your local machine, create the file ''spark-defaults.conf'': | ||
spark.master yarn | spark.master yarn | ||
| Line 27: | Line 28: | ||
== Test run Spark == | == Test run Spark == | ||
On ''spark-driver'', start Spark with: | On ''spark-driver'', make sure HDFS/YARN are running (to be certain, you can use''$HADOOP_HOME/sbin/start-all.sh'', because Spark also defines a ''start-all.sh'' script...), and start Spark with: | ||
SPARK_PUBLIC_DNS=$MASTER_NODE pyspark | SPARK_PUBLIC_DNS=$MASTER_NODE pyspark | ||
SPARK_PUBLIC_DNS is the public IP address that Spark's web UI listens to at port 4040. To set it permanently: | SPARK_PUBLIC_DNS is the public IP address that Spark's web UI listens to at port 4040. To set it permanently: | ||
| Line 38: | Line 39: | ||
.master("local") \ | .master("local") \ | ||
.appName("Word Count") \ | .appName("Word Count") \ | ||
.getOrCreate() | .getOrCreate() | ||
print('*** The result row is', | print('*** The result row is', | ||
| Line 46: | Line 46: | ||
Test run with: | Test run with: | ||
spark-submit spark-test.py | spark-submit spark-test.py | ||
In the program, change the line | |||
.master("local") \ | |||
to | |||
.master("yarn") \ | |||
to test run on the Hadoop cluster. | |||
<!-- | <!-- | ||
| Line 53: | Line 59: | ||
mkdir -p ~/volume/python | mkdir -p ~/volume/python | ||
cd ~/volume/python | cd ~/volume/python | ||
sudo apt install python3-pip python3-dev python3-venv | sudo apt install python3-pip python3-dev python3-venv | ||
python3 -m venv venv | python3 -m venv venv | ||
. venv/bin/activate | . venv/bin/activate | ||
python3 -m pip install --upgrade pip | python3 -m pip install --upgrade pip | ||
pip install findspark | pip install tweepy # kafka-python afinn | ||
# pip install findspark | |||
# because SPARK_HOME is set, findspark is not needed | # because SPARK_HOME is set, findspark is not needed | ||
- | sudo apt install emacs | ||
# edit exercise1.py, first to read form big file, then from hdfs | |||
--> | |||
'''Task:''' Run ''exercise1.py'' from Exercise 1, for example with | '''Task 1:''' Run ''exercise1.py'' from Exercise 1, for example with ''spark-submit exercise1.py''. A few tips: | ||
* Use the large dataset in ''tweets_id_text_100000.jl'' (i.e., '''small_dataset = False'''). | * Use the large dataset in ''tweets_id_text_100000.jl'' (i.e., '''small_dataset = False'''). | ||
* Because SPARK_HOME is set, you do not need ''findspark''. | * Because SPARK_HOME is set, you do not need ''findspark''. | ||
* | * When you run on top of YARN, Spark expects to input files from HDFS, not from the regular file system. | ||
* Use the HDFS and YARN web UIs to | * Use the HDFS and YARN web UIs to check what is going on. | ||
* The default replication factors and other settings are not clever. Don't worry about that now: they are enough to get you started | * The default replication factors and other settings in YARN and Spark are not clever. Don't worry about that now: they are enough to get you started. | ||
== Web UIs == | == Web UIs == | ||
While spark is running, you can attempt to access Spark's web UI at http://158.39.201.197:4040 (assuming that 158.39.201.197 is the IPv4 address of ''spark-driver''). But when you run on top of YARN it just attempts to redirect to YARN's web UI at http://158.39.201.197:8088 , which you have accessed already. | While spark is running, you can attempt to access Spark's web UI at http://158.39.201.197:4040 (assuming that 158.39.201.197 is the IPv4 address of ''spark-driver''). But when you run on top of YARN it just attempts to redirect to YARN's web UI at http://158.39.201.197:8088 , which you have accessed already. | ||
Latest revision as of 12:17, 31 October 2022
Install Spark on the cluster
Install Spark
Go to Apache Spark Downloads. Download and unpack a recent Spark binary. For example, on each instance:
cd ~/volume wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz tar xzvf spark-3.3.1-bin-hadoop3.tgz rm spark-3.3.1-bin-hadoop3.tgz ln -fns spark-3.3.1-bin-hadoop3 spark
Set $SPARK_HOME and add the Spark binaries and scripts to your path:
export SPARK_HOME=/home/ubuntu/volume/spark
export PATH=${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin
cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc)
echo "export SPARK_HOME=/home/ubuntu/volume/spark" >> ~/.bashrc
echo "export PATH=\${PATH}:\${SPARK_HOME}/bin:\${SPARK_HOME}/sbin" >> ~/.bashrc
Configure Spark
On your local machine, create the file spark-defaults.conf:
spark.master yarn spark.driver.memory 512m spark.yarn.am.memory 512m spark.executor.memory 512m
From the local machine, upload spark-defaults.conf to each instance:
scp spark-defaults.conf spark-driver:volume/spark/conf scp spark-defaults.conf spark-worker-1:volume/spark/conf ...
Test run Spark
On spark-driver, make sure HDFS/YARN are running (to be certain, you can use$HADOOP_HOME/sbin/start-all.sh, because Spark also defines a start-all.sh script...), and start Spark with:
SPARK_PUBLIC_DNS=$MASTER_NODE pyspark
SPARK_PUBLIC_DNS is the public IP address that Spark's web UI listens to at port 4040. To set it permanently:
export SPARK_PUBLIC_DNS=$MASTER_NODE echo "export SPARK_PUBLIC_DNS=$MASTER_NODE" >> ~/.bashrc
Create a test program, e.g., spark-test.py:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("Word Count") \
.getOrCreate()
print('*** The result row is',
spark.range(5000).where("id > 500").selectExpr("sum(id)").collect(),
'***')
Test run with:
spark-submit spark-test.py
In the program, change the line
.master("local") \
to
.master("yarn") \
to test run on the Hadoop cluster.
Task 1: Run exercise1.py from Exercise 1, for example with spark-submit exercise1.py. A few tips:
- Use the large dataset in tweets_id_text_100000.jl (i.e., small_dataset = False).
- Because SPARK_HOME is set, you do not need findspark.
- When you run on top of YARN, Spark expects to input files from HDFS, not from the regular file system.
- Use the HDFS and YARN web UIs to check what is going on.
- The default replication factors and other settings in YARN and Spark are not clever. Don't worry about that now: they are enough to get you started.
Web UIs
While spark is running, you can attempt to access Spark's web UI at http://158.39.201.197:4040 (assuming that 158.39.201.197 is the IPv4 address of spark-driver). But when you run on top of YARN it just attempts to redirect to YARN's web UI at http://158.39.201.197:8088 , which you have accessed already.
