Revision as of 17:42, 17 October 2022

Install Spark on the cluster

Install Spark

Go to Apache Spark Downloads. Download and unpack a recent Spark binary. For example, on each instance:

cd ~/volume
wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
tar xzvf spark-3.3.0-bin-hadoop3.tgz
rm spark-3.3.0-bin-hadoop3.tgz
ln -fns spark-3.3.0-bin-hadoop3 spark

Set $SPARK_HOME and add Spark binaries to path:

export SPARK_HOME=/home/ubuntu/volume/spark
export PATH=${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin
cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc)
echo "export SPARK_HOME=/home/ubuntu/volume/spark" >>  ~/.bashrc
echo "export PATH=\${PATH}:\${SPARK_HOME}/bin:\${SPARK_HOME}/sbin" >>  ~/.bashrc

On your local machine, create the file spark-defaults.conf:

spark.master yarn
spark.driver.memory 512m
spark.yarn.am.memory 512m
spark.executor.memory 512m

From the local machine, upload spark-defaults.conf to each instance:

scp spark-defaults.conf spark-driver:volume/spark/conf
scp spark-defaults.conf spark-worker-1:volume/spark/conf
...

Test run Spark

On spark-driver, start Spark with:

SPARK_PUBLIC_DNS=$MASTER_NODE pyspark

SPARK_PUBLIC_DNS is the public IP address that Spark's web UI listens to at port 4040. To set it permanently:

export SPARK_PUBLIC_DNS=$MASTER_NODE
echo "export SPARK_PUBLIC_DNS=$MASTER_NODE" >> ~/.bashrc

Create a test program, e.g., spark-test.py:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
            .master("local") \
            .appName("Word Count") \
            .config("spark.some.config.option", "some-value") \
            .getOrCreate()
print('*** The result row is', 
        spark.range(5000).where("id > 500").selectExpr("sum(id)").collect(),
        '***')

Test run with:

spark-submit spark-test.py

Task: Run exercise1.py from Exercise 1, for example with spark-submit exercise1.py. A few tips:

Use the large dataset in tweets_id_text_100000.jl (i.e., small_dataset = False).
Because SPARK_HOME is set, you do not need findspark.
When you run on top of YARN, Spark expects to input files from HDFS, not from the regular file system.
Use the HDFS and YARN web UIs to check what is going on.
The default replication factors and other settings are not clever. Don't worry about that now: they are enough to get you started.

Task: Run the full Twitter pipeline from Exercise 3. Tips:

You need to create virtual environment to download packages, specifically tweepy, kafka-python, and afinn.
You need to create keys/bearer_token with restricted access.
With the default settings, it can take time before the final join is written to console. Be patient...
Start-up sequence from spark-driver:

# assuming zookeeper and kafka run already
# create kafka topics
kafka-topics.sh --create --topic tweets --bootstrap-server localhost:9092
kafka-topics.sh --create --topic media --bootstrap-server localhost:9092
kafka-topics.sh --create --topic photos --bootstrap-server localhost:9092
# optional monitoring
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic tweets --from-beginning &
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic media --from-beginning &
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic photos --from-beginning &
# main programs
python twitterpipe_tweet_harvester.py  # run this repeatedly or modify to reconnect after sleeping
python twitterpipe_media_harvester.py &  # this one can run in background
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 twitterpipe_tweet_pipeline.py &

Because you no longer use findspark, you need to use the --packages' option with spark-submit. Make sure you use exactly the same package as you did with findspark.add_packages(...).

Web UIs

While spark is running, you can attempt to access Spark's web UI at http://158.39.201.197:4040 (assuming that 158.39.201.197 is the IPv4 address of spark-driver). But when you run on top of YARN it just attempts to redirect to YARN's web UI at http://158.39.201.197:8088 , which you have accessed already.

@@ Line 53: / Line 53: @@
   mkdir -p ~/volume/python
   cd ~/volume/python
- sudo apt install emacs
- # edit exercise1.py, first to read form big file, then from hdfs
   sudo apt install python3-pip python3-dev python3-venv
   python3 -m venv venv
   . venv/bin/activate
   python3 -m pip install --upgrade pip
-  pip install findspark
+  pip install tweepy  # kafka-python afinn
+ # pip install findspark
   # because SPARK_HOME is set, findspark is not needed
+ sudo apt install emacs
+ # edit exercise1.py, first to read form big file, then from hdfs
 -->
-'''Task:''' Run ''exercise1.py'' from Exercise 1, for example with '''spark-submit exercise1.py''. A few tips:
+'''Task:''' Run ''exercise1.py'' from Exercise 1, for example with ''spark-submit exercise1.py''. A few tips:
 * Use the large dataset in ''tweets_id_text_100000.jl'' (i.e., '''small_dataset = False''').
 * Because SPARK_HOME is set, you do not need ''findspark''.
-* WHen you run on top of YARN, Spark expects to find files in HDFS, not in the regular file system.
+* When you run on top of YARN, Spark expects to input files from HDFS, not from the regular file system.
-* Use the HDFS and YARN web UIs to look a little at what is going on.
+* Use the HDFS and YARN web UIs to check what is going on.
 * The default replication factors and other settings are not clever. Don't worry about that now: they are enough to get you started.
-'''Task:''' Run the full Twitter pipeline from Exercise 3.
+'''Task:''' Run the full Twitter pipeline from Exercise 3. Tips:
+* You need to create virtual environment to download packages, specifically ''tweepy'', ''kafka-python'', and ''afinn''.
+* You need to create ''keys/bearer_token'' with restricted access.
+* With the default settings, it can take time before the final join is written to console. Be patient...
+* Start-up sequence from ''spark-driver'':
+ # assuming zookeeper and kafka run already
+ # create kafka topics
+ kafka-topics.sh --create --topic tweets --bootstrap-server localhost:9092
+ kafka-topics.sh --create --topic media --bootstrap-server localhost:9092
+ kafka-topics.sh --create --topic photos --bootstrap-server localhost:9092
+ # optional monitoring
+ kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic tweets --from-beginning &
+ kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic media --from-beginning &
+ kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic photos --from-beginning &
+ # main programs
+ python twitterpipe_tweet_harvester.py  # run this repeatedly or modify to reconnect after sleeping
+ python twitterpipe_media_harvester.py &  # this one can run in background
+ spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 twitterpipe_tweet_pipeline.py &
+Because you no longer use ''findspark'', you need to use the '''--packages''' option with '''spark-submit'''. Make sure you use exactly the same package as you did with ''findspark.add_packages(...)'''.
 == Web UIs ==
 While spark is running, you can attempt to access Spark's web UI at http://158.39.201.197:4040 (assuming that 158.39.201.197 is the IPv4 address of ''spark-driver''). But when you run on top of YARN it just attempts to redirect to YARN's web UI at http://158.39.201.197:8088 , which you have accessed already.

Anonymous

Search

Install Spark on the cluster: Difference between revisions

Namespaces

More

Page actions

Revision as of 17:42, 17 October 2022

Contents

Install Spark on the cluster

Install Spark

Test run Spark

Web UIs

Navigation

Pages

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Install Spark on the cluster: Difference between revisions

Revision as of 17:42, 17 October 2022

Install Spark on the cluster

Install Spark

Test run Spark

Web UIs

Navigation

Wiki tools

Page tools