Revision as of 15:46, 25 October 2022

Install Spark on the cluster

Install Spark

Go to Apache Spark Downloads. Download and unpack a recent Spark binary. For example, on each instance:

cd ~/volume
wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
tar xzvf spark-3.3.0-bin-hadoop3.tgz
rm spark-3.3.0-bin-hadoop3.tgz
ln -fns spark-3.3.0-bin-hadoop3 spark

Set $SPARK_HOME and add Spark binaries to path:

export SPARK_HOME=/home/ubuntu/volume/spark
export PATH=${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin
cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc)
echo "export SPARK_HOME=/home/ubuntu/volume/spark" >>  ~/.bashrc
echo "export PATH=\${PATH}:\${SPARK_HOME}/bin:\${SPARK_HOME}/sbin" >>  ~/.bashrc

On your local machine, create the file spark-defaults.conf:

spark.master yarn
spark.driver.memory 512m
spark.yarn.am.memory 512m
spark.executor.memory 512m

From the local machine, upload spark-defaults.conf to each instance:

scp spark-defaults.conf spark-driver:volume/spark/conf
scp spark-defaults.conf spark-worker-1:volume/spark/conf
...

Test run Spark

On spark-driver, make sure HDFS/YARN are running (start-all.sh), and start Spark with:

SPARK_PUBLIC_DNS=$MASTER_NODE pyspark

SPARK_PUBLIC_DNS is the public IP address that Spark's web UI listens to at port 4040. To set it permanently:

export SPARK_PUBLIC_DNS=$MASTER_NODE
echo "export SPARK_PUBLIC_DNS=$MASTER_NODE" >> ~/.bashrc

Create a test program, e.g., spark-test.py:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
            .master("local") \
            .appName("Word Count") \
            .config("spark.some.config.option", "some-value") \
            .getOrCreate()
print('*** The result row is', 
        spark.range(5000).where("id > 500").selectExpr("sum(id)").collect(),
        '***')

Test run with:

spark-submit spark-test.py

Task 1: Run exercise1.py from Exercise 1, for example with spark-submit exercise1.py. A few tips:

Use the large dataset in tweets_id_text_100000.jl (i.e., small_dataset = False).
Because SPARK_HOME is set, you do not need findspark.
When you run on top of YARN, Spark expects to input files from HDFS, not from the regular file system.
Use the HDFS and YARN web UIs to check what is going on.
The default replication factors and other settings are not clever. Don't worry about that now: they are enough to get you started.

Task 2: Run the full Twitter pipeline from Exercise 3. Tips:

You need to create virtual environment to download packages, specifically tweepy, kafka-python, and afinn.
You need to create keys/bearer_token with restricted access.
With the default settings, it can take time before the final join is written to console. Be patient...
Start-up sequence from spark-driver:

# assuming zookeeper and kafka run already
# create kafka topics
kafka-topics.sh --create --topic tweets --bootstrap-server localhost:9092
kafka-topics.sh --create --topic media --bootstrap-server localhost:9092
kafka-topics.sh --create --topic photos --bootstrap-server localhost:9092
# optional monitoring
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic tweets --from-beginning &
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic media --from-beginning &
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic photos --from-beginning &
# main programs
python twitterpipe_tweet_harvester.py  # run this repeatedly or modify to reconnect after sleeping
python twitterpipe_media_harvester.py &  # this one can run in background
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 twitterpipe_tweet_pipeline.py &

Because you no longer use findspark, you need to use the --packages option with spark-submit. Make sure you use exactly the same package as you did with findspark.add_packages(...).

Web UIs

While spark is running, you can attempt to access Spark's web UI at http://158.39.201.197:4040 (assuming that 158.39.201.197 is the IPv4 address of spark-driver). But when you run on top of YARN it just attempts to redirect to YARN's web UI at http://158.39.201.197:8088 , which you have accessed already.

@@ Line 27: / Line 27: @@
 == Test run Spark ==
-On ''spark-driver'', start Spark with:
+On ''spark-driver'', make sure HDFS/YARN are running ('''start-all.sh'''), and start Spark with:
   SPARK_PUBLIC_DNS=$MASTER_NODE pyspark
 SPARK_PUBLIC_DNS is the public IP address that Spark's web UI listens to at port 4040. To set it permanently:

Anonymous

Search

Install Spark on the cluster: Difference between revisions

Namespaces

More

Page actions

Sinoa (talk | contribs)

Revision as of 15:46, 25 October 2022

Contents

Install Spark on the cluster

Install Spark

Test run Spark

Web UIs

Navigation

Pages

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Install Spark on the cluster: Difference between revisions

Sinoa (talk | contribs)

Revision as of 15:46, 25 October 2022

Install Spark on the cluster

Install Spark

Test run Spark

Web UIs

Navigation

Wiki tools

Page tools