Install Spark on the cluster: Difference between revisions

From info319
Line 53: Line 53:
  mkdir -p ~/volume/python
  mkdir -p ~/volume/python
  cd ~/volume/python
  cd ~/volume/python
sudo apt install emacs
# edit exercise1.py, first to read form big file, then from hdfs
  sudo apt install python3-pip python3-dev python3-venv
  sudo apt install python3-pip python3-dev python3-venv
  python3 -m venv venv
  python3 -m venv venv
  . venv/bin/activate
  . venv/bin/activate
  python3 -m pip install --upgrade pip
  python3 -m pip install --upgrade pip
  pip install findspark
  pip install tweepy  # kafka-python afinn
# pip install findspark
  # because SPARK_HOME is set, findspark is not needed
  # because SPARK_HOME is set, findspark is not needed
sudo apt install emacs
# edit exercise1.py, first to read form big file, then from hdfs
-->
-->


'''Task:''' Run ''exercise1.py'' from Exercise 1, for example with '''spark-submit exercise1.py''. A few tips:
'''Task:''' Run ''exercise1.py'' from Exercise 1, for example with ''spark-submit exercise1.py''. A few tips:
* Use the large dataset in ''tweets_id_text_100000.jl'' (i.e., '''small_dataset = False''').
* Use the large dataset in ''tweets_id_text_100000.jl'' (i.e., '''small_dataset = False''').
* Because SPARK_HOME is set, you do not need ''findspark''.
* Because SPARK_HOME is set, you do not need ''findspark''.
* WHen you run on top of YARN, Spark expects to find files in HDFS, not in the regular file system.
* When you run on top of YARN, Spark expects to input files from HDFS, not from the regular file system.
* Use the HDFS and YARN web UIs to look a little at what is going on.
* Use the HDFS and YARN web UIs to check what is going on.
* The default replication factors and other settings are not clever. Don't worry about that now: they are enough to get you started.
* The default replication factors and other settings are not clever. Don't worry about that now: they are enough to get you started.


'''Task:''' Run the full Twitter pipeline from Exercise 3.
'''Task:''' Run the full Twitter pipeline from Exercise 3. Tips:
* You need to create virtual environment to download packages, specifically ''tweepy'', ''kafka-python'', and ''afinn''.
* You need to create ''keys/bearer_token'' with restricted access.
* With the default settings, it can take time before the final join is written to console. Be patient...
* Start-up sequence from ''spark-driver'':
# assuming zookeeper and kafka run already
# create kafka topics
kafka-topics.sh --create --topic tweets --bootstrap-server localhost:9092
kafka-topics.sh --create --topic media --bootstrap-server localhost:9092
kafka-topics.sh --create --topic photos --bootstrap-server localhost:9092
# optional monitoring
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic tweets --from-beginning &
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic media --from-beginning &
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic photos --from-beginning &
# main programs
python twitterpipe_tweet_harvester.py  # run this repeatedly or modify to reconnect after sleeping
python twitterpipe_media_harvester.py &  # this one can run in background
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 twitterpipe_tweet_pipeline.py &
Because you no longer use ''findspark'', you need to use the '''--packages''' option with '''spark-submit'''. Make sure you use exactly the same package as you did with ''findspark.add_packages(...)'''.


== Web UIs ==
== Web UIs ==
While spark is running, you can attempt to access Spark's web UI at http://158.39.201.197:4040 (assuming that 158.39.201.197 is the IPv4 address of ''spark-driver''). But when you run on top of YARN it just attempts to redirect to YARN's web UI at http://158.39.201.197:8088 , which you have accessed already.
While spark is running, you can attempt to access Spark's web UI at http://158.39.201.197:4040 (assuming that 158.39.201.197 is the IPv4 address of ''spark-driver''). But when you run on top of YARN it just attempts to redirect to YARN's web UI at http://158.39.201.197:8088 , which you have accessed already.

Revision as of 17:42, 17 October 2022

Install Spark on the cluster

Install Spark

Go to Apache Spark Downloads. Download and unpack a recent Spark binary. For example, on each instance:

cd ~/volume
wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
tar xzvf spark-3.3.0-bin-hadoop3.tgz
rm spark-3.3.0-bin-hadoop3.tgz
ln -fns spark-3.3.0-bin-hadoop3 spark

Set $SPARK_HOME and add Spark binaries to path:

export SPARK_HOME=/home/ubuntu/volume/spark
export PATH=${PATH}:${SPARK_HOME}/bin:${SPARK_HOME}/sbin
cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc)
echo "export SPARK_HOME=/home/ubuntu/volume/spark" >>  ~/.bashrc
echo "export PATH=\${PATH}:\${SPARK_HOME}/bin:\${SPARK_HOME}/sbin" >>  ~/.bashrc

On your local machine, create the file spark-defaults.conf:

spark.master yarn
spark.driver.memory 512m
spark.yarn.am.memory 512m
spark.executor.memory 512m

From the local machine, upload spark-defaults.conf to each instance:

scp spark-defaults.conf spark-driver:volume/spark/conf
scp spark-defaults.conf spark-worker-1:volume/spark/conf
...

Test run Spark

On spark-driver, start Spark with:

SPARK_PUBLIC_DNS=$MASTER_NODE pyspark

SPARK_PUBLIC_DNS is the public IP address that Spark's web UI listens to at port 4040. To set it permanently:

export SPARK_PUBLIC_DNS=$MASTER_NODE
echo "export SPARK_PUBLIC_DNS=$MASTER_NODE" >> ~/.bashrc

Create a test program, e.g., spark-test.py:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
            .master("local") \
            .appName("Word Count") \
            .config("spark.some.config.option", "some-value") \
            .getOrCreate()
print('*** The result row is', 
        spark.range(5000).where("id > 500").selectExpr("sum(id)").collect(),
        '***')

Test run with:

spark-submit spark-test.py


Task: Run exercise1.py from Exercise 1, for example with spark-submit exercise1.py. A few tips:

  • Use the large dataset in tweets_id_text_100000.jl (i.e., small_dataset = False).
  • Because SPARK_HOME is set, you do not need findspark.
  • When you run on top of YARN, Spark expects to input files from HDFS, not from the regular file system.
  • Use the HDFS and YARN web UIs to check what is going on.
  • The default replication factors and other settings are not clever. Don't worry about that now: they are enough to get you started.

Task: Run the full Twitter pipeline from Exercise 3. Tips:

  • You need to create virtual environment to download packages, specifically tweepy, kafka-python, and afinn.
  • You need to create keys/bearer_token with restricted access.
  • With the default settings, it can take time before the final join is written to console. Be patient...
  • Start-up sequence from spark-driver:
# assuming zookeeper and kafka run already
# create kafka topics
kafka-topics.sh --create --topic tweets --bootstrap-server localhost:9092
kafka-topics.sh --create --topic media --bootstrap-server localhost:9092
kafka-topics.sh --create --topic photos --bootstrap-server localhost:9092
# optional monitoring
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic tweets --from-beginning &
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic media --from-beginning &
kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic photos --from-beginning &
# main programs
python twitterpipe_tweet_harvester.py  # run this repeatedly or modify to reconnect after sleeping
python twitterpipe_media_harvester.py &  # this one can run in background
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 twitterpipe_tweet_pipeline.py &

Because you no longer use findspark, you need to use the --packages' option with spark-submit. Make sure you use exactly the same package as you did with findspark.add_packages(...).

Web UIs

While spark is running, you can attempt to access Spark's web UI at http://158.39.201.197:4040 (assuming that 158.39.201.197 is the IPv4 address of spark-driver). But when you run on top of YARN it just attempts to redirect to YARN's web UI at http://158.39.201.197:8088 , which you have accessed already.