Getting started with Apache Spark
Getting started with Apache Spark
Purpose
- Getting up and running with Apache Spark
- Getting experience with non-trivial Linux installation
- Using VS Code (or another IDE of your choice)
- Writing and running your own first Spark program
For a general introduction, see the slides to Session 1 on Apache Spark. There is a useful tutorial at TutorialsPoint.
Preparations
In the first exercise, you will run Spark standalone on your own computers, both in a console/terminal windows and in your favourite IDE (Integrated Development Environment). VS Code (Visual Studio Code) is recommended and will be used in these instructions.
Run pyspark-shell from console
Open a console (or terminal) window. I will use a Linux console in the examples.(If you run Windows, this is a good time to install WSL - Windows Subsystem for Linux - and Windows Terminal, but it is not a priority.)
(If you are on a Windows computer, installing WSL2 (Windows Subsystem for Linux) and using it as your IDE "Terminal" or "Console" is also a good idea, but not a priority right now.)
Create a Python environment using pip, pipenv or Conda. I will use pip in the examples. It is simple and transparent.
You need to have python3 and pip on your machine. (In addition, Spark needs a Java Runtime Environment somewhere on your PATH.)
sudo apt install python3 python3-dev python3-pip
mkdir info319-exercises cd info319-exercises
which python3 python3 --version python3 -m venv venv
I have used both Python 3.8 and 3.10, but other recent versions should be fine. (The examples will use 3.10.)
Activate the environment:
. venv/bin/activate which python
- this should return something like .../info319-exercises/venv/bin/python
python --version pip --version
Upgrade pip if necessary and install pyspark:
python3 -m pip install --upgrade pip pip --version pip install pyspark
Check that pyspark was installed in the right place:
ls venv/lib/python3.10/site-packages/pyspark
Start pyspark:
pyspark Python 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information.
...don't panic if you get a few warnings here...
Welcome to
____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.3.0 /_/
Using Python version 3.8.10 (default, Jun 22 2022 20:18:18) Spark context Web UI available at http://172.23.240.233:4040 Spark context available as 'sc' (master = local[*], app id = local-1661179410845). SparkSession available as 'spark'. >>>
- the >>> prompt means you are ready to go, but first exit to download some data:
>>> exit()
Processing tweets
Download this file: Remove the '.txt' suffix and unpack in your exercise folder.
ls tweet-id-text-345
pyspark
(venv) $ pyspark ...
>>> folder = 'tweet-id-text-345' >>> tweets = spark.read.text(folder) >>> type(tweets) <class 'pyspark.sql.dataframe.DataFrame'>
- DataFrame is a very central data structure in Spark.
>>> tweets.count() >>> tweet_list = tweets.collect() >>> type(tweet_list)
- We are back in Python, but not completely:
>>> tweet_list[13] >>> type(tweet_list[13]) <class 'pyspark.sql.types.Row'>
- DataFrame Rows are another central Spark data structure.
- Can you get the rows out as Python dicts?
>>> python_list = [tweet.asDict() for tweet in tweet_list]
In Session 1 we will look at more things to do with Spark DataFrames. Here are some things to do in exercise 1 (this is not final):
- Load the tweets as json objects.
- Collect only the texts from the tweets.
- Split the texts into words and select all the hashtags.
- Build a graph of retweets
- Split the tweets into two sets of 80% and 20% size.
- Find URLs in the texts and download a few image files.
- Work on a folder with more tweets.
- Open the Spark context Web UI (see pyspark's start-up banner)
- Experiment with different numbers of partitioners and executors.
Of course, we will do these things _in Spark_, without going via plain Python.
Set up git (optional)
Log in to a git repository, such as github.com or UiB's own GitLab git.app.uib.no . (This can be hard to set up with private and public SSH keys, but you will need it later in the course anyway.)
Create a new project 'info319-exercises' (it is practical to use same name as your folder). Copy the SSH address, such as 'git@git.app.uib.no:yourname/info319-exercises.git'.
Go back to your exercises folder. Create a file '.gitignore' with at least this line: '/venv/'.
echo "/venv/" > .gitignore
You can now push your project to the git repository:
cd info319_exercises git remote add origin https://git.app.uib.no/yourname/info319-exercises.git git branch -M main git push -uf origin main
(The push will be sparse since we haven't written any Spark _program_ yet.)
Running Spark in VS Code
TBD