Getting started with Apache Spark
Getting started with Apache Spark
Purpose
- Getting up and running with Apache Spark
- Getting experience with non-trivial Linux installation
- Using VS Code (or another IDE of your choice)
- Writing and running your own first Spark program
For a general introduction, see the slides to Session 1 on Apache Spark. There is a useful tutorial at TutorialsPoint.
Preparations
In the first exercise, you will run Spark standalone on your own computers, both in a console/terminal windows and in your favourite IDE (Integrated Development Environment). VS Code (Visual Studio Code) is recommended and will be used in these instructions.
Run pyspark-shell from console
Open a console (or terminal) window. I will use a Linux console in the examples.
(If you are on a Windows computer, it is a very good idea to install WSL2 - Windows Subsystem for Linux - and use it as your console/terminal window. But it is not a priority right now.)
You need to have python3 and pip on your machine. On Linux:
sudo apt install python3 python3-dev python3-pip
(In addition, Spark needs a Java Runtime Environment - a JRE or JDK - somewhere on your PATH.)
mkdir info319-exercises cd info319-exercises
Create a Python environment using pip, pipenv or Conda. I will use pip in the examples. It is simple and transparent.
$ which python3 $ python3 --version $ python3 -m venv venv
I have used both Python 3.8 and 3.10, but other recent versions should be fine. (The examples will use 3.10.)
Activate the environment:
$ . venv/bin/activate (venv) $ which python
This should return something like .../info319-exercises/venv/bin/python
(venv) $ python --version (venv) $ pip --version
Upgrade pip if necessary and install pyspark:
(venv) $ python3 -m pip install --upgrade pip (venv) $ pip --version (venv) $ pip install pyspark
Check that pyspark was installed in the right place:
(venv) $ ls venv/lib/python3.10/site-packages
You should now see the pyspark folder.
Start pyspark:
(venv) $ pyspark Python 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information.
Don't panic if you get a few warnings here...
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.3.0 /_/
Using Python version 3.8.10 (default, Jun 22 2022 20:18:18) Spark context Web UI available at http://172.23.240.233:4040 Spark context available as 'sc' (master = local[*], app id = local-1661179410845). SparkSession available as 'spark'. >>>
The >>> prompt means you are ready to go, but first exit to download some data:
>>> exit()
Processing tweets
Download this archive: [file:tweet-id-text-345.tar.bz2.txt]. Remove the '.txt' suffix and unpack in your exercise folder:
(venv) $ tar xjf tweet-id-text-345.tar.bz2 (venv) $ ls tweet-id-text-345
The folder should contain 345 small text files, each representing a tweet.
(venv) $ pyspark
...
>>> folder = 'tweet-id-text-345' >>> tweets = spark.read.text(folder) >>> type(tweets) <class 'pyspark.sql.dataframe.DataFrame'>
DataFrame is a very central data structure in Spark.
>>> tweets.count() >>> tweet_list = tweets.collect() >>> type(tweet_list)
We are back in Python, but not completely:
>>> tweet_list[13] >>> type(tweet_list[13]) <class 'pyspark.sql.types.Row'>
DataFrame Rows are another central Spark data structure. Can you get the rows out as Python dicts?
Exploring tweets in Spark
In Session 1 we will look at more things to do with Spark DataFrames. Here are some possible things to do in Exercise 1 (this is not final):
- Load the tweets as json objects.
- Collect only the texts from the tweets.
- Split the texts into words and select all the hashtags.
- Build a graph of retweets
- Split the tweets into two sets of 80% and 20% size.
- Find URLs in the texts and download a few image files.
- Work on a folder with more tweets.
- Open the Spark context Web UI (see pyspark's start-up banner)
- Experiment with different numbers of partitioners and executors.
Of course, we will do these things _in Spark_, without going via plain Python.
Set up git (optional)
Log in to a git repository, such as github.com or UiB's own GitLab git.app.uib.no . (This can be hard to set up with private and public SSH keys, but you will need it later in the course anyway.)
Create a new project 'info319-exercises' (it is practical to use same name as your folder). Copy the SSH address, such as 'git@git.app.uib.no:yourname/info319-exercises.git'.
Go back to your exercises folder. Create a file '.gitignore' with at least this line: '/venv/'.
$ echo "/venv/" > .gitignore
You can now push your project to the git repository:
$ cd info319_exercises $ git remote add origin https://git.app.uib.no/yourname/info319-exercises.git $ git branch -M main $ git push -uf origin main
(The push will be sparse since we haven't written any Spark _program_ yet.)
Running Spark in VS Code
TBD