Getting started with Apache Spark

From info319
Revision as of 22:53, 22 August 2022 by Sinoa (talk | contribs)

Getting started with Apache Spark

Purpose

  • Getting up and running with Apache Spark
  • Getting experience with non-trivial Linux installation
  • Using VS Code (or another IDE of your choice)
  • Writing and running your own first Spark program

For a general introduction, see the slides to Session 1 on Apache Spark. There is a useful tutorial at TutorialsPoint.

Preparations

In the first exercise, you will run Spark standalone on your own computers, both in a console/terminal windows and in your favourite IDE (Integrated Development Environment). VS Code (Visual Studio Code) is recommended and will be used in these instructions.

Run pyspark-shell from console

Open a console (or terminal) window. I will use a Linux console in the examples.

(If you are on a Windows computer, it is a very good idea to install WSL2 - Windows Subsystem for Linux - and use it as your console/terminal window. But it is not a priority right now.)

You need to have python3 and pip on your machine. On Linux:

sudo apt install python3 python3-dev python3-pip

(In addition, Spark needs a Java Runtime Environment - a JRE or JDK - somewhere on your PATH.)

mkdir info319-exercises
cd info319-exercises

Create a Python environment using pip, pipenv or Conda. I will use pip in the examples. It is simple and transparent.

$ which python3
$ python3 --version
$ python3 -m venv venv

I have used both Python 3.8 and 3.10, but other recent versions should be fine. (The examples will use 3.10.)

Activate the environment:

$ . venv/bin/activate
(venv) $ which python

This should return something like .../info319-exercises/venv/bin/python

(venv) $ python --version
(venv) $ pip --version

Upgrade pip if necessary and install pyspark:

(venv) $ python3 -m pip install --upgrade pip
(venv) $ pip --version
(venv) $ pip install pyspark

Check that pyspark was installed in the right place:

(venv) $ ls venv/lib/python3.10/site-packages

You should now see the pyspark folder.

Start pyspark:

(venv) $ pyspark
Python 3.8.10 (default, Jun 22 2022, 20:18:18)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information. 

Don't panic if you get a few warnings here...

Welcome to
     ____              __
    / __/__  ___ _____/ /__
   _\ \/ _ \/ _ `/ __/  '_/
  /__ / .__/\_,_/_/ /_/\_\   version 3.3.0
     /_/
Using Python version 3.8.10 (default, Jun 22 2022 20:18:18)
Spark context Web UI available at http://172.23.240.233:4040
Spark context available as 'sc' (master = local[*], app id = local-1661179410845).
SparkSession available as 'spark'.
>>>

The >>> prompt means you are ready to go, but first exit to download some data:

>>> exit() 

Processing tweets

Download this archive of random tweets, and unpack in your exercise folder:

(venv) $ tar xjf tweet-id-text-345.tar.bz2
(venv) $ ls tweet-id-text-345

The folder should contain 345 small text files, each representing a tweet.

(venv) $ pyspark

...

>>> folder = 'tweet-id-text-345'
>>> tweets = spark.read.text(folder)
>>> type(tweets)
<class 'pyspark.sql.dataframe.DataFrame'>

DataFrame is a very central data structure in Spark.

>>> tweets.count()
>>> tweet_list = tweets.collect()
>>> type(tweet_list)

We are back in Python, but not completely:

>>> tweet_list[13]
>>> type(tweet_list[13])
<class 'pyspark.sql.types.Row'>

DataFrame Rows are another central Spark data structure. Can you get the rows out as Python dicts?

Exploring tweets in Spark

In Session 1 we will look at more things to do with Spark DataFrames. Here are some possible things to do in Exercise 1 (this is not final):

  • Load the tweets as json objects.
  • Collect only the texts from the tweets.
  • Split the texts into words and select all the hashtags.
  • Build a graph of retweets
  • Split the tweets into two sets of 80% and 20% size.
  • Find URLs in the texts and download a few image files.
  • Work on a folder with more tweets.
  • Open the Spark context Web UI (see pyspark's start-up banner)
  • Experiment with different numbers of partitioners and executors.

Of course, we will do these things _in Spark_, without going via plain Python.

Set up git (optional)

Log in to a git repository, such as github.com or UiB's own GitLab git.app.uib.no . (This can be hard to set up with private and public SSH keys, but you will need it later in the course anyway.)

Create a new project 'info319-exercises' (it is practical to use same name as your folder). Copy the SSH address, such as 'git@git.app.uib.no:yourname/info319-exercises.git'.

Go back to your exercises folder. Create a file '.gitignore' with at least this line: '/venv/'.

$ echo "/venv/" > .gitignore

You can now push your project to the git repository:

$ cd info319_exercises
$ git remote add origin https://git.app.uib.no/yourname/info319-exercises.git
$ git branch -M main
$ git push -uf origin main

(The push will be sparse since we haven't written any Spark _program_ yet.)

Running Spark in Jupyter and VS Code

You can also run pyspark in a Jupyter notebook standalone or inside an IDE. The example below is for Jupyter inside VS Code, but should work on other Jupyters.

Install and start VS Code. Open your 'info319-exercise' from before. Create a new file with extension .ipynb, for example exercise1.ipynb. Doubleclick the new file, which should open it in a Jupyter notebook inside VS Code.

Click the 'Python' button near the upper right corner of the Python notebook. A window should pop up and ask you to Change kernel for 'exercise1.ipynb'. Type in or select ./venv/bin/python.

In the first cell:

!pip install findspark

In the next:

import findspark
findspark.init()

In the third:

from pyspark.sql import SQLContext, SparkSession
spark = SparkSession \
        .builder \
        .appName("Jupyter Spark shell") \
        .getOrCreate()
sc = spark.sparkContext

You can now use the notebook as a pyspark shell like before. For example:

folder = 'tweet-id-text-345'
tweets = spark.read.text(folder)
tweets.count()