Getting started with Apache Spark

From info319
Revision as of 15:36, 22 August 2022 by Sinoa (talk | contribs) (Created page with "=Getting started with Apache Spark= ==Purpose== * Getting up and running with Apache Spark * Getting experience with non-trivial Linux installation * Using VS Code (or another...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Getting started with Apache Spark

Purpose

  • Getting up and running with Apache Spark
  • Getting experience with non-trivial Linux installation
  • Using VS Code (or another IDE of your choice)
  • Writing and running your own first Spark program

For a general introduction, see the slides to Session 1 on Apache Spark. There is a useful tutorial at TutorialsPoint.

Preparations

In the first exercise, you will run Spark standalone on your own computers, both in a console/terminal windows and in your favourite IDE (Integrated Development Environment). VS Code (Visual Studio Code) is recommended and will be used in these instructions.


Run pyspark-shell from console

Open a console (or terminal) window. I will use a Linux console in the examples.(If you run Windows, this is a good time to install WSL - Windows Subsystem for Linux - and Windows Terminal, but it is not a priority.)


(If you are on a Windows computer, installing WSL2 (Windows Subsystem for Linux) and using it as your IDE "Terminal" or "Console" is also a good idea, but not a priority right now.)


Create a Python environment using pip, pipenv or Conda. I will use pip in the examples. It is simple and transparent.

You need to have python3 and pip on your machine. (In addition, Spark needs a Java Runtime Environment somewhere on your PATH.)

sudo apt install python3 python3-dev python3-pip

mkdir info319-exercises cd info319-exercises

which python3 python3 --version python3 -m venv venv

I have used both Python 3.8 and 3.10, but other recent versions should be fine. (The examples will use 3.10.)

Activate the environment:

. venv/bin/activate which python

  1. this should return something like .../info319-exercises/venv/bin/python

python --version pip --version

Upgrade pip if necessary and install pyspark:

python3 -m pip install --upgrade pip pip --version pip install pyspark

Check that pyspark was installed in the right place:

ls venv/lib/python3.10/site-packages/pyspark

Start pyspark:

pyspark Python 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information.

...don't panic if you get a few warnings here...

Welcome to

     ____              __
    / __/__  ___ _____/ /__
   _\ \/ _ \/ _ `/ __/  '_/
  /__ / .__/\_,_/_/ /_/\_\   version 3.3.0
     /_/

Using Python version 3.8.10 (default, Jun 22 2022 20:18:18) Spark context Web UI available at http://172.23.240.233:4040 Spark context available as 'sc' (master = local[*], app id = local-1661179410845). SparkSession available as 'spark'. >>>

  1. the >>> prompt means you are ready to go, but first exit to download some data:

>>> exit()

Processing tweets

Download this file: Remove the '.txt' suffix and unpack in your exercise folder.

ls tweet-id-text-345

pyspark

(venv) $ pyspark ...

>>> folder = 'tweet-id-text-345' >>> tweets = spark.read.text(folder) >>> type(tweets) <class 'pyspark.sql.dataframe.DataFrame'>

  1. DataFrame is a very central data structure in Spark.

>>> tweets.count() >>> tweet_list = tweets.collect() >>> type(tweet_list)

  1. We are back in Python, but not completely:

>>> tweet_list[13] >>> type(tweet_list[13]) <class 'pyspark.sql.types.Row'>

  1. DataFrame Rows are another central Spark data structure.
  2. Can you get the rows out as Python dicts?

>>> python_list = [tweet.asDict() for tweet in tweet_list]


In Session 1 we will look at more things to do with Spark DataFrames. Here are some things to do in exercise 1 (this is not final):

  • Load the tweets as json objects.
  • Collect only the texts from the tweets.
  • Split the texts into words and select all the hashtags.
  • Build a graph of retweets
  • Split the tweets into two sets of 80% and 20% size.
  • Find URLs in the texts and download a few image files.
  • Work on a folder with more tweets.
  • Open the Spark context Web UI (see pyspark's start-up banner)
  • Experiment with different numbers of partitioners and executors.

Of course, we will do these things _in Spark_, without going via plain Python.

Set up git (optional)

Log in to a git repository, such as github.com or UiB's own GitLab git.app.uib.no . (This can be hard to set up with private and public SSH keys, but you will need it later in the course anyway.)

Create a new project 'info319-exercises' (it is practical to use same name as your folder). Copy the SSH address, such as 'git@git.app.uib.no:yourname/info319-exercises.git'.

Go back to your exercises folder. Create a file '.gitignore' with at least this line: '/venv/'.

echo "/venv/" > .gitignore

You can now push your project to the git repository:

cd info319_exercises git remote add origin https://git.app.uib.no/yourname/info319-exercises.git git branch -M main git push -uf origin main

(The push will be sparse since we haven't written any Spark _program_ yet.)

Running Spark in VS Code

TBD