Getting started with Apache Spark: Difference between revisions

From info319
(Created page with "=Getting started with Apache Spark= ==Purpose== * Getting up and running with Apache Spark * Getting experience with non-trivial Linux installation * Using VS Code (or another...")
 
No edit summary
 
(17 intermediate revisions by the same user not shown)
Line 2: Line 2:
==Purpose==
==Purpose==
* Getting up and running with Apache Spark
* Getting up and running with Apache Spark
* Getting experience with non-trivial Linux installation
* Using VS Code (or your IDE of choice) with Spark
* Using VS Code (or another IDE of your choice)
* Running your own first Spark instructions
* Writing and running your own first Spark program
 
For a general introduction, see the slides to Session 1 on Apache Spark. There is a useful tutorial at [https://www.tutorialspoint.com/spark_sql/spark_introduction.htm TutorialsPoint].


==Preparations==
==Preparations==
In the first exercise, you will run Spark standalone on your own computers, both in a console/terminal windows and in your favourite IDE (Integrated Development Environment). VS Code (Visual Studio Code) is recommended and will be used in these instructions.  
In the first exercise, you will run Spark standalone on your own computer, first in a ''console'' (or ''terminal window'') and then in your favourite IDE (Integrated Development Environment).  
* [https://spark.apache.org/docs/latest/quick-start.html Spark Quick Start]
* [https://spark.apache.org/docs/latest/sql-programming-guide.html Spark SQL, DataFrames, and Datasets Guide]
* [https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html PySpark QuickStart: DataFrame]


=== Run PySpark from console ===
Open a console (or terminal window) on your computer. I will use a Linux console in the examples. If you are on a Windows computer, it is a very good idea to install [https://docs.microsoft.com/en-us/windows/wsl/about WSL2 - Windows Subsystem for Linux version 2]. Choose the Ubuntu 20.4 or 22.04 flavour. You can also install [https://docs.microsoft.com/en-us/windows/terminal/install Windows Terminal], which is much more user-friendly and flexible than the default console ('cmd.exe').


=== Run pyspark-shell from console ===
You need to have python3 and pip on your machine. On Ubuntu Linux (including on WSL2 Ubuntu flavour):
$ sudo apt install python3 python3-dev python3-pip
I have used both Python 3.8 and 3.10, but other recent versions should be fine.


Open a console (or terminal) window. I will use a Linux console in the examples.(If you run Windows, this is a good time to install WSL - Windows Subsystem for Linux - and Windows Terminal, but it is not a priority.)
In addition, Spark will need a Java Runtime Environment - a JRE or JDK - somewhere on your PATH. For example:
$ sudo apt install openjdk-17-jdk-headless/


And in your '~/.bashrc' file add the line (check the location first, it could depend on your system):
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
(We may use packages later in the course that have problems with the newest Java versions, but let us start with 17.)


(If you are on a Windows computer, installing WSL2 (Windows Subsystem for Linux) and using it as your IDE "Terminal" or "Console" is also a good idea, but not a priority right now.)
Create a new folder for running Spark:
 
$ mkdir info319-exercises
$ cd info319-exercises


Create a Python environment using pip, pipenv or Conda. I will use pip in the examples. It is simple and transparent.
Create a Python environment using pip, pipenv or Conda. I will use pip in the examples. It is simple and transparent.
$ which python3
$ python3 --version
$ python3 -m venv venv


You need to have python3 and pip on your machine. (In addition, Spark needs a Java Runtime Environment somewhere on your PATH.)
Activate the new environment:
 
$ . venv/bin/activate
sudo apt install python3 python3-dev python3-pip
(venv) $ which python
 
This should return something like .../info319-exercises/venv/bin/python  
mkdir info319-exercises
cd info319-exercises
 
which python3
python3 --version
python3 -m venv venv
 
I have used both Python 3.8 and  3.10, but other recent versions should be fine. (The examples will use 3.10.)
 
Activate the environment:
 
. venv/bin/activate
which python
# this should return something like .../info319-exercises/venv/bin/python  
python --version
pip --version
 
Upgrade pip if necessary and install pyspark:
 
python3 -m pip install --upgrade pip
pip --version
pip install pyspark
 
Check that pyspark was installed in the right place:
 
ls venv/lib/python3.10/site-packages/pyspark


Start pyspark:
(venv) $ python --version
(venv) $ pip --version


pyspark
Upgrade pip if necessary and install PySpark:
Python 3.8.10 (default, Jun 22 2022, 20:18:18)
(venv) $ python3 -m pip install --upgrade pip
[GCC 9.4.0] on linux
(venv) $ pip --version
Type "help", "copyright", "credits" or "license" for more information.
(venv) $ pip install pyspark


...don't panic if you get a few warnings here...
Check that PySpark was installed in the right place:
(venv) $ ls venv/lib/python3.10/site-packages
This should list the contents of the PySpark folder.


Welcome to
Start PySpark:
(venv) $ pyspark
Python 3.8.10 (default, Jun 22 2022, 20:18:18)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
(Don't panic if you see a few warnings here...)
Welcome to
       ____              __
       ____              __
     / __/__  ___ _____/ /__
     / __/__  ___ _____/ /__
Line 68: Line 65:
   /__ / .__/\_,_/_/ /_/\_\  version 3.3.0
   /__ / .__/\_,_/_/ /_/\_\  version 3.3.0
       /_/
       /_/
Using Python version 3.8.10 (default, Jun 22 2022 20:18:18)
Spark context Web UI available at http://172.23.240.233:4040
Spark context available as 'sc' (master = local[*], app id = local-1661179410845).
SparkSession available as 'spark'.
>>>


Using Python version 3.8.10 (default, Jun 22 2022 20:18:18)
The >>> prompt means you are ready to go, but first you must exit to download some data:
Spark context Web UI available at http://172.23.240.233:4040
>>> exit()  
Spark context available as 'sc' (master = local[*], app id = local-1661179410845).
SparkSession available as 'spark'.
>>>
 
# the >>> prompt means you are ready to go, but first exit to download some data:
 
>>> exit()  


=== Processing tweets ===
=== Processing tweets ===
Download [https://mitt.uib.no/courses/37204/files/4406276/download?download_frd=1 this archive of random tweets] (available from Files -> Datafiles in https://mitt.uib.no), and unpack in your exercise folder:
(venv) $ tar xjf tweet-id-text-345.tar.bz2
(venv) $ ls tweet-id-text-345


Download this file:  
The folder should contain 345 small text files, each representing a tweet.
Remove the '.txt' suffix and unpack in your exercise folder.  
(venv) $ pyspark
...
>>> folder = 'tweet-id-text-345'
>>> tweets = spark.read.text(folder)
  >>> type(tweets)
<class 'pyspark.sql.dataframe.DataFrame'>
[https://spark.apache.org/docs/latest/sql-programming-guide.html DataFrame] is a very central data structure in Spark. Here are [https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html more PySpark examples of DataFrames].


ls tweet-id-text-345
>>> tweets.count()
>>> tweet_list = tweets.collect()
>>> type(tweet_list)


pyspark
We are back in Python, but not completely:
>>> tweet_list[13]
>>> type(tweet_list[13])
<class 'pyspark.sql.types.Row'>
[https://spark.apache.org/docs/latest/sql-programming-guide.html DataFrame Rows] are another central Spark data structure. Can you get the rows out as Python dicts?


(venv) $ pyspark
=== Set up git (optional) ===
...
Log in to a git repository, such as github.com or [https://git.app.uib.no UiB's own GitLab]. It can be hard to set up with private and public SSH keys, but you will need it later in the course anyway.  


>>> folder = 'tweet-id-text-345'
Create a new project 'info319-exercises' (it is practical to use same name as your folder). Copy the SSH address, such as 'git@git.app.uib.no:yourname/info319-exercises.git'.
>>> tweets = spark.read.text(folder)
>>> type(tweets)
<class 'pyspark.sql.dataframe.DataFrame'>
# DataFrame is a very central data structure in Spark.  
>>> tweets.count()
>>> tweet_list = tweets.collect()
>>> type(tweet_list)
# We are back in Python, but not completely:
>>> tweet_list[13]
>>> type(tweet_list[13])
<class 'pyspark.sql.types.Row'>
# DataFrame Rows are another central Spark data structure.  
# Can you get the rows out as Python dicts?
>>> python_list = [tweet.asDict() for tweet in tweet_list]


Go back to your exercises folder. Create a file '.gitignore' with at least this line: '/venv/'.
$ echo "/venv/" > .gitignore


In Session 1 we will look at more things to do with Spark DataFrames. Here are some things to do in exercise 1 (this is not final):
You can now push your project to the git repository:
$ cd info319_exercises
$ git remote add origin https://git.app.uib.no/yourname/info319-exercises.git
$ git branch -M main
$ git push -uf origin main
(The push will not contain many files since we haven't written any Spark ''program'' yet.)


* Load the tweets as json objects.
=== Running Spark in Jupyter and VS Code ===
* Collect only the texts from the tweets.
You can also run PySpark in a Jupyter notebook, either standalone or inside an Integrated Development Environment (IDE). On Windows, you may also need to install the [https://code.visualstudio.com/docs/remote/wsl Visual Studio Code Remote - WSL] plugin.
* Split the texts into words and select all the hashtags.
* Build a graph of retweets
* Split the tweets into two sets of 80% and 20% size.
* Find URLs in the texts and download a few image files.
* Work on a folder with more tweets.
* Open the Spark context Web UI (see pyspark's start-up banner)
* Experiment with different numbers of partitioners and executors.


Of course, we will do these things _in Spark_, without going via plain Python.
The example below runs PySpark in Jupyter inside [https://code.visualstudio.com/ VS Code (Visual Studio Code)]. See Microsoft's own guide to [https://code.visualstudio.com/docs/introvideos/basics Getting started with Visual Studio Code]. See this guide for [https://code.visualstudio.com/docs/datascience/jupyter-notebooks running Jupyter notebooks in VS Code].


=== Set up git (optional) ===
The example code below should work on other Jupyters too, such as [https://jupyter.org/ JupyterLab].


Log in to a git repository, such as github.com or UiB's own GitLab git.app.uib.no . (This can be hard to set up with private and public SSH keys, but you will need it later in the course anyway.)
Install and start [https://code.visualstudio.com/ VS Code]. Install the  


Create a new project 'info319-exercises' (it is practical to use same name as your folder). Copy the SSH address, such as 'git@git.app.uib.no:yourname/info319-exercises.git'.
Select '''File -> Open folder...''' and open 'info319-exercise' from before. Create a new file with extension ''.ipynb'', for example ''exercise1.ipynb''. Doubleclick the new file, which should open it in a Jupyter notebook inside VS Code.  


Go back to your exercises folder. Create a file '.gitignore' with at least this line: '/venv/'.  
Activate the new ''exercise1.ipynb'' tab and then click the 'Python' button near the upper right corner of the Python notebook. A window should pop up and ask you to ''Change kernel for 'exercise1.ipynb'''. Type in or select ''./venv/bin/python''. Now you are ready to use Jupter inside VS Code.  


echo "/venv/" > .gitignore
In the first cell, enter (and run it with ''Shift-Return''):
!pip install findspark


You can now push your project to the git repository:
In the next:
import findspark
findspark.init()


cd info319_exercises
In the third:
git remote add origin https://git.app.uib.no/yourname/info319-exercises.git
from pyspark.sql import SQLContext, SparkSession
git branch -M main
spark = SparkSession \
git push -uf origin main
        .builder \
        .appName("Jupyter Spark shell") \
        .getOrCreate()
sc = spark.sparkContext


(The push will be sparse since we haven't written any Spark _program_ yet.)
You can now use Jupyter to run PySpark like before. For example:
folder = 'tweet-id-text-345'
tweets = spark.read.text(folder)
tweets.count()


=== Running Spark in VS Code ===
Spark is lazily evaluated, so if you enter transformations, you will normally not get any output before you combine them with actions (such as ''count()'', ''collect()'', or ''show()''). But you can instruct Spark to evaluate eagerly instead:
spark.conf.set('spark.sql.repl.eagerEval.enabled', True)
spark.conf.set('spark.sql.repl.eagerEval.maxNumRows', 5)


TBD
An important advantage over running PySpark directly in a console/terminal window is that autocompletion and other help functions now work. You can also easily save and load notebooks, and benefit from other useful IDE functions.

Latest revision as of 17:41, 5 September 2022

Getting started with Apache Spark

Purpose

  • Getting up and running with Apache Spark
  • Using VS Code (or your IDE of choice) with Spark
  • Running your own first Spark instructions

Preparations

In the first exercise, you will run Spark standalone on your own computer, first in a console (or terminal window) and then in your favourite IDE (Integrated Development Environment).

Run PySpark from console

Open a console (or terminal window) on your computer. I will use a Linux console in the examples. If you are on a Windows computer, it is a very good idea to install WSL2 - Windows Subsystem for Linux version 2. Choose the Ubuntu 20.4 or 22.04 flavour. You can also install Windows Terminal, which is much more user-friendly and flexible than the default console ('cmd.exe').

You need to have python3 and pip on your machine. On Ubuntu Linux (including on WSL2 Ubuntu flavour):

$ sudo apt install python3 python3-dev python3-pip

I have used both Python 3.8 and 3.10, but other recent versions should be fine.

In addition, Spark will need a Java Runtime Environment - a JRE or JDK - somewhere on your PATH. For example:

$ sudo apt install openjdk-17-jdk-headless/

And in your '~/.bashrc' file add the line (check the location first, it could depend on your system):

export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64

(We may use packages later in the course that have problems with the newest Java versions, but let us start with 17.)

Create a new folder for running Spark:

$ mkdir info319-exercises
$ cd info319-exercises

Create a Python environment using pip, pipenv or Conda. I will use pip in the examples. It is simple and transparent.

$ which python3
$ python3 --version
$ python3 -m venv venv

Activate the new environment:

$ . venv/bin/activate
(venv) $ which python

This should return something like .../info319-exercises/venv/bin/python

(venv) $ python --version
(venv) $ pip --version

Upgrade pip if necessary and install PySpark:

(venv) $ python3 -m pip install --upgrade pip
(venv) $ pip --version
(venv) $ pip install pyspark

Check that PySpark was installed in the right place:

(venv) $ ls venv/lib/python3.10/site-packages

This should list the contents of the PySpark folder.

Start PySpark:

(venv) $ pyspark
Python 3.8.10 (default, Jun 22 2022, 20:18:18)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information. 

(Don't panic if you see a few warnings here...)

Welcome to
     ____              __
    / __/__  ___ _____/ /__
   _\ \/ _ \/ _ `/ __/  '_/
  /__ / .__/\_,_/_/ /_/\_\   version 3.3.0
     /_/

Using Python version 3.8.10 (default, Jun 22 2022 20:18:18)
Spark context Web UI available at http://172.23.240.233:4040
Spark context available as 'sc' (master = local[*], app id = local-1661179410845).
SparkSession available as 'spark'.
>>>

The >>> prompt means you are ready to go, but first you must exit to download some data:

>>> exit() 

Processing tweets

Download this archive of random tweets (available from Files -> Datafiles in https://mitt.uib.no), and unpack in your exercise folder:

(venv) $ tar xjf tweet-id-text-345.tar.bz2
(venv) $ ls tweet-id-text-345

The folder should contain 345 small text files, each representing a tweet.

(venv) $ pyspark

...

>>> folder = 'tweet-id-text-345'
>>> tweets = spark.read.text(folder)
>>> type(tweets)
<class 'pyspark.sql.dataframe.DataFrame'>

DataFrame is a very central data structure in Spark. Here are more PySpark examples of DataFrames.

>>> tweets.count()
>>> tweet_list = tweets.collect()
>>> type(tweet_list)

We are back in Python, but not completely:

>>> tweet_list[13]
>>> type(tweet_list[13])
<class 'pyspark.sql.types.Row'>

DataFrame Rows are another central Spark data structure. Can you get the rows out as Python dicts?

Set up git (optional)

Log in to a git repository, such as github.com or UiB's own GitLab. It can be hard to set up with private and public SSH keys, but you will need it later in the course anyway.

Create a new project 'info319-exercises' (it is practical to use same name as your folder). Copy the SSH address, such as 'git@git.app.uib.no:yourname/info319-exercises.git'.

Go back to your exercises folder. Create a file '.gitignore' with at least this line: '/venv/'.

$ echo "/venv/" > .gitignore

You can now push your project to the git repository:

$ cd info319_exercises
$ git remote add origin https://git.app.uib.no/yourname/info319-exercises.git
$ git branch -M main
$ git push -uf origin main

(The push will not contain many files since we haven't written any Spark program yet.)

Running Spark in Jupyter and VS Code

You can also run PySpark in a Jupyter notebook, either standalone or inside an Integrated Development Environment (IDE). On Windows, you may also need to install the Visual Studio Code Remote - WSL plugin.

The example below runs PySpark in Jupyter inside VS Code (Visual Studio Code). See Microsoft's own guide to Getting started with Visual Studio Code. See this guide for running Jupyter notebooks in VS Code.

The example code below should work on other Jupyters too, such as JupyterLab.

Install and start VS Code. Install the

Select File -> Open folder... and open 'info319-exercise' from before. Create a new file with extension .ipynb, for example exercise1.ipynb. Doubleclick the new file, which should open it in a Jupyter notebook inside VS Code.

Activate the new exercise1.ipynb tab and then click the 'Python' button near the upper right corner of the Python notebook. A window should pop up and ask you to Change kernel for 'exercise1.ipynb'. Type in or select ./venv/bin/python. Now you are ready to use Jupter inside VS Code.

In the first cell, enter (and run it with Shift-Return):

!pip install findspark

In the next:

import findspark
findspark.init()

In the third:

from pyspark.sql import SQLContext, SparkSession
spark = SparkSession \
        .builder \
        .appName("Jupyter Spark shell") \
        .getOrCreate()
sc = spark.sparkContext

You can now use Jupyter to run PySpark like before. For example:

folder = 'tweet-id-text-345'
tweets = spark.read.text(folder)
tweets.count()

Spark is lazily evaluated, so if you enter transformations, you will normally not get any output before you combine them with actions (such as count(), collect(), or show()). But you can instruct Spark to evaluate eagerly instead:

spark.conf.set('spark.sql.repl.eagerEval.enabled', True)
spark.conf.set('spark.sql.repl.eagerEval.maxNumRows', 5)

An important advantage over running PySpark directly in a console/terminal window is that autocompletion and other help functions now work. You can also easily save and load notebooks, and benefit from other useful IDE functions.