Getting started with Apache Spark: Difference between revisions
No edit summary |
No edit summary |
||
Line 10: | Line 10: | ||
=== Run PySpark from console === | === Run PySpark from console === | ||
Open a console (or terminal window). I will use a Linux console in the examples. | Open a console (or terminal window) on your computer. I will use a Linux console in the examples. If you are on a Windows computer, it is a very good idea to install [https://docs.microsoft.com/en-us/windows/wsl/about WSL2 - Windows Subsystem for Linux version 2]. Choose the Ubuntu 20.4 or 22.04 flavour. You can also install [https://docs.microsoft.com/en-us/windows/terminal/install Windows Terminal], which is much more user-friendly and flexible than the default console ('cmd.exe'). | ||
( | You need to have python3 and pip on your machine. On Ubuntu Linux (including on WSL2 Ubuntu flavour): | ||
$ sudo apt install python3 python3-dev python3-pip | |||
I have used both Python 3.8 and 3.10, but other recent versions should be fine. | |||
In addition, Spark will need a Java Runtime Environment - a JRE or JDK - somewhere on your PATH. For example: | |||
$ sudo apt install openjdk-8-jdk-headless/ | |||
And in your '~/.bashrc' file add the line (check the location first, it could depend on your system): | |||
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 | |||
( | (This is an old Java version, but we may use packages later in the course that have problems with never Javas.) | ||
mkdir info319-exercises | Create a new folder for running Spark: | ||
cd info319-exercises | $ mkdir info319-exercises | ||
$ cd info319-exercises | |||
Create a Python environment using pip, pipenv or Conda. I will use pip in the examples. It is simple and transparent. | Create a Python environment using pip, pipenv or Conda. I will use pip in the examples. It is simple and transparent. | ||
$ which python3 | $ which python3 | ||
$ python3 --version | $ python3 --version | ||
$ python3 -m venv venv | $ python3 -m venv venv | ||
Activate the new environment: | |||
Activate the environment: | |||
$ . venv/bin/activate | $ . venv/bin/activate | ||
(venv) $ which python | (venv) $ which python | ||
This should return something like .../info319-exercises/venv/bin/python | This should return something like .../info319-exercises/venv/bin/python | ||
Line 42: | Line 42: | ||
Upgrade pip if necessary and install PySpark: | Upgrade pip if necessary and install PySpark: | ||
(venv) $ python3 -m pip install --upgrade pip | (venv) $ python3 -m pip install --upgrade pip | ||
(venv) $ pip --version | (venv) $ pip --version | ||
Line 48: | Line 47: | ||
Check that PySpark was installed in the right place: | Check that PySpark was installed in the right place: | ||
(venv) $ ls venv/lib/python3.10/site-packages | (venv) $ ls venv/lib/python3.10/site-packages | ||
This should list the contents of the PySpark folder. | |||
Start PySpark: | Start PySpark: | ||
Line 58: | Line 55: | ||
[GCC 9.4.0] on linux | [GCC 9.4.0] on linux | ||
Type "help", "copyright", "credits" or "license" for more information. | Type "help", "copyright", "credits" or "license" for more information. | ||
(Don't panic if you see a few warnings here...) | (Don't panic if you see a few warnings here...) | ||
Welcome to | Welcome to | ||
____ __ | ____ __ | ||
Line 74: | Line 71: | ||
>>> | >>> | ||
The >>> prompt means you are ready to go, but first exit to download some data: | The >>> prompt means you are ready to go, but first you must exit to download some data: | ||
>>> exit() | >>> exit() | ||
=== Processing tweets === | === Processing tweets === | ||
Download [https://drive.google.com/file/d/175MWQp_D_eFqP3DcRhY9AwrHOvHMP0vT/view?usp=sharing this archive of random tweets], and unpack in your exercise folder: | Download [https://drive.google.com/file/d/175MWQp_D_eFqP3DcRhY9AwrHOvHMP0vT/view?usp=sharing this archive of random tweets], and unpack in your exercise folder: | ||
(venv) $ tar xjf tweet-id-text-345.tar.bz2 | (venv) $ tar xjf tweet-id-text-345.tar.bz2 | ||
(venv) $ ls tweet-id-text-345 | (venv) $ ls tweet-id-text-345 | ||
The folder should contain 345 small text files, each representing a tweet. | The folder should contain 345 small text files, each representing a tweet. | ||
(venv) $ pyspark | (venv) $ pyspark | ||
... | ... | ||
>>> folder = 'tweet-id-text-345' | >>> folder = 'tweet-id-text-345' | ||
>>> tweets = spark.read.text(folder) | >>> tweets = spark.read.text(folder) | ||
>>> type(tweets) | >>> type(tweets) | ||
<class 'pyspark.sql.dataframe.DataFrame'> | <class 'pyspark.sql.dataframe.DataFrame'> | ||
DataFrame is a very central data structure in Spark. | DataFrame is a very central data structure in Spark. | ||
Line 103: | Line 95: | ||
We are back in Python, but not completely: | We are back in Python, but not completely: | ||
>>> tweet_list[13] | >>> tweet_list[13] | ||
>>> type(tweet_list[13]) | >>> type(tweet_list[13]) | ||
<class 'pyspark.sql.types.Row'> | <class 'pyspark.sql.types.Row'> | ||
DataFrame Rows are another central Spark data structure. Can you get the rows out as Python dicts? | DataFrame Rows are another central Spark data structure. Can you get the rows out as Python dicts? | ||
=== Exploring tweets in Spark === | === Exploring tweets in Spark === | ||
In Session 1 we will look at more things to do with Spark DataFrames. Here are some possible things to do in Exercise 1 (this is not final): | In Session 1 we will look at more things to do with Spark DataFrames. Here are some possible things to do in Exercise 1 (this is not final): | ||
* Load the tweets as JSON objects. | |||
* Load the tweets as | |||
* Collect only the texts from the tweets. | * Collect only the texts from the tweets. | ||
* Split the texts into words and select all the hashtags. | * Split the texts into words and select all the hashtags. | ||
Line 121: | Line 110: | ||
* Work on a folder with more tweets. | * Work on a folder with more tweets. | ||
* Open the Spark context Web UI (see PySpark's start-up banner) | * Open the Spark context Web UI (see PySpark's start-up banner) | ||
* Experiment with different numbers of | * Experiment with different numbers of partitions and executors. | ||
Of course, we will do these things _in Spark_, without going via plain Python. | Of course, we will do these things _in Spark_, without going via plain Python. | ||
=== Set up git (optional) === | === Set up git (optional) === | ||
Log in to a git repository, such as github.com or | Log in to a git repository, such as github.com or [https://git.app.uib.no UiB's own GitLab]. It can be hard to set up with private and public SSH keys, but you will need it later in the course anyway. | ||
Create a new project 'info319-exercises' (it is practical to use same name as your folder). Copy the SSH address, such as 'git@git.app.uib.no:yourname/info319-exercises.git'. | Create a new project 'info319-exercises' (it is practical to use same name as your folder). Copy the SSH address, such as 'git@git.app.uib.no:yourname/info319-exercises.git'. | ||
Go back to your exercises folder. Create a file '.gitignore' with at least this line: '/venv/'. | Go back to your exercises folder. Create a file '.gitignore' with at least this line: '/venv/'. | ||
$ echo "/venv/" > .gitignore | $ echo "/venv/" > .gitignore | ||
You can now push your project to the git repository: | You can now push your project to the git repository: | ||
$ cd info319_exercises | $ cd info319_exercises | ||
$ git remote add origin https://git.app.uib.no/yourname/info319-exercises.git | $ git remote add origin https://git.app.uib.no/yourname/info319-exercises.git | ||
$ git branch -M main | $ git branch -M main | ||
$ git push -uf origin main | $ git push -uf origin main | ||
(The push will not contain many files since we haven't written any Spark _program_ yet.) | |||
(The push will | |||
=== Running Spark in Jupyter and VS Code === | === Running Spark in Jupyter and VS Code === |
Revision as of 10:49, 24 August 2022
Getting started with Apache Spark
Purpose
- Getting up and running with Apache Spark
- Using VS Code (or your IDE of choice) with Spark
- Writing and running your own first Spark instructions
For a general introduction, see the slides to Session 1 on Apache Spark. There is a useful tutorial at TutorialsPoint.
Preparations
In the first exercise, you will run Spark standalone on your own computer, first in a console (or terminal window) and then in your favourite IDE (Integrated Development Environment).
Run PySpark from console
Open a console (or terminal window) on your computer. I will use a Linux console in the examples. If you are on a Windows computer, it is a very good idea to install WSL2 - Windows Subsystem for Linux version 2. Choose the Ubuntu 20.4 or 22.04 flavour. You can also install Windows Terminal, which is much more user-friendly and flexible than the default console ('cmd.exe').
You need to have python3 and pip on your machine. On Ubuntu Linux (including on WSL2 Ubuntu flavour):
$ sudo apt install python3 python3-dev python3-pip
I have used both Python 3.8 and 3.10, but other recent versions should be fine.
In addition, Spark will need a Java Runtime Environment - a JRE or JDK - somewhere on your PATH. For example:
$ sudo apt install openjdk-8-jdk-headless/
And in your '~/.bashrc' file add the line (check the location first, it could depend on your system):
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
(This is an old Java version, but we may use packages later in the course that have problems with never Javas.)
Create a new folder for running Spark:
$ mkdir info319-exercises $ cd info319-exercises
Create a Python environment using pip, pipenv or Conda. I will use pip in the examples. It is simple and transparent.
$ which python3 $ python3 --version $ python3 -m venv venv
Activate the new environment:
$ . venv/bin/activate (venv) $ which python
This should return something like .../info319-exercises/venv/bin/python
(venv) $ python --version (venv) $ pip --version
Upgrade pip if necessary and install PySpark:
(venv) $ python3 -m pip install --upgrade pip (venv) $ pip --version (venv) $ pip install pyspark
Check that PySpark was installed in the right place:
(venv) $ ls venv/lib/python3.10/site-packages
This should list the contents of the PySpark folder.
Start PySpark:
(venv) $ pyspark Python 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. (Don't panic if you see a few warnings here...) Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.3.0 /_/
Using Python version 3.8.10 (default, Jun 22 2022 20:18:18) Spark context Web UI available at http://172.23.240.233:4040 Spark context available as 'sc' (master = local[*], app id = local-1661179410845). SparkSession available as 'spark'. >>>
The >>> prompt means you are ready to go, but first you must exit to download some data:
>>> exit()
Processing tweets
Download this archive of random tweets, and unpack in your exercise folder:
(venv) $ tar xjf tweet-id-text-345.tar.bz2 (venv) $ ls tweet-id-text-345
The folder should contain 345 small text files, each representing a tweet.
(venv) $ pyspark ... >>> folder = 'tweet-id-text-345' >>> tweets = spark.read.text(folder) >>> type(tweets) <class 'pyspark.sql.dataframe.DataFrame'>
DataFrame is a very central data structure in Spark.
>>> tweets.count() >>> tweet_list = tweets.collect() >>> type(tweet_list)
We are back in Python, but not completely:
>>> tweet_list[13] >>> type(tweet_list[13]) <class 'pyspark.sql.types.Row'>
DataFrame Rows are another central Spark data structure. Can you get the rows out as Python dicts?
Exploring tweets in Spark
In Session 1 we will look at more things to do with Spark DataFrames. Here are some possible things to do in Exercise 1 (this is not final):
- Load the tweets as JSON objects.
- Collect only the texts from the tweets.
- Split the texts into words and select all the hashtags.
- Build a graph of retweets
- Split the tweets into two sets of 80% and 20% size.
- Find URLs in the texts and download a few image files.
- Work on a folder with more tweets.
- Open the Spark context Web UI (see PySpark's start-up banner)
- Experiment with different numbers of partitions and executors.
Of course, we will do these things _in Spark_, without going via plain Python.
Set up git (optional)
Log in to a git repository, such as github.com or UiB's own GitLab. It can be hard to set up with private and public SSH keys, but you will need it later in the course anyway.
Create a new project 'info319-exercises' (it is practical to use same name as your folder). Copy the SSH address, such as 'git@git.app.uib.no:yourname/info319-exercises.git'.
Go back to your exercises folder. Create a file '.gitignore' with at least this line: '/venv/'.
$ echo "/venv/" > .gitignore
You can now push your project to the git repository:
$ cd info319_exercises $ git remote add origin https://git.app.uib.no/yourname/info319-exercises.git $ git branch -M main $ git push -uf origin main
(The push will not contain many files since we haven't written any Spark _program_ yet.)
Running Spark in Jupyter and VS Code
You can also run PySpark in a Jupyter notebook, either standalone or inside an Integrated Development Environment (IDE).
The example below runs PySpark in Jupyter inside VS Code (Visual Studio Code). See Microsoft's own guide to Getting started with Visual Studio Code. See this guide for running Jupyter notebooks in VS Code.
The example code below should work on other Jupyters too, such as JupyterLab.
Install and start VS Code. Select File -> Open folder... and open 'info319-exercise' from before. Create a new file with extension .ipynb, for example exercise1.ipynb. Doubleclick the new file, which should open it in a Jupyter notebook inside VS Code.
Activate the new exercise1.ipynb tab and then click the 'Python' button near the upper right corner of the Python notebook. A window should pop up and ask you to Change kernel for 'exercise1.ipynb'. Type in or select ./venv/bin/python. Now you are ready to use Jupter inside VS Code.
In the first cell, enter (and run it with Shift-Return):
!pip install findspark
In the next:
import findspark findspark.init()
In the third:
from pyspark.sql import SQLContext, SparkSession spark = SparkSession \ .builder \ .appName("Jupyter Spark shell") \ .getOrCreate() sc = spark.sparkContext
You can now use Jupyter to run PySpark like before. For example:
folder = 'tweet-id-text-345' tweets = spark.read.text(folder) tweets.count()
An important advantage over running PySpark directly in a console/terminal window is that autocompletion and other help functions now work. You can also easily save and load notebooks, and benefit from other useful IDE functions.