Getting started with Apache Spark: Difference between revisions
(Created page with "=Getting started with Apache Spark= ==Purpose== * Getting up and running with Apache Spark * Getting experience with non-trivial Linux installation * Using VS Code (or another...") |
No edit summary |
||
Line 5: | Line 5: | ||
* Using VS Code (or another IDE of your choice) | * Using VS Code (or another IDE of your choice) | ||
* Writing and running your own first Spark program | * Writing and running your own first Spark program | ||
For a general introduction, see the slides to Session 1 on Apache Spark. There is a useful tutorial at [https://www.tutorialspoint.com/spark_sql/spark_introduction.htm TutorialsPoint]. | For a general introduction, see the slides to Session 1 on Apache Spark. There is a useful tutorial at [https://www.tutorialspoint.com/spark_sql/spark_introduction.htm TutorialsPoint]. | ||
Line 11: | Line 10: | ||
In the first exercise, you will run Spark standalone on your own computers, both in a console/terminal windows and in your favourite IDE (Integrated Development Environment). VS Code (Visual Studio Code) is recommended and will be used in these instructions. | In the first exercise, you will run Spark standalone on your own computers, both in a console/terminal windows and in your favourite IDE (Integrated Development Environment). VS Code (Visual Studio Code) is recommended and will be used in these instructions. | ||
=== Run pyspark-shell from console === | |||
Open a console (or terminal) window. I will use a Linux console in the examples. | |||
(If you are on a Windows computer, it is a very good idea to install WSL2 - Windows Subsystem for Linux - and use it as your console/terminal window. But it is not a priority right now.) | |||
You need to have python3 and pip on your machine. On Linux: | |||
sudo apt install python3 python3-dev python3-pip | |||
( | (In addition, Spark needs a Java Runtime Environment - a JRE or JDK - somewhere on your PATH.) | ||
mkdir info319-exercises | |||
cd info319-exercises | |||
Create a Python environment using pip, pipenv or Conda. I will use pip in the examples. It is simple and transparent. | Create a Python environment using pip, pipenv or Conda. I will use pip in the examples. It is simple and transparent. | ||
$ which python3 | |||
$ python3 --version | |||
$ python3 -m venv venv | |||
I have used both Python 3.8 and 3.10, but other recent versions should be fine. (The examples will use 3.10.) | |||
Activate the environment: | |||
$ . venv/bin/activate | |||
(venv) $ which python | |||
This should return something like .../info319-exercises/venv/bin/python | |||
(venv) $ python --version | |||
(venv) $ pip --version | |||
python --version | |||
pip --version | |||
Upgrade pip if necessary and install pyspark: | Upgrade pip if necessary and install pyspark: | ||
python3 -m pip install --upgrade pip | (venv) $ python3 -m pip install --upgrade pip | ||
pip --version | (venv) $ pip --version | ||
pip install pyspark | (venv) $ pip install pyspark | ||
Check that pyspark was installed in the right place: | Check that pyspark was installed in the right place: | ||
ls venv/lib/python3.10/site-packages | (venv) $ ls venv/lib/python3.10/site-packages | ||
You should now see the pyspark folder. | |||
Start pyspark: | Start pyspark: | ||
pyspark | (venv) $ pyspark | ||
Python 3.8.10 (default, Jun 22 2022, 20:18:18) | Python 3.8.10 (default, Jun 22 2022, 20:18:18) | ||
[GCC 9.4.0] on linux | [GCC 9.4.0] on linux | ||
Type "help", "copyright", "credits" or "license" for more information. | Type "help", "copyright", "credits" or "license" for more information. | ||
Don't panic if you get a few warnings here... | |||
Welcome to | Welcome to | ||
____ __ | ____ __ | ||
/ __/__ ___ _____/ /__ | / __/__ ___ _____/ /__ | ||
Line 69: | Line 70: | ||
/_/ | /_/ | ||
Using Python version 3.8.10 (default, Jun 22 2022 20:18:18) | Using Python version 3.8.10 (default, Jun 22 2022 20:18:18) | ||
Spark context Web UI available at http://172.23.240.233:4040 | Spark context Web UI available at http://172.23.240.233:4040 | ||
Spark context available as 'sc' (master = local[*], app id = local-1661179410845). | Spark context available as 'sc' (master = local[*], app id = local-1661179410845). | ||
SparkSession available as 'spark'. | SparkSession available as 'spark'. | ||
>>> | >>> | ||
The >>> prompt means you are ready to go, but first exit to download some data: | |||
>>> exit() | >>> exit() | ||
=== Processing tweets === | === Processing tweets === | ||
Download this file: | Download this archive: [file:tweet-id-text-345.tar.bz2.txt]. | ||
Remove the '.txt' suffix and unpack in your exercise folder. | Remove the '.txt' suffix and unpack in your exercise folder: | ||
(venv) $ tar xjf tweet-id-text-345.tar.bz2 | |||
(venv) $ ls tweet-id-text-345 | |||
The folder should contain 345 small text files, each representing a tweet. | |||
pyspark | (venv) $ pyspark | ||
... | ... | ||
>>> folder = 'tweet-id-text-345' | >>> folder = 'tweet-id-text-345' | ||
>>> tweets = spark.read.text(folder) | >>> tweets = spark.read.text(folder) | ||
>>> type(tweets) | >>> type(tweets) | ||
<class 'pyspark.sql.dataframe.DataFrame'> | <class 'pyspark.sql.dataframe.DataFrame'> | ||
>>> tweets.count() | DataFrame is a very central data structure in Spark. | ||
>>> tweet_list = tweets.collect() | |||
>>> type(tweet_list) | >>> tweets.count() | ||
>>> tweet_list = tweets.collect() | |||
>>> tweet_list[13] | >>> type(tweet_list) | ||
>>> type(tweet_list[13]) | |||
<class 'pyspark.sql.types.Row'> | We are back in Python, but not completely: | ||
>>> tweet_list[13] | |||
>>> type(tweet_list[13]) | |||
<class 'pyspark.sql.types.Row'> | |||
DataFrame Rows are another central Spark data structure. Can you get the rows out as Python dicts? | |||
In Session 1 we will look at more things to do with Spark DataFrames. Here are some things to do in | === Exploring tweets in Spark === | ||
In Session 1 we will look at more things to do with Spark DataFrames. Here are some possible things to do in Exercise 1 (this is not final): | |||
* Load the tweets as json objects. | * Load the tweets as json objects. | ||
Line 123: | Line 129: | ||
=== Set up git (optional) === | === Set up git (optional) === | ||
Log in to a git repository, such as github.com or UiB's own GitLab git.app.uib.no . (This can be hard to set up with private and public SSH keys, but you will need it later in the course anyway.) | Log in to a git repository, such as github.com or UiB's own GitLab git.app.uib.no . (This can be hard to set up with private and public SSH keys, but you will need it later in the course anyway.) | ||
Line 130: | Line 135: | ||
Go back to your exercises folder. Create a file '.gitignore' with at least this line: '/venv/'. | Go back to your exercises folder. Create a file '.gitignore' with at least this line: '/venv/'. | ||
echo "/venv/" > .gitignore | $ echo "/venv/" > .gitignore | ||
You can now push your project to the git repository: | You can now push your project to the git repository: | ||
cd info319_exercises | $ cd info319_exercises | ||
git remote add origin https://git.app.uib.no/yourname/info319-exercises.git | $ git remote add origin https://git.app.uib.no/yourname/info319-exercises.git | ||
git branch -M main | $ git branch -M main | ||
git push -uf origin main | $ git push -uf origin main | ||
(The push will be sparse since we haven't written any Spark _program_ yet.) | (The push will be sparse since we haven't written any Spark _program_ yet.) | ||
=== Running Spark in VS Code === | === Running Spark in VS Code === | ||
'''TBD''' | |||
TBD |
Revision as of 16:01, 22 August 2022
Getting started with Apache Spark
Purpose
- Getting up and running with Apache Spark
- Getting experience with non-trivial Linux installation
- Using VS Code (or another IDE of your choice)
- Writing and running your own first Spark program
For a general introduction, see the slides to Session 1 on Apache Spark. There is a useful tutorial at TutorialsPoint.
Preparations
In the first exercise, you will run Spark standalone on your own computers, both in a console/terminal windows and in your favourite IDE (Integrated Development Environment). VS Code (Visual Studio Code) is recommended and will be used in these instructions.
Run pyspark-shell from console
Open a console (or terminal) window. I will use a Linux console in the examples.
(If you are on a Windows computer, it is a very good idea to install WSL2 - Windows Subsystem for Linux - and use it as your console/terminal window. But it is not a priority right now.)
You need to have python3 and pip on your machine. On Linux:
sudo apt install python3 python3-dev python3-pip
(In addition, Spark needs a Java Runtime Environment - a JRE or JDK - somewhere on your PATH.)
mkdir info319-exercises cd info319-exercises
Create a Python environment using pip, pipenv or Conda. I will use pip in the examples. It is simple and transparent.
$ which python3 $ python3 --version $ python3 -m venv venv
I have used both Python 3.8 and 3.10, but other recent versions should be fine. (The examples will use 3.10.)
Activate the environment:
$ . venv/bin/activate (venv) $ which python
This should return something like .../info319-exercises/venv/bin/python
(venv) $ python --version (venv) $ pip --version
Upgrade pip if necessary and install pyspark:
(venv) $ python3 -m pip install --upgrade pip (venv) $ pip --version (venv) $ pip install pyspark
Check that pyspark was installed in the right place:
(venv) $ ls venv/lib/python3.10/site-packages
You should now see the pyspark folder.
Start pyspark:
(venv) $ pyspark Python 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information.
Don't panic if you get a few warnings here...
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.3.0 /_/
Using Python version 3.8.10 (default, Jun 22 2022 20:18:18) Spark context Web UI available at http://172.23.240.233:4040 Spark context available as 'sc' (master = local[*], app id = local-1661179410845). SparkSession available as 'spark'. >>>
The >>> prompt means you are ready to go, but first exit to download some data:
>>> exit()
Processing tweets
Download this archive: [file:tweet-id-text-345.tar.bz2.txt]. Remove the '.txt' suffix and unpack in your exercise folder:
(venv) $ tar xjf tweet-id-text-345.tar.bz2 (venv) $ ls tweet-id-text-345
The folder should contain 345 small text files, each representing a tweet.
(venv) $ pyspark
...
>>> folder = 'tweet-id-text-345' >>> tweets = spark.read.text(folder) >>> type(tweets) <class 'pyspark.sql.dataframe.DataFrame'>
DataFrame is a very central data structure in Spark.
>>> tweets.count() >>> tweet_list = tweets.collect() >>> type(tweet_list)
We are back in Python, but not completely:
>>> tweet_list[13] >>> type(tweet_list[13]) <class 'pyspark.sql.types.Row'>
DataFrame Rows are another central Spark data structure. Can you get the rows out as Python dicts?
Exploring tweets in Spark
In Session 1 we will look at more things to do with Spark DataFrames. Here are some possible things to do in Exercise 1 (this is not final):
- Load the tweets as json objects.
- Collect only the texts from the tweets.
- Split the texts into words and select all the hashtags.
- Build a graph of retweets
- Split the tweets into two sets of 80% and 20% size.
- Find URLs in the texts and download a few image files.
- Work on a folder with more tweets.
- Open the Spark context Web UI (see pyspark's start-up banner)
- Experiment with different numbers of partitioners and executors.
Of course, we will do these things _in Spark_, without going via plain Python.
Set up git (optional)
Log in to a git repository, such as github.com or UiB's own GitLab git.app.uib.no . (This can be hard to set up with private and public SSH keys, but you will need it later in the course anyway.)
Create a new project 'info319-exercises' (it is practical to use same name as your folder). Copy the SSH address, such as 'git@git.app.uib.no:yourname/info319-exercises.git'.
Go back to your exercises folder. Create a file '.gitignore' with at least this line: '/venv/'.
$ echo "/venv/" > .gitignore
You can now push your project to the git repository:
$ cd info319_exercises $ git remote add origin https://git.app.uib.no/yourname/info319-exercises.git $ git branch -M main $ git push -uf origin main
(The push will be sparse since we haven't written any Spark _program_ yet.)
Running Spark in VS Code
TBD