Getting started with Apache Spark: Difference between revisions

From info319
(Created page with "=Getting started with Apache Spark= ==Purpose== * Getting up and running with Apache Spark * Getting experience with non-trivial Linux installation * Using VS Code (or another...")
 
No edit summary
Line 5: Line 5:
* Using VS Code (or another IDE of your choice)
* Using VS Code (or another IDE of your choice)
* Writing and running your own first Spark program
* Writing and running your own first Spark program
For a general introduction, see the slides to Session 1 on Apache Spark. There is a useful tutorial at [https://www.tutorialspoint.com/spark_sql/spark_introduction.htm TutorialsPoint].
For a general introduction, see the slides to Session 1 on Apache Spark. There is a useful tutorial at [https://www.tutorialspoint.com/spark_sql/spark_introduction.htm TutorialsPoint].


Line 11: Line 10:
In the first exercise, you will run Spark standalone on your own computers, both in a console/terminal windows and in your favourite IDE (Integrated Development Environment). VS Code (Visual Studio Code) is recommended and will be used in these instructions.  
In the first exercise, you will run Spark standalone on your own computers, both in a console/terminal windows and in your favourite IDE (Integrated Development Environment). VS Code (Visual Studio Code) is recommended and will be used in these instructions.  


=== Run pyspark-shell from console ===
Open a console (or terminal) window. I will use a Linux console in the examples.


=== Run pyspark-shell from console ===
(If you are on a Windows computer, it is a very good idea to install WSL2 - Windows Subsystem for Linux - and use it as your console/terminal window. But it is not a priority right now.)


Open a console (or terminal) window. I will use a Linux console in the examples.(If you run Windows, this is a good time to install WSL - Windows Subsystem for Linux - and Windows Terminal, but it is not a priority.)
You need to have python3 and pip on your machine. On Linux:


sudo apt install python3 python3-dev python3-pip


(If you are on a Windows computer, installing WSL2 (Windows Subsystem for Linux) and using it as your IDE "Terminal" or "Console" is also a good idea, but not a priority right now.)
(In addition, Spark needs a Java Runtime Environment - a JRE or JDK - somewhere on your PATH.)  


mkdir info319-exercises
cd info319-exercises


Create a Python environment using pip, pipenv or Conda. I will use pip in the examples. It is simple and transparent.
Create a Python environment using pip, pipenv or Conda. I will use pip in the examples. It is simple and transparent.


You need to have python3 and pip on your machine. (In addition, Spark needs a Java Runtime Environment somewhere on your PATH.)
$ which python3
$ python3 --version
$ python3 -m venv venv


sudo apt install python3 python3-dev python3-pip
I have used both Python 3.8 and  3.10, but other recent versions should be fine. (The examples will use 3.10.)


mkdir info319-exercises
Activate the environment:
cd info319-exercises
 
which python3
python3 --version
python3 -m venv venv


I have used both Python 3.8 and 3.10, but other recent versions should be fine. (The examples will use 3.10.)
$ . venv/bin/activate
  (venv) $ which python


Activate the environment:
This should return something like .../info319-exercises/venv/bin/python


. venv/bin/activate
(venv) $ python --version
which python
(venv) $ pip --version
# this should return something like .../info319-exercises/venv/bin/python
python --version
pip --version


Upgrade pip if necessary and install pyspark:
Upgrade pip if necessary and install pyspark:


python3 -m pip install --upgrade pip
(venv) $ python3 -m pip install --upgrade pip
pip --version
(venv) $ pip --version
pip install pyspark
(venv) $ pip install pyspark


Check that pyspark was installed in the right place:
Check that pyspark was installed in the right place:


ls venv/lib/python3.10/site-packages/pyspark
(venv) $ ls venv/lib/python3.10/site-packages
 
You should now see the pyspark folder.


Start pyspark:
Start pyspark:


pyspark
(venv) $ pyspark
Python 3.8.10 (default, Jun 22 2022, 20:18:18)
Python 3.8.10 (default, Jun 22 2022, 20:18:18)
[GCC 9.4.0] on linux
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Type "help", "copyright", "credits" or "license" for more information.  


...don't panic if you get a few warnings here...
Don't panic if you get a few warnings here...


Welcome to
Welcome to
       ____              __
       ____              __
     / __/__  ___ _____/ /__
     / __/__  ___ _____/ /__
Line 69: Line 70:
       /_/
       /_/


Using Python version 3.8.10 (default, Jun 22 2022 20:18:18)
Using Python version 3.8.10 (default, Jun 22 2022 20:18:18)
Spark context Web UI available at http://172.23.240.233:4040
Spark context Web UI available at http://172.23.240.233:4040
Spark context available as 'sc' (master = local[*], app id = local-1661179410845).
Spark context available as 'sc' (master = local[*], app id = local-1661179410845).
SparkSession available as 'spark'.
SparkSession available as 'spark'.
>>>
>>>


# the >>> prompt means you are ready to go, but first exit to download some data:
The >>> prompt means you are ready to go, but first exit to download some data:


>>> exit()  
>>> exit()  


=== Processing tweets ===
=== Processing tweets ===


Download this file:
Download this archive: [file:tweet-id-text-345.tar.bz2.txt].
Remove the '.txt' suffix and unpack in your exercise folder.  
Remove the '.txt' suffix and unpack in your exercise folder:
 
(venv) $ tar xjf tweet-id-text-345.tar.bz2
(venv) $ ls tweet-id-text-345


ls tweet-id-text-345
The folder should contain 345 small text files, each representing a tweet.


pyspark
(venv) $ pyspark


(venv) $ pyspark
...
...


>>> folder = 'tweet-id-text-345'
>>> folder = 'tweet-id-text-345'
>>> tweets = spark.read.text(folder)
>>> tweets = spark.read.text(folder)
>>> type(tweets)
>>> type(tweets)
<class 'pyspark.sql.dataframe.DataFrame'>
<class 'pyspark.sql.dataframe.DataFrame'>
# DataFrame is a very central data structure in Spark.  
 
>>> tweets.count()
DataFrame is a very central data structure in Spark.  
>>> tweet_list = tweets.collect()
 
>>> type(tweet_list)
>>> tweets.count()
# We are back in Python, but not completely:
>>> tweet_list = tweets.collect()
>>> tweet_list[13]
>>> type(tweet_list)
>>> type(tweet_list[13])
 
<class 'pyspark.sql.types.Row'>
We are back in Python, but not completely:
# DataFrame Rows are another central Spark data structure.
 
# Can you get the rows out as Python dicts?
>>> tweet_list[13]
>>> python_list = [tweet.asDict() for tweet in tweet_list]
>>> type(tweet_list[13])
<class 'pyspark.sql.types.Row'>


DataFrame Rows are another central Spark data structure. Can you get the rows out as Python dicts?


In Session 1 we will look at more things to do with Spark DataFrames. Here are some things to do in exercise 1 (this is not final):
=== Exploring tweets in Spark ===
In Session 1 we will look at more things to do with Spark DataFrames. Here are some possible things to do in Exercise 1 (this is not final):


* Load the tweets as json objects.
* Load the tweets as json objects.
Line 123: Line 129:


=== Set up git (optional) ===
=== Set up git (optional) ===
Log in to a git repository, such as github.com or UiB's own GitLab git.app.uib.no . (This can be hard to set up with private and public SSH keys, but you will need it later in the course anyway.)
Log in to a git repository, such as github.com or UiB's own GitLab git.app.uib.no . (This can be hard to set up with private and public SSH keys, but you will need it later in the course anyway.)


Line 130: Line 135:
Go back to your exercises folder. Create a file '.gitignore' with at least this line: '/venv/'.  
Go back to your exercises folder. Create a file '.gitignore' with at least this line: '/venv/'.  


echo "/venv/" > .gitignore
$ echo "/venv/" > .gitignore


You can now push your project to the git repository:
You can now push your project to the git repository:


cd info319_exercises
$ cd info319_exercises
git remote add origin https://git.app.uib.no/yourname/info319-exercises.git
$ git remote add origin https://git.app.uib.no/yourname/info319-exercises.git
git branch -M main
$ git branch -M main
git push -uf origin main
$ git push -uf origin main


(The push will be sparse since we haven't written any Spark _program_ yet.)
(The push will be sparse since we haven't written any Spark _program_ yet.)


=== Running Spark in VS Code ===
=== Running Spark in VS Code ===
 
'''TBD'''
TBD

Revision as of 16:01, 22 August 2022

Getting started with Apache Spark

Purpose

  • Getting up and running with Apache Spark
  • Getting experience with non-trivial Linux installation
  • Using VS Code (or another IDE of your choice)
  • Writing and running your own first Spark program

For a general introduction, see the slides to Session 1 on Apache Spark. There is a useful tutorial at TutorialsPoint.

Preparations

In the first exercise, you will run Spark standalone on your own computers, both in a console/terminal windows and in your favourite IDE (Integrated Development Environment). VS Code (Visual Studio Code) is recommended and will be used in these instructions.

Run pyspark-shell from console

Open a console (or terminal) window. I will use a Linux console in the examples.

(If you are on a Windows computer, it is a very good idea to install WSL2 - Windows Subsystem for Linux - and use it as your console/terminal window. But it is not a priority right now.)

You need to have python3 and pip on your machine. On Linux:

sudo apt install python3 python3-dev python3-pip

(In addition, Spark needs a Java Runtime Environment - a JRE or JDK - somewhere on your PATH.)

mkdir info319-exercises
cd info319-exercises

Create a Python environment using pip, pipenv or Conda. I will use pip in the examples. It is simple and transparent.

$ which python3
$ python3 --version
$ python3 -m venv venv

I have used both Python 3.8 and 3.10, but other recent versions should be fine. (The examples will use 3.10.)

Activate the environment:

$ . venv/bin/activate
(venv) $ which python

This should return something like .../info319-exercises/venv/bin/python

(venv) $ python --version
(venv) $ pip --version

Upgrade pip if necessary and install pyspark:

(venv) $ python3 -m pip install --upgrade pip
(venv) $ pip --version
(venv) $ pip install pyspark

Check that pyspark was installed in the right place:

(venv) $ ls venv/lib/python3.10/site-packages

You should now see the pyspark folder.

Start pyspark:

(venv) $ pyspark
Python 3.8.10 (default, Jun 22 2022, 20:18:18)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information. 

Don't panic if you get a few warnings here...

Welcome to
     ____              __
    / __/__  ___ _____/ /__
   _\ \/ _ \/ _ `/ __/  '_/
  /__ / .__/\_,_/_/ /_/\_\   version 3.3.0
     /_/
Using Python version 3.8.10 (default, Jun 22 2022 20:18:18)
Spark context Web UI available at http://172.23.240.233:4040
Spark context available as 'sc' (master = local[*], app id = local-1661179410845).
SparkSession available as 'spark'.
>>>

The >>> prompt means you are ready to go, but first exit to download some data:

>>> exit() 

Processing tweets

Download this archive: [file:tweet-id-text-345.tar.bz2.txt]. Remove the '.txt' suffix and unpack in your exercise folder:

(venv) $ tar xjf tweet-id-text-345.tar.bz2
(venv) $ ls tweet-id-text-345

The folder should contain 345 small text files, each representing a tweet.

(venv) $ pyspark

...

>>> folder = 'tweet-id-text-345'
>>> tweets = spark.read.text(folder)
>>> type(tweets)
<class 'pyspark.sql.dataframe.DataFrame'>

DataFrame is a very central data structure in Spark.

>>> tweets.count()
>>> tweet_list = tweets.collect()
>>> type(tweet_list)

We are back in Python, but not completely:

>>> tweet_list[13]
>>> type(tweet_list[13])
<class 'pyspark.sql.types.Row'>

DataFrame Rows are another central Spark data structure. Can you get the rows out as Python dicts?

Exploring tweets in Spark

In Session 1 we will look at more things to do with Spark DataFrames. Here are some possible things to do in Exercise 1 (this is not final):

  • Load the tweets as json objects.
  • Collect only the texts from the tweets.
  • Split the texts into words and select all the hashtags.
  • Build a graph of retweets
  • Split the tweets into two sets of 80% and 20% size.
  • Find URLs in the texts and download a few image files.
  • Work on a folder with more tweets.
  • Open the Spark context Web UI (see pyspark's start-up banner)
  • Experiment with different numbers of partitioners and executors.

Of course, we will do these things _in Spark_, without going via plain Python.

Set up git (optional)

Log in to a git repository, such as github.com or UiB's own GitLab git.app.uib.no . (This can be hard to set up with private and public SSH keys, but you will need it later in the course anyway.)

Create a new project 'info319-exercises' (it is practical to use same name as your folder). Copy the SSH address, such as 'git@git.app.uib.no:yourname/info319-exercises.git'.

Go back to your exercises folder. Create a file '.gitignore' with at least this line: '/venv/'.

$ echo "/venv/" > .gitignore

You can now push your project to the git repository:

$ cd info319_exercises
$ git remote add origin https://git.app.uib.no/yourname/info319-exercises.git
$ git branch -M main
$ git push -uf origin main

(The push will be sparse since we haven't written any Spark _program_ yet.)

Running Spark in VS Code

TBD