Latest revision as of 15:35, 22 August 2022

Apache Spark

Purpose

Getting up and running with Apache Spark
Getting experience with non-trivial Linux installation
Using VS Code (or another IDE of your choice)
Writing and running your own first Spark program

For a general introduction, see the slides to Session 1 on Apache Spark. There is a useful tutorial at TutorialsPoint.

Preparations

In the first exercise, you will run Spark standalone on your own computers, both in a console/terminal windows and in your favourite IDE (Integrated Development Environment). VS Code (Visual Studio Code) is recommended and will be used in these instructions.

Installing Spark Standalone to a Cluster http://spark.apache.org/docs/latest/spark-standalone.html

Follow these preparations to install Spark on your Linux or Windows-machine. If you are on MacOS, it runs BSD Unix under the hood, so most Linux-commands should work in a Terminal window on your Mac too.

Spark Preparations

Downloading

Create a Spark folder on your computer, preferrably next your Hadoop folder, if you have one.

Linux: Anywhere should do. I have created a root folder called /opt and given myself full permission:

   sudo mkdir /opt
   sudo chmod u+rwx /opt

Windows has limits on file path lengths and some Linux programs do not like spaces in paths. I created a root folder called C:\Programs and gave my self full rights to it (which must be done as Administrator).

Download an Apache Spark-archive from:

   https://spark.apache.org/downloads.html

for example this one:

   https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz

We will not need any source code archive.

Unpacking

Unpack the archive into your Spark installation folder, which should be a sub-folder of the one you just created:

Windows: I unpacked the archive into C:\Programs\spark-2.2.0-bin-hadoop2.7.
Linux: Copy the spark-2.2.0-bin-hadoop2.7-file into your new folder (e.g., /opt), and unpack it into, e.g., /opt/spark-2.2.0-bin-hadoop2.7):

   cd /opt
   tar zxf spark-2.2.0-bin-hadoop2.7.tar.gz

On Windows you may need two additional executable files: hadoop.dll and winutils.exe (for an explanation see https://wiki.apache.org/hadoop/WindowsProblems). Maybe they are already on your PATH because you installed them with Hadoop earlier.

Otherwise, you need to download them. Downloading executables is always risky, so continue at your own peril. I downloaded them from here: https://github.com/steveloughran/winutils/tree/master/hadoop-2.8.1 and put then in the .../bin subfolder of my Spark installation folder (i.e., under C:\Programs\spark-2.2.0-bin-hadoop2.7\bin).

(To be checked: I am not sure Spark still needs hadoop.dll . Also, there are both 32- and 64-bit versions of winutils.exe, according to https://hernandezpaul.wordpress.com/2016/01/24/apache-spark-installation-on-windows-10/ .)

Two guides for installing spark on Mac

https://medium.freecodecamp.org/installing-scala-and-apache-spark-on-mac-os-837ae57d283f

https://medium.com/luckspark/installing-spark-2-3-0-on-macos-high-sierra-276a127b8b85 [1]<https://medium.com/luckspark/installing-spark-2-3-0-on-macos-high-sierra-276a127b8b85>

Installing Apache Spark 2.3.0 on macOS High Sierra – LuckSpark – Medium<https://medium.com/luckspark/installing-spark-2-3-0-on-macos-high-sierra-276a127b8b85> medium.com This tutorial guides you through essential installation steps of Apache Spark 2.3.0 on macOS High Sierra. March 2018.

Java

You need Java and a Java SDK (Software Development Kit). I have used a recent version of Java 8. To check if you have a Java SDK and which version it is, do:

Linux:

   which javac
   javac -version

Windows: In a Command Prompt window, do

   javac -version

To install a recent Java 8:

Linux:

   sudo apt install openjdk-8-jdk

Windows: Download an installer from http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html . It is best to install Java too into a folder with no space in its name, like C:\Programs\Java\jdk1.8.0_121.
MacOS: Use an online tutorial for this. (The above link has installers for Windows too.)

In a console (or command prompt, or terminal) window, check that it works:

   javac -version

Scala

Scala is another programming language that runs on Java Virtual Machines (and thus can build on many of Java's APIs). It adds functional programming on top of a Java-like syntax (version 8 of Java has since added functional programming too, but spark-shell, which we will use later, remains Scala-based.)

To check if you have Scala and which version it is, do:

Linux:

   which scala 
   scala -version

Windows: In a Command Prompt window, do

   scala -version

To install a recent Scala:

Linux:

   sudo apt install scala

Windows: Download an installer from http://www.scala-lang.org/download/ , but skip point 2 and go down to Other ways to download Scala. I used this link:

   https://downloads.lightbend.com/scala/2.12.3/scala-2.12.3.tgz

Again, it is best to install Scala into a folder with no space in its name, like C:\Programs\scala-2.12.3.

In a console (or command prompt, or terminal) window, check that it works:

   scala -version

@@ Line 1: / Line 1: @@
-==Apache Spark==
+=Apache Spark=
-===Purpose===
+==Purpose==
-* Getting up and running with
+* Getting up and running with Apache Spark
-* Getting experience with non-trivial installation
+* Getting experience with non-trivial Linux installation
-* Using IntelliJ IDEA.
+* Using VS Code (or another IDE of your choice)
 * Writing and running your own first Spark program
-For a general introduction, see the slides to [[:File:S4-Spark-intro.pdf | Session 2 on Apache Spark]]. Here is a useful tutorial: https://www.tutorialspoint.com/spark_sql/spark_introduction.htm . Configuring Spark dependency in InjelliJ IDEA http://spark.apache.org/docs/latest/rdd-programming-guide.html
+For a general introduction, see the slides to Session 1 on Apache Spark. There is a useful tutorial at [https://www.tutorialspoint.com/spark_sql/spark_introduction.htm TutorialsPoint].
-===Preparations===
+==Preparations==
-As for Hadoop, you will run Spark standalone on your computers (and independently of your previous Hadoop installation to keep things simple). Running Spark on a cluster of many computers is harder to set up (and you will need a cluster of computers), but after that, the coding and running of code is the same. Installing Spark Standalone to a Cluster http://spark.apache.org/docs/latest/spark-standalone.html
+In the first exercise, you will run Spark standalone on your own computers, both in a console/terminal windows and in your favourite IDE (Integrated Development Environment). VS Code (Visual Studio Code) is recommended and will be used in these instructions.
+Installing Spark Standalone to a Cluster http://spark.apache.org/docs/latest/spark-standalone.html
 Follow these [[Spark preparations | preparations]] to install Spark on your '''Linux''' or '''Windows'''-machine. If you are on '''MacOS''', it runs BSD Unix under the hood, so most Linux-commands should work in a ''Terminal'' window on your Mac too.
-===Tasks===
+===Spark Preparations===
-* [[Running Spark | Getting started with Apache Spark]]
+===Downloading===
+Create a Spark folder on your computer, preferrably next your Hadoop folder, if you have one.
+* '''Linux:''' Anywhere should do. I have created a root folder called ''/opt'' and given myself full permission:
+    sudo mkdir /opt
+    sudo chmod u+rwx /opt
+* '''Windows''' has limits on file path lengths and some Linux programs do not like spaces in paths. I created a root folder called ''C:\Programs'' and gave my self full rights to it (which must be done as ''Administrator'').
+Download an Apache Spark-archive from:
+    https://spark.apache.org/downloads.html
+for example this one:
+    https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
+We will not need any source code archive.
+===Unpacking===
+Unpack the archive into your ''Spark installation folder'', which should be a sub-folder of the one you just created:
+* '''Windows:''' I unpacked the archive into ''C:\Programs\spark-2.2.0-bin-hadoop2.7''.
+* '''Linux:''' Copy the ''spark-2.2.0-bin-hadoop2.7''-file into your new folder (e.g., ''/opt''), and unpack it into, e.g., ''/opt/spark-2.2.0-bin-hadoop2.7''):
+    cd /opt
+    tar zxf spark-2.2.0-bin-hadoop2.7.tar.gz
+On '''Windows''' you may need two additional executable files: ''hadoop.dll'' and ''winutils.exe'' (for an explanation see https://wiki.apache.org/hadoop/WindowsProblems). Maybe they are already on your PATH because you installed them with Hadoop earlier.
+Otherwise, you need to download them. Downloading executables is always risky, so continue at your own peril. I downloaded them from here: https://github.com/steveloughran/winutils/tree/master/hadoop-2.8.1 and put then in the ''.../bin'' subfolder of my Spark installation folder (i.e., under ''C:\Programs\spark-2.2.0-bin-hadoop2.7\bin'').
+''(To be checked: I am not sure Spark still needs hadoop.dll . Also, there are both 32- and 64-bit versions of ''winutils.exe'', according to https://hernandezpaul.wordpress.com/2016/01/24/apache-spark-installation-on-windows-10/ .)''
+===Two guides for installing spark on Mac===
+https://medium.freecodecamp.org/installing-scala-and-apache-spark-on-mac-os-837ae57d283f
+https://medium.com/luckspark/installing-spark-2-3-0-on-macos-high-sierra-276a127b8b85
+[https://cdn-images-1.medium.com/max/1200/1*99lgzUPQOsSKbx3xNHKo-g.png]<https://medium.com/luckspark/installing-spark-2-3-0-on-macos-high-sierra-276a127b8b85>
+Installing Apache Spark 2.3.0 on macOS High Sierra – LuckSpark – Medium<https://medium.com/luckspark/installing-spark-2-3-0-on-macos-high-sierra-276a127b8b85>
+medium.com
+This tutorial guides you through essential installation steps of Apache Spark 2.3.0 on macOS High Sierra. March 2018.
+===Java===
+You need Java and a Java SDK (Software Development Kit). I have used a recent version of Java 8. To check if you have a Java SDK and which version it is, do:
+* '''Linux:'''
+    which javac
+    javac -version
+* '''Windows:''' In a ''Command Prompt'' window, do
+    javac -version
+To install a recent Java 8:
+* '''Linux:'''
+    sudo apt install openjdk-8-jdk
+* '''Windows:''' Download an installer from http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html . It is best to install Java too into a folder with no space in its name, like ''C:\Programs\Java\jdk1.8.0_121''.
+* '''MacOS:''' Use an online tutorial for this. (The above link has installers for Windows too.)
+In a console (or command prompt, or terminal) window, check that it works:
+    javac -version
+===Scala===
+Scala is another programming language that runs on Java Virtual Machines (and thus can build on many of Java's APIs). It adds ''functional programming'' on top of a Java-like syntax (version 8 of Java has since added functional programming too, but ''spark-shell'', which we will use later, remains Scala-based.)
+To check if you have Scala and which version it is, do:
+* '''Linux:'''
+    which scala
+    scala -version
+* '''Windows:''' In a ''Command Prompt'' window, do
+    scala -version
+To install a recent Scala:
+* '''Linux:'''
+    sudo apt install scala
+* '''Windows:''' Download an installer from http://www.scala-lang.org/download/ , but skip point 2 and go down to ''Other ways to download Scala''. I used this link:
+    https://downloads.lightbend.com/scala/2.12.3/scala-2.12.3.tgz
+Again, it is best to install Scala into a folder with no space in its name, like ''C:\Programs\scala-2.12.3''.
+In a console (or command prompt, or terminal) window, check that it works:
+    scala -version

Anonymous

Search

Apache Spark: Difference between revisions

Namespaces

More

Page actions

Latest revision as of 15:35, 22 August 2022

Contents

Apache Spark

Purpose

Preparations

Spark Preparations

Downloading

Unpacking

Two guides for installing spark on Mac

Java

Scala

Navigation

Pages

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Apache Spark: Difference between revisions

Latest revision as of 15:35, 22 August 2022

Apache Spark

Purpose

Preparations

Spark Preparations

Downloading

Unpacking

Two guides for installing spark on Mac

Java

Scala

Navigation

Wiki tools

Page tools