Apache Spark: Difference between revisions

From info319
No edit summary
No edit summary
Line 12: Line 12:


Follow these [[Spark preparations | preparations]] to install Spark on your '''Linux''' or '''Windows'''-machine. If you are on '''MacOS''', it runs BSD Unix under the hood, so most Linux-commands should work in a ''Terminal'' window on your Mac too.
Follow these [[Spark preparations | preparations]] to install Spark on your '''Linux''' or '''Windows'''-machine. If you are on '''MacOS''', it runs BSD Unix under the hood, so most Linux-commands should work in a ''Terminal'' window on your Mac too.
===Downloading===
Create a Spark folder on your computer, preferrably next your Hadoop folder, if you have one.
* '''Linux:''' Anywhere should do. I have created a root folder called ''/opt'' and given myself full permission:
    sudo mkdir /opt
    sudo chmod u+rwx /opt
* '''Windows''' has limits on file path lengths and some Linux programs do not like spaces in paths. I created a root folder called ''C:\Programs'' and gave my self full rights to it (which must be done as ''Administrator'').
Download an Apache Spark-archive from:
    https://spark.apache.org/downloads.html
for example this one:
    https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
We will not need any source code archive.
===Unpacking===
Unpack the archive into your ''Spark installation folder'', which should be a sub-folder of the one you just created:
* '''Windows:''' I unpacked the archive into ''C:\Programs\spark-2.2.0-bin-hadoop2.7''.
* '''Linux:''' Copy the ''spark-2.2.0-bin-hadoop2.7''-file into your new folder (e.g., ''/opt''), and unpack it into, e.g., ''/opt/spark-2.2.0-bin-hadoop2.7''):
    cd /opt
    tar zxf spark-2.2.0-bin-hadoop2.7.tar.gz
On '''Windows''' you may need two additional executable files: ''hadoop.dll'' and ''winutils.exe'' (for an explanation see https://wiki.apache.org/hadoop/WindowsProblems). Maybe they are already on your PATH because you installed them with Hadoop earlier.
Otherwise, you need to download them. Downloading executables is always risky, so continue at your own peril. I downloaded them from here: https://github.com/steveloughran/winutils/tree/master/hadoop-2.8.1 and put then in the ''.../bin'' subfolder of my Spark installation folder (i.e., under ''C:\Programs\spark-2.2.0-bin-hadoop2.7\bin'').
''(To be checked: I am not sure Spark still needs hadoop.dll . Also, there are both 32- and 64-bit versions of ''winutils.exe'', according to https://hernandezpaul.wordpress.com/2016/01/24/apache-spark-installation-on-windows-10/ .)''
===Java===
You need Java and a Java SDK (Software Development Kit). I have used a recent version of Java 8. To check if you have a Java SDK and which version it is, do:
* '''Linux:'''
    which javac
    javac -version
* '''Windows:''' In a ''Command Prompt'' window, do
    javac -version
To install a recent Java 8:
* '''Linux:'''
    sudo apt install openjdk-8-jdk
* '''Windows:''' Download an installer from http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html . It is best to install Java too into a folder with no space in its name, like ''C:\Programs\Java\jdk1.8.0_121''.
* '''MacOS:''' Use an online tutorial for this. (The above link has installers for Windows too.)
In a console (or command prompt, or terminal) window, check that it works:
    javac -version
===Scala===
Scala is another programming language that runs on Java Virtual Machines (and thus can build on many of Java's APIs). It adds ''functional programming'' on top of a Java-like syntax (version 8 of Java has since added functional programming too, but ''spark-shell'', which we will use later, remains Scala-based.)
To check if you have Scala and which version it is, do:
* '''Linux:'''
    which scala
    scala -version
* '''Windows:''' In a ''Command Prompt'' window, do
    scala -version
To install a recent Scala:
* '''Linux:'''
    sudo apt install scala
* '''Windows:''' Download an installer from http://www.scala-lang.org/download/ , but skip point 2 and go down to ''Other ways to download Scala''. I used this link:
    https://downloads.lightbend.com/scala/2.12.3/scala-2.12.3.tgz
Again, it is best to install Scala into a folder with no space in its name, like ''C:\Programs\scala-2.12.3''.
In a console (or command prompt, or terminal) window, check that it works:
    scala -version
===Environment variables===
You need to add the Scala binaries folder to your PATH. The nicest way is to go via a SCALA_HOME environment variable. To see if SCALA_HOME is set:
* '''Linux:''' ''echo $SCALA_HOME''
* '''Windows:''' ''echo %SCALA_HOME%''
If it is set correctly, the SCALA_HOME folder will have a ''bin/'' subfolder containing files called ''scala'', ''scalac'', and so on. If it is not set, you need to find out where Scala has been installed to:
* '''Linux:''' Check ''/usr/share/scala''.
* '''Windows:''' Check ''C:\Programs-or-Program Files\scala-something...''.
To set SCALA_HOME:
* '''Linux:''' Add this line to your ''~/.bashrc''-file:
    export SCALA_HOME=/path/to/your/scala/installation/folder
* '''Windows:''' Here it is hidden away. On Windows 10, in the ''Start'' menu, open ''Settings'' (the cog wheel), go to ''System -> About -> System info -> Advanced system settings -> Environment Variables''. Here you can add and edit environment variables.
You need to do the same thing for SPARK_HOME. It is good practice to always set environment variables like JAVA_HOME, SCALA_HOME, HADOOP_HOME, SPARK_HOME, etc. even when you do not need them immediately: other well-behaved packages you install later may be able to use them if they are set, thus saving you time and avoiding errors. Each such variable should point to an installation folder with a ''bin'' folder inside it, but not to the inner ''bin''-folder itself.
On '''Windows''', remember that some Linux programs do not like spaces in paths. See the [[Hadoop preparations]] for a way around this problem if you run into it.
===Modifying your PATH===
You need to change PATH to include SCALA_HOME/bin and SPARK_HOME/bin.
* '''Linux:''' Add this line to the end of ''~/.bashrc'':
    export PATH=$SCALA_HOME/bin:$SPARK_HOME/bin:$PATH
* '''Windows:''' You must go into the ''Environment variables'' tool again and edit ''PATH''. You can use variable expressions such as ''%SCALA_HOME%\bin'' and ''%SPARK_HOME%\bin'' to define new variables.
To put the new environment variable in effect:
* '''Linux:'''
    source ~/.bashrc
* '''Windows:''' Close the Command Prompt window and open a new one.
Finally, on '''Windows''' you now need to run these commands:
    winutils chmod 777 /tmp
    winutils chmod 777 /tmp/hive
(You make have to run the ''Command Prompt'' windows ''as Administration'' to do this.)
===Running the Spark shell===
Go to your home folder (you do not need to run Spark from its installation folder) and check that it works:
    cd ~
    spark-shell
You will get a lot of warnings, because we have not tailored Spark properly, but we will ignore them for now. In the end you should see a ''Welcome to Spark'' banner with some version information and a ''spark-shell'' command prompt:
    scala>
Type '':quit'' or use ''Ctrl-D'' to terminate the spark-shell (the latter is the standard way to kill a Linux shell).
You are now ready to [[Running Spark | get started with Apache Spark]].


===Tasks===
===Tasks===
* [[Running Spark | Getting started with Apache Spark]]
* [[Running Spark | Getting started with Apache Spark]]

Revision as of 12:24, 31 August 2018

Apache Spark

Purpose

  • Getting up and running with
  • Getting experience with non-trivial installation
  • Using IntelliJ IDEA.
  • Writing and running your own first Spark program

For a general introduction, see the slides to Session 2 on Apache Spark. Here is a useful tutorial: https://www.tutorialspoint.com/spark_sql/spark_introduction.htm . Configuring Spark dependency in InjelliJ IDEA http://spark.apache.org/docs/latest/rdd-programming-guide.html

Preparations

As for Hadoop, you will run Spark standalone on your computers (and independently of your previous Hadoop installation to keep things simple). Running Spark on a cluster of many computers is harder to set up (and you will need a cluster of computers), but after that, the coding and running of code is the same. Installing Spark Standalone to a Cluster http://spark.apache.org/docs/latest/spark-standalone.html

Follow these preparations to install Spark on your Linux or Windows-machine. If you are on MacOS, it runs BSD Unix under the hood, so most Linux-commands should work in a Terminal window on your Mac too.

Downloading

Create a Spark folder on your computer, preferrably next your Hadoop folder, if you have one.

  • Linux: Anywhere should do. I have created a root folder called /opt and given myself full permission:
   sudo mkdir /opt
   sudo chmod u+rwx /opt
  • Windows has limits on file path lengths and some Linux programs do not like spaces in paths. I created a root folder called C:\Programs and gave my self full rights to it (which must be done as Administrator).

Download an Apache Spark-archive from:

   https://spark.apache.org/downloads.html

for example this one:

   https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz

We will not need any source code archive.

Unpacking

Unpack the archive into your Spark installation folder, which should be a sub-folder of the one you just created:

  • Windows: I unpacked the archive into C:\Programs\spark-2.2.0-bin-hadoop2.7.
  • Linux: Copy the spark-2.2.0-bin-hadoop2.7-file into your new folder (e.g., /opt), and unpack it into, e.g., /opt/spark-2.2.0-bin-hadoop2.7):
   cd /opt
   tar zxf spark-2.2.0-bin-hadoop2.7.tar.gz

On Windows you may need two additional executable files: hadoop.dll and winutils.exe (for an explanation see https://wiki.apache.org/hadoop/WindowsProblems). Maybe they are already on your PATH because you installed them with Hadoop earlier.

Otherwise, you need to download them. Downloading executables is always risky, so continue at your own peril. I downloaded them from here: https://github.com/steveloughran/winutils/tree/master/hadoop-2.8.1 and put then in the .../bin subfolder of my Spark installation folder (i.e., under C:\Programs\spark-2.2.0-bin-hadoop2.7\bin).

(To be checked: I am not sure Spark still needs hadoop.dll . Also, there are both 32- and 64-bit versions of winutils.exe, according to https://hernandezpaul.wordpress.com/2016/01/24/apache-spark-installation-on-windows-10/ .)

Java

You need Java and a Java SDK (Software Development Kit). I have used a recent version of Java 8. To check if you have a Java SDK and which version it is, do:

  • Linux:
   which javac
   javac -version
  • Windows: In a Command Prompt window, do
   javac -version

To install a recent Java 8:

  • Linux:
   sudo apt install openjdk-8-jdk

In a console (or command prompt, or terminal) window, check that it works:

   javac -version

Scala

Scala is another programming language that runs on Java Virtual Machines (and thus can build on many of Java's APIs). It adds functional programming on top of a Java-like syntax (version 8 of Java has since added functional programming too, but spark-shell, which we will use later, remains Scala-based.)

To check if you have Scala and which version it is, do:

  • Linux:
   which scala 
   scala -version
  • Windows: In a Command Prompt window, do
   scala -version

To install a recent Scala:

  • Linux:
   sudo apt install scala
   https://downloads.lightbend.com/scala/2.12.3/scala-2.12.3.tgz

Again, it is best to install Scala into a folder with no space in its name, like C:\Programs\scala-2.12.3.

In a console (or command prompt, or terminal) window, check that it works:

   scala -version

Environment variables

You need to add the Scala binaries folder to your PATH. The nicest way is to go via a SCALA_HOME environment variable. To see if SCALA_HOME is set:

  • Linux: echo $SCALA_HOME
  • Windows: echo %SCALA_HOME%

If it is set correctly, the SCALA_HOME folder will have a bin/ subfolder containing files called scala, scalac, and so on. If it is not set, you need to find out where Scala has been installed to:

  • Linux: Check /usr/share/scala.
  • Windows: Check C:\Programs-or-Program Files\scala-something....

To set SCALA_HOME:

  • Linux: Add this line to your ~/.bashrc-file:
   export SCALA_HOME=/path/to/your/scala/installation/folder
  • Windows: Here it is hidden away. On Windows 10, in the Start menu, open Settings (the cog wheel), go to System -> About -> System info -> Advanced system settings -> Environment Variables. Here you can add and edit environment variables.

You need to do the same thing for SPARK_HOME. It is good practice to always set environment variables like JAVA_HOME, SCALA_HOME, HADOOP_HOME, SPARK_HOME, etc. even when you do not need them immediately: other well-behaved packages you install later may be able to use them if they are set, thus saving you time and avoiding errors. Each such variable should point to an installation folder with a bin folder inside it, but not to the inner bin-folder itself.

On Windows, remember that some Linux programs do not like spaces in paths. See the Hadoop preparations for a way around this problem if you run into it.

Modifying your PATH

You need to change PATH to include SCALA_HOME/bin and SPARK_HOME/bin.

  • Linux: Add this line to the end of ~/.bashrc:
   export PATH=$SCALA_HOME/bin:$SPARK_HOME/bin:$PATH
  • Windows: You must go into the Environment variables tool again and edit PATH. You can use variable expressions such as %SCALA_HOME%\bin and %SPARK_HOME%\bin to define new variables.

To put the new environment variable in effect:

  • Linux:
   source ~/.bashrc
  • Windows: Close the Command Prompt window and open a new one.

Finally, on Windows you now need to run these commands:

   winutils chmod 777 /tmp
   winutils chmod 777 /tmp/hive

(You make have to run the Command Prompt windows as Administration to do this.)

Running the Spark shell

Go to your home folder (you do not need to run Spark from its installation folder) and check that it works:

   cd ~
   spark-shell

You will get a lot of warnings, because we have not tailored Spark properly, but we will ignore them for now. In the end you should see a Welcome to Spark banner with some version information and a spark-shell command prompt:

   scala>

Type :quit or use Ctrl-D to terminate the spark-shell (the latter is the standard way to kill a Linux shell).

You are now ready to get started with Apache Spark.


Tasks