Hadoop preparations

From info319

Downloading

Create a Hadoop folder on your computer.

  • Linux: Anywhere should do, but keep it out of the folders Linux already use for special purposes. I have created a root folder called /opt and given myself full permission:
   sudo mkdir /opt
   sudo chmod u+rwx /opt
  • Windows has limits on file path lengths and Hadoop does not like spaces in paths, so C:\Program Files will not work. I created a root folder called C:\Programs and gave my self full rights to it (must be done as Administrator: open File Explorer, go to the root folder C:\, right click the Programs icon, and choose Run as administrator).

Download an Apache Hadoop-archive from:

   http://hadoop.apache.org/releases.html

for example this one (I tried a 3.0.x beta version first, but I ran into trouble):

   http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz

We will not need the source code archive, but it can be useful to look at some of the code examples, so download it if you want.

Unpacking

Unpack the archive into your Hadoop installation folder, which should be a sub-folder of the one you just created:

  • Windows: On Windows, I unpacked the archive into C:\Programs\hadoop-2.8.1.
  • Linux: Copy the hadoop-2.8.1.tar.gz-file into your new folder (e.g., /opt), and unpack it into , e.g., /opt/hadoop-2.8.1:
   cd /opt
   tar zxf hadoop-2.8.1.tar.gz

On Windows you will need two additional executable files: hadoop.dll and winutils.exe (for an explanation see https://wiki.apache.org/hadoop/WindowsProblems). Downloading executables is always risky, so continue at your own peril. I downloaded them from here: https://github.com/steveloughran/winutils/tree/master/hadoop-2.8.1 and put then in the .../bin subfolder of my Hadoop installation folder (i.e., under C:\Programs\hadoop-2.8.1\bin).

(Apparently, there are both 32- and 64-bit versions of winutils.exe, according to https://hernandezpaul.wordpress.com/2016/01/24/apache-spark-installation-on-windows-10/ . The file I have linked above works on x64.)

Linux tip: Over time, you may make many installations into your /opt folder, sometimes installing different versions of the same package (hadoop-2.8.1, hadoop-2.8.2, hadoop-2.8.3 etc. To make this easier to manage, you can create an /opt/latest folder with links to the most recent version of each package:

   mkdir /opt/latest
   ln -s /opt/hadoop-2.8.3 /opt/latest/hadoop

Now you can use /opt/latest/hadoop in all your scripts and environment variables, so that upgrading to new versions becomes faster (and the names become shorter too).

Java

You need Java and a Java Development Kit (JDK). I have used a recent version of Java 8. To check if you have a Java SDK and which version it is, do:

  • Linux:
   which javac
   javac -version
  • Windows: In a Command Prompt window, do
   javac -version

To install a recent Java 8 Development Kit:

  • Linux:
   sudo apt install openjdk-8-jdk

In a console (or command prompt, or terminal) window, check that it works:

   javac -version

Environment variables

You need to have three environment variables correctly set: JAVA_HOME, PATH, and HADOOP_CLASSPATH. To see if JAVA_HOME is set:

  • Linux: echo $JAVA_HOME
  • Windows: echo %JAVA_HOME%

If it is set correctly, the JAVA_HOME folder will have a lib/ subfolder that contains a file tools.jar. The JAVA_HOME path shall not contain the substring /jre anywhere (but it contains a subfolder called jre). If it is not set, you need to find out where the JDK has been installed to:

  • Linux: Check /usr/java/default or /usr/lib/jvm.
  • Windows: Check C:\Programs-or-Program Files\jdk-or-java-something....

To set JAVA_HOME:

  • Linux: Add this line to your ~/.bashrc-file:
   export JAVA_HOME=/path/to/your/jdk/installation/folder
  • Windows: Here it is hidden away. On Windows 10, in the Start menu, open Settings (the cog wheel), go to System -> About -> System info -> Advanced system settings -> Environment Variables. Here you can add and edit environment variables.

On Windows there is one more problem: Because Hadoop does not like spaces in paths, you should not set JAVA_HOME to for example C:\Program Files\jdk1.8.0_121. Instead, Windows offers an alternative way to write paths like this: C:\Progra~1\jdk1.8.0_121. Here, Progra is the six first characters in the name of your folder, and the 1 is used to distinguish between them if you have several folders in C: that starts with Progra.

Although you strictly do not need it, you may also want to set HADOOP_HOME in the same way, pointing to your Hadoop installation folder. It is good practice to always set environment variables like JAVA_HOME, HADOOP_HOME, etc. even when you do not need them immediately: other well-behaved packages you install later may be able to use them if they are set, thus saving you time and avoiding errors. Each such variable should point to an installation folder with a bin folder inside it, but not to the inner bin-folder itself.

Modifying your PATH

You now need to change PATH to include JAVA_HOME/bin , and possibly also HADOOP_HOME/bin.

  • Linux: Add this line to the end of ~/.bashrc:
   export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH
  • Windows: You must go into the Environment variables tool again and edit PATH.

Finally, you need to set HADOOP_CLASSPATH:

  • Linux: Add this line to the end of ~/.bashrc:
   export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
  • Windows: You must go into the Environment variables tool and create a new HADOOP_CLASSPATH variable.

To put the new environment variable in effect:

  • Linux:
   source ~/.bashrc
  • Windows: Close the Command Prompt window and open a new one.

If things are right, these two commands should give similar output:

  • Linux:
   echo $HADOOP_CLASSPATH
   ls $HADOOP_CLASSPATH
  • Windows:
   echo %HADOOP_CLASSPATH%
   dir %HADOOP_CLASSPATH%

Running Hadoop

Go to the Hadoop installation folder and check that it works:

  • Linux:
   cd /opt/hadoop-2.8.1
   bin/hadoop
  • Windows:
   cd C:\Programs\hadoop-2.8.1
   bin\hadoop

These commands should give you a list of possible uses of Hadoop.

(If you added HADOOP_HOME/bin to your PATH you can run hadoop and hdfs from anywhere on your computer, without the bin/.)