Install HDFS and YARN on the cluster
Install Java
You need Java on each instance. We will use an older, stable version in case some of the tools are not upgraded to more recent versions:
sudo apt update # ('sudo apt upgrade' is also a good idea on a new instance) sudo apt install -y openjdk-8-jdk-headless export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc) echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> ~/.bashrc
Explanation:
- export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64: sets an environment variable temporarily
- cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc): creates a time-stamped backup of the initialisation file ~/.bashrc
- echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> ~/.bashrc: adds the export command to the initialisation so the environment variable becomes permanent
Install Hadoop
On each instance, go to http://hadoop.apache.org and download a recent binary distribution to ~/volume. For example:
cd ~/volume wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz tar xzvf hadoop-3.3.4.tar.gz rm hadoop-3.3.4.tar.gz ln -fns hadoop-3.3.4 hadoop
On each instance, set the environment variable HADOOP_HOME and add Hadoop's bin folder to your PATH:
export HADOOP_HOME=/home/ubuntu/volume/hadoop export PATH=${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc) echo "export HADOOP_HOME=/home/ubuntu/volume/hadoop" >> ~/.bashrc echo "export PATH=\${PATH}:\${HADOOP_HOME}/bin:\${HADOOP_HOME}/sbin" >> ~/.bashrc
If something goes wrong with a file update, install a text editor to fix it, such as sudo apt emacs or sudo apt nano.
On each instance, set additional environment variables for Hadoop:
export HADOOP_MAPRED_HOME=${HADOOP_HOME} export HADOOP_COMMON_HOME=${HADOOP_HOME} export HADOOP_HDFS_HOME=${HADOOP_HOME} export YARN_HOME=${HADOOP_HOME} export HADOOP_CONF=${HADOOP_HOME}/etc/hadoop export HADOOP_CONF_DIR=${HADOOP_CONF} export LD_LIBRARY_PATH=${HADOOP_HOME}/lib/native:${LD_LIBRARY_PATH} export HDFS_DATA=${HADOOP_HOME}/hdfs/data cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc) echo "export HADOOP_MAPRED_HOME=\${HADOOP_HOME}" >> ~/.bashrc echo "export HADOOP_COMMON_HOME=\${HADOOP_HOME}" >> ~/.bashrc echo "export HADOOP_HDFS_HOME=\${HADOOP_HOME}" >> ~/.bashrc echo "export YARN_HOME=\${HADOOP_HOME}" >> ~/.bashrc echo "export HADOOP_CONF=\${HADOOP_HOME}/etc/hadoop" >> ~/.bashrc echo "export HADOOP_CONF_DIR=\${HADOOP_CONF}" >> ~/.bashrc echo "export LD_LIBRARY_PATH=\${HADOOP_HOME}/lib/native:\${LD_LIBRARY_PATH}" >> ~/.bashrc echo "export HDFS_DATA=\${HADOOP_HOME}/hdfs/data" >> ~/.bashrc mkdir -p ${HDFS_DATA} # this is where HDFS will store its data
Configure Hadoop
On each instance, extract further environment variables from /etc/hosts:
export MASTER_NODE=$(grep spark-driver /etc/hosts | cut -d" " -f1) export HADOOP_NAMENODE=${MASTER_NODE} export NUM_WORKERS=$(grep spark- /etc/hosts | wc -l | cut -d" " -f1) cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc) echo "export MASTER_NODE=\$(grep spark-driver /etc/hosts | cut -d\" \" -f1)" >> ~/.bashrc echo "export HADOOP_NAMENODE=\${MASTER_NODE}" >> ~/.bashrc echo "export NUM_WORKERS=\$(grep spark- /etc/hosts | wc -l | cut -d\" \" -f1)" >> ~/.bashrc
Create these four files on your local computer:
File "core-site.xml": <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://${HADOOP_NAMENODE}:9000</value> </property> </configuration> File "hdfs-site.xml": <configuration> <property> <name>dfs.replication</name> <value>${NUM_WORKERS}</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file://${HDFS_DATA}</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file://${HDFS_DATA}</value> </property> </configuration> File "yarn-site.xml": <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>${HADOOP_NAMENODE}</value> </property> </configuration> File mapred-site.xml: <configuration> <property> <name>mapreduce.jobtracker.address</name> <value>${HADOOP_NAMENODE}</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
On your local machine:
scp {core,hdfs,yarn}-site.xml spark-driver: scp {core,hdfs,yarn}-site.xml spark-worker-1: ... scp mapred-site.xml spark-driver: # mapred-site.xml is only needed on the driver
On each instance, use the envsubst command to substitute variables names like ${NUM_WORKERS} in the xml-files with their values in the local environment:
for name in core hdfs yarn mapred; do cp ${HADOOP_CONF}/${name}-site.xml ${HADOOP_CONF}/${name}-site.xml.original; done cd ~ for name in *-site.xml; do envsubst < ${name} > ${HADOOP_CONF}/${name}; done rm ~/*-site.xml
On each instance, define the master and worker IPv4 addresses, and do a few other preparations:
grep spark-driver /etc/hosts | cut -d" " -f2 > ${HADOOP_CONF}/masters grep spark- /etc/hosts | cut -d" " -f2 > ${HADOOP_CONF}/workers echo "JAVA_HOME=$JAVA_HOME" >> $HADOOP_CONF/hadoop-env.sh mkdir -p -m 0777 $HADOOP_HOME/logs
Run HDFS on the cluster
On spark-driver:
hdfs namenode -format # answer 'Y' if you are asked about reformatting start-dfs.sh
If you run into trouble, you can look in volume/hadoop/logs/ for clues. The tail command is your friend. The ls -lt command lists files sorted by modification time.
On any instance, test that HDFS is working with
hdfs dfs -mkdir -p /user/ubuntu/raw-data hdfs dfs -ls /user/ubuntu/ # on any machine in the spark-* cluster
From your local machine (this file is available at mitt.uib.no):
scp head-100000-latest-all.json spark-ANY-INSTANCE:
On spark-ANY-INSTANCE:
hdfs dfs -put ~/head-100000-latest-all.json raw-data
From another instance, use hdfs dfs -ls ... to check that the file is there.
On spark-driver, stop HDFS again:
stop-dfs.sh
Run HDFS and YARN on the cluster
You are now ready to start YARN and HDFS together. On spark-drive run:
start-all.sh
When you want to stop YARN and HDFS, use:
stop-all.sh
Web UIs
If you have opened the ports on spark-driver, you can now access
- HDFS' web UI at http://158.39.77.125:9870
- YARN's web UI at http://158.39.77.125:8088
(assuming that 158.39.201.197 is the IPv4 address of spark-driver).