Install HDFS and YARN on the cluster
Install Java
You need Java on each instance. We will use an older, stable version in case some of the tools are not upgraded to more recent versions:
sudo apt install -y openjdk-8-jdk-headless export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc) echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> ~/.bashrc
Explanation:
- export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64: sets an environment variable temporarily
- cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc): creates a time-stamped backup of the initialisation file ~/.bashrc
- echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> ~/.bashrc: adds the export command to the initialisaiton so the environment variable becomes permanent
Install Hadoop
On each instance, go to http://hadoop.apache.org and download a recent binary distribution to ~/volume. For example:
cd ~/volume wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz tar xzvf hadoop-3.3.4.tar.gz rm hadoop-3.3.4.tar.gz ln -fns hadoop-3.3.4 hadoop
On each instance, set the environment variable HADOOP_HOME and add Hadoop's bin folder to your PATH:
export HADOOP_HOME=/home/ubuntu/volume/hadoop-3.3.4 export PATH=${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc) echo "export HADOOP_HOME=/home/ubuntu/volume/hadoop-3.3.4" >> ~/.bashrc echo "export PATH=\${PATH}:\${HADOOP_HOME}/bin:\${HADOOP_HOME}/sbin" >> ~/.bashrc
If something goes wrong with a file update, install a text editor to fix it, such as sudo apt emacs or sudo apt nano.
On each instance, set additional environment variables for Hadoop:
export HADOOP_MAPRED_HOME=${HADOOP_HOME} export HADOOP_COMMON_HOME=${HADOOP_HOME} export HADOOP_HDFS_HOME=${HADOOP_HOME} export YARN_HOME=${HADOOP_HOME} export HADOOP_CONF=${HADOOP_HOME}/etc/hadoop export HADOOP_CONF_DIR=${HADOOP_CONF} export LD_LIBRARY_PATH=${HADOOP_HOME}/lib/native:${LD_LIBRARY_PATH} export HDFS_DATA=${HADOOP_HOME}/hdfs/data cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc) echo "export HADOOP_MAPRED_HOME=\${HADOOP_HOME}" >> ~/.bashrc echo "export HADOOP_COMMON_HOME=\${HADOOP_HOME}" >> ~/.bashrc echo "export HADOOP_HDFS_HOME=\${HADOOP_HOME}" >> ~/.bashrc echo "export YARN_HOME=\${HADOOP_HOME}" >> ~/.bashrc echo "export HADOOP_CONF=\${HADOOP_HOME}/etc/hadoop" >> ~/.bashrc echo "export HADOOP_CONF_DIR=\${HADOOP_CONF}" >> ~/.bashrc echo "export LD_LIBRARY_PATH=\${HADOOP_HOME}/lib/native:\${LD_LIBRARY_PATH}" >> ~/.bashrc echo "export HDFS_DATA=\${HADOOP_HOME}/hdfs/data" >> ~/.bashrc
mkdir -p ${HDFS_DATA} # this is where HDFS will store its data
On each instance, extract further environment variables from /etc/hosts:
export MASTER_NODE=$(grep spark-driver /etc/hosts | cut -d" " -f1) export HADOOP_NAMENODE=${MASTER_NODE} export NUM_WORKERS=$(grep spark- /etc/hosts | wc -l | cut -d" " -f1) cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc) echo "export MASTER_NODE=\$(grep spark-driver /etc/hosts | cut -d\" \" -f1)" >> ~/.bashrc echo "export HADOOP_NAMENODE=\${MASTER_NODE}" >> ~/.bashrc echo "export NUM_WORKERS=\$(grep spark- /etc/hosts | wc -l | cut -d\" \" -f1)" >> ~/.bashrc
Create these four files on your local computer:
File "core-site.xml": <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://${HADOOP_NAMENODE}:9000</value> </property> </configuration> File "hdfs-site.xml": <configuration> <property> <name>dfs.replication</name> <value>${NUM_WORKERS}</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file://${HDFS_DATA}</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file://${HDFS_DATA}</value> </property> </configuration> File "yarn-site.xml": <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>${HADOOP_NAMENODE}</value> </property> </configuration> File mapred-site.xml: <configuration> <property> <name>mapreduce.jobtracker.address</name> <value>${HADOOP_NAMENODE}</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
On your local machine:
scp {core,hdfs,yarn}-site.xml spark-driver: scp {core,hdfs,yarn}-site.xml spark-worker-1: ... scp mapred-site.xml spark-driver: # it is only needed on master
On each instance:
for name in core hdfs yarn mapred; do cp ${HADOOP_CONF}/${name}-site.xml ${HADOOP_CONF}/${name}-site.xml.original; done cd ~ for name in *-site.xml; do envsubst < ${name} > ${HADOOP_CONF}/${name}; done rm ~/*-site.xml
On each instance, define the master and worker IPv4 addresses, and do a few other preparations:
grep spark-driver /etc/hosts | cut -d" " -f2 > ${HADOOP_CONF}/masters grep spark- /etc/hosts | cut -d" " -f2 > ${HADOOP_CONF}/workers echo "JAVA_HOME=$JAVA_HOME" >> $HADOOP_CONF/hadoop-env.sh mkdir -p -m 0777 $HADOOP_HOME/logs
Run HDFS on the cluster
On spark-driver:
hdfs namenode -format # answer 'Y' to reformatting start-dfs.sh
On any instance, test it with
hdfs dfs -mkdir -p /user/ubuntu/raw-data hdfs dfs -ls /user/ubuntu/ # on any machine in the spark-* cluster
From your local machine (this file is available at mitt.uib.no):
scp head-100000-latest-all.json spark-driver:
On any instance:
hdfs dfs -put ~/head-100000-latest-all.json raw-data
Stop HDFS again:
stop-dfs.sh
Run YARN and HDFS on the cluster
You are now ready to start YARN and HDFS together. On spark-drive run:
start-all.sh
When you want to stop YARN and HDFS, use:
stop-all.sh