Install HDFS and YARN on the cluster: Difference between revisions
(2 intermediate revisions by the same user not shown) | |||
Line 50: | Line 50: | ||
mkdir -p ${HDFS_DATA} # this is where HDFS will store its data | mkdir -p ${HDFS_DATA} # this is where HDFS will store its data | ||
== Configure Hadoop == | |||
On each instance, extract further environment variables from ''/etc/hosts'': | On each instance, extract further environment variables from ''/etc/hosts'': | ||
export MASTER_NODE=$(grep spark-driver /etc/hosts | cut -d" " -f1) | export MASTER_NODE=$(grep spark-driver /etc/hosts | cut -d" " -f1) | ||
Line 126: | Line 127: | ||
scp mapred-site.xml spark-driver: # mapred-site.xml is only needed on the driver | scp mapred-site.xml spark-driver: # mapred-site.xml is only needed on the driver | ||
On each instance: | On each instance, use the ''envsubst'' command to substitute variables names like ''${NUM_WORKERS}'' in the xml-files with their values in the local environment: | ||
for name in core hdfs yarn mapred; do | for name in core hdfs yarn mapred; do | ||
cp ${HADOOP_CONF}/${name}-site.xml ${HADOOP_CONF}/${name}-site.xml.original; | cp ${HADOOP_CONF}/${name}-site.xml ${HADOOP_CONF}/${name}-site.xml.original; | ||
Line 146: | Line 147: | ||
hdfs namenode -format # answer 'Y' if you are asked about reformatting | hdfs namenode -format # answer 'Y' if you are asked about reformatting | ||
start-dfs.sh | start-dfs.sh | ||
If you run into trouble, you can look in ''volume/hadoop/logs/'' for clues. The ''tail'' command is your friend. The ''ls -lt'' command lists files sorted by modification time. | |||
On any instance, test | On any instance, test that HDFS is working with | ||
hdfs dfs -mkdir -p /user/ubuntu/raw-data | hdfs dfs -mkdir -p /user/ubuntu/raw-data | ||
hdfs dfs -ls /user/ubuntu/ # on any machine in the spark-* cluster | hdfs dfs -ls /user/ubuntu/ # on any machine in the spark-* cluster | ||
Line 171: | Line 173: | ||
== Web UIs == | == Web UIs == | ||
If you have opened the ports on ''spark-driver'', you can now access | If you have opened the ports on ''spark-driver'', you can now access | ||
* HDFS' web UI at http://158.39.77. | * HDFS' web UI at http://158.39.77.103:9870 | ||
* YARN's web UI at http://158.39.77. | * YARN's web UI at http://158.39.77.103:8088 | ||
(assuming that 158.39. | (assuming that 158.39.77.103 is the IPv4 address of ''spark-driver''). |
Latest revision as of 12:10, 31 October 2022
Install Java
You need Java on each instance. We will use an older, stable version in case some of the tools are not upgraded to more recent versions:
sudo apt update # ('sudo apt upgrade' is also a good idea on a new instance) sudo apt install -y openjdk-8-jdk-headless export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc) echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> ~/.bashrc
Explanation:
- export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64: sets an environment variable temporarily
- cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc): creates a time-stamped backup of the initialisation file ~/.bashrc
- echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> ~/.bashrc: adds the export command to the initialisation so the environment variable becomes permanent
Install Hadoop
On each instance, go to http://hadoop.apache.org and download a recent binary distribution to ~/volume. For example:
cd ~/volume wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz tar xzvf hadoop-3.3.4.tar.gz rm hadoop-3.3.4.tar.gz ln -fns hadoop-3.3.4 hadoop
On each instance, set the environment variable HADOOP_HOME and add Hadoop's bin folder to your PATH:
export HADOOP_HOME=/home/ubuntu/volume/hadoop export PATH=${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc) echo "export HADOOP_HOME=/home/ubuntu/volume/hadoop" >> ~/.bashrc echo "export PATH=\${PATH}:\${HADOOP_HOME}/bin:\${HADOOP_HOME}/sbin" >> ~/.bashrc
If something goes wrong with a file update, install a text editor to fix it, such as sudo apt emacs or sudo apt nano.
On each instance, set additional environment variables for Hadoop:
export HADOOP_MAPRED_HOME=${HADOOP_HOME} export HADOOP_COMMON_HOME=${HADOOP_HOME} export HADOOP_HDFS_HOME=${HADOOP_HOME} export YARN_HOME=${HADOOP_HOME} export HADOOP_CONF=${HADOOP_HOME}/etc/hadoop export HADOOP_CONF_DIR=${HADOOP_CONF} export LD_LIBRARY_PATH=${HADOOP_HOME}/lib/native:${LD_LIBRARY_PATH} export HDFS_DATA=${HADOOP_HOME}/hdfs/data cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc) echo "export HADOOP_MAPRED_HOME=\${HADOOP_HOME}" >> ~/.bashrc echo "export HADOOP_COMMON_HOME=\${HADOOP_HOME}" >> ~/.bashrc echo "export HADOOP_HDFS_HOME=\${HADOOP_HOME}" >> ~/.bashrc echo "export YARN_HOME=\${HADOOP_HOME}" >> ~/.bashrc echo "export HADOOP_CONF=\${HADOOP_HOME}/etc/hadoop" >> ~/.bashrc echo "export HADOOP_CONF_DIR=\${HADOOP_CONF}" >> ~/.bashrc echo "export LD_LIBRARY_PATH=\${HADOOP_HOME}/lib/native:\${LD_LIBRARY_PATH}" >> ~/.bashrc echo "export HDFS_DATA=\${HADOOP_HOME}/hdfs/data" >> ~/.bashrc mkdir -p ${HDFS_DATA} # this is where HDFS will store its data
Configure Hadoop
On each instance, extract further environment variables from /etc/hosts:
export MASTER_NODE=$(grep spark-driver /etc/hosts | cut -d" " -f1) export HADOOP_NAMENODE=${MASTER_NODE} export NUM_WORKERS=$(grep spark- /etc/hosts | wc -l | cut -d" " -f1) cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc) echo "export MASTER_NODE=\$(grep spark-driver /etc/hosts | cut -d\" \" -f1)" >> ~/.bashrc echo "export HADOOP_NAMENODE=\${MASTER_NODE}" >> ~/.bashrc echo "export NUM_WORKERS=\$(grep spark- /etc/hosts | wc -l | cut -d\" \" -f1)" >> ~/.bashrc
Create these four files on your local computer:
File "core-site.xml": <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://${HADOOP_NAMENODE}:9000</value> </property> </configuration> File "hdfs-site.xml": <configuration> <property> <name>dfs.replication</name> <value>${NUM_WORKERS}</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file://${HDFS_DATA}</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file://${HDFS_DATA}</value> </property> </configuration> File "yarn-site.xml": <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>${HADOOP_NAMENODE}</value> </property> </configuration> File mapred-site.xml: <configuration> <property> <name>mapreduce.jobtracker.address</name> <value>${HADOOP_NAMENODE}</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
On your local machine:
scp {core,hdfs,yarn}-site.xml spark-driver: scp {core,hdfs,yarn}-site.xml spark-worker-1: ... scp mapred-site.xml spark-driver: # mapred-site.xml is only needed on the driver
On each instance, use the envsubst command to substitute variables names like ${NUM_WORKERS} in the xml-files with their values in the local environment:
for name in core hdfs yarn mapred; do cp ${HADOOP_CONF}/${name}-site.xml ${HADOOP_CONF}/${name}-site.xml.original; done cd ~ for name in *-site.xml; do envsubst < ${name} > ${HADOOP_CONF}/${name}; done rm ~/*-site.xml
On each instance, define the master and worker IPv4 addresses, and do a few other preparations:
grep spark-driver /etc/hosts | cut -d" " -f2 > ${HADOOP_CONF}/masters grep spark- /etc/hosts | cut -d" " -f2 > ${HADOOP_CONF}/workers echo "JAVA_HOME=$JAVA_HOME" >> $HADOOP_CONF/hadoop-env.sh mkdir -p -m 0777 $HADOOP_HOME/logs
Run HDFS on the cluster
On spark-driver:
hdfs namenode -format # answer 'Y' if you are asked about reformatting start-dfs.sh
If you run into trouble, you can look in volume/hadoop/logs/ for clues. The tail command is your friend. The ls -lt command lists files sorted by modification time.
On any instance, test that HDFS is working with
hdfs dfs -mkdir -p /user/ubuntu/raw-data hdfs dfs -ls /user/ubuntu/ # on any machine in the spark-* cluster
From your local machine (this file is available at mitt.uib.no):
scp head-100000-latest-all.json spark-ANY-INSTANCE:
On spark-ANY-INSTANCE:
hdfs dfs -put ~/head-100000-latest-all.json raw-data
From another instance, use hdfs dfs -ls ... to check that the file is there.
On spark-driver, stop HDFS again:
stop-dfs.sh
Run HDFS and YARN on the cluster
You are now ready to start YARN and HDFS together. On spark-drive run:
start-all.sh
When you want to stop YARN and HDFS, use:
stop-all.sh
Web UIs
If you have opened the ports on spark-driver, you can now access
- HDFS' web UI at http://158.39.77.103:9870
- YARN's web UI at http://158.39.77.103:8088
(assuming that 158.39.77.103 is the IPv4 address of spark-driver).