Install Kafka on the cluster

Install Zookeeper

On each instance, go to https://zookeeper.apache.org/releases.html#download . Download and unpack a recent binary distribution to ~/volume. For example:

cd ~/volume
wget https://dlcdn.apache.org/zookeeper/zookeeper-3.8.0/apache-zookeeper-3.8.0-bin.tar.gz
tar zxvf apache-zookeeper-3.8.0-bin.tar.gz
rm apache-zookeeper-3.8.0-bin.tar.gz
ln -fns apache-zookeeper-3.8.0-bin zookeeper

On your local computer, create the file zookeeper.properties:

dataDir=/tmp/zookeeper
clientPort=2181
maxClientCnxns=200
tickTime=2000
initLimit=20
syncLimit=10

From your local computer, upload

scp zookeeper.properties spark-driver:
scp zookeeper.properties spark-worker-1:
scp zookeeper.properties spark-worker-2:
# we are not going to run zookeeper on spark-worker-3 (even number of hosts not allowed)

# you can also run this as a bash-loop:
# for host in spark-{driver,worker-{1,2}}: ; do
#     scp zookeeper.properties $host: ;
# done

# in any case, you need to run this loop, which sets the "id" of each zookeeper node
i=1
for host in spark-{driver,worker-{1,2}}: ; 
    do echo $i > myid; scp myid $host; ((i++));
done

On each instance except spark-worker-3, add addresses from /etc/hosts to zookeeper.properties:

mv zookeeper.properties ~/volume/apache-zookeeper-3.8.0-bin/conf/
i=1
for line in $(grep spark- /etc/hosts | cut -d" " -f1 | head -3); do
    echo "server.$i=$line:2888:3888"; ((i++));
done >> ~/volume/apache-zookeeper-3.8.0-bin/conf/zookeeper.properties
mkdir -p /tmp/zookeeper
mv myid /tmp/zookeeper

Set ZOOKEEPER_HOME and add bin/ to PATH:

export ZOOKEEPER_HOME=/home/ubuntu/volume/apache-zookeeper-3.8.0-bin
export PATH=${PATH}:${ZOOKEEPER_HOME}/bin
cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc)
echo "export ZOOKEEPER_HOME=/home/ubuntu/volume/apache-zookeeper-3.8.0-bin" >>  ~/.bashrc
echo "export PATH=\${PATH}:\${ZOOKEEPER_HOME}/bin" >>  ~/.bashrc

Test run zookeeper

On spark-driver and -worker-1 and -2:

zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties &  # ***
zkServer.sh status ${ZOOKEEPER_HOME}/conf/zookeeper.properties

In case of trouble, check ${ZOOKEEPER_HOME}/logs/

Install Kafka

On each instance (including spark-worker-3), go to http://kafka.apache.org/downloads . Download and unpack a recent binary distribution to ~/volume. It is practical, but not critical, to install a Kafka that runs on the same Scala version as your Spark. If you want to know the Scala version of your installed Spark, you can check this file:

ls $SPARK_HOME/jars/spark-sql*

For example, .../spark-sql_2.12-3.3.0.jar means that you run Scala 2.12 and Spark 3.3.0.

On each instance:

cd ~/volume
wget https://downloads.apache.org/kafka/3.3.1/kafka_2.12-3.3.1.tgz
tar xzvf kafka_2.12-3.3.1.tgz
rm kafka_2.12-3.3.1.tgz
ln -fns kafka_2.12-3.3.1 kafka

On each instance, configure Kafka:

cp volume/kafka_2.12-3.3.1/config/server.properties volume/kafka_2.12-3.3.1/config/server.properties.original
sed -i "s/broker.id=0/broker.id=$(cat /tmp/zookeeper/myid)/g" volume/kafka_2.12-3.3.1/config/server.properties
export ZOOKEEPER_HOSTLIST=$(grep spark- /etc/hosts | cut -d" " -f1 | head -3 | paste -sd,)
cat "export ZOOKEEPER_HOSTLIST=\$(grep spark- /etc/hosts | cut -d\" \" -f1 | head -3 | paste -sd,)" >> ~/.bashrc
sed -i "s/zookeeper.connect=localhost:2181/zookeeper.connect=$ZOOKEEPER_HOSTLIST/g" volume/kafka_2.12-3.3.1/config/server.properties
local_ip=$(ip -4 address | grep -o "^ *inet \(.\+\)\/.\+global.*$" | grep -o "[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+" | head -1) ; \
echo "advertised.host.name=$local_ip" >> volume/kafka_2.12-3.3.1/config/server.properties

Set KAFKA_HOME and add bin/ to PATH:

export KAFKA_HOME=/home/ubuntu/volume/kafka_2.12-3.3.1
export PATH=${PATH}:${KAFKA_HOME}/bin
cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc)
echo "export KAFKA_HOME=/home/ubuntu/volume/kafka_2.12-3.3.1" >>  ~/.bashrc
echo "export PATH=${PATH}:${KAFKA_HOME}/bin" >>  ~/.bashrc

Test run Kafka

Ensure Zookeeper is still running. On each node:

kafka-server-start.sh -daemon ${KAFKA_HOME}/config/server.properties

Test commands to run on different instances:

kafka-topics.sh --create --topic test --replication-factor 2 --partitions 3 --bootstrap-server 158.39.77.227:9092
kafka-topics.sh --list --bootstrap-server 158.39.77.227:9092
kafka-console-producer.sh --topic test --bootstrap-server 158.39.77.227:9092
kafka-console-consumer.sh --topic test --bootstrap-server 158.39.77.227:9092
# ... etc ...

Run these two commands on different instances:

bin/kafka-console-producer.sh --topic test --bootstrap-server 158.39.77.227:9092
bin/kafka-console-consumer.sh --topic test --bootstrap-server 158.39.77.227:9092 --from-beginning

Web UIs

You can access Spark's web UI at http://158.39.77.227:4040 . But when you run on top of YARN it attempts to redirect to YARN's web UI at http://158.39.77.227:8088 .

You can access HDFS' web UI at http://158.39.77.227:9870 and YARN's web UI at http://158.39.77.227:8088 .

You can access zookeeper's very simple web UI at http://158.39.77.227:8080/commands/stat .

Kafka has no built in web UI, but third-party UIs are available. If you have Docker on your local machine, you can run:

docker run -it -p 9000:9000 \
    -e KAFKA_BROKERCONNECT=158.39.77.227:9092 \
    obsidiandynamics/kafdrop