Create Spark cluster
Create SSH key pair
You need a new SSH key pair for your Spark cluster. Do not use a key pair you are using elsewhere, because you will need to upload the private key to the the virtual machines ("instances") in the cluster, and the may not always be secure. On your local computer, in a terminal window that runs Ubuntu Linux and bash shell:
cd ~/.ssh ssh-keygen -b 4096 -f info319-spark-cluster # leave password empty ls -ld ~/.ssh # mode should be drwx------ ls -l ~/.ssh/info319-spark-cluster # mode should be -rw------- ls -l ~/.ssh/info319-spark-cluster.pu # mode can be -rw-r--r--
Select project
Sign in to the Norwegian Research and Education Cloud (NREC). In the Overview screen, choose the info319 project (for example uib-info319-abc * bgo'. You can also play around with your DEMO project a bit before your proceed.
Create security group
In the NREC Overview window, go to Network -> Security groups. Create a new security group info319-spark-cluster. Add rule as follows (accept defaults for the rest):
- name: info319-spark-cluster
- rule: SSH
- direction: Ingress
- port: 22
- CIDR: 2001:700:200:13::204/64
Here, 2001:700:200:13::204/64 is a range of IPv6 UiB-addresses that includes login.uib.no. We will use IPv6 in the course because it is future-leaning and many organisations are quickly running out of IPv4 addresses. Because your machine may not run IPv6 yet, this exercise uses login.uib.no as a JumpHost to log into the Spark cluster. To log in directly from other IPv6-machines, or using other protocols, you can add more rules to the security group.
Create the Spark driver machine (instance)
A virtual machine in the NREC cloud is called an instance (or sometimes a node or host). In the NREC Overview window, under Compute -> Instances, Launch instance as follows (accept defaults for the rest):
- name: spark-driver (in Ubuntu/Linux, valid hostnames are between 2 and 64 characters in length. They can contain only letters, numbers, periods, and hyphens, but must begin and end with letters and numbers only)
- boot source: GOLD Ubuntu 22.04 LTS
- flavor: m1.large (to get started)
- networks: DualStack
- key pair: import the public SSH key you just created from ~/.ssh/info319-spark-cluster.pub
- security groups: in addition to the default, add the info319-spark-cluster group
Log in to spark-driver
In the NREC Overview window, under Compute -> Instances, get the IPv6 address for spark-driver. It may look like this: 2001:700:2:8301::1111. In a local terminal window, login to spark-driver with something like:
ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8301::1111
Here, the -J YOUR_USERNAME@login.uib.no option means you are using login.uib.no as a "jumphost" to connect to spark-driver. You need to do this because you are (most likely) on a IPv4 machine that cannot connect directly to IPv6. You also need to be on the UiB network (at least through VPN). Now you will be asked for your UiB-password, possibly triggering two-factor authentication. Accept the new fingerprint if asked. Check out the new instance and use exit to log out again.
To simplify future login, in a text editor, open the file ~/.ssh/config on your local computer and add lines like these:
Host spark-driver Hostname 2001:700:2:8301::1111 # the new IPv6 address here User ubuntu IdentityFile ~/.ssh/info319-spark-cluster ProxyJump YOUR_USERNAME@login.uib.no StrictHostKeyChecking no UserKnownHostsFile /dev/null
The two last lines relax security a little, but they are ok for a test instance with open data on it. In general, IPv6 is pretty secure.
You can now login to the server with the simple command:
ssh spark-driver
The NREC documentation suggests a final improvement. On your local Linux-machine, do:
mkdir -m 0700 ~/.ssh/controlmasters
And add lines like these to ~/.ssh/config:
Host login.uib.no User YOUR_USERNAME ControlPath /THE_PATH_YOU_JUST_USED/controlmasters/%r@%h:%p ControlMaster auto ControlPersist 10m
If you get socket error messages when you try to use SSH, you can try these directories instead of ~/.ssh/config: /dev/shm/controlmasters or /var/shm/controlmasters. However, these alternatives are temporary directories, so you need to recreate them whenever you log into your local machine, for example:
echo "mkdir -p -m /dev/shm/controlmasters" >> ~/.profile # ...or maybe you use ~/.bash_profile
Create the Spark worker machines (instances)
To create Spark worker instances, repeat the steps for the Spark driver with these modifications:
- name: spark-worker
- count: 3 # the number of workers
- networks: IPv6 (instead of DualStack)
As before:
- boot source: GOLD Ubuntu 22.04 LTS
- flavor: m1.large
- key pair: import the public SSH key from ~/.ssh/info319-spark-cluster.pub
- security groups: in addition to the default, add the info319-spark-cluster group
Do not create more workers already now. The reason is that, in this exercise, you will install all the software by hand... In Exercise 5, we will do it with scripts instead.
Login to one or more of the new instances like before, where 2001:700:2:8301::2222 is one of the new Ipv6 addresses:
ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8301::2222
You can now name the instances and simplify login using ~/.ssh/config as before. To avoid duplicate lines, you can use wildcards like this:
Host spark-* User ubuntu Port 22 IdentityFile ~/.ssh/info319-spark-cluster ProxyJump sinoa@login.uib.no StrictHostKeyChecking no UserKnownHostsFile=/dev/null Host spark-driver Hostname 2001:700:2:8301::1111 Host spark-worker-1 Hostname 2001:700:2:8301::2222
Add corresponding lines for the other Spark workers.
Create and mount virtual disks
In the NREC Overview window, go to Volumes -> Volumes, Create volumes called spark-driver-volume and spark-worker-volume-1, spark-worker-volume-2, and spark-worker-volume-3 as follows:
- volume source: image
- use image as a source: GOLD Ubuntu 22.04 LTS
- use a small disk size, like 100Gb (This is just a test cluster, so we leave some room for a bigger cluster later.)
Still in the NREC Overview window, go to Compute -> Instances, use the Actions drop-down menus to attach each new volume to its respective instance.
Back in Volumes -> Volumes, check the path each volume is Attached To. Here, we will assume that the new virtual disks are all attached at /dev/sdb'.
Log into each each instance (spark-driver, spark-worker-1, ...) in a separate terminal window. In each window, run the following commands in each to mount and format the disks (we follow this guide, which works well on 22.04 too):
sudo gdisk /dev/sdb
You have to enter this chain of responses:
n 1 [Return] [Return] [Return] w Y
Back at the Linux prompt:
sudo mkfs.ext4 /dev/sdb1 # the '1' is because you chose partition number '1' a few lines up cd ~ mkdir volume # or another location than ~/volume, like /opt sudo mount /dev/sdb1 volume sudo chown -R ubuntu:ubuntu volume
/home/ubuntu/volume is now available on each instance as a regular folder. But you will have to remount it every time to restart (reboot) the instance. To mount the volume permanently, you must add a line to /etc/fstab:
echo $(sudo blkid /dev/sdb1 -s UUID | cut -d" " -f2) /home/ubuntu/volume ext4 defaults 0 0 | \ sudo tee -a /etc/fstab sudo cat /etc/fstab
To check that the volume is permanently mounted, do
sudo reboot now
Back in your local terminal window, after 20-30 seconds, do:
ssh spark-INSTANCE-NAME
Check that the volume is still there and owned by user ubuntu.
ls -ld volume
Set up SSH on the cluster
The virtual machines in the cluster also need to have SSH connections between each other. Internally in the cluster, we will use IPv4 addresses, because some of the tools we will use do not yet support IPv6.
On your local computer, create a file hosts with the IPv4 addresses and names of all the instances. The spark-driver must be first:
158.39.77.227 spark-driver 10.1.0.146 spark-worker-1 10.1.1.208 spark-worker-2 10.1.2.50 spark-worker-3
Copy the hosts file into each instance:
scp hosts spark-driver: spark-worker-1: spark-worker-2: spark-worker-3:
On each instance, extend the file /etc/hosts with the new instances as follows:
cat hosts /etc/hosts | sudo tee /etc/hosts rm hosts
The rest of this exercise assumes that spark-driver is always first, and that no other lines in this file contain the string "spark-".
From your local machine, upload the SSH private keys to each instance:
scp ~/.ssh/info319-spark-cluster spark-driver:.ssh scp ~/.ssh/info319-spark-cluster spark-worker-1:.ssh ...
On your local machine, create the file config.stub:
Host spark-* User ubuntu IdentityFile ~/.ssh/info319-spark-cluster StrictHostKeyChecking no UserKnownHostsFile=/dev/null
This is the start your local .ssh/config file, but with the JumpHost stuff removed. Upload it to each instance:
scp config.stub spark-driver:.ssh/config scp config.stub spark-worker-1:.ssh/config ...
You can also use a for loop to do multiple uploads:
for dest in spark-driver spark-worker-1 spark-worker-2 spark-worker-3: scp config.stub dest:.ssh/config
Or even more compactly:
for dest in spark-{driver,worker-{1,2,3}}: scp config.stub dest:.ssh/config
Now, you can connect from any of your instances to any other. For example, on spark-worker-2
ssh spark-worker-3
Security group for spark-driver
It is not mandatory, but practical to open a few more ports on spark-driver to be able to access the web UIs for HDFS, YARN, Spark, Zookeeper, and Kafka later.
In the Overview window, go to Network -> Security groups. Create a new security group info319-spark-driver. Add rules that open the following TCP ports for Ingress:
- rule: TCP
- direction: Ingress
- ports: 4040, 8042, 8080, 8088, 9092, 9870
- CIDR: 129.177.146.20/20 (this covers most UiB addresses, add more rules to access from outside UiB/VPN)
Still in the Overview window, go to Compute -> Instances, use the Actions drop-down menus to add the new security group info319-spark-driver to the spark-driver instance.