Create Spark cluster: Difference between revisions

From info319
No edit summary
 
(43 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Create Spark cluster =
== Create SSH key pair ==
== Create SSH key pair ==
You need a new SSH key pair for your Spark cluster. Do not use a key pair you already have, because the virtual machines ("instances") in the cluster may not be equally secure. In a local terminal window that runs Ubuntu Linux and bash shell:
You need a new SSH key pair for your Spark cluster. Do not use a key pair you are using elsewhere, because you will need to upload the private key to the the virtual machines ("instances") in the cluster, and they may not always be secure. On your local computer, in a terminal window that runs Ubuntu Linux and bash shell:
  cd ~/.ssh
  cd ~/.ssh
  ssh-keygen -b 4096 -f info319-spark-cluster  # leave password empty
  ssh-keygen -b 4096 -f info319-spark-cluster
  ls -l
Leave the password empty.
ls -ld ~/.ssh                                # mode should be drwx------
  ls -l ~/.ssh/info319-spark-cluster          # mode should be -rw-------
  ls -l ~/.ssh/info319-spark-cluster.pub      # mode can be -rw-r--r--


== Create login group ==
== Select project ==
Sign in to the [https://nrec.no/ Norwegian Research and Education Cloud (NREC)]. In the Overview screen, choose the ''info319'' project (such as ''uib-info319-abc * bgo'''. (You can also play around with your DEMO project first, before your proceed.)
Sign in to the [https://nrec.no/ Norwegian Research and Education Cloud (NREC)]. In the Overview screen, choose the ''info319'' project (for example ''uib-info319-abc * bgo'''. You can also play around with your DEMO project a bit before your proceed.


== Create security group ==
== Create security group ==
In the NREC Overview window, go to ''Network -> Security groups''. Create a new security group ''info319-spark-cluster''. '''Add rule''' as follows (accept defaults for the rest):
In the NREC Overview window, go to ''Network -> Security groups''. '''Create Security Group''' ''info319-spark-cluster''. '''Add rule''' as follows (accept defaults for the rest):
* name: info319-spark-cluster
* rule: SSH
* rule: SSH
* direction: Ingress
* port: 22
* CIDR: 2001:700:200:13::204/64
* CIDR: 2001:700:200:13::204/64
Here, ''2001:700:200:13::204/64'' is a range of IPv6 UiB-addresses that includes ''login.uib.no''. We will use IPv6 in the course because it is future-leaning and many organisations are quickly running out of IPv4 addresses. Because your machine may not run IPv6 yet, this exercise uses ''login.uib.no'' as a JumpHost to log into the Spark cluster. To log in directly from other IPv6-machines, or using other protocols, you can add more rules to the security group.
Here, ''2001:700:200:13::204/64'' is a range of IPv6 UiB-addresses that includes ''login.uib.no''. We will use IPv6 in the course because it is future-leaning and many organisations (UiB including) are quickly running out of IPv4 addresses. Because your machine may not run IPv6 yet, this exercise uses ''login.uib.no'' as a JumpHost to log into the Spark cluster. This means that SSH connects to your cluster machines via ''login.uib.no'', which converts from IPv4 to IPv6 (and back). To log in directly from other IPv6-machines, or using other protocols, you can add more rules to the security group.


== Create Spark driver machine (instance) ==
== Create the Spark driver machine (instance) ==
A virtual machine in the NREC cloud is called an instance (or sometimes a node). In the NREC Overview window, under ''Compute -> Instances'', '''Launch instance''' as follows (accept defaults for the rest):
A virtual machine in the NREC cloud is called an ''instance'' (or sometimes a ''node'' or ''host''). In the NREC Overview window, under ''Compute -> Instances'', '''Launch instance''' as follows (accept defaults for the rest):
* name: spark-driver (in Ubuntu/Linux, valid hostnames are between 2 and 64 characters in length. They can contain only letters, numbers, periods, and hyphens, but must begin and end with letters and numbers only)
* name: spark-driver (in Ubuntu/Linux, valid host names are between 2 and 64 characters in length. They can contain only letters, numbers, periods, and hyphens, but must begin and end with letters and numbers only)
* boot source: GOLD Ubuntu 22.04 LTS
* boot source: GOLD Ubuntu 22.04 LTS
* flavor: m1.large
* flavor: m1.large (to get started)
* networks: DualStack
* networks: DualStack (most of your instances will be IPv6 only, but the driver will need IPv4 too later)
* security groups: in addition to the ''default'', add the ''info319-spark-cluster'' group
* key pair: import the public SSH key you just created from ''~/.ssh/info319-spark-cluster.pub''
* key pair: import the public SSH key you just created from ''~/.ssh/info319-spark-cluster.pub''
* security groups: in addition to the ''default'', add the ''info319-spark-cluster'' group


== Log in to spark-driver ==
== Log in to spark-driver ==
Line 31: Line 31:
  ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8301::1111
  ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8301::1111
Here, the ''-J YOUR_USERNAME@login.uib.no'' option means you are using ''login.uib.no'' as a "jumphost" to connect to ''spark-driver''. You need to do this because you are (most likely) on a IPv4 machine that cannot connect directly to IPv6. You also need to be on the UiB network (at least through VPN). Now you will be asked for your UiB-password, possibly triggering two-factor authentication. Accept the new fingerprint if asked. Check out the new instance and use ''exit'' to log out again.
Here, the ''-J YOUR_USERNAME@login.uib.no'' option means you are using ''login.uib.no'' as a "jumphost" to connect to ''spark-driver''. You need to do this because you are (most likely) on a IPv4 machine that cannot connect directly to IPv6. You also need to be on the UiB network (at least through VPN). Now you will be asked for your UiB-password, possibly triggering two-factor authentication. Accept the new fingerprint if asked. Check out the new instance and use ''exit'' to log out again.
''Note:'' If this does not work, check if you can ssh directly to login.uib.no . It is sometimes unavailable, but usually comes up again in a few minutes. There is a 10-minute block if you do something suspicious, like getting a user name wrong.


To simplify future login, in a text editor, open the file ''~/.ssh/config'' on your local computer and add lines like these:
To simplify future login, in a text editor, open the file ''~/.ssh/config'' on your local computer and add lines like these:
Line 39: Line 41:
       ProxyJump YOUR_USERNAME@login.uib.no
       ProxyJump YOUR_USERNAME@login.uib.no
       StrictHostKeyChecking no
       StrictHostKeyChecking no
       UserKnownHostsFile=/dev/null
       UserKnownHostsFile /dev/null
The two last lines relaxes security a little, but they are ok for a test instance with open data on it. In general, IPv6 is pretty secure.
The two last lines relax security a little, but they are ok for a test instance with open data on it. In general, IPv6 is pretty secure.


You can now login to the server with the simple:
You can now login to the server with the simple command:
  ssh spark-driver
  ssh spark-driver


Line 55: Line 57:
       ControlPersist 10m
       ControlPersist 10m


If you get ''socket'' error messages when you try to use SSH, you can try these directories instead:  
If you get ''socket'' error messages when you try to use SSH, you can try these directories instead of ''~/.ssh/config'':  
''/dev/shm/controlmasters'' or ''/var/shm/controlmasters''. However, they are temporary, so you need to recreate them in the login file on your local machine:
''/dev/shm/controlmasters'' or ''/var/shm/controlmasters''. However, these alternatives are temporary directories, so you need to recreate them whenever you log into your local machine, for example:
  echo "mkdir -p -m /dev/shm/controlmasters" >> ~/.profile
  echo "mkdir -p -m /dev/shm/controlmasters" >> ~/.profile # ...or maybe you use  ~/.bash_profile


== Create Spark worker machines (instances) ==
== Create the Spark worker machines (instances) ==
To create Spark worker instances, repeat the steps for the Spark driver with these modifications:
To create Spark worker instances, repeat the steps for the Spark driver with these modifications:
* name: spark-worker
* name: spark-worker
Line 67: Line 69:
* boot source: GOLD Ubuntu 22.04 LTS
* boot source: GOLD Ubuntu 22.04 LTS
* flavor: m1.large
* flavor: m1.large
* key pair: import the public SSH key from ''~/.ssh/info319-spark-cluster.pub''
* security groups: in addition to the ''default'', add the ''info319-spark-cluster'' group
* security groups: in addition to the ''default'', add the ''info319-spark-cluster'' group
* key pair: use the same SSH key you imported from ''~/.ssh/info319-spark-cluster.pub''


Do not create more workers already now because for now you will install all the software by hand... In Exercise 5, we do it with scripts instead.
Do not create more workers already now. The reason is that, in this exercise, you will install all the software by hand... In Exercise 5, we will do it with scripts instead.


Login to one or more of the new instances like before, where ''2001:700:2:8301::2222'' is one of the new Ipv6 addresses:
Login to one or more of the new instances like before, where ''2001:700:2:8301::2222'' is one of the new Ipv6 addresses:
  ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8301::2222
  ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8301::2222


You can now name the instances and simplify login using ''~/.ssh/config'' as before. To avoid duplicate lines, you can use wildcards like this:
You can now name the instances and simplify login using ''~/.ssh/config'' as before. To avoid duplicate lines, you can use a wildcard like this:
  Host spark-*
  Host spark-*
       User ubuntu
       User ubuntu
Line 92: Line 94:


== Create and mount virtual disks ==
== Create and mount virtual disks ==
In the NREC Overview window, go to ''Volumes -> Volumes'', '''Create volumes''' called ''spark-driver-volume'' and ''spark-worker-volume-1'',''spark-worker-volume-2'', and ''spark-worker-volume-3'' as follows:
In the NREC Overview window, go to ''Volumes -> Volumes'', '''Create volumes''' called ''spark-driver-volume'' and ''spark-worker-volume-1'', ''spark-worker-volume-2'', and ''spark-worker-volume-3'' as follows:
* ''Volume source'': ''Image''
* volume source: image
* ''Use image as a source'': ''GOLD Ubuntu 22.04 LTS''
* use image as a source: GOLD Ubuntu 22.04 LTS
* use a small disk size, like 100Gb (This is just a test cluster, so we leave some room for a bigger cluster later.)
* use a small disk size, like 100Gb (this is just a test cluster, so we leave some room for a bigger cluster later)
Still in the NREC Overview window, go to ''Compute -> Instances'', use the ''Actions'' drop-down menus to attach each new volume to its respective instance.
Use the ''Actions'' drop-down menus to '''Manage Attachments''' of each new volume to its respective instance. For ''Device Name'' you can try ''/dev/sdb'', or leave it open.
 
Check which path each volume has become ''Attached To''. Here, we will assume that the new virtual disks are all attached at ''/dev/sdb'' (this is what usually happens).


Back in ''Volumes -> Volumes'', check the path each volume is ''Attached To''. Here, we will assume that the new virtual disks are all attached at '''/dev/sdb'.
Log into each each instance (''spark-driver'', ''spark-worker-1'', ...) in a separate terminal window. ''In each window,'' run the following commands to partition, mount and format the disks (we follow [https://techguides.yt/guides/how-to-partition-format-and-auto-mount-disk-on-ubuntu-20-04/ this guide], which works well on 22.04 too).


Log into each each instance (''spark-driver'', ''spark-worker-1'', ...) in a separate terminal window. ''In each window,'' run the following commands in each to mount and format the disks (we follow [https://techguides.yt/guides/how-to-partition-format-and-auto-mount-disk-on-ubuntu-20-04/ this guide], which works well on 22.04 too):
Partition the disk:
  sudo gdisk /dev/sdb
  sudo gdisk /dev/sdb
You have to enter this chain of responses:
You have to enter this chain of responses:
Line 110: Line 114:
  w
  w
  Y  
  Y  
Back at the Linux prompt:
 
Back at the Linux prompt on each instance, format and mount the new partition:
  sudo mkfs.ext4 /dev/sdb1  # the '1' is because you chose partition number '1' a few lines up
  sudo mkfs.ext4 /dev/sdb1  # the '1' is because you chose partition number '1' a few lines up
  mkdir volume
  mkdir ~/volume
  sudo mount /dev/sdb1 volume
  sudo mount /dev/sdb1 ~/volume
  sudo chown -R ubuntu:ubuntu volume
  sudo chown -R ubuntu:ubuntu ~/volume
''/home/ubuntu/volume'' is now available on each instance as a regular folder. But you will have to remount it every time to restart (reboot) the instance. To mount the volume permanently, you must add a line to ''/etc/fstab'':
''/home/ubuntu/volume'' is now available on each instance as a regular folder. But you will have to remount it every time to restart (reboot) the instance. To mount the volume permanently, you must add a line to ''/etc/fstab'':
  echo $(sudo blkid /dev/sdb1 -s UUID | cut -d" " -f2) /home/ubuntu/volume ext4 defaults 0 0 | \
  echo $(sudo blkid /dev/sdb1 -s UUID | cut -d" " -f2) /home/ubuntu/volume ext4 defaults 0 0 | \
Line 121: Line 126:
To check that the volume is permanently mounted, do
To check that the volume is permanently mounted, do
  sudo reboot now
  sudo reboot now
Bback in your local terminal window, after 20-30 seconds, do:
Back in your local terminal window, after 20-30 seconds, do:
  ssh spark-INSTANCE-NAME
  ssh spark-INSTANCE-NAME
Check that the volume is still there and owned by user ''ubuntu''.  
Check that the volume is still there and owned by user ''ubuntu''.  
  ls -ld volume
  ls -ld ~/volume # mode and ownership should be drwxr-xr-x 3 ubuntu ubuntu


== Set up SSH on the cluster ==
== Set up SSH on the cluster ==
The virtual machines in the cluster also need to have SSH connections between each other. Internally in the cluster, we will use IPv4 addresses, because some of the tools we will use do not yet support IPv6.
The virtual machines in the cluster also need to have SSH connections ''between each other''. Internally in the cluster, we will use IPv4 addresses, because some of the software we will run does not yet support IPv6.
 
On your local computer, create an ''exercise-4'' folder, for example a subfolder of ''info319-exercises'', and '''cd''' into it.
 
On your local computer, create a file ''hosts'' with the IPv4 addresses and names of all the instances. '''The ''spark-driver'' must be first,''' for example:
158.39.77.125 spark-driver
10.1.0.143 spark-worker-1
10.1.1.202 spark-worker-2
10.1.2.49 spark-worker-3
(''spark-driver'' has a differently-looking IPv4 address because it is global (''DualStack''), whereas the workers have local IPv4 addresses that are only valid inside the cluster.)


On your local computer, create a file ''hosts'' with the IPv4 addresses and names of all the instances. The ''spark-driver'' must be first:
158.39.77.227 spark-driver
10.1.0.146 spark-worker-1
10.1.1.208 spark-worker-2
10.1.2.50 spark-worker-3
Copy the ''hosts'' file into each instance:
Copy the ''hosts'' file into each instance:
  scp hosts spark-driver: spark-worker-1: spark-worker-2: spark-worker-3:
  scp hosts spark-driver:
scp hosts spark-worker-1:
scp hosts spark-worker-2:
scp hosts spark-worker-3:
On each instance, extend the file ''/etc/hosts'' with the new instances as follows:
On each instance, extend the file ''/etc/hosts'' with the new instances as follows:
  cat hosts /etc/hosts | sudo tee /etc/hosts
  cat hosts /etc/hosts | sudo tee /etc/hosts
  rm hosts
  rm hosts
The rest of this exercise assumes that ''spark-driver'' is always first, and that no other lines in this file contain the string "spark-".
The rest of this exercise assumes that ''spark-driver'' is always listed first in ''/etc/hosts'', and that no other lines in this file contain the string "spark-".


From your local machine, upload the SSH private keys to each instance:
From your local machine, upload the SSH private keys to each instance:
Line 145: Line 157:
  scp ~/.ssh/info319-spark-cluster spark-worker-1:.ssh
  scp ~/.ssh/info319-spark-cluster spark-worker-1:.ssh
  ...
  ...
Check that the copied files have the right permissions (mode): -rw-------


On your local machine, create the file config.stub:
On your local machine, create the file config.stub:
Line 158: Line 171:


You can also use a for loop to do multiple uploads:
You can also use a for loop to do multiple uploads:
  for dest in spark-driver spark-worker-1 spark-worker-2 spark-worker-3:
  for dest in spark-driver spark-worker-1 spark-worker-2 spark-worker-3 ; do
     scp config.stub dest:.ssh/config
     scp config.stub $dest:.ssh/config ;
done
Or even more compactly:
Or even more compactly:
  for dest in spark-{driver,worker-{1,2,3}}:
  for dest in spark-{driver,worker-{1,2,3}} ; do
     scp config.stub dest:.ssh/config
     scp config.stub $dest:.ssh/config ;
done


Now, you can connect from any of your instances to any other. For example, on ''spark-worker-2''
Now, you can connect from any of your instances directly to any other. For example, on ''spark-worker-2''
  ssh spark-driver
  ssh spark-worker-3
 
On one of the instances, also try
ssh localhost
 
If if does not work, you can change the first line of ''config.stub'' above:
Host spark-* localhost


== Security group for spark-driver ==
== Security group for spark-driver ==
It is not mandatory, but practical to open a few more ports on ''spark-driver'' to access web UIs for HDFS, YARN, Spark, Zookeeper, and Kafka.  
It is useful to open a few more ports on ''spark-driver'' to be able to access the web UIs for HDFS, YARN, Spark, Zookeeper, and Kafka later.  


In the Overview window, go to ''Network -> Security groups''. Create a new security group ''info319-spark-driver''. '''Add rules''' that open the following TCP ports for Ingress:
In the NREC Overview window, go to ''Network -> Security groups''. Create a new security group ''info319-spark-driver''. '''Add rules''' that open the following TCP ports for incoming connections:
* rule: TCP
* rule: Custom TCP Rule
* direction: Ingress
* direction: Ingress
* ports: 4040, 8042, 8080, 8088, 9092, 9870
* ports: 4040, 8042, 8080, 8088, 9092, 9870 (you need to create one rule for each)
* CIDR: 129.177.146.20/20 (this covers most UiB addresses, add more rules to access from outside UiB/VPN)
* CIDR: 129.177.146.20/20 (this covers most UiB addresses - you can add more rules to access from outside UiB/VPN)


Still in the Overview window, go to ''Compute -> Instances'', use the ''Actions'' drop-down menus to add the new security group ''info319-spark-driver'' to the ''spark-driver'' instance.
Still in the Overview window, go to ''Compute -> Instances'', use the ''Actions'' drop-down menus ('''Edit Security Groups''') to add the new security group ''info319-spark-driver'' to the ''spark-driver'' instance.

Latest revision as of 11:52, 31 October 2022

Create SSH key pair

You need a new SSH key pair for your Spark cluster. Do not use a key pair you are using elsewhere, because you will need to upload the private key to the the virtual machines ("instances") in the cluster, and they may not always be secure. On your local computer, in a terminal window that runs Ubuntu Linux and bash shell:

cd ~/.ssh
ssh-keygen -b 4096 -f info319-spark-cluster

Leave the password empty.

ls -ld ~/.ssh                                # mode should be drwx------
ls -l ~/.ssh/info319-spark-cluster           # mode should be -rw-------
ls -l ~/.ssh/info319-spark-cluster.pub       # mode can be -rw-r--r--

Select project

Sign in to the Norwegian Research and Education Cloud (NREC). In the Overview screen, choose the info319 project (for example uib-info319-abc * bgo'. You can also play around with your DEMO project a bit before your proceed.

Create security group

In the NREC Overview window, go to Network -> Security groups. Create Security Group info319-spark-cluster. Add rule as follows (accept defaults for the rest):

  • name: info319-spark-cluster
  • rule: SSH
  • CIDR: 2001:700:200:13::204/64

Here, 2001:700:200:13::204/64 is a range of IPv6 UiB-addresses that includes login.uib.no. We will use IPv6 in the course because it is future-leaning and many organisations (UiB including) are quickly running out of IPv4 addresses. Because your machine may not run IPv6 yet, this exercise uses login.uib.no as a JumpHost to log into the Spark cluster. This means that SSH connects to your cluster machines via login.uib.no, which converts from IPv4 to IPv6 (and back). To log in directly from other IPv6-machines, or using other protocols, you can add more rules to the security group.

Create the Spark driver machine (instance)

A virtual machine in the NREC cloud is called an instance (or sometimes a node or host). In the NREC Overview window, under Compute -> Instances, Launch instance as follows (accept defaults for the rest):

  • name: spark-driver (in Ubuntu/Linux, valid host names are between 2 and 64 characters in length. They can contain only letters, numbers, periods, and hyphens, but must begin and end with letters and numbers only)
  • boot source: GOLD Ubuntu 22.04 LTS
  • flavor: m1.large (to get started)
  • networks: DualStack (most of your instances will be IPv6 only, but the driver will need IPv4 too later)
  • security groups: in addition to the default, add the info319-spark-cluster group
  • key pair: import the public SSH key you just created from ~/.ssh/info319-spark-cluster.pub

Log in to spark-driver

In the NREC Overview window, under Compute -> Instances, get the IPv6 address for spark-driver. It may look like this: 2001:700:2:8301::1111. In a local terminal window, login to spark-driver with something like:

ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8301::1111

Here, the -J YOUR_USERNAME@login.uib.no option means you are using login.uib.no as a "jumphost" to connect to spark-driver. You need to do this because you are (most likely) on a IPv4 machine that cannot connect directly to IPv6. You also need to be on the UiB network (at least through VPN). Now you will be asked for your UiB-password, possibly triggering two-factor authentication. Accept the new fingerprint if asked. Check out the new instance and use exit to log out again.

Note: If this does not work, check if you can ssh directly to login.uib.no . It is sometimes unavailable, but usually comes up again in a few minutes. There is a 10-minute block if you do something suspicious, like getting a user name wrong.

To simplify future login, in a text editor, open the file ~/.ssh/config on your local computer and add lines like these:

Host spark-driver
     Hostname 2001:700:2:8301::1111  # the new IPv6 address here
     User ubuntu
     IdentityFile ~/.ssh/info319-spark-cluster
     ProxyJump YOUR_USERNAME@login.uib.no
     StrictHostKeyChecking no
     UserKnownHostsFile /dev/null

The two last lines relax security a little, but they are ok for a test instance with open data on it. In general, IPv6 is pretty secure.

You can now login to the server with the simple command:

ssh spark-driver

The NREC documentation suggests a final improvement. On your local Linux-machine, do:

mkdir -m 0700 ~/.ssh/controlmasters

And add lines like these to ~/.ssh/config:

Host login.uib.no
     User YOUR_USERNAME
     ControlPath /THE_PATH_YOU_JUST_USED/controlmasters/%r@%h:%p
     ControlMaster auto
     ControlPersist 10m

If you get socket error messages when you try to use SSH, you can try these directories instead of ~/.ssh/config: /dev/shm/controlmasters or /var/shm/controlmasters. However, these alternatives are temporary directories, so you need to recreate them whenever you log into your local machine, for example:

echo "mkdir -p -m /dev/shm/controlmasters" >> ~/.profile  # ...or maybe you use  ~/.bash_profile

Create the Spark worker machines (instances)

To create Spark worker instances, repeat the steps for the Spark driver with these modifications:

  • name: spark-worker
  • count: 3 # the number of workers
  • networks: IPv6 (instead of DualStack)

As before:

  • boot source: GOLD Ubuntu 22.04 LTS
  • flavor: m1.large
  • security groups: in addition to the default, add the info319-spark-cluster group
  • key pair: use the same SSH key you imported from ~/.ssh/info319-spark-cluster.pub

Do not create more workers already now. The reason is that, in this exercise, you will install all the software by hand... In Exercise 5, we will do it with scripts instead.

Login to one or more of the new instances like before, where 2001:700:2:8301::2222 is one of the new Ipv6 addresses:

ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8301::2222

You can now name the instances and simplify login using ~/.ssh/config as before. To avoid duplicate lines, you can use a wildcard like this:

Host spark-*
     User ubuntu
     Port 22
     IdentityFile ~/.ssh/info319-spark-cluster
     ProxyJump sinoa@login.uib.no
     StrictHostKeyChecking no
     UserKnownHostsFile=/dev/null

Host spark-driver
     Hostname 2001:700:2:8301::1111

Host spark-worker-1
     Hostname 2001:700:2:8301::2222

Add corresponding lines for the other Spark workers.

Create and mount virtual disks

In the NREC Overview window, go to Volumes -> Volumes, Create volumes called spark-driver-volume and spark-worker-volume-1, spark-worker-volume-2, and spark-worker-volume-3 as follows:

  • volume source: image
  • use image as a source: GOLD Ubuntu 22.04 LTS
  • use a small disk size, like 100Gb (this is just a test cluster, so we leave some room for a bigger cluster later)

Use the Actions drop-down menus to Manage Attachments of each new volume to its respective instance. For Device Name you can try /dev/sdb, or leave it open.

Check which path each volume has become Attached To. Here, we will assume that the new virtual disks are all attached at /dev/sdb (this is what usually happens).

Log into each each instance (spark-driver, spark-worker-1, ...) in a separate terminal window. In each window, run the following commands to partition, mount and format the disks (we follow this guide, which works well on 22.04 too).

Partition the disk:

sudo gdisk /dev/sdb

You have to enter this chain of responses:

n
1
[Return]
[Return]
[Return]
w
Y 

Back at the Linux prompt on each instance, format and mount the new partition:

sudo mkfs.ext4 /dev/sdb1  # the '1' is because you chose partition number '1' a few lines up
mkdir ~/volume
sudo mount /dev/sdb1 ~/volume
sudo chown -R ubuntu:ubuntu ~/volume

/home/ubuntu/volume is now available on each instance as a regular folder. But you will have to remount it every time to restart (reboot) the instance. To mount the volume permanently, you must add a line to /etc/fstab:

echo $(sudo blkid /dev/sdb1 -s UUID | cut -d" " -f2) /home/ubuntu/volume ext4 defaults 0 0 | \
    sudo tee -a /etc/fstab
sudo cat /etc/fstab

To check that the volume is permanently mounted, do

sudo reboot now

Back in your local terminal window, after 20-30 seconds, do:

ssh spark-INSTANCE-NAME

Check that the volume is still there and owned by user ubuntu.

ls -ld ~/volume  # mode and ownership should be drwxr-xr-x 3 ubuntu ubuntu

Set up SSH on the cluster

The virtual machines in the cluster also need to have SSH connections between each other. Internally in the cluster, we will use IPv4 addresses, because some of the software we will run does not yet support IPv6.

On your local computer, create an exercise-4 folder, for example a subfolder of info319-exercises, and cd into it.

On your local computer, create a file hosts with the IPv4 addresses and names of all the instances. The spark-driver must be first, for example:

158.39.77.125 spark-driver
10.1.0.143 spark-worker-1
10.1.1.202 spark-worker-2
10.1.2.49 spark-worker-3

(spark-driver has a differently-looking IPv4 address because it is global (DualStack), whereas the workers have local IPv4 addresses that are only valid inside the cluster.)

Copy the hosts file into each instance:

scp hosts spark-driver:
scp hosts spark-worker-1:
scp hosts spark-worker-2:
scp hosts spark-worker-3:

On each instance, extend the file /etc/hosts with the new instances as follows:

cat hosts /etc/hosts | sudo tee /etc/hosts
rm hosts

The rest of this exercise assumes that spark-driver is always listed first in /etc/hosts, and that no other lines in this file contain the string "spark-".

From your local machine, upload the SSH private keys to each instance:

scp ~/.ssh/info319-spark-cluster spark-driver:.ssh
scp ~/.ssh/info319-spark-cluster spark-worker-1:.ssh
...

Check that the copied files have the right permissions (mode): -rw-------

On your local machine, create the file config.stub:

Host spark-*
     User ubuntu
     IdentityFile ~/.ssh/info319-spark-cluster
     StrictHostKeyChecking no
     UserKnownHostsFile=/dev/null

This is the start your local .ssh/config file, but with the JumpHost stuff removed. Upload it to each instance:

scp config.stub spark-driver:.ssh/config
scp config.stub spark-worker-1:.ssh/config
...

You can also use a for loop to do multiple uploads:

for dest in spark-driver spark-worker-1 spark-worker-2 spark-worker-3 ; do
    scp config.stub $dest:.ssh/config ;
done

Or even more compactly:

for dest in spark-{driver,worker-{1,2,3}} ; do
    scp config.stub $dest:.ssh/config ;
done

Now, you can connect from any of your instances directly to any other. For example, on spark-worker-2

ssh spark-worker-3

On one of the instances, also try

ssh localhost

If if does not work, you can change the first line of config.stub above:

Host spark-* localhost

Security group for spark-driver

It is useful to open a few more ports on spark-driver to be able to access the web UIs for HDFS, YARN, Spark, Zookeeper, and Kafka later.

In the NREC Overview window, go to Network -> Security groups. Create a new security group info319-spark-driver. Add rules that open the following TCP ports for incoming connections:

  • rule: Custom TCP Rule
  • direction: Ingress
  • ports: 4040, 8042, 8080, 8088, 9092, 9870 (you need to create one rule for each)
  • CIDR: 129.177.146.20/20 (this covers most UiB addresses - you can add more rules to access from outside UiB/VPN)

Still in the Overview window, go to Compute -> Instances, use the Actions drop-down menus (Edit Security Groups) to add the new security group info319-spark-driver to the spark-driver instance.