Configure Spark cluster using Ansible: Difference between revisions

From info319
No edit summary
No edit summary
Line 53: Line 53:
       ansible.builtin.blockinfile:
       ansible.builtin.blockinfile:
         path: /etc/hosts
         path: /etc/hosts
         block: "\{\{ lookup('file', 'ipv4-hosts') \}\}"
         block: "{{ lookup('file', 'ipv4-hosts') } }"
       become: yes
       become: yes


Line 71: Line 71:
     - name: Authorise public cluster key
     - name: Authorise public cluster key
       ansible.posix.authorized_key:
       ansible.posix.authorized_key:
         key: "{{ lookup('file', '/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub') }}"
         key: "{{ lookup('file', '/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub') } }"
         user: ubuntu
         user: ubuntu


Line 102: Line 102:
     <property>
     <property>
         <name>fs.defaultFS</name>
         <name>fs.defaultFS</name>
         <value>hdfs://\{\{ hadoop_namenode \}\}:9000</value>
         <value>hdfs://{{ hadoop_namenode } }:9000</value>
     </property>
     </property>
  </configuration>
  </configuration>

Revision as of 11:27, 29 October 2022

Install and configure Ansible

On your local host, install Ansible, for example:

sudo apt install ansible

To prepare:

sudo cp /etc/ansible/hosts /etc/ansible/hosts.original
sudo chmod 666 /etc/ansible/hosts

Ansible needs to know the names of your cluster machines. Change info319-cluster.tf so it also writes a file like this to /etc/ansible/hosts:

terraform-driver
terraform-worker-0
...
terraform-worker-5

Finally, Ansible must be installed on all the hosts too. Add the line

- ansible

to the packages: section of user-data.cfg, and re-run terraform apply.

Test run Ansible

Make sure you can log into your cluster machines without a password. Test your Ansible set up from your local machine:

ansible all -m ping

On your local machine, create the file info319-cluster.yaml with a simple task: - name: Prepare .bashrc

 hosts: all
 tasks:
   - name: Save original .bashrc
     ansible.builtin.copy:
       src: /home/ubuntu/.bashrc
       dest: /home/ubuntu/.bashrc.original
       remote_src: yes

Run Ansible:

ansible-playbook info319-cluster.yaml 

Create Ansible playbook

Extend the playbook file info319-cluster.yaml to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.

Preparing .bashrc

In Exercise 4 we made a lot of modifications to ~/.bashrc. In some cases it is more practical to have the cluster configuration in a separate file, for example ~/.info319.

  • Add a task that uses the ansible.builtin.file module to create ("touch") /home/ubuntu/.info319 on all the hosts.
  • Add a task that uses the ansible.builtin.lineinfile module to add this line to /home/ubuntu/.info319 on all the hosts:
source .info319

See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.

Configure SSH

You can use the blockinfile module to add your local ipv4-hosts to /etc/hosts on each node:

   - name: Copy IPv4 addresses to /etc/hosts
     ansible.builtin.blockinfile:
       path: /etc/hosts
       block: "{{ lookup('file', 'ipv4-hosts') } }"
     become: yes

On your local machine, create the file config in your exercise-5 folder:

Host terraform-* localhost
     User ubuntu
     IdentityFile ~/.ssh/info319-spark-cluster
     StrictHostKeyChecking no
     UserKnownHostsFile /dev/null

Include ~/.ssh/config.terraform-hosts

(This is the config.stub file from Exercise 4, with the Include line added. Also, localhost has been added to the first line to allow nodes to ssh themselves...)

Use the copy module to upload this file, along with ~/.ssh/config.terraform-hosts and ~/.ssh/info319-spark-cluster to all hosts.

In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this:

    - name: Authorise public cluster key
      ansible.posix.authorized_key:
        key: "{{ lookup('file', '/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub') } }"
        user: ubuntu

Tip: ansible-playbook has a --start-at-task "Task name" option to avoid repeating all earlier blocks and stages. You can also use --step to have Ansible ask before each step whether to execute, skip, or finish.

Install Java

Use Ansible's ansible.builtin.apt module and install an old and stable Java version, for example openjdk-8-jdk-headless.

Mount volumes

Using the community.general.parted, community.general.filesystem and ansible.posix.mount modules from your local machine may require installation:

ansible-galaxy collection install community.general
ansible-galaxy collection install ansible.posix

Install HDFS and YARN

You can download the Hadoop and later Spark archives directly to each cluster host. But if you re-run your script many times, it takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.

Change the variable syntax in the files {core,hdfs,mapred,yarn}-site.xml from Exercise 4 from Linux to Ansible. For example

  • from core-site.xml:
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://${HADOOP_NAMENODE}:9000</value>
    </property>
</configuration>
  • to core-site.xml.j2:
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://{{ hadoop_namenode } }:9000</value>
    </property>
</configuration>

You can now use the ansible.builtin.template module instead of Linux' envsubst command.

hdfs namenode -format has got a -nonInteractive option that does not re-format an already formatted namenode.

Now you can log into terraform-driver and test run HDFS on the cluster as in Exercise 4.

Install Spark on the cluster

Install Kafka on the cluster

Run Spark pipeline