Configure Spark cluster using Ansible: Difference between revisions
No edit summary |
|||
(20 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== | == Ansible == | ||
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example: | On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example: | ||
sudo apt install ansible | sudo apt install ansible | ||
=== Configure Ansible === | |||
To prepare: | To prepare: | ||
sudo cp /etc/ansible/hosts /etc/ansible/hosts.original | sudo cp /etc/ansible/hosts /etc/ansible/hosts.original | ||
Line 17: | Line 18: | ||
to the ''packages:'' section of ''user-data.cfg'', and re-run '''terraform apply'''. | to the ''packages:'' section of ''user-data.cfg'', and re-run '''terraform apply'''. | ||
== Test run Ansible == | === Test run Ansible === | ||
Make sure you can log into your cluster machines without a password. Test your Ansible | Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine: | ||
ansible all -m ping | ansible all -m ping | ||
On your local machine, create the file ''info319-cluster.yaml'' with a simple task: | On your local machine, create the file ''info319-cluster.yaml'' with a simple task that backs up ''~/.bashrc'': | ||
- name: Prepare .bashrc | - name: Prepare .bashrc | ||
hosts: all | |||
tasks: | |||
- name: Save original .bashrc | |||
ansible.builtin.copy: | |||
src: /home/ubuntu/.bashrc | |||
dest: /home/ubuntu/.bashrc.original | |||
remote_src: yes | |||
Run Ansible: | Run Ansible: | ||
ansible-playbook info319-cluster.yaml | ansible-playbook info319-cluster.yaml | ||
== | == Ansible playbook == | ||
Extend the playbook file ''info319-cluster.yaml'' to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things. | Extend the playbook file ''info319-cluster.yaml'' to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things. | ||
Line 42: | Line 43: | ||
* Add a task that uses the ''ansible.builtin.file'' module to create ("touch") ''/home/ubuntu/.info319'' on all the hosts. | * Add a task that uses the ''ansible.builtin.file'' module to create ("touch") ''/home/ubuntu/.info319'' on all the hosts. | ||
* Add a task that uses the ''ansible.builtin.lineinfile'' module to add this line to ''/home/ubuntu/. | * Add a task that uses the ''ansible.builtin.lineinfile'' module to add this line to the end of ''/home/ubuntu/.bashrc'' on all the hosts: | ||
source .info319 | source .info319 | ||
Line 53: | Line 54: | ||
ansible.builtin.blockinfile: | ansible.builtin.blockinfile: | ||
path: /etc/hosts | path: /etc/hosts | ||
block: " | block: "{{ lookup('file', 'ipv4-hosts') } }" | ||
become: yes | become: yes # because you need root privilege (sudo) to update /etc/hosts | ||
''Note:'' There should not be a space between the two curly braces at the end of the ''key:'' line. But without the space, WikiText misinterprets them as a template marker. | |||
On your local machine, create the file ''config'' in your ''exercise-5'' folder: | On your local machine, create the file ''config'' in your ''exercise-5'' folder: | ||
Line 71: | Line 74: | ||
- name: Authorise public cluster key | - name: Authorise public cluster key | ||
ansible.posix.authorized_key: | ansible.posix.authorized_key: | ||
key: "{{ lookup('file', '/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub') }}" | key: "{{ lookup('file', '/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub') } }" | ||
user: ubuntu | user: ubuntu | ||
''Tip:'' '''ansible-playbook''' has a '''--start-at-task "Task name"''' option to avoid repeating all earlier blocks and stages. You can also use '''--step''' to have Ansible ask before each step whether to execute | ''Tip:'' '''ansible-playbook''' has a '''--start-at-task "Task name"''' option you can use to avoid repeating all earlier plays (blocks) and stages. You can also use '''--step''' to have Ansible ask before each step whether to execute or skip it. | ||
=== Install Java === | === Install Java === | ||
Line 80: | Line 83: | ||
=== Mount volumes === | === Mount volumes === | ||
Use the ''community.general.parted'', ''community.general.filesystem'' and ''ansible.posix.mount'' modules to do this. They may require installation on your local machine: | |||
ansible-galaxy collection install community.general | ansible-galaxy collection install community.general | ||
ansible-galaxy collection install ansible.posix | ansible-galaxy collection install ansible.posix | ||
=== Install HDFS and YARN === | === Install HDFS and YARN === | ||
To install HDFS and YARN you need the ''master_node'' and ''num_workers'' available as Ansible variables (facts). You can use the ''ansible.builtin.shell'' and ''.set_fact'' modules to do this, for example at the start of a new Ansible play: | |||
- name: Install HDFS and YARN | |||
hosts: all | |||
tasks: | |||
- name: Register master_node expression | |||
shell: grep tf-driver /etc/hosts | cut -d' ' -f1 | |||
register: master_node_expr | |||
- name: Set master_node fact | |||
set_fact: | |||
master_node: "{{ master_node_expr.stdout } }" | |||
Write two corresponding tasks for ''num_workers''. | |||
Use the ''ansible.builtin.get_url'' module to download the Hadoop (and other) archives directly to each cluster host. But if you re-run your script many times, this takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed. | |||
Use the ''ansible.builtin.unarchive'' module to unpack the archives. Use the ''apt'' module to install ''gzip'' if you need it. Use ''ansible.builtin.file'' to create symbolic links as in Exercise 4. | |||
=== Configure HDFS and YARN === | |||
Use ''ansible.builtin.lineinfile'' to define environment variables by adding them to ''~/.info319'' (instead of ''~/.bashrc''). | |||
Change the variable syntax in the files ''{core,hdfs,mapred,yarn}-site.xml'' from Exercise 4 from Linux to Ansible. For example | Change the variable syntax in the files ''{core,hdfs,mapred,yarn}-site.xml'' from Exercise 4 from Linux to Ansible. For example | ||
Line 102: | Line 121: | ||
<property> | <property> | ||
<name>fs.defaultFS</name> | <name>fs.defaultFS</name> | ||
<value>hdfs:// | <value>hdfs://{{ hadoop_namenode } }:9000</value> | ||
</property> | </property> | ||
</configuration> | </configuration> | ||
You can now use the ''ansible.builtin.template'' module instead of Linux' ''envsubst'' command. | You can now use the ''ansible.builtin.template'' module (instead of Linux' ''envsubst'' command) to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }. | ||
Use ''ansible.builtin.shell'' to create Hadoop's ''worker'' and ''master'' files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the ''terraform-driver'' host. | |||
'''hdfs namenode -format''' has got a '''-nonInteractive''' option that does not re-format an already formatted namenode. | Note that '''hdfs namenode -format''' has got a '''-nonInteractive''' option that does not re-format an already formatted namenode. Use ''failed_when'' to make Ansible ignore ''Exit code 1'' from hdfs in such cases: | ||
- name: Format HDFS namenode | |||
ansible.builtin.shell: | |||
argv: ["/home/ubuntu/volume/hadoop/bin/hdfs", "namenode", "-format", "-nonInteractive"] | |||
register: result | |||
failed_when: result.rc not in [0, 1] | |||
Now you can log into ''terraform-driver'' and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]]. | Now you can log into ''terraform-driver'' and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]]. | ||
Finally, you can start HDFS and YARN from ''terraform-driver''. Note that the ''ansible.builtin.copy'' and ''ansible.builtin.shell'' modules will normally ''not'' run ''~/.bashrc''. The reason is that ''~/.bashrc'' is intended for interactive shells running, for example, in a terminal window. ''~/.bashrc'' is not needed for many simpler commands, but more complex programs and scripts like Hadoop's ''start-all.sh'' need many environment variables. Therefore, you must start your own ''/usr/bin/bash'', initialise it with ''~/.info319'', and then run ''start-all.sh'' inside it: | |||
- name: Start HDFS and YARN | |||
ansible.builtin.shell: | |||
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "start-all.sh"] | |||
=== Install Spark on the cluster === | === Install Spark on the cluster === | ||
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into ''terraform-driver'' and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]]. | |||
=== Install Zookeeper on the cluster === | |||
There are two challenges with Zookeeper: | |||
# it may not run on all the machines in the cluster (it must be an odd number) | |||
# each zookeeper needs to know its ''myid'' number | |||
As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even. | |||
As for the second point, Exercise 4 has already suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can explore. | |||
In the end, this task will start Zookeeper on the selected nodes: | |||
- name: Start Zookeper | |||
ansible.builtin.shell: | |||
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties"] | |||
=== Install Kafka on the cluster === | === Install Kafka on the cluster === | ||
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its ''local_ip'', which you can set like this: | |||
- name: Register local_ip expression | |||
shell: ip -4 address | grep -o "^ *inet \(.\+\)\/.\+global.*$" | grep -o "[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+" | head -1 | |||
register: local_ip_expr | |||
- name: Set local_ip fact | |||
set_fact: | |||
local_ip: "{{ local_ip_expr.stdout } }" | |||
It also needs to know its ''id'', which was written to the file ''/home/ubuntu/volume/zookeeper/data/myid'' before (better than ''/tmp/zookeeper/myid'' which was suggested before). | |||
Finally, you can log into ''terraform-driver'' and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations! |
Latest revision as of 13:55, 31 October 2022
Ansible
On your local host, install Ansible, for example:
sudo apt install ansible
Configure Ansible
To prepare:
sudo cp /etc/ansible/hosts /etc/ansible/hosts.original sudo chmod 666 /etc/ansible/hosts
Ansible needs to know the names of your cluster machines. Change info319-cluster.tf so it also writes a file like this to /etc/ansible/hosts:
terraform-driver terraform-worker-0 ... terraform-worker-5
Finally, Ansible must be installed on all the hosts too. Add the line
- ansible
to the packages: section of user-data.cfg, and re-run terraform apply.
Test run Ansible
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:
ansible all -m ping
On your local machine, create the file info319-cluster.yaml with a simple task that backs up ~/.bashrc:
- name: Prepare .bashrc hosts: all tasks: - name: Save original .bashrc ansible.builtin.copy: src: /home/ubuntu/.bashrc dest: /home/ubuntu/.bashrc.original remote_src: yes
Run Ansible:
ansible-playbook info319-cluster.yaml
Ansible playbook
Extend the playbook file info319-cluster.yaml to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.
Preparing .bashrc
In Exercise 4 we made a lot of modifications to ~/.bashrc. In some cases it is more practical to have the cluster configuration in a separate file, for example ~/.info319.
- Add a task that uses the ansible.builtin.file module to create ("touch") /home/ubuntu/.info319 on all the hosts.
- Add a task that uses the ansible.builtin.lineinfile module to add this line to the end of /home/ubuntu/.bashrc on all the hosts:
source .info319
See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.
Configure SSH
You can use the blockinfile module to add your local ipv4-hosts to /etc/hosts on each node:
- name: Copy IPv4 addresses to /etc/hosts ansible.builtin.blockinfile: path: /etc/hosts block: "{{ lookup('file', 'ipv4-hosts') } }" become: yes # because you need root privilege (sudo) to update /etc/hosts
Note: There should not be a space between the two curly braces at the end of the key: line. But without the space, WikiText misinterprets them as a template marker.
On your local machine, create the file config in your exercise-5 folder:
Host terraform-* localhost User ubuntu IdentityFile ~/.ssh/info319-spark-cluster StrictHostKeyChecking no UserKnownHostsFile /dev/null Include ~/.ssh/config.terraform-hosts
(This is the config.stub file from Exercise 4, with the Include line added. Also, localhost has been added to the first line to allow nodes to ssh themselves...)
Use the copy module to upload this file, along with ~/.ssh/config.terraform-hosts and ~/.ssh/info319-spark-cluster to all hosts.
In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this:
- name: Authorise public cluster key ansible.posix.authorized_key: key: "{{ lookup('file', '/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub') } }" user: ubuntu
Tip: ansible-playbook has a --start-at-task "Task name" option you can use to avoid repeating all earlier plays (blocks) and stages. You can also use --step to have Ansible ask before each step whether to execute or skip it.
Install Java
Use Ansible's ansible.builtin.apt module and install an old and stable Java version, for example openjdk-8-jdk-headless.
Mount volumes
Use the community.general.parted, community.general.filesystem and ansible.posix.mount modules to do this. They may require installation on your local machine:
ansible-galaxy collection install community.general ansible-galaxy collection install ansible.posix
Install HDFS and YARN
To install HDFS and YARN you need the master_node and num_workers available as Ansible variables (facts). You can use the ansible.builtin.shell and .set_fact modules to do this, for example at the start of a new Ansible play:
- name: Install HDFS and YARN hosts: all tasks: - name: Register master_node expression shell: grep tf-driver /etc/hosts | cut -d' ' -f1 register: master_node_expr - name: Set master_node fact set_fact: master_node: "{{ master_node_expr.stdout } }"
Write two corresponding tasks for num_workers.
Use the ansible.builtin.get_url module to download the Hadoop (and other) archives directly to each cluster host. But if you re-run your script many times, this takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.
Use the ansible.builtin.unarchive module to unpack the archives. Use the apt module to install gzip if you need it. Use ansible.builtin.file to create symbolic links as in Exercise 4.
Configure HDFS and YARN
Use ansible.builtin.lineinfile to define environment variables by adding them to ~/.info319 (instead of ~/.bashrc).
Change the variable syntax in the files {core,hdfs,mapred,yarn}-site.xml from Exercise 4 from Linux to Ansible. For example
- from core-site.xml:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://${HADOOP_NAMENODE}:9000</value> </property> </configuration>
- to core-site.xml.j2:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://{{ hadoop_namenode } }:9000</value> </property> </configuration>
You can now use the ansible.builtin.template module (instead of Linux' envsubst command) to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.
Use ansible.builtin.shell to create Hadoop's worker and master files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the terraform-driver host.
Note that hdfs namenode -format has got a -nonInteractive option that does not re-format an already formatted namenode. Use failed_when to make Ansible ignore Exit code 1 from hdfs in such cases:
- name: Format HDFS namenode ansible.builtin.shell: argv: ["/home/ubuntu/volume/hadoop/bin/hdfs", "namenode", "-format", "-nonInteractive"] register: result failed_when: result.rc not in [0, 1]
Now you can log into terraform-driver and test run HDFS on the cluster as in Exercise 4.
Finally, you can start HDFS and YARN from terraform-driver. Note that the ansible.builtin.copy and ansible.builtin.shell modules will normally not run ~/.bashrc. The reason is that ~/.bashrc is intended for interactive shells running, for example, in a terminal window. ~/.bashrc is not needed for many simpler commands, but more complex programs and scripts like Hadoop's start-all.sh need many environment variables. Therefore, you must start your own /usr/bin/bash, initialise it with ~/.info319, and then run start-all.sh inside it:
- name: Start HDFS and YARN ansible.builtin.shell: argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "start-all.sh"]
Install Spark on the cluster
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into terraform-driver and test run Spark on the cluster as in Exercise 4.
Install Zookeeper on the cluster
There are two challenges with Zookeeper:
- it may not run on all the machines in the cluster (it must be an odd number)
- each zookeeper needs to know its myid number
As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.
As for the second point, Exercise 4 has already suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful loop constructs you can explore.
In the end, this task will start Zookeeper on the selected nodes:
- name: Start Zookeper ansible.builtin.shell: argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties"]
Install Kafka on the cluster
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its local_ip, which you can set like this:
- name: Register local_ip expression shell: ip -4 address | grep -o "^ *inet \(.\+\)\/.\+global.*$" | grep -o "[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+" | head -1 register: local_ip_expr - name: Set local_ip fact set_fact: local_ip: "{{ local_ip_expr.stdout } }"
It also needs to know its id, which was written to the file /home/ubuntu/volume/zookeeper/data/myid before (better than /tmp/zookeeper/myid which was suggested before).
Finally, you can log into terraform-driver and test run Spark on top of Kafka as in Exercise 4. Congratulations!