File:S06-Privacy.pdf

2022-11-10T21:11:22Z

Sinoa:

Course wiki for INFO319

2022-11-10T10:35:14Z

Sinoa:

This wiki (under development) contains practical information about INFO319 - Big Data in the autumn of 2022, including readings and exercises.

https://miro.com/welcomeonboard/UlBicXNHQXFtZDVTV2lDZzdrbzdoYVBFdXdoUzdkTXBNR0x0MEZQVDljYk1qSGxYd1FWMVRXUEM5OHFYUlh6RnwzMDc0NDU3MzY4MjYxOTAxNTIzfDI=?share_link_id=781099561364

* [[Readings]]: Mandatory and supplementary readings for the course. In addition, the sessions page proposes specific readings for each session.
* [[Sessions]]: The course comprises 8 full-day seminars. The sessions involve lectures, demos, student presentations, and practical work.
* [[Exercises]]: Exercises related to the sessions. Solutions to the exercises can form the backbone of the mandatory group assignment.
* [https://www.uib.no/emne/INFO319?sem=2022h Assessment]: The course assessment has three parts:
** Portfolio evaluation of the essay and group assignment (55%)
** Oral presentations of essay and group assignment (15%)
** Written exam (3 hours) (30%)
* [[Essay]]: One part of the portfolio evaluation is "an individual, theoretical essay with thoughtful research and discussion of an assigned topic".
* [[Programming project|Project]]: Another part of the portfolio evaluation is a "practical assignment in groups" that has the form of a group programming project. Solutions to the suggested exercises can form the backbone of your project.
* Participation: Participation in 80% of the course seminars is ''mandatory''. Participation in ''the two last sessions is also mandatory'' (because your presentations there are part of the course assessment).

* [https://mitt.uib.no/courses/37204/pages/info319-big-data-h22 Administration]: For formal and administrative information, see [https://mitt.uib.no/courses/37204/pages/info319-big-data-h22 UiB's Study Portal].

 

'''Contact:''' [mailto:Andreas.Opdahl@uib.no Andreas.Opdahl@uib.no]

''For questions that are not strictly personal, always use the [https://mitt.uib.no/courses/37204/discussion_topics/321322 discussion forum (requires login)] at [https://mitt.uib.no/courses/37204/pages/info319-big-data-h22 mitt.uib.no]. If I receive general questions about INFO319 by email, I will answer them in the forum anyway, so it is fastest to post them directly there.''

Sessions

2022-11-09T12:53:17Z

Sinoa: /* Session 6 - Societal issues. Privacy. GDPR */

== Tentative themes for each session ==
* Thursday August 18th: Introduction meeting [[File:IntroductionMeeting.pdf]]
* Thursday September 1st: Session 1 - Introduction to big data. Big-data processing. Spark
* Thursday September 15th: Session 2 - More about Spark. Data sources. Twitter
* Thursday September 29th: Session 3 - Streaming Spark. Big-data architectures. Kafka
* Thursday October 13th: Session 4 - Cloud computing. NREC an Openstack
* Thursday October 27th: Session 5 - Cloud management. Terraform and Ansible. Docker and Kubernetes
* Thursday November 10th: Session 6 - Societal issues. Privacy. GDPR
* Thursday November 24th: Session 7 - Essay presentations
* Thursday December 8th: Session 8 - Project demonstrations

== Session 1 - Introduction to big data. Big-data processing. Spark ==
* Kitchin, chapters 1, 4-5
* Chambers & Zaharia, chapters 1-3, 12, 15
* Slides: [[File:S01-BigData-published.pdf]] [[File:S01-Spark-published.pdf]]

Supplementary:
* Section 1 in Opdahl, A. L., & Nunavath, V. (2020). Big Data. Big Data in Emergency Management: Exploitation Techniques for Social and Mobile Data, 15-29. [https://link.springer.com/chapter/10.1007/978-3-030-48099-8_2 Paper]
* Spark 3.3.0 [https://spark.apache.org/docs/latest/overview.html Overview] and [https://spark.apache.org/docs/latest/quick-start.html Quick Start (with Python examples)]

== Session 2 - More about Spark. Data sources. Twitter ==
* Chambers & Zaharia, chapters 4-9 (chapter 10 on SQL is also very relevant)
* Kitchin, chapter 3
* Slides: [[File:S02-OrganisationINFO319-published.pdf]] [[File:S02-DataSources-published.pdf]] [[File:DanielRosnes-Introduction-to-Tweepy-and-Twitter-API-2.0.pdf]] [[File:S02-MoreSpark-published.pdf]]

Guest presentation: Daniel Rosnes on using Twitter data for the news: Introduction to Twitter API v2 and Tweepy

Supplementary:
* Chambers & Zaharia, chapter 10 ''(perhaps mandatory too)''
* [https://developer.twitter.com/en/docs/twitter-api Twitter API v2]
* [https://github.com/tweepy/tweepy Tweepy: Twitter for Python]
* [https://docs.tweepy.org/en/latest/ Tweepy Documentation]

== Session 3 - Streaming Spark. Big-data architectures. Kafka ==
* Chambers & Zaharia, chapters 20-21
* Gallofré, M., Opdahl, A. L., Stoppel, S., Tessem, B., & Veres, C. (2021). The News Angler Project: Exploring the Next Generation of Journalistic Knowledge Platforms. In Proceedings of Norsk IKT-konferanse for forskning og utdanning. [https://ojs.bibsys.no/index.php/NIK/article/view/939/792 Short Paper] and poster: [[file:A1-Poster-NIKT2021.pdf]]
* [https://kafka.apache.org/intro Kafka Introduction]
* Slides: [[file:S03-StreamingSpark-published.pdf]] [[file:S03-MoreSpark-published.pdf]] [[file:S03-Kafka-published.pdf]] [[file:S03-ResearchMethod-published.pdf]]

''The Guest Talk on architectures and the News Hunter platform is postponed to a later session.''

Supplementary:
* [https://kafka-python.readthedocs.io/en/master/ kafka-python API]
* [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Structured Streaming Spark Programming Guide]
* News Hunter:
** Berven, A., Christensen, O. A., Moldeklev, S., Opdahl, A. L., & Villanger, K. J. (2020). A knowledge-graph platform for newsrooms. Computers in Industry, 123, 103321. [https://scholar.google.com/scholar?output=instlink&q=info:0K5dB1_9nusJ:scholar.google.com/&hl=en&as_sdt=0,5&as_ylo=2018&scillfp=11776208952974186557&oi=lle Paper]
** Opdahl, A. L., & Tessem, B. (2021). Ontologies for finding journalistic angles. Software and Systems Modeling, 20(1), 71-87. [https://link.springer.com/article/10.1007/s10270-020-00801-w Paper]
* Design science research method:
** [https://www.jstor.org/stable/25148625#metadata_info_tab_contents Design Science in Information Systems Research] by Alan R. Hevner, Salvatore T. March, Jinsoo Park and Sudha Ram. MIS Quarterly 28(1):75-105, March 2004. ''(You need to be on UiB's network to access the link - I have uploaded it under Files in mitt.uib.no, but it may soon be deleted from there...)''
** Hevner, A. R. (2007). A three cycle view of design science research. Scandinavian journal of information systems, 19(2), 4. [[File:Hevner2007-ThreeCycleView-SJIS.pdf]]

== Session 4 - Cloud computing. NREC and Openstack ==
* [https://docs.nrec.no/index.html NREC and OpenStack], the following sections/pages: Introduction, Project application, Logging in, The dashboard, Create a Linux virtual machine (skip: Windows), Using SSH, Working with Security Groups, Create and manage volumes, Create and manage snapshots (skip: images), Instance console
* Slides: [[file:S04-OpenStack-published.pdf]] [[file:S04-UbuntuLinux-published.pdf]]

Guest presentation: Sohail Khan on computer vision and deep networks for image analysis. His slides and demo code are uploaded to mitt.uib.no under Files (size and file-type limitations).

There are not so many readings for this session, because it is where we will start running Spark in a cluster, so there will be practical work that takes some time. Computer networks and image analysis is not a mandatory part of the course, but something you may want to use in your projects. Sohail's presentation will include suggestions for further reading.

== Session 5 - Cloud management. Terraform and Ansible.  ==
* [https://docs.nrec.no/terraform-part1.html TerraForm and NREC part I], [https://docs.nrec.no/terraform-part2.html part II], and [https://docs.nrec.no/terraform-part3.html part III]
* [https://www.ansible.com/overview/how-ansible-works How Ansible Works] and [https://docs.ansible.com/ansible_community.html the Ansible Community portal]

* Slides: [[File:S05-Terraform-Ansible-published.pdf]] [[File:S05-NewsAngler-published.pdf]]

Guest presentation: Marc Gallofré Ocaña on the News Hunter platform and its big-data ready architecture. Slides: [[File:MarcGallofre-BigDataArchitecture.pdf]]

Comment: Hopefully, we can introduce Docker and Kubernetes in later sesssions.

== Session 6 - Societal issues. Privacy. GDPR ==
* Kitchin, chapters 13-14 and 17-19
* [https://gdpr.eu/what-is-gdpr/ What is GDPR, the EU’s new data protection law?]

Guest presentation: Ghazaal Sheiki on fact checking

Supplementary:
* Kitchin, chapters 12 and 15-16 are also recommended reading
* EU's [https://gdpr-info.eu/ General Data Protection Regulation (GDPR)] - the official legal text

== Session 7 - Essay presentations ==

== Session 8 - Project demonstrations ==

Sessions

2022-11-01T11:21:33Z

Sinoa: /* Session 6 - Societal issues. Privacy. GDPR */

== Tentative themes for each session ==
* Thursday August 18th: Introduction meeting [[File:IntroductionMeeting.pdf]]
* Thursday September 1st: Session 1 - Introduction to big data. Big-data processing. Spark
* Thursday September 15th: Session 2 - More about Spark. Data sources. Twitter
* Thursday September 29th: Session 3 - Streaming Spark. Big-data architectures. Kafka
* Thursday October 13th: Session 4 - Cloud computing. NREC an Openstack
* Thursday October 27th: Session 5 - Cloud management. Terraform and Ansible. Docker and Kubernetes
* Thursday November 10th: Session 6 - Societal issues. Privacy. GDPR
* Thursday November 24th: Session 7 - Essay presentations
* Thursday December 8th: Session 8 - Project demonstrations

== Session 1 - Introduction to big data. Big-data processing. Spark ==
* Kitchin, chapters 1, 4-5
* Chambers & Zaharia, chapters 1-3, 12, 15
* Slides: [[File:S01-BigData-published.pdf]] [[File:S01-Spark-published.pdf]]

Supplementary:
* Section 1 in Opdahl, A. L., & Nunavath, V. (2020). Big Data. Big Data in Emergency Management: Exploitation Techniques for Social and Mobile Data, 15-29. [https://link.springer.com/chapter/10.1007/978-3-030-48099-8_2 Paper]
* Spark 3.3.0 [https://spark.apache.org/docs/latest/overview.html Overview] and [https://spark.apache.org/docs/latest/quick-start.html Quick Start (with Python examples)]

== Session 2 - More about Spark. Data sources. Twitter ==
* Chambers & Zaharia, chapters 4-9 (chapter 10 on SQL is also very relevant)
* Kitchin, chapter 3
* Slides: [[File:S02-OrganisationINFO319-published.pdf]] [[File:S02-DataSources-published.pdf]] [[File:DanielRosnes-Introduction-to-Tweepy-and-Twitter-API-2.0.pdf]] [[File:S02-MoreSpark-published.pdf]]

Guest presentation: Daniel Rosnes on using Twitter data for the news: Introduction to Twitter API v2 and Tweepy

Supplementary:
* Chambers & Zaharia, chapter 10 ''(perhaps mandatory too)''
* [https://developer.twitter.com/en/docs/twitter-api Twitter API v2]
* [https://github.com/tweepy/tweepy Tweepy: Twitter for Python]
* [https://docs.tweepy.org/en/latest/ Tweepy Documentation]

== Session 3 - Streaming Spark. Big-data architectures. Kafka ==
* Chambers & Zaharia, chapters 20-21
* Gallofré, M., Opdahl, A. L., Stoppel, S., Tessem, B., & Veres, C. (2021). The News Angler Project: Exploring the Next Generation of Journalistic Knowledge Platforms. In Proceedings of Norsk IKT-konferanse for forskning og utdanning. [https://ojs.bibsys.no/index.php/NIK/article/view/939/792 Short Paper] and poster: [[file:A1-Poster-NIKT2021.pdf]]
* [https://kafka.apache.org/intro Kafka Introduction]
* Slides: [[file:S03-StreamingSpark-published.pdf]] [[file:S03-MoreSpark-published.pdf]] [[file:S03-Kafka-published.pdf]] [[file:S03-ResearchMethod-published.pdf]]

''The Guest Talk on architectures and the News Hunter platform is postponed to a later session.''

Supplementary:
* [https://kafka-python.readthedocs.io/en/master/ kafka-python API]
* [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Structured Streaming Spark Programming Guide]
* News Hunter:
** Berven, A., Christensen, O. A., Moldeklev, S., Opdahl, A. L., & Villanger, K. J. (2020). A knowledge-graph platform for newsrooms. Computers in Industry, 123, 103321. [https://scholar.google.com/scholar?output=instlink&q=info:0K5dB1_9nusJ:scholar.google.com/&hl=en&as_sdt=0,5&as_ylo=2018&scillfp=11776208952974186557&oi=lle Paper]
** Opdahl, A. L., & Tessem, B. (2021). Ontologies for finding journalistic angles. Software and Systems Modeling, 20(1), 71-87. [https://link.springer.com/article/10.1007/s10270-020-00801-w Paper]
* Design science research method:
** [https://www.jstor.org/stable/25148625#metadata_info_tab_contents Design Science in Information Systems Research] by Alan R. Hevner, Salvatore T. March, Jinsoo Park and Sudha Ram. MIS Quarterly 28(1):75-105, March 2004. ''(You need to be on UiB's network to access the link - I have uploaded it under Files in mitt.uib.no, but it may soon be deleted from there...)''
** Hevner, A. R. (2007). A three cycle view of design science research. Scandinavian journal of information systems, 19(2), 4. [[File:Hevner2007-ThreeCycleView-SJIS.pdf]]

== Session 4 - Cloud computing. NREC and Openstack ==
* [https://docs.nrec.no/index.html NREC and OpenStack], the following sections/pages: Introduction, Project application, Logging in, The dashboard, Create a Linux virtual machine (skip: Windows), Using SSH, Working with Security Groups, Create and manage volumes, Create and manage snapshots (skip: images), Instance console
* Slides: [[file:S04-OpenStack-published.pdf]] [[file:S04-UbuntuLinux-published.pdf]]

Guest presentation: Sohail Khan on computer vision and deep networks for image analysis. His slides and demo code are uploaded to mitt.uib.no under Files (size and file-type limitations).

There are not so many readings for this session, because it is where we will start running Spark in a cluster, so there will be practical work that takes some time. Computer networks and image analysis is not a mandatory part of the course, but something you may want to use in your projects. Sohail's presentation will include suggestions for further reading.

== Session 5 - Cloud management. Terraform and Ansible.  ==
* [https://docs.nrec.no/terraform-part1.html TerraForm and NREC part I], [https://docs.nrec.no/terraform-part2.html part II], and [https://docs.nrec.no/terraform-part3.html part III]
* [https://www.ansible.com/overview/how-ansible-works How Ansible Works] and [https://docs.ansible.com/ansible_community.html the Ansible Community portal]

* Slides: [[File:S05-Terraform-Ansible-published.pdf]] [[File:S05-NewsAngler-published.pdf]]

Guest presentation: Marc Gallofré Ocaña on the News Hunter platform and its big-data ready architecture. Slides: [[File:MarcGallofre-BigDataArchitecture.pdf]]

Comment: Hopefully, we can introduce Docker and Kubernetes in later sesssions.

== Session 6 - Societal issues. Privacy. GDPR ==
* Kitchin, chapters 13-14 and 17-19
* [https://gdpr.eu/what-is-gdpr/ What is GDPR, the EU’s new data protection law?]

Guest presentation: Laurence Dierickx on aspects of big-data quality

Supplementary:
* Kitchin, chapters 12 and 15-16 are also recommended reading
* EU's [https://gdpr-info.eu/ General Data Protection Regulation (GDPR)] - the official legal text

== Session 7 - Essay presentations ==

== Session 8 - Project demonstrations ==

Readings

2022-11-01T11:19:59Z

Sinoa: /* Books */

== Books ==
Text books:
* Rob Kitchin. ''The Data Revolution - A Critical Analysis of Big Data, Open Data and Data Infrastructures'', 2nd Edition. Sage, 2021.
** chapters 1, 3-5, 13-14, 17-19 are mandatory (12 and 15-16 are supplementary)

* Bill Chambers and Matei Zaharia: ''Sprk: The Definitive Guide - Big Data Processing Made Simple''. O'Riley, 2018. [[File:Spark-TheDefinitiveGuide.pdf]]
** chapters 1-9, 12, 15, 20-21 are mandatory (chapter 10 on SQL is also highly relevant)
** [https://github.com/databricks/Spark-The-Definitive-Guide GitHub repository with code and data examples]

== Papers ==
Selected papers will become available here, including:
* [https://arxiv.org/pdf/2012.09109 Section 1] in Opdahl, A. L., & Nunavath, V. (2020). Big Data. Big Data in Emergency Management: Exploitation Techniques for Social and Mobile Data, 15-29. Book chapter
* Gallofré, M., Opdahl, A. L., Stoppel, S., Tessem, B., & Veres, C. (2021). The News Angler Project: Exploring the Next Generation of Journalistic Knowledge Platforms. In Proceedings of Norsk IKT-konferanse for forskning og utdanning. [https://ojs.bibsys.no/index.php/NIK/article/view/939/792 Short Paper] and poster: [[File:A1-Poster-NIKT2021.pdf]]




Supplementary:
* Opdahl, A. L., & Tessem, B. (2021). Ontologies for finding journalistic angles. Software and Systems Modeling, 20(1), 71-87. [https://scholar.google.com/scholar?output=instlink&q=info:pKELE6iBzpAJ:scholar.google.com/&hl=en&as_sdt=0,5&as_ylo=2021&scillfp=4299025271368542631&oi=lle Paper]
* Berven, A., Christensen, O. A., Moldeklev, S., Opdahl, A. L., & Villanger, K. J. (2020). A knowledge-graph platform for newsrooms. Computers in Industry, 123, 103321. [https://scholar.google.com/scholar?output=instlink&q=info:0K5dB1_9nusJ:scholar.google.com/&hl=en&as_sdt=0,5&as_ylo=2018&scillfp=11776208952974186557&oi=lle Paper]





== Technical introductions ==
Selected web pages will become available here, including:
* [https://kafka.apache.org/intro Kafka Introduction]
* [https://docs.nrec.no/index.html NREC and OpenStack], the following sections/pages: Introduction, Project application, Logging in, The dashboard, Create a Linux virtual machine (skip: Windows), Using SSH, Working with Security Groups, Create and manage volumes, Create and manage snapshots (skip: images), Instance console
* [https://docs.nrec.no/terraform-part1.html TerraForm and NREC part I], [https://docs.nrec.no/terraform-part2.html part II], and [https://docs.nrec.no/terraform-part3.html part III]
* [https://www.ansible.com/overview/how-ansible-works How Ansible Works] and [https://docs.ansible.com/ansible_community.html the Ansible Community portal]
* Docker Docs: [https://docs.docker.com/get-started/overview/ Docker overview] and [https://docs.docker.com/get-started/overview/ Get started]
* [https://kubernetes.io/docs/tutorials/kubernetes-basics/ Learn Kubernetes basics], modules 1-6
* [https://gdpr.eu/what-is-gdpr/ What is GDPR, the EU’s new data protection law?]

Supplementary:
* Spark 3.3.0 [https://spark.apache.org/docs/latest/index.html Overview] and [https://spark.apache.org/docs/latest/quick-start.html Quick Start (with Python examples)]
* [https://developer.twitter.com/en/docs/twitter-api Twitter API v2]
* [https://github.com/tweepy/tweepy Tweepy: Twitter for Python]
* [https://docs.tweepy.org/en/latest/ Tweepy Documentation]
* [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Structured Streaming Spark Programming Guide]
* Apache Spark [https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/index.html Structured Streaming API]
* [https://kafka-python.readthedocs.io/en/master/ kafka-python API]
* EU's [https://gdpr-info.eu/ General Data Protection Regulation (GDPR)] - the official legal text



== Lecture slides==
See the [[Sessions|Session page]] for lecture slides after each session.

==Readings for each session==
The [[Sessions|Sessions page]] will suggest specific readings for each session and its associated exercise.

Exercises

2022-10-31T13:56:30Z

Sinoa:

Outline of the exercises. Because the exercises are new this year, it is hard to plan exactly, so this is likely to change a bit!
* Exercise 1: [[Getting started with Apache Spark]] and [[Processing tweets with Spark]].
* Exercise 2: [[Streaming tweets with Twitter API]]
* Exercise 3: [[Streaming tweets with Kafka and Spark]]
* Exercise 4:
** [[Create Spark cluster]]
** [[Install HDFS and YARN on the cluster]]
** [[Install Spark on the cluster]]
** [[Install Kafka on the cluster]]
* Exercise 5:
** [[Create Spark cluster using Terraform]]
** [[Configure Spark cluster using Ansible]]

Configure Spark cluster using Ansible

2022-10-31T13:55:23Z

Sinoa: /* Install Zookeeper on the cluster */

== Ansible ==
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:
sudo apt install ansible

=== Configure Ansible ===
To prepare:
sudo cp /etc/ansible/hosts /etc/ansible/hosts.original
sudo chmod 666 /etc/ansible/hosts

Ansible needs to know the names of your cluster machines. Change ''info319-cluster.tf'' so it also writes a file like this to ''/etc/ansible/hosts'':
terraform-driver
terraform-worker-0
...
terraform-worker-5

Finally, Ansible must be installed on all the hosts too. Add the line
- ansible
to the ''packages:'' section of ''user-data.cfg'', and re-run '''terraform apply'''.

=== Test run Ansible ===
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:
ansible all -m ping

On your local machine, create the file ''info319-cluster.yaml'' with a simple task that backs up ''~/.bashrc'':
- name: Prepare .bashrc
hosts: all
tasks:

- name: Save original .bashrc
ansible.builtin.copy:
src: /home/ubuntu/.bashrc
dest: /home/ubuntu/.bashrc.original
remote_src: yes

Run Ansible:
ansible-playbook info319-cluster.yaml

== Ansible playbook ==
Extend the playbook file ''info319-cluster.yaml'' to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.

=== Preparing .bashrc ===
In Exercise 4 we made a lot of modifications to ''~/.bashrc''. In some cases it is more practical to have the cluster configuration in a separate file, for example ''~/.info319''.

* Add a task that uses the ''ansible.builtin.file'' module to create ("touch") ''/home/ubuntu/.info319'' on all the hosts.
* Add a task that uses the ''ansible.builtin.lineinfile'' module to add this line to the end of ''/home/ubuntu/.bashrc'' on all the hosts:
source .info319

See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.

=== Configure SSH ===
You can use the ''blockinfile'' module to add your local ''ipv4-hosts'' to ''/etc/hosts'' on each node:
- name: Copy IPv4 addresses to /etc/hosts
ansible.builtin.blockinfile:
path: /etc/hosts
block: "{{ lookup('file', 'ipv4-hosts') } }"
become: yes # because you need root privilege (sudo) to update /etc/hosts

''Note:'' There should not be a space between the two curly braces at the end of the ''key:'' line. But without the space, WikiText misinterprets them as a template marker.

On your local machine, create the file ''config'' in your ''exercise-5'' folder:
Host terraform-* localhost
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ~/.ssh/config.terraform-hosts
(This is the ''config.stub'' file from Exercise 4, with the ''Include'' line added. Also, ''localhost'' has been added to the first line to allow nodes to ''ssh'' themselves...)

Use the ''copy'' module to upload this file, along with ''~/.ssh/config.terraform-hosts'' and ''~/.ssh/info319-spark-cluster'' to all hosts.

In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this:
- name: Authorise public cluster key
ansible.posix.authorized_key:
key: "{{ lookup('file', '/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub') } }"
user: ubuntu

''Tip:'' '''ansible-playbook''' has a '''--start-at-task "Task name"''' option you can use to avoid repeating all earlier plays (blocks) and stages. You can also use '''--step''' to have Ansible ask before each step whether to execute or skip it.

=== Install Java ===
Use Ansible's ''ansible.builtin.apt'' module and install an old and stable Java version, for example ''openjdk-8-jdk-headless''.

=== Mount volumes ===
Use the ''community.general.parted'', ''community.general.filesystem'' and ''ansible.posix.mount'' modules to do this. They may require installation on your local machine:
ansible-galaxy collection install community.general
ansible-galaxy collection install ansible.posix

=== Install HDFS and YARN ===
To install HDFS and YARN you need the ''master_node'' and ''num_workers'' available as Ansible variables (facts). You can use the ''ansible.builtin.shell'' and ''.set_fact'' modules to do this, for example at the start of a new Ansible play:
- name: Install HDFS and YARN
hosts: all
tasks:

- name: Register master_node expression
shell: grep tf-driver /etc/hosts | cut -d' ' -f1
register: master_node_expr

- name: Set master_node fact
set_fact:
master_node: "{{ master_node_expr.stdout } }"
Write two corresponding tasks for ''num_workers''.

Use the ''ansible.builtin.get_url'' module to download the Hadoop (and other) archives directly to each cluster host. But if you re-run your script many times, this takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.

Use the ''ansible.builtin.unarchive'' module to unpack the archives. Use the ''apt'' module to install ''gzip'' if you need it. Use ''ansible.builtin.file'' to create symbolic links as in Exercise 4.

=== Configure HDFS and YARN ===
Use ''ansible.builtin.lineinfile'' to define environment variables by adding them to ''~/.info319'' (instead of ''~/.bashrc'').

Change the variable syntax in the files ''{core,hdfs,mapred,yarn}-site.xml'' from Exercise 4 from Linux to Ansible. For example
* from ''core-site.xml'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://${HADOOP_NAMENODE}:9000</value>
</property>
</configuration>
* to ''core-site.xml.j2'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://{{ hadoop_namenode } }:9000</value>
</property>
</configuration>
You can now use the ''ansible.builtin.template'' module (instead of Linux' ''envsubst'' command) to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.

Use ''ansible.builtin.shell'' to create Hadoop's ''worker'' and ''master'' files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the ''terraform-driver'' host.

Note that '''hdfs namenode -format''' has got a '''-nonInteractive''' option that does not re-format an already formatted namenode. Use ''failed_when'' to make Ansible ignore ''Exit code 1'' from hdfs in such cases:
- name: Format HDFS namenode
ansible.builtin.shell:
argv: ["/home/ubuntu/volume/hadoop/bin/hdfs", "namenode", "-format", "-nonInteractive"]
register: result
failed_when: result.rc not in [0, 1]

Now you can log into ''terraform-driver'' and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].

Finally, you can start HDFS and YARN from ''terraform-driver''. Note that the ''ansible.builtin.copy'' and ''ansible.builtin.shell'' modules will normally ''not'' run ''~/.bashrc''. The reason is that ''~/.bashrc'' is intended for interactive shells running, for example, in a terminal window. ''~/.bashrc'' is not needed for many simpler commands, but more complex programs and scripts like Hadoop's ''start-all.sh'' need many environment variables. Therefore, you must start your own ''/usr/bin/bash'', initialise it with ''~/.info319'', and then run ''start-all.sh'' inside it:
- name: Start HDFS and YARN
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "start-all.sh"]

=== Install Spark on the cluster ===
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into ''terraform-driver'' and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].

=== Install Zookeeper on the cluster ===
There are two challenges with Zookeeper:
# it may not run on all the machines in the cluster (it must be an odd number)
# each zookeeper needs to know its ''myid'' number

As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.

As for the second point, Exercise 4 has already suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can explore.

In the end, this task will start Zookeeper on the selected nodes:
- name: Start Zookeper
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties"]

=== Install Kafka on the cluster ===
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its ''local_ip'', which you can set like this:
- name: Register local_ip expression
shell: ip -4 address | grep -o "^ *inet $.\+$\/.\+global.*$" | grep -o "[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+" | head -1
register: local_ip_expr

- name: Set local_ip fact
set_fact:
local_ip: "{{ local_ip_expr.stdout } }"

It also needs to know its ''id'', which was written to the file ''/home/ubuntu/volume/zookeeper/data/myid'' before (better than ''/tmp/zookeeper/myid'' which was suggested before).

Finally, you can log into ''terraform-driver'' and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!

Configure Spark cluster using Ansible

2022-10-31T13:52:21Z

Sinoa: /* Install HDFS and YARN */

== Ansible ==
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:
sudo apt install ansible

=== Configure Ansible ===
To prepare:
sudo cp /etc/ansible/hosts /etc/ansible/hosts.original
sudo chmod 666 /etc/ansible/hosts

Ansible needs to know the names of your cluster machines. Change ''info319-cluster.tf'' so it also writes a file like this to ''/etc/ansible/hosts'':
terraform-driver
terraform-worker-0
...
terraform-worker-5

Finally, Ansible must be installed on all the hosts too. Add the line
- ansible
to the ''packages:'' section of ''user-data.cfg'', and re-run '''terraform apply'''.

=== Test run Ansible ===
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:
ansible all -m ping

On your local machine, create the file ''info319-cluster.yaml'' with a simple task that backs up ''~/.bashrc'':
- name: Prepare .bashrc
hosts: all
tasks:

- name: Save original .bashrc
ansible.builtin.copy:
src: /home/ubuntu/.bashrc
dest: /home/ubuntu/.bashrc.original
remote_src: yes

Run Ansible:
ansible-playbook info319-cluster.yaml

== Ansible playbook ==
Extend the playbook file ''info319-cluster.yaml'' to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.

=== Preparing .bashrc ===
In Exercise 4 we made a lot of modifications to ''~/.bashrc''. In some cases it is more practical to have the cluster configuration in a separate file, for example ''~/.info319''.

* Add a task that uses the ''ansible.builtin.file'' module to create ("touch") ''/home/ubuntu/.info319'' on all the hosts.
* Add a task that uses the ''ansible.builtin.lineinfile'' module to add this line to the end of ''/home/ubuntu/.bashrc'' on all the hosts:
source .info319

See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.

=== Configure SSH ===
You can use the ''blockinfile'' module to add your local ''ipv4-hosts'' to ''/etc/hosts'' on each node:
- name: Copy IPv4 addresses to /etc/hosts
ansible.builtin.blockinfile:
path: /etc/hosts
block: "{{ lookup('file', 'ipv4-hosts') } }"
become: yes # because you need root privilege (sudo) to update /etc/hosts

''Note:'' There should not be a space between the two curly braces at the end of the ''key:'' line. But without the space, WikiText misinterprets them as a template marker.

On your local machine, create the file ''config'' in your ''exercise-5'' folder:
Host terraform-* localhost
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ~/.ssh/config.terraform-hosts
(This is the ''config.stub'' file from Exercise 4, with the ''Include'' line added. Also, ''localhost'' has been added to the first line to allow nodes to ''ssh'' themselves...)

Use the ''copy'' module to upload this file, along with ''~/.ssh/config.terraform-hosts'' and ''~/.ssh/info319-spark-cluster'' to all hosts.

In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this:
- name: Authorise public cluster key
ansible.posix.authorized_key:
key: "{{ lookup('file', '/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub') } }"
user: ubuntu

''Tip:'' '''ansible-playbook''' has a '''--start-at-task "Task name"''' option you can use to avoid repeating all earlier plays (blocks) and stages. You can also use '''--step''' to have Ansible ask before each step whether to execute or skip it.

=== Install Java ===
Use Ansible's ''ansible.builtin.apt'' module and install an old and stable Java version, for example ''openjdk-8-jdk-headless''.

=== Mount volumes ===
Use the ''community.general.parted'', ''community.general.filesystem'' and ''ansible.posix.mount'' modules to do this. They may require installation on your local machine:
ansible-galaxy collection install community.general
ansible-galaxy collection install ansible.posix

=== Install HDFS and YARN ===
To install HDFS and YARN you need the ''master_node'' and ''num_workers'' available as Ansible variables (facts). You can use the ''ansible.builtin.shell'' and ''.set_fact'' modules to do this, for example at the start of a new Ansible play:
- name: Install HDFS and YARN
hosts: all
tasks:

- name: Register master_node expression
shell: grep tf-driver /etc/hosts | cut -d' ' -f1
register: master_node_expr

- name: Set master_node fact
set_fact:
master_node: "{{ master_node_expr.stdout } }"
Write two corresponding tasks for ''num_workers''.

Use the ''ansible.builtin.get_url'' module to download the Hadoop (and other) archives directly to each cluster host. But if you re-run your script many times, this takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.

Use the ''ansible.builtin.unarchive'' module to unpack the archives. Use the ''apt'' module to install ''gzip'' if you need it. Use ''ansible.builtin.file'' to create symbolic links as in Exercise 4.

=== Configure HDFS and YARN ===
Use ''ansible.builtin.lineinfile'' to define environment variables by adding them to ''~/.info319'' (instead of ''~/.bashrc'').

Change the variable syntax in the files ''{core,hdfs,mapred,yarn}-site.xml'' from Exercise 4 from Linux to Ansible. For example
* from ''core-site.xml'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://${HADOOP_NAMENODE}:9000</value>
</property>
</configuration>
* to ''core-site.xml.j2'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://{{ hadoop_namenode } }:9000</value>
</property>
</configuration>
You can now use the ''ansible.builtin.template'' module (instead of Linux' ''envsubst'' command) to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.

Use ''ansible.builtin.shell'' to create Hadoop's ''worker'' and ''master'' files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the ''terraform-driver'' host.

Note that '''hdfs namenode -format''' has got a '''-nonInteractive''' option that does not re-format an already formatted namenode. Use ''failed_when'' to make Ansible ignore ''Exit code 1'' from hdfs in such cases:
- name: Format HDFS namenode
ansible.builtin.shell:
argv: ["/home/ubuntu/volume/hadoop/bin/hdfs", "namenode", "-format", "-nonInteractive"]
register: result
failed_when: result.rc not in [0, 1]

Now you can log into ''terraform-driver'' and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].

Finally, you can start HDFS and YARN from ''terraform-driver''. Note that the ''ansible.builtin.copy'' and ''ansible.builtin.shell'' modules will normally ''not'' run ''~/.bashrc''. The reason is that ''~/.bashrc'' is intended for interactive shells running, for example, in a terminal window. ''~/.bashrc'' is not needed for many simpler commands, but more complex programs and scripts like Hadoop's ''start-all.sh'' need many environment variables. Therefore, you must start your own ''/usr/bin/bash'', initialise it with ''~/.info319'', and then run ''start-all.sh'' inside it:
- name: Start HDFS and YARN
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "start-all.sh"]

=== Install Spark on the cluster ===
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into ''terraform-driver'' and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].

=== Install Zookeeper on the cluster ===
There are two challenges with Zookeeper:
# it may not run on all the machines in the cluster (it must be an odd number)
# each zookeeper needs to know its ''myid'' number

As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.

As for the second point, Exercise 4 suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can use.

This task will start Zookeeper on the selected nodes:
- name: Start Zookeper
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties"]

=== Install Kafka on the cluster ===
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its ''local_ip'', which you can set like this:
- name: Register local_ip expression
shell: ip -4 address | grep -o "^ *inet $.\+$\/.\+global.*$" | grep -o "[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+" | head -1
register: local_ip_expr

- name: Set local_ip fact
set_fact:
local_ip: "{{ local_ip_expr.stdout } }"

It also needs to know its ''id'', which was written to the file ''/home/ubuntu/volume/zookeeper/data/myid'' before (better than ''/tmp/zookeeper/myid'' which was suggested before).

Finally, you can log into ''terraform-driver'' and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!

Configure Spark cluster using Ansible

2022-10-31T13:51:39Z

Sinoa: /* Install HDFS and YARN */

== Ansible ==
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:
sudo apt install ansible

=== Configure Ansible ===
To prepare:
sudo cp /etc/ansible/hosts /etc/ansible/hosts.original
sudo chmod 666 /etc/ansible/hosts

Ansible needs to know the names of your cluster machines. Change ''info319-cluster.tf'' so it also writes a file like this to ''/etc/ansible/hosts'':
terraform-driver
terraform-worker-0
...
terraform-worker-5

Finally, Ansible must be installed on all the hosts too. Add the line
- ansible
to the ''packages:'' section of ''user-data.cfg'', and re-run '''terraform apply'''.

=== Test run Ansible ===
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:
ansible all -m ping

On your local machine, create the file ''info319-cluster.yaml'' with a simple task that backs up ''~/.bashrc'':
- name: Prepare .bashrc
hosts: all
tasks:

- name: Save original .bashrc
ansible.builtin.copy:
src: /home/ubuntu/.bashrc
dest: /home/ubuntu/.bashrc.original
remote_src: yes

Run Ansible:
ansible-playbook info319-cluster.yaml

== Ansible playbook ==
Extend the playbook file ''info319-cluster.yaml'' to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.

=== Preparing .bashrc ===
In Exercise 4 we made a lot of modifications to ''~/.bashrc''. In some cases it is more practical to have the cluster configuration in a separate file, for example ''~/.info319''.

* Add a task that uses the ''ansible.builtin.file'' module to create ("touch") ''/home/ubuntu/.info319'' on all the hosts.
* Add a task that uses the ''ansible.builtin.lineinfile'' module to add this line to the end of ''/home/ubuntu/.bashrc'' on all the hosts:
source .info319

See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.

=== Configure SSH ===
You can use the ''blockinfile'' module to add your local ''ipv4-hosts'' to ''/etc/hosts'' on each node:
- name: Copy IPv4 addresses to /etc/hosts
ansible.builtin.blockinfile:
path: /etc/hosts
block: "{{ lookup('file', 'ipv4-hosts') } }"
become: yes # because you need root privilege (sudo) to update /etc/hosts

''Note:'' There should not be a space between the two curly braces at the end of the ''key:'' line. But without the space, WikiText misinterprets them as a template marker.

On your local machine, create the file ''config'' in your ''exercise-5'' folder:
Host terraform-* localhost
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ~/.ssh/config.terraform-hosts
(This is the ''config.stub'' file from Exercise 4, with the ''Include'' line added. Also, ''localhost'' has been added to the first line to allow nodes to ''ssh'' themselves...)

Use the ''copy'' module to upload this file, along with ''~/.ssh/config.terraform-hosts'' and ''~/.ssh/info319-spark-cluster'' to all hosts.

In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this:
- name: Authorise public cluster key
ansible.posix.authorized_key:
key: "{{ lookup('file', '/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub') } }"
user: ubuntu

''Tip:'' '''ansible-playbook''' has a '''--start-at-task "Task name"''' option you can use to avoid repeating all earlier plays (blocks) and stages. You can also use '''--step''' to have Ansible ask before each step whether to execute or skip it.

=== Install Java ===
Use Ansible's ''ansible.builtin.apt'' module and install an old and stable Java version, for example ''openjdk-8-jdk-headless''.

=== Mount volumes ===
Use the ''community.general.parted'', ''community.general.filesystem'' and ''ansible.posix.mount'' modules to do this. They may require installation on your local machine:
ansible-galaxy collection install community.general
ansible-galaxy collection install ansible.posix

=== Install HDFS and YARN ===
To install HDFS and YARN you need the ''master_node'' and ''num_workers'' available as Ansible variables (facts). You can use the ''ansible.builtin.shell'' and ''.set_fact'' modules to do this, for example at the start of a new Ansible play:
- name: Install HDFS and YARN
hosts: all
tasks:

- name: Register master_node expression
shell: grep tf-driver /etc/hosts | cut -d' ' -f1
register: master_node_expr

- name: Set master_node fact
set_fact:
master_node: "{{ master_node_expr.stdout } }"
Write two corresponding tasks for ''num_workers''.

Use the ''ansible.builtin.get_url'' module to download the Hadoop (and other) archives directly to each cluster host. But if you re-run your script many times, this takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.

Use the ''ansible.builtin.unarchive'' module to unpack the archives. Use the ''apt'' module to install ''gzip'' if you need it. Use ''ansible.builtin.file'' to create symbolic links as in Exercise 4.

Use ''ansible.builtin.lineinfile'' to define environment variables by adding them to ''~/.info319'' (instead of ''~/.bashrc'').

Change the variable syntax in the files ''{core,hdfs,mapred,yarn}-site.xml'' from Exercise 4 from Linux to Ansible. For example
* from ''core-site.xml'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://${HADOOP_NAMENODE}:9000</value>
</property>
</configuration>
* to ''core-site.xml.j2'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://{{ hadoop_namenode } }:9000</value>
</property>
</configuration>
You can now use the ''ansible.builtin.template'' module (instead of Linux' ''envsubst'' command) to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.

Use ''ansible.builtin.shell'' to create Hadoop's ''worker'' and ''master'' files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the ''terraform-driver'' host.

Note that '''hdfs namenode -format''' has got a '''-nonInteractive''' option that does not re-format an already formatted namenode. Use ''failed_when'' to make Ansible ignore ''Exit code 1'' from hdfs in such cases:
- name: Format HDFS namenode
ansible.builtin.shell:
argv: ["/home/ubuntu/volume/hadoop/bin/hdfs", "namenode", "-format", "-nonInteractive"]
register: result
failed_when: result.rc not in [0, 1]

Now you can log into ''terraform-driver'' and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].

Finally, you can start HDFS and YARN from ''terraform-driver''. Note that the ''ansible.builtin.copy'' and ''ansible.builtin.shell'' modules will normally ''not'' run ''~/.bashrc''. The reason is that ''~/.bashrc'' is intended for interactive shells running, for example, in a terminal window. ''~/.bashrc'' is not needed for many simpler commands, but more complex programs and scripts like Hadoop's ''start-all.sh'' need many environment variables. Therefore, you must start your own ''/usr/bin/bash'', initialise it with ''~/.info319'', and then run ''start-all.sh'' inside it:
- name: Start HDFS and YARN
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "start-all.sh"]

=== Install Spark on the cluster ===
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into ''terraform-driver'' and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].

=== Install Zookeeper on the cluster ===
There are two challenges with Zookeeper:
# it may not run on all the machines in the cluster (it must be an odd number)
# each zookeeper needs to know its ''myid'' number

As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.

As for the second point, Exercise 4 suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can use.

This task will start Zookeeper on the selected nodes:
- name: Start Zookeper
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties"]

=== Install Kafka on the cluster ===
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its ''local_ip'', which you can set like this:
- name: Register local_ip expression
shell: ip -4 address | grep -o "^ *inet $.\+$\/.\+global.*$" | grep -o "[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+" | head -1
register: local_ip_expr

- name: Set local_ip fact
set_fact:
local_ip: "{{ local_ip_expr.stdout } }"

It also needs to know its ''id'', which was written to the file ''/home/ubuntu/volume/zookeeper/data/myid'' before (better than ''/tmp/zookeeper/myid'' which was suggested before).

Finally, you can log into ''terraform-driver'' and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!

Configure Spark cluster using Ansible

2022-10-31T13:49:33Z

Sinoa: /* Mount volumes */

== Ansible ==
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:
sudo apt install ansible

=== Configure Ansible ===
To prepare:
sudo cp /etc/ansible/hosts /etc/ansible/hosts.original
sudo chmod 666 /etc/ansible/hosts

Ansible needs to know the names of your cluster machines. Change ''info319-cluster.tf'' so it also writes a file like this to ''/etc/ansible/hosts'':
terraform-driver
terraform-worker-0
...
terraform-worker-5

Finally, Ansible must be installed on all the hosts too. Add the line
- ansible
to the ''packages:'' section of ''user-data.cfg'', and re-run '''terraform apply'''.

=== Test run Ansible ===
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:
ansible all -m ping

On your local machine, create the file ''info319-cluster.yaml'' with a simple task that backs up ''~/.bashrc'':
- name: Prepare .bashrc
hosts: all
tasks:

- name: Save original .bashrc
ansible.builtin.copy:
src: /home/ubuntu/.bashrc
dest: /home/ubuntu/.bashrc.original
remote_src: yes

Run Ansible:
ansible-playbook info319-cluster.yaml

== Ansible playbook ==
Extend the playbook file ''info319-cluster.yaml'' to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.

=== Preparing .bashrc ===
In Exercise 4 we made a lot of modifications to ''~/.bashrc''. In some cases it is more practical to have the cluster configuration in a separate file, for example ''~/.info319''.

* Add a task that uses the ''ansible.builtin.file'' module to create ("touch") ''/home/ubuntu/.info319'' on all the hosts.
* Add a task that uses the ''ansible.builtin.lineinfile'' module to add this line to the end of ''/home/ubuntu/.bashrc'' on all the hosts:
source .info319

See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.

=== Configure SSH ===
You can use the ''blockinfile'' module to add your local ''ipv4-hosts'' to ''/etc/hosts'' on each node:
- name: Copy IPv4 addresses to /etc/hosts
ansible.builtin.blockinfile:
path: /etc/hosts
block: "{{ lookup('file', 'ipv4-hosts') } }"
become: yes # because you need root privilege (sudo) to update /etc/hosts

''Note:'' There should not be a space between the two curly braces at the end of the ''key:'' line. But without the space, WikiText misinterprets them as a template marker.

On your local machine, create the file ''config'' in your ''exercise-5'' folder:
Host terraform-* localhost
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ~/.ssh/config.terraform-hosts
(This is the ''config.stub'' file from Exercise 4, with the ''Include'' line added. Also, ''localhost'' has been added to the first line to allow nodes to ''ssh'' themselves...)

Use the ''copy'' module to upload this file, along with ''~/.ssh/config.terraform-hosts'' and ''~/.ssh/info319-spark-cluster'' to all hosts.

In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this:
- name: Authorise public cluster key
ansible.posix.authorized_key:
key: "{{ lookup('file', '/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub') } }"
user: ubuntu

''Tip:'' '''ansible-playbook''' has a '''--start-at-task "Task name"''' option you can use to avoid repeating all earlier plays (blocks) and stages. You can also use '''--step''' to have Ansible ask before each step whether to execute or skip it.

=== Install Java ===
Use Ansible's ''ansible.builtin.apt'' module and install an old and stable Java version, for example ''openjdk-8-jdk-headless''.

=== Mount volumes ===
Use the ''community.general.parted'', ''community.general.filesystem'' and ''ansible.posix.mount'' modules to do this. They may require installation on your local machine:
ansible-galaxy collection install community.general
ansible-galaxy collection install ansible.posix

=== Install HDFS and YARN ===
To install HDFS and YARN you need the ''master_node'' and ''num_workers'' available as Ansible variables (facts). You can use the ''ansible.builtin.shell'' and ''.set_fact'' modules to do this, for example at the start of a new Ansible play:
- name: Install HDFS and YARN
hosts: all
tasks:

- name: Register master_node expression
shell: grep tf-driver /etc/hosts | cut -d' ' -f1
register: master_node_expr

- name: Set master_node fact
set_fact:
master_node: "{{ master_node_expr.stdout } }"
Write two corresponding tasks for ''num_workers''.

Use the ''ansible.builtin.get_url'' module to download the Hadoop and later Spark archives directly to each cluster host. But if you re-run your script many times, it takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.

Use the ''ansible.builtin.unarchive'' module to unpack the archives. Use the ''apt'' module to install ''gzip'' if you need it. Use ''ansible.builtin.file'' to create symbolic links as in Exercise 4.

Use ''ansible.builtin.lineinfile'' to define environment variables by adding them to ''~/.info319'' (instead of ''~/.bashrc'').

Change the variable syntax in the files ''{core,hdfs,mapred,yarn}-site.xml'' from Exercise 4 from Linux to Ansible. For example
* from ''core-site.xml'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://${HADOOP_NAMENODE}:9000</value>
</property>
</configuration>
* to ''core-site.xml.j2'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://{{ hadoop_namenode } }:9000</value>
</property>
</configuration>
You can now use the ''ansible.builtin.template'' module instead of Linux' ''envsubst'' command to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.

Use ''ansible.builtin.shell'' to create Hadoop's ''worker'' and ''master'' files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the ''terraform-driver'' host.

Note that '''hdfs namenode -format''' has got a '''-nonInteractive''' option that does not re-format an already formatted namenode. Use ''failed_when'' to make Ansible ignore ''Exit code 1'' from hdfs in such cases:
- name: Format HDFS namenode
ansible.builtin.shell:
argv: ["/home/ubuntu/volume/hadoop/bin/hdfs", "namenode", "-format", "-nonInteractive"]
register: result
failed_when: result.rc not in [0, 1]

Now you can log into ''terraform-driver'' and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].

Finally, you can start HDFS and YARN from ''terraform-driver''. Note that the ''ansible.builtin.copy'' and ''ansible.builtin.shell'' modules will normally ''not'' run ''~/.bashrc''. The reason is that ''~/.bashrc'' is intended for interactive shells running, for example, in a terminal window. ''~/.bashrc'' is not needed for many simpler commands, but more complex programs and scripts like Hadoop's ''start-all.sh'' need many environment variables. Therefore, you must start your own ''/usr/bin/bash'', initialise it with ''~/.info319'', and then run ''start-all.sh'' inside it:
- name: Start HDFS and YARN
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "start-all.sh"]

=== Install Spark on the cluster ===
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into ''terraform-driver'' and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].

=== Install Zookeeper on the cluster ===
There are two challenges with Zookeeper:
# it may not run on all the machines in the cluster (it must be an odd number)
# each zookeeper needs to know its ''myid'' number

As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.

As for the second point, Exercise 4 suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can use.

This task will start Zookeeper on the selected nodes:
- name: Start Zookeper
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties"]

=== Install Kafka on the cluster ===
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its ''local_ip'', which you can set like this:
- name: Register local_ip expression
shell: ip -4 address | grep -o "^ *inet $.\+$\/.\+global.*$" | grep -o "[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+" | head -1
register: local_ip_expr

- name: Set local_ip fact
set_fact:
local_ip: "{{ local_ip_expr.stdout } }"

It also needs to know its ''id'', which was written to the file ''/home/ubuntu/volume/zookeeper/data/myid'' before (better than ''/tmp/zookeeper/myid'' which was suggested before).

Finally, you can log into ''terraform-driver'' and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!

Configure Spark cluster using Ansible

2022-10-31T13:47:34Z

Sinoa: /* Configure SSH */

== Ansible ==
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:
sudo apt install ansible

=== Configure Ansible ===
To prepare:
sudo cp /etc/ansible/hosts /etc/ansible/hosts.original
sudo chmod 666 /etc/ansible/hosts

Ansible needs to know the names of your cluster machines. Change ''info319-cluster.tf'' so it also writes a file like this to ''/etc/ansible/hosts'':
terraform-driver
terraform-worker-0
...
terraform-worker-5

Finally, Ansible must be installed on all the hosts too. Add the line
- ansible
to the ''packages:'' section of ''user-data.cfg'', and re-run '''terraform apply'''.

=== Test run Ansible ===
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:
ansible all -m ping

On your local machine, create the file ''info319-cluster.yaml'' with a simple task that backs up ''~/.bashrc'':
- name: Prepare .bashrc
hosts: all
tasks:

- name: Save original .bashrc
ansible.builtin.copy:
src: /home/ubuntu/.bashrc
dest: /home/ubuntu/.bashrc.original
remote_src: yes

Run Ansible:
ansible-playbook info319-cluster.yaml

== Ansible playbook ==
Extend the playbook file ''info319-cluster.yaml'' to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.

=== Preparing .bashrc ===
In Exercise 4 we made a lot of modifications to ''~/.bashrc''. In some cases it is more practical to have the cluster configuration in a separate file, for example ''~/.info319''.

* Add a task that uses the ''ansible.builtin.file'' module to create ("touch") ''/home/ubuntu/.info319'' on all the hosts.
* Add a task that uses the ''ansible.builtin.lineinfile'' module to add this line to the end of ''/home/ubuntu/.bashrc'' on all the hosts:
source .info319

See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.

=== Configure SSH ===
You can use the ''blockinfile'' module to add your local ''ipv4-hosts'' to ''/etc/hosts'' on each node:
- name: Copy IPv4 addresses to /etc/hosts
ansible.builtin.blockinfile:
path: /etc/hosts
block: "{{ lookup('file', 'ipv4-hosts') } }"
become: yes # because you need root privilege (sudo) to update /etc/hosts

''Note:'' There should not be a space between the two curly braces at the end of the ''key:'' line. But without the space, WikiText misinterprets them as a template marker.

On your local machine, create the file ''config'' in your ''exercise-5'' folder:
Host terraform-* localhost
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ~/.ssh/config.terraform-hosts
(This is the ''config.stub'' file from Exercise 4, with the ''Include'' line added. Also, ''localhost'' has been added to the first line to allow nodes to ''ssh'' themselves...)

Use the ''copy'' module to upload this file, along with ''~/.ssh/config.terraform-hosts'' and ''~/.ssh/info319-spark-cluster'' to all hosts.

In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this:
- name: Authorise public cluster key
ansible.posix.authorized_key:
key: "{{ lookup('file', '/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub') } }"
user: ubuntu

''Tip:'' '''ansible-playbook''' has a '''--start-at-task "Task name"''' option you can use to avoid repeating all earlier plays (blocks) and stages. You can also use '''--step''' to have Ansible ask before each step whether to execute or skip it.

=== Install Java ===
Use Ansible's ''ansible.builtin.apt'' module and install an old and stable Java version, for example ''openjdk-8-jdk-headless''.

=== Mount volumes ===
Using the ''community.general.parted'', ''community.general.filesystem'' and ''ansible.posix.mount'' modules from your local machine may require installation:
ansible-galaxy collection install community.general
ansible-galaxy collection install ansible.posix

=== Install HDFS and YARN ===
To install HDFS and YARN you need the ''master_node'' and ''num_workers'' available as Ansible variables (facts). You can use the ''ansible.builtin.shell'' and ''.set_fact'' modules to do this, for example at the start of a new Ansible play:
- name: Install HDFS and YARN
hosts: all
tasks:

- name: Register master_node expression
shell: grep tf-driver /etc/hosts | cut -d' ' -f1
register: master_node_expr

- name: Set master_node fact
set_fact:
master_node: "{{ master_node_expr.stdout } }"
Write two corresponding tasks for ''num_workers''.

Use the ''ansible.builtin.get_url'' module to download the Hadoop and later Spark archives directly to each cluster host. But if you re-run your script many times, it takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.

Use the ''ansible.builtin.unarchive'' module to unpack the archives. Use the ''apt'' module to install ''gzip'' if you need it. Use ''ansible.builtin.file'' to create symbolic links as in Exercise 4.

Use ''ansible.builtin.lineinfile'' to define environment variables by adding them to ''~/.info319'' (instead of ''~/.bashrc'').

Change the variable syntax in the files ''{core,hdfs,mapred,yarn}-site.xml'' from Exercise 4 from Linux to Ansible. For example
* from ''core-site.xml'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://${HADOOP_NAMENODE}:9000</value>
</property>
</configuration>
* to ''core-site.xml.j2'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://{{ hadoop_namenode } }:9000</value>
</property>
</configuration>
You can now use the ''ansible.builtin.template'' module instead of Linux' ''envsubst'' command to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.

Use ''ansible.builtin.shell'' to create Hadoop's ''worker'' and ''master'' files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the ''terraform-driver'' host.

Note that '''hdfs namenode -format''' has got a '''-nonInteractive''' option that does not re-format an already formatted namenode. Use ''failed_when'' to make Ansible ignore ''Exit code 1'' from hdfs in such cases:
- name: Format HDFS namenode
ansible.builtin.shell:
argv: ["/home/ubuntu/volume/hadoop/bin/hdfs", "namenode", "-format", "-nonInteractive"]
register: result
failed_when: result.rc not in [0, 1]

Now you can log into ''terraform-driver'' and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].

Finally, you can start HDFS and YARN from ''terraform-driver''. Note that the ''ansible.builtin.copy'' and ''ansible.builtin.shell'' modules will normally ''not'' run ''~/.bashrc''. The reason is that ''~/.bashrc'' is intended for interactive shells running, for example, in a terminal window. ''~/.bashrc'' is not needed for many simpler commands, but more complex programs and scripts like Hadoop's ''start-all.sh'' need many environment variables. Therefore, you must start your own ''/usr/bin/bash'', initialise it with ''~/.info319'', and then run ''start-all.sh'' inside it:
- name: Start HDFS and YARN
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "start-all.sh"]

=== Install Spark on the cluster ===
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into ''terraform-driver'' and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].

=== Install Zookeeper on the cluster ===
There are two challenges with Zookeeper:
# it may not run on all the machines in the cluster (it must be an odd number)
# each zookeeper needs to know its ''myid'' number

As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.

As for the second point, Exercise 4 suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can use.

This task will start Zookeeper on the selected nodes:
- name: Start Zookeper
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties"]

=== Install Kafka on the cluster ===
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its ''local_ip'', which you can set like this:
- name: Register local_ip expression
shell: ip -4 address | grep -o "^ *inet $.\+$\/.\+global.*$" | grep -o "[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+" | head -1
register: local_ip_expr

- name: Set local_ip fact
set_fact:
local_ip: "{{ local_ip_expr.stdout } }"

It also needs to know its ''id'', which was written to the file ''/home/ubuntu/volume/zookeeper/data/myid'' before (better than ''/tmp/zookeeper/myid'' which was suggested before).

Finally, you can log into ''terraform-driver'' and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!

Configure Spark cluster using Ansible

2022-10-31T13:45:05Z

Sinoa: /* Preparing .bashrc */

== Ansible ==
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:
sudo apt install ansible

=== Configure Ansible ===
To prepare:
sudo cp /etc/ansible/hosts /etc/ansible/hosts.original
sudo chmod 666 /etc/ansible/hosts

Ansible needs to know the names of your cluster machines. Change ''info319-cluster.tf'' so it also writes a file like this to ''/etc/ansible/hosts'':
terraform-driver
terraform-worker-0
...
terraform-worker-5

Finally, Ansible must be installed on all the hosts too. Add the line
- ansible
to the ''packages:'' section of ''user-data.cfg'', and re-run '''terraform apply'''.

=== Test run Ansible ===
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:
ansible all -m ping

On your local machine, create the file ''info319-cluster.yaml'' with a simple task that backs up ''~/.bashrc'':
- name: Prepare .bashrc
hosts: all
tasks:

- name: Save original .bashrc
ansible.builtin.copy:
src: /home/ubuntu/.bashrc
dest: /home/ubuntu/.bashrc.original
remote_src: yes

Run Ansible:
ansible-playbook info319-cluster.yaml

== Ansible playbook ==
Extend the playbook file ''info319-cluster.yaml'' to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.

=== Preparing .bashrc ===
In Exercise 4 we made a lot of modifications to ''~/.bashrc''. In some cases it is more practical to have the cluster configuration in a separate file, for example ''~/.info319''.

* Add a task that uses the ''ansible.builtin.file'' module to create ("touch") ''/home/ubuntu/.info319'' on all the hosts.
* Add a task that uses the ''ansible.builtin.lineinfile'' module to add this line to the end of ''/home/ubuntu/.bashrc'' on all the hosts:
source .info319

See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.

=== Configure SSH ===
You can use the ''blockinfile'' module to add your local ''ipv4-hosts'' to ''/etc/hosts'' on each node:
- name: Copy IPv4 addresses to /etc/hosts
ansible.builtin.blockinfile:
path: /etc/hosts
block: "{{ lookup('file', 'ipv4-hosts') } }"
become: yes

''Note:'' There should not be a space between the two curly braces at the end of the ''key:'' line. But without the space, WikiText misinterprets them as a template marker.

On your local machine, create the file ''config'' in your ''exercise-5'' folder:
Host terraform-* localhost
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ~/.ssh/config.terraform-hosts
(This is the ''config.stub'' file from Exercise 4, with the ''Include'' line added. Also, ''localhost'' has been added to the first line to allow nodes to ''ssh'' themselves...)

Use the ''copy'' module to upload this file, along with ''~/.ssh/config.terraform-hosts'' and ''~/.ssh/info319-spark-cluster'' to all hosts.

In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this:
- name: Authorise public cluster key
ansible.posix.authorized_key:
key: "{{ lookup('file', '/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub') } }"
user: ubuntu

''Tip:'' '''ansible-playbook''' has a '''--start-at-task "Task name"''' option to avoid repeating all earlier blocks and stages. You can also use '''--step''' to have Ansible ask before each step whether to execute, skip, or finish.

=== Install Java ===
Use Ansible's ''ansible.builtin.apt'' module and install an old and stable Java version, for example ''openjdk-8-jdk-headless''.

=== Mount volumes ===
Using the ''community.general.parted'', ''community.general.filesystem'' and ''ansible.posix.mount'' modules from your local machine may require installation:
ansible-galaxy collection install community.general
ansible-galaxy collection install ansible.posix

=== Install HDFS and YARN ===
To install HDFS and YARN you need the ''master_node'' and ''num_workers'' available as Ansible variables (facts). You can use the ''ansible.builtin.shell'' and ''.set_fact'' modules to do this, for example at the start of a new Ansible play:
- name: Install HDFS and YARN
hosts: all
tasks:

- name: Register master_node expression
shell: grep tf-driver /etc/hosts | cut -d' ' -f1
register: master_node_expr

- name: Set master_node fact
set_fact:
master_node: "{{ master_node_expr.stdout } }"
Write two corresponding tasks for ''num_workers''.

Use the ''ansible.builtin.get_url'' module to download the Hadoop and later Spark archives directly to each cluster host. But if you re-run your script many times, it takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.

Use the ''ansible.builtin.unarchive'' module to unpack the archives. Use the ''apt'' module to install ''gzip'' if you need it. Use ''ansible.builtin.file'' to create symbolic links as in Exercise 4.

Use ''ansible.builtin.lineinfile'' to define environment variables by adding them to ''~/.info319'' (instead of ''~/.bashrc'').

Change the variable syntax in the files ''{core,hdfs,mapred,yarn}-site.xml'' from Exercise 4 from Linux to Ansible. For example
* from ''core-site.xml'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://${HADOOP_NAMENODE}:9000</value>
</property>
</configuration>
* to ''core-site.xml.j2'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://{{ hadoop_namenode } }:9000</value>
</property>
</configuration>
You can now use the ''ansible.builtin.template'' module instead of Linux' ''envsubst'' command to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.

Use ''ansible.builtin.shell'' to create Hadoop's ''worker'' and ''master'' files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the ''terraform-driver'' host.

Note that '''hdfs namenode -format''' has got a '''-nonInteractive''' option that does not re-format an already formatted namenode. Use ''failed_when'' to make Ansible ignore ''Exit code 1'' from hdfs in such cases:
- name: Format HDFS namenode
ansible.builtin.shell:
argv: ["/home/ubuntu/volume/hadoop/bin/hdfs", "namenode", "-format", "-nonInteractive"]
register: result
failed_when: result.rc not in [0, 1]

Now you can log into ''terraform-driver'' and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].

Finally, you can start HDFS and YARN from ''terraform-driver''. Note that the ''ansible.builtin.copy'' and ''ansible.builtin.shell'' modules will normally ''not'' run ''~/.bashrc''. The reason is that ''~/.bashrc'' is intended for interactive shells running, for example, in a terminal window. ''~/.bashrc'' is not needed for many simpler commands, but more complex programs and scripts like Hadoop's ''start-all.sh'' need many environment variables. Therefore, you must start your own ''/usr/bin/bash'', initialise it with ''~/.info319'', and then run ''start-all.sh'' inside it:
- name: Start HDFS and YARN
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "start-all.sh"]

=== Install Spark on the cluster ===
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into ''terraform-driver'' and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].

=== Install Zookeeper on the cluster ===
There are two challenges with Zookeeper:
# it may not run on all the machines in the cluster (it must be an odd number)
# each zookeeper needs to know its ''myid'' number

As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.

As for the second point, Exercise 4 suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can use.

This task will start Zookeeper on the selected nodes:
- name: Start Zookeper
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties"]

=== Install Kafka on the cluster ===
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its ''local_ip'', which you can set like this:
- name: Register local_ip expression
shell: ip -4 address | grep -o "^ *inet $.\+$\/.\+global.*$" | grep -o "[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+" | head -1
register: local_ip_expr

- name: Set local_ip fact
set_fact:
local_ip: "{{ local_ip_expr.stdout } }"

It also needs to know its ''id'', which was written to the file ''/home/ubuntu/volume/zookeeper/data/myid'' before (better than ''/tmp/zookeeper/myid'' which was suggested before).

Finally, you can log into ''terraform-driver'' and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!

Configure Spark cluster using Ansible

2022-10-31T13:44:17Z

Sinoa: /* Create Ansible playbook */

== Ansible ==
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:
sudo apt install ansible

=== Configure Ansible ===
To prepare:
sudo cp /etc/ansible/hosts /etc/ansible/hosts.original
sudo chmod 666 /etc/ansible/hosts

Ansible needs to know the names of your cluster machines. Change ''info319-cluster.tf'' so it also writes a file like this to ''/etc/ansible/hosts'':
terraform-driver
terraform-worker-0
...
terraform-worker-5

Finally, Ansible must be installed on all the hosts too. Add the line
- ansible
to the ''packages:'' section of ''user-data.cfg'', and re-run '''terraform apply'''.

=== Test run Ansible ===
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:
ansible all -m ping

On your local machine, create the file ''info319-cluster.yaml'' with a simple task that backs up ''~/.bashrc'':
- name: Prepare .bashrc
hosts: all
tasks:

- name: Save original .bashrc
ansible.builtin.copy:
src: /home/ubuntu/.bashrc
dest: /home/ubuntu/.bashrc.original
remote_src: yes

Run Ansible:
ansible-playbook info319-cluster.yaml

== Ansible playbook ==
Extend the playbook file ''info319-cluster.yaml'' to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.

=== Preparing .bashrc ===
In Exercise 4 we made a lot of modifications to ''~/.bashrc''. In some cases it is more practical to have the cluster configuration in a separate file, for example ''~/.info319''.

* Add a task that uses the ''ansible.builtin.file'' module to create ("touch") ''/home/ubuntu/.info319'' on all the hosts.
* Add a task that uses the ''ansible.builtin.lineinfile'' module to add this line to ''/home/ubuntu/.info319'' on all the hosts:
source .info319

See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.

=== Configure SSH ===
You can use the ''blockinfile'' module to add your local ''ipv4-hosts'' to ''/etc/hosts'' on each node:
- name: Copy IPv4 addresses to /etc/hosts
ansible.builtin.blockinfile:
path: /etc/hosts
block: "{{ lookup('file', 'ipv4-hosts') } }"
become: yes

''Note:'' There should not be a space between the two curly braces at the end of the ''key:'' line. But without the space, WikiText misinterprets them as a template marker.

On your local machine, create the file ''config'' in your ''exercise-5'' folder:
Host terraform-* localhost
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ~/.ssh/config.terraform-hosts
(This is the ''config.stub'' file from Exercise 4, with the ''Include'' line added. Also, ''localhost'' has been added to the first line to allow nodes to ''ssh'' themselves...)

Use the ''copy'' module to upload this file, along with ''~/.ssh/config.terraform-hosts'' and ''~/.ssh/info319-spark-cluster'' to all hosts.

In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this:
- name: Authorise public cluster key
ansible.posix.authorized_key:
key: "{{ lookup('file', '/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub') } }"
user: ubuntu

''Tip:'' '''ansible-playbook''' has a '''--start-at-task "Task name"''' option to avoid repeating all earlier blocks and stages. You can also use '''--step''' to have Ansible ask before each step whether to execute, skip, or finish.

=== Install Java ===
Use Ansible's ''ansible.builtin.apt'' module and install an old and stable Java version, for example ''openjdk-8-jdk-headless''.

=== Mount volumes ===
Using the ''community.general.parted'', ''community.general.filesystem'' and ''ansible.posix.mount'' modules from your local machine may require installation:
ansible-galaxy collection install community.general
ansible-galaxy collection install ansible.posix

=== Install HDFS and YARN ===
To install HDFS and YARN you need the ''master_node'' and ''num_workers'' available as Ansible variables (facts). You can use the ''ansible.builtin.shell'' and ''.set_fact'' modules to do this, for example at the start of a new Ansible play:
- name: Install HDFS and YARN
hosts: all
tasks:

- name: Register master_node expression
shell: grep tf-driver /etc/hosts | cut -d' ' -f1
register: master_node_expr

- name: Set master_node fact
set_fact:
master_node: "{{ master_node_expr.stdout } }"
Write two corresponding tasks for ''num_workers''.

Use the ''ansible.builtin.get_url'' module to download the Hadoop and later Spark archives directly to each cluster host. But if you re-run your script many times, it takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.

Use the ''ansible.builtin.unarchive'' module to unpack the archives. Use the ''apt'' module to install ''gzip'' if you need it. Use ''ansible.builtin.file'' to create symbolic links as in Exercise 4.

Use ''ansible.builtin.lineinfile'' to define environment variables by adding them to ''~/.info319'' (instead of ''~/.bashrc'').

Change the variable syntax in the files ''{core,hdfs,mapred,yarn}-site.xml'' from Exercise 4 from Linux to Ansible. For example
* from ''core-site.xml'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://${HADOOP_NAMENODE}:9000</value>
</property>
</configuration>
* to ''core-site.xml.j2'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://{{ hadoop_namenode } }:9000</value>
</property>
</configuration>
You can now use the ''ansible.builtin.template'' module instead of Linux' ''envsubst'' command to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.

Use ''ansible.builtin.shell'' to create Hadoop's ''worker'' and ''master'' files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the ''terraform-driver'' host.

Note that '''hdfs namenode -format''' has got a '''-nonInteractive''' option that does not re-format an already formatted namenode. Use ''failed_when'' to make Ansible ignore ''Exit code 1'' from hdfs in such cases:
- name: Format HDFS namenode
ansible.builtin.shell:
argv: ["/home/ubuntu/volume/hadoop/bin/hdfs", "namenode", "-format", "-nonInteractive"]
register: result
failed_when: result.rc not in [0, 1]

Now you can log into ''terraform-driver'' and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].

Finally, you can start HDFS and YARN from ''terraform-driver''. Note that the ''ansible.builtin.copy'' and ''ansible.builtin.shell'' modules will normally ''not'' run ''~/.bashrc''. The reason is that ''~/.bashrc'' is intended for interactive shells running, for example, in a terminal window. ''~/.bashrc'' is not needed for many simpler commands, but more complex programs and scripts like Hadoop's ''start-all.sh'' need many environment variables. Therefore, you must start your own ''/usr/bin/bash'', initialise it with ''~/.info319'', and then run ''start-all.sh'' inside it:
- name: Start HDFS and YARN
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "start-all.sh"]

=== Install Spark on the cluster ===
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into ''terraform-driver'' and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].

=== Install Zookeeper on the cluster ===
There are two challenges with Zookeeper:
# it may not run on all the machines in the cluster (it must be an odd number)
# each zookeeper needs to know its ''myid'' number

As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.

As for the second point, Exercise 4 suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can use.

This task will start Zookeeper on the selected nodes:
- name: Start Zookeper
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties"]

=== Install Kafka on the cluster ===
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its ''local_ip'', which you can set like this:
- name: Register local_ip expression
shell: ip -4 address | grep -o "^ *inet $.\+$\/.\+global.*$" | grep -o "[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+" | head -1
register: local_ip_expr

- name: Set local_ip fact
set_fact:
local_ip: "{{ local_ip_expr.stdout } }"

It also needs to know its ''id'', which was written to the file ''/home/ubuntu/volume/zookeeper/data/myid'' before (better than ''/tmp/zookeeper/myid'' which was suggested before).

Finally, you can log into ''terraform-driver'' and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!

Configure Spark cluster using Ansible

2022-10-31T13:43:23Z

Sinoa: /* Test run Ansible */

== Ansible ==
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:
sudo apt install ansible

=== Configure Ansible ===
To prepare:
sudo cp /etc/ansible/hosts /etc/ansible/hosts.original
sudo chmod 666 /etc/ansible/hosts

Ansible needs to know the names of your cluster machines. Change ''info319-cluster.tf'' so it also writes a file like this to ''/etc/ansible/hosts'':
terraform-driver
terraform-worker-0
...
terraform-worker-5

Finally, Ansible must be installed on all the hosts too. Add the line
- ansible
to the ''packages:'' section of ''user-data.cfg'', and re-run '''terraform apply'''.

=== Test run Ansible ===
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:
ansible all -m ping

On your local machine, create the file ''info319-cluster.yaml'' with a simple task that backs up ''~/.bashrc'':
- name: Prepare .bashrc
hosts: all
tasks:

- name: Save original .bashrc
ansible.builtin.copy:
src: /home/ubuntu/.bashrc
dest: /home/ubuntu/.bashrc.original
remote_src: yes

Run Ansible:
ansible-playbook info319-cluster.yaml

== Create Ansible playbook ==
Extend the playbook file ''info319-cluster.yaml'' to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.

=== Preparing .bashrc ===
In Exercise 4 we made a lot of modifications to ''~/.bashrc''. In some cases it is more practical to have the cluster configuration in a separate file, for example ''~/.info319''.

* Add a task that uses the ''ansible.builtin.file'' module to create ("touch") ''/home/ubuntu/.info319'' on all the hosts.
* Add a task that uses the ''ansible.builtin.lineinfile'' module to add this line to ''/home/ubuntu/.info319'' on all the hosts:
source .info319

See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.

=== Configure SSH ===
You can use the ''blockinfile'' module to add your local ''ipv4-hosts'' to ''/etc/hosts'' on each node:
- name: Copy IPv4 addresses to /etc/hosts
ansible.builtin.blockinfile:
path: /etc/hosts
block: "{{ lookup('file', 'ipv4-hosts') } }"
become: yes

''Note:'' There should not be a space between the two curly braces at the end of the ''key:'' line. But without the space, WikiText misinterprets them as a template marker.

On your local machine, create the file ''config'' in your ''exercise-5'' folder:
Host terraform-* localhost
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ~/.ssh/config.terraform-hosts
(This is the ''config.stub'' file from Exercise 4, with the ''Include'' line added. Also, ''localhost'' has been added to the first line to allow nodes to ''ssh'' themselves...)

Use the ''copy'' module to upload this file, along with ''~/.ssh/config.terraform-hosts'' and ''~/.ssh/info319-spark-cluster'' to all hosts.

In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this:
- name: Authorise public cluster key
ansible.posix.authorized_key:
key: "{{ lookup('file', '/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub') } }"
user: ubuntu

''Tip:'' '''ansible-playbook''' has a '''--start-at-task "Task name"''' option to avoid repeating all earlier blocks and stages. You can also use '''--step''' to have Ansible ask before each step whether to execute, skip, or finish.

=== Install Java ===
Use Ansible's ''ansible.builtin.apt'' module and install an old and stable Java version, for example ''openjdk-8-jdk-headless''.

=== Mount volumes ===
Using the ''community.general.parted'', ''community.general.filesystem'' and ''ansible.posix.mount'' modules from your local machine may require installation:
ansible-galaxy collection install community.general
ansible-galaxy collection install ansible.posix

=== Install HDFS and YARN ===
To install HDFS and YARN you need the ''master_node'' and ''num_workers'' available as Ansible variables (facts). You can use the ''ansible.builtin.shell'' and ''.set_fact'' modules to do this, for example at the start of a new Ansible play:
- name: Install HDFS and YARN
hosts: all
tasks:

- name: Register master_node expression
shell: grep tf-driver /etc/hosts | cut -d' ' -f1
register: master_node_expr

- name: Set master_node fact
set_fact:
master_node: "{{ master_node_expr.stdout } }"
Write two corresponding tasks for ''num_workers''.

Use the ''ansible.builtin.get_url'' module to download the Hadoop and later Spark archives directly to each cluster host. But if you re-run your script many times, it takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.

Use the ''ansible.builtin.unarchive'' module to unpack the archives. Use the ''apt'' module to install ''gzip'' if you need it. Use ''ansible.builtin.file'' to create symbolic links as in Exercise 4.

Use ''ansible.builtin.lineinfile'' to define environment variables by adding them to ''~/.info319'' (instead of ''~/.bashrc'').

Change the variable syntax in the files ''{core,hdfs,mapred,yarn}-site.xml'' from Exercise 4 from Linux to Ansible. For example
* from ''core-site.xml'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://${HADOOP_NAMENODE}:9000</value>
</property>
</configuration>
* to ''core-site.xml.j2'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://{{ hadoop_namenode } }:9000</value>
</property>
</configuration>
You can now use the ''ansible.builtin.template'' module instead of Linux' ''envsubst'' command to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.

Use ''ansible.builtin.shell'' to create Hadoop's ''worker'' and ''master'' files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the ''terraform-driver'' host.

Note that '''hdfs namenode -format''' has got a '''-nonInteractive''' option that does not re-format an already formatted namenode. Use ''failed_when'' to make Ansible ignore ''Exit code 1'' from hdfs in such cases:
- name: Format HDFS namenode
ansible.builtin.shell:
argv: ["/home/ubuntu/volume/hadoop/bin/hdfs", "namenode", "-format", "-nonInteractive"]
register: result
failed_when: result.rc not in [0, 1]

Now you can log into ''terraform-driver'' and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].

Finally, you can start HDFS and YARN from ''terraform-driver''. Note that the ''ansible.builtin.copy'' and ''ansible.builtin.shell'' modules will normally ''not'' run ''~/.bashrc''. The reason is that ''~/.bashrc'' is intended for interactive shells running, for example, in a terminal window. ''~/.bashrc'' is not needed for many simpler commands, but more complex programs and scripts like Hadoop's ''start-all.sh'' need many environment variables. Therefore, you must start your own ''/usr/bin/bash'', initialise it with ''~/.info319'', and then run ''start-all.sh'' inside it:
- name: Start HDFS and YARN
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "start-all.sh"]

=== Install Spark on the cluster ===
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into ''terraform-driver'' and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].

=== Install Zookeeper on the cluster ===
There are two challenges with Zookeeper:
# it may not run on all the machines in the cluster (it must be an odd number)
# each zookeeper needs to know its ''myid'' number

As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.

As for the second point, Exercise 4 suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can use.

This task will start Zookeeper on the selected nodes:
- name: Start Zookeper
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties"]

=== Install Kafka on the cluster ===
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its ''local_ip'', which you can set like this:
- name: Register local_ip expression
shell: ip -4 address | grep -o "^ *inet $.\+$\/.\+global.*$" | grep -o "[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+" | head -1
register: local_ip_expr

- name: Set local_ip fact
set_fact:
local_ip: "{{ local_ip_expr.stdout } }"

It also needs to know its ''id'', which was written to the file ''/home/ubuntu/volume/zookeeper/data/myid'' before (better than ''/tmp/zookeeper/myid'' which was suggested before).

Finally, you can log into ''terraform-driver'' and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!

Configure Spark cluster using Ansible

2022-10-31T13:42:46Z

Sinoa: /* Test run Ansible */

== Ansible ==
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:
sudo apt install ansible

=== Configure Ansible ===
To prepare:
sudo cp /etc/ansible/hosts /etc/ansible/hosts.original
sudo chmod 666 /etc/ansible/hosts

Ansible needs to know the names of your cluster machines. Change ''info319-cluster.tf'' so it also writes a file like this to ''/etc/ansible/hosts'':
terraform-driver
terraform-worker-0
...
terraform-worker-5

Finally, Ansible must be installed on all the hosts too. Add the line
- ansible
to the ''packages:'' section of ''user-data.cfg'', and re-run '''terraform apply'''.

=== Test run Ansible ===
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:
ansible all -m ping

On your local machine, create the file ''info319-cluster.yaml'' with a simple task:
- name: Prepare .bashrc
hosts: all
tasks:

- name: Save original .bashrc
ansible.builtin.copy:
src: /home/ubuntu/.bashrc
dest: /home/ubuntu/.bashrc.original
remote_src: yes

Run Ansible:
ansible-playbook info319-cluster.yaml

== Create Ansible playbook ==
Extend the playbook file ''info319-cluster.yaml'' to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.

=== Preparing .bashrc ===
In Exercise 4 we made a lot of modifications to ''~/.bashrc''. In some cases it is more practical to have the cluster configuration in a separate file, for example ''~/.info319''.

* Add a task that uses the ''ansible.builtin.file'' module to create ("touch") ''/home/ubuntu/.info319'' on all the hosts.
* Add a task that uses the ''ansible.builtin.lineinfile'' module to add this line to ''/home/ubuntu/.info319'' on all the hosts:
source .info319

See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.

=== Configure SSH ===
You can use the ''blockinfile'' module to add your local ''ipv4-hosts'' to ''/etc/hosts'' on each node:
- name: Copy IPv4 addresses to /etc/hosts
ansible.builtin.blockinfile:
path: /etc/hosts
block: "{{ lookup('file', 'ipv4-hosts') } }"
become: yes

''Note:'' There should not be a space between the two curly braces at the end of the ''key:'' line. But without the space, WikiText misinterprets them as a template marker.

On your local machine, create the file ''config'' in your ''exercise-5'' folder:
Host terraform-* localhost
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ~/.ssh/config.terraform-hosts
(This is the ''config.stub'' file from Exercise 4, with the ''Include'' line added. Also, ''localhost'' has been added to the first line to allow nodes to ''ssh'' themselves...)

Use the ''copy'' module to upload this file, along with ''~/.ssh/config.terraform-hosts'' and ''~/.ssh/info319-spark-cluster'' to all hosts.

In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this:
- name: Authorise public cluster key
ansible.posix.authorized_key:
key: "{{ lookup('file', '/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub') } }"
user: ubuntu

''Tip:'' '''ansible-playbook''' has a '''--start-at-task "Task name"''' option to avoid repeating all earlier blocks and stages. You can also use '''--step''' to have Ansible ask before each step whether to execute, skip, or finish.

=== Install Java ===
Use Ansible's ''ansible.builtin.apt'' module and install an old and stable Java version, for example ''openjdk-8-jdk-headless''.

=== Mount volumes ===
Using the ''community.general.parted'', ''community.general.filesystem'' and ''ansible.posix.mount'' modules from your local machine may require installation:
ansible-galaxy collection install community.general
ansible-galaxy collection install ansible.posix

=== Install HDFS and YARN ===
To install HDFS and YARN you need the ''master_node'' and ''num_workers'' available as Ansible variables (facts). You can use the ''ansible.builtin.shell'' and ''.set_fact'' modules to do this, for example at the start of a new Ansible play:
- name: Install HDFS and YARN
hosts: all
tasks:

- name: Register master_node expression
shell: grep tf-driver /etc/hosts | cut -d' ' -f1
register: master_node_expr

- name: Set master_node fact
set_fact:
master_node: "{{ master_node_expr.stdout } }"
Write two corresponding tasks for ''num_workers''.

Use the ''ansible.builtin.get_url'' module to download the Hadoop and later Spark archives directly to each cluster host. But if you re-run your script many times, it takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.

Use the ''ansible.builtin.unarchive'' module to unpack the archives. Use the ''apt'' module to install ''gzip'' if you need it. Use ''ansible.builtin.file'' to create symbolic links as in Exercise 4.

Use ''ansible.builtin.lineinfile'' to define environment variables by adding them to ''~/.info319'' (instead of ''~/.bashrc'').

Change the variable syntax in the files ''{core,hdfs,mapred,yarn}-site.xml'' from Exercise 4 from Linux to Ansible. For example
* from ''core-site.xml'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://${HADOOP_NAMENODE}:9000</value>
</property>
</configuration>
* to ''core-site.xml.j2'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://{{ hadoop_namenode } }:9000</value>
</property>
</configuration>
You can now use the ''ansible.builtin.template'' module instead of Linux' ''envsubst'' command to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.

Use ''ansible.builtin.shell'' to create Hadoop's ''worker'' and ''master'' files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the ''terraform-driver'' host.

Note that '''hdfs namenode -format''' has got a '''-nonInteractive''' option that does not re-format an already formatted namenode. Use ''failed_when'' to make Ansible ignore ''Exit code 1'' from hdfs in such cases:
- name: Format HDFS namenode
ansible.builtin.shell:
argv: ["/home/ubuntu/volume/hadoop/bin/hdfs", "namenode", "-format", "-nonInteractive"]
register: result
failed_when: result.rc not in [0, 1]

Now you can log into ''terraform-driver'' and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].

Finally, you can start HDFS and YARN from ''terraform-driver''. Note that the ''ansible.builtin.copy'' and ''ansible.builtin.shell'' modules will normally ''not'' run ''~/.bashrc''. The reason is that ''~/.bashrc'' is intended for interactive shells running, for example, in a terminal window. ''~/.bashrc'' is not needed for many simpler commands, but more complex programs and scripts like Hadoop's ''start-all.sh'' need many environment variables. Therefore, you must start your own ''/usr/bin/bash'', initialise it with ''~/.info319'', and then run ''start-all.sh'' inside it:
- name: Start HDFS and YARN
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "start-all.sh"]

=== Install Spark on the cluster ===
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into ''terraform-driver'' and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].

=== Install Zookeeper on the cluster ===
There are two challenges with Zookeeper:
# it may not run on all the machines in the cluster (it must be an odd number)
# each zookeeper needs to know its ''myid'' number

As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.

As for the second point, Exercise 4 suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can use.

This task will start Zookeeper on the selected nodes:
- name: Start Zookeper
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties"]

=== Install Kafka on the cluster ===
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its ''local_ip'', which you can set like this:
- name: Register local_ip expression
shell: ip -4 address | grep -o "^ *inet $.\+$\/.\+global.*$" | grep -o "[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+" | head -1
register: local_ip_expr

- name: Set local_ip fact
set_fact:
local_ip: "{{ local_ip_expr.stdout } }"

It also needs to know its ''id'', which was written to the file ''/home/ubuntu/volume/zookeeper/data/myid'' before (better than ''/tmp/zookeeper/myid'' which was suggested before).

Finally, you can log into ''terraform-driver'' and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!

Configure Spark cluster using Ansible

2022-10-31T13:42:18Z

Sinoa: /* Test run Ansible */

== Ansible ==
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:
sudo apt install ansible

=== Configure Ansible ===
To prepare:
sudo cp /etc/ansible/hosts /etc/ansible/hosts.original
sudo chmod 666 /etc/ansible/hosts

Ansible needs to know the names of your cluster machines. Change ''info319-cluster.tf'' so it also writes a file like this to ''/etc/ansible/hosts'':
terraform-driver
terraform-worker-0
...
terraform-worker-5

Finally, Ansible must be installed on all the hosts too. Add the line
- ansible
to the ''packages:'' section of ''user-data.cfg'', and re-run '''terraform apply'''.

=== Test run Ansible ===
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:
ansible all -m ping

On your local machine, create the file ''info319-cluster.yaml'' with a simple task:
- name: Prepare .bashrc
hosts: all
tasks:

- name: Save original .bashrc
ansible.builtin.copy:
src: /home/ubuntu/.bashrc
dest: /home/ubuntu/.bashrc.original
remote_src: yes

Run Ansible:
ansible-playbook info319-cluster.yaml

== Create Ansible playbook ==
Extend the playbook file ''info319-cluster.yaml'' to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.

=== Preparing .bashrc ===
In Exercise 4 we made a lot of modifications to ''~/.bashrc''. In some cases it is more practical to have the cluster configuration in a separate file, for example ''~/.info319''.

* Add a task that uses the ''ansible.builtin.file'' module to create ("touch") ''/home/ubuntu/.info319'' on all the hosts.
* Add a task that uses the ''ansible.builtin.lineinfile'' module to add this line to ''/home/ubuntu/.info319'' on all the hosts:
source .info319

See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.

=== Configure SSH ===
You can use the ''blockinfile'' module to add your local ''ipv4-hosts'' to ''/etc/hosts'' on each node:
- name: Copy IPv4 addresses to /etc/hosts
ansible.builtin.blockinfile:
path: /etc/hosts
block: "{{ lookup('file', 'ipv4-hosts') } }"
become: yes

''Note:'' There should not be a space between the two curly braces at the end of the ''key:'' line. But without the space, WikiText misinterprets them as a template marker.

On your local machine, create the file ''config'' in your ''exercise-5'' folder:
Host terraform-* localhost
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ~/.ssh/config.terraform-hosts
(This is the ''config.stub'' file from Exercise 4, with the ''Include'' line added. Also, ''localhost'' has been added to the first line to allow nodes to ''ssh'' themselves...)

Use the ''copy'' module to upload this file, along with ''~/.ssh/config.terraform-hosts'' and ''~/.ssh/info319-spark-cluster'' to all hosts.

In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this:
- name: Authorise public cluster key
ansible.posix.authorized_key:
key: "{{ lookup('file', '/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub') } }"
user: ubuntu

''Tip:'' '''ansible-playbook''' has a '''--start-at-task "Task name"''' option to avoid repeating all earlier blocks and stages. You can also use '''--step''' to have Ansible ask before each step whether to execute, skip, or finish.

=== Install Java ===
Use Ansible's ''ansible.builtin.apt'' module and install an old and stable Java version, for example ''openjdk-8-jdk-headless''.

=== Mount volumes ===
Using the ''community.general.parted'', ''community.general.filesystem'' and ''ansible.posix.mount'' modules from your local machine may require installation:
ansible-galaxy collection install community.general
ansible-galaxy collection install ansible.posix

=== Install HDFS and YARN ===
To install HDFS and YARN you need the ''master_node'' and ''num_workers'' available as Ansible variables (facts). You can use the ''ansible.builtin.shell'' and ''.set_fact'' modules to do this, for example at the start of a new Ansible play:
- name: Install HDFS and YARN
hosts: all
tasks:

- name: Register master_node expression
shell: grep tf-driver /etc/hosts | cut -d' ' -f1
register: master_node_expr

- name: Set master_node fact
set_fact:
master_node: "{{ master_node_expr.stdout } }"
Write two corresponding tasks for ''num_workers''.

Use the ''ansible.builtin.get_url'' module to download the Hadoop and later Spark archives directly to each cluster host. But if you re-run your script many times, it takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.

Use the ''ansible.builtin.unarchive'' module to unpack the archives. Use the ''apt'' module to install ''gzip'' if you need it. Use ''ansible.builtin.file'' to create symbolic links as in Exercise 4.

Use ''ansible.builtin.lineinfile'' to define environment variables by adding them to ''~/.info319'' (instead of ''~/.bashrc'').

Change the variable syntax in the files ''{core,hdfs,mapred,yarn}-site.xml'' from Exercise 4 from Linux to Ansible. For example
* from ''core-site.xml'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://${HADOOP_NAMENODE}:9000</value>
</property>
</configuration>
* to ''core-site.xml.j2'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://{{ hadoop_namenode } }:9000</value>
</property>
</configuration>
You can now use the ''ansible.builtin.template'' module instead of Linux' ''envsubst'' command to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.

Use ''ansible.builtin.shell'' to create Hadoop's ''worker'' and ''master'' files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the ''terraform-driver'' host.

Note that '''hdfs namenode -format''' has got a '''-nonInteractive''' option that does not re-format an already formatted namenode. Use ''failed_when'' to make Ansible ignore ''Exit code 1'' from hdfs in such cases:
- name: Format HDFS namenode
ansible.builtin.shell:
argv: ["/home/ubuntu/volume/hadoop/bin/hdfs", "namenode", "-format", "-nonInteractive"]
register: result
failed_when: result.rc not in [0, 1]

Now you can log into ''terraform-driver'' and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].

Finally, you can start HDFS and YARN from ''terraform-driver''. Note that the ''ansible.builtin.copy'' and ''ansible.builtin.shell'' modules will normally ''not'' run ''~/.bashrc''. The reason is that ''~/.bashrc'' is intended for interactive shells running, for example, in a terminal window. ''~/.bashrc'' is not needed for many simpler commands, but more complex programs and scripts like Hadoop's ''start-all.sh'' need many environment variables. Therefore, you must start your own ''/usr/bin/bash'', initialise it with ''~/.info319'', and then run ''start-all.sh'' inside it:
- name: Start HDFS and YARN
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "start-all.sh"]

=== Install Spark on the cluster ===
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into ''terraform-driver'' and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].

=== Install Zookeeper on the cluster ===
There are two challenges with Zookeeper:
# it may not run on all the machines in the cluster (it must be an odd number)
# each zookeeper needs to know its ''myid'' number

As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.

As for the second point, Exercise 4 suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can use.

This task will start Zookeeper on the selected nodes:
- name: Start Zookeper
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties"]

=== Install Kafka on the cluster ===
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its ''local_ip'', which you can set like this:
- name: Register local_ip expression
shell: ip -4 address | grep -o "^ *inet $.\+$\/.\+global.*$" | grep -o "[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+" | head -1
register: local_ip_expr

- name: Set local_ip fact
set_fact:
local_ip: "{{ local_ip_expr.stdout } }"

It also needs to know its ''id'', which was written to the file ''/home/ubuntu/volume/zookeeper/data/myid'' before (better than ''/tmp/zookeeper/myid'' which was suggested before).

Finally, you can log into ''terraform-driver'' and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!

Configure Spark cluster using Ansible

2022-10-31T13:41:42Z

Sinoa: /* Install and configure Ansible */

== Ansible ==
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:
sudo apt install ansible

=== Configure Ansible ===
To prepare:
sudo cp /etc/ansible/hosts /etc/ansible/hosts.original
sudo chmod 666 /etc/ansible/hosts

Ansible needs to know the names of your cluster machines. Change ''info319-cluster.tf'' so it also writes a file like this to ''/etc/ansible/hosts'':
terraform-driver
terraform-worker-0
...
terraform-worker-5

Finally, Ansible must be installed on all the hosts too. Add the line
- ansible
to the ''packages:'' section of ''user-data.cfg'', and re-run '''terraform apply'''.

=== Test run Ansible ===
Make sure you can log into your cluster machines without a password. Test your Ansible set up from your local machine:
ansible all -m ping

On your local machine, create the file ''info319-cluster.yaml'' with a simple task:
- name: Prepare .bashrc
hosts: all
tasks:

- name: Save original .bashrc
ansible.builtin.copy:
src: /home/ubuntu/.bashrc
dest: /home/ubuntu/.bashrc.original
remote_src: yes

Run Ansible:
ansible-playbook info319-cluster.yaml

== Create Ansible playbook ==
Extend the playbook file ''info319-cluster.yaml'' to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.

=== Preparing .bashrc ===
In Exercise 4 we made a lot of modifications to ''~/.bashrc''. In some cases it is more practical to have the cluster configuration in a separate file, for example ''~/.info319''.

* Add a task that uses the ''ansible.builtin.file'' module to create ("touch") ''/home/ubuntu/.info319'' on all the hosts.
* Add a task that uses the ''ansible.builtin.lineinfile'' module to add this line to ''/home/ubuntu/.info319'' on all the hosts:
source .info319

See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.

=== Configure SSH ===
You can use the ''blockinfile'' module to add your local ''ipv4-hosts'' to ''/etc/hosts'' on each node:
- name: Copy IPv4 addresses to /etc/hosts
ansible.builtin.blockinfile:
path: /etc/hosts
block: "{{ lookup('file', 'ipv4-hosts') } }"
become: yes

''Note:'' There should not be a space between the two curly braces at the end of the ''key:'' line. But without the space, WikiText misinterprets them as a template marker.

On your local machine, create the file ''config'' in your ''exercise-5'' folder:
Host terraform-* localhost
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ~/.ssh/config.terraform-hosts
(This is the ''config.stub'' file from Exercise 4, with the ''Include'' line added. Also, ''localhost'' has been added to the first line to allow nodes to ''ssh'' themselves...)

Use the ''copy'' module to upload this file, along with ''~/.ssh/config.terraform-hosts'' and ''~/.ssh/info319-spark-cluster'' to all hosts.

In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this:
- name: Authorise public cluster key
ansible.posix.authorized_key:
key: "{{ lookup('file', '/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub') } }"
user: ubuntu

''Tip:'' '''ansible-playbook''' has a '''--start-at-task "Task name"''' option to avoid repeating all earlier blocks and stages. You can also use '''--step''' to have Ansible ask before each step whether to execute, skip, or finish.

=== Install Java ===
Use Ansible's ''ansible.builtin.apt'' module and install an old and stable Java version, for example ''openjdk-8-jdk-headless''.

=== Mount volumes ===
Using the ''community.general.parted'', ''community.general.filesystem'' and ''ansible.posix.mount'' modules from your local machine may require installation:
ansible-galaxy collection install community.general
ansible-galaxy collection install ansible.posix

=== Install HDFS and YARN ===
To install HDFS and YARN you need the ''master_node'' and ''num_workers'' available as Ansible variables (facts). You can use the ''ansible.builtin.shell'' and ''.set_fact'' modules to do this, for example at the start of a new Ansible play:
- name: Install HDFS and YARN
hosts: all
tasks:

- name: Register master_node expression
shell: grep tf-driver /etc/hosts | cut -d' ' -f1
register: master_node_expr

- name: Set master_node fact
set_fact:
master_node: "{{ master_node_expr.stdout } }"
Write two corresponding tasks for ''num_workers''.

Use the ''ansible.builtin.get_url'' module to download the Hadoop and later Spark archives directly to each cluster host. But if you re-run your script many times, it takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.

Use the ''ansible.builtin.unarchive'' module to unpack the archives. Use the ''apt'' module to install ''gzip'' if you need it. Use ''ansible.builtin.file'' to create symbolic links as in Exercise 4.

Use ''ansible.builtin.lineinfile'' to define environment variables by adding them to ''~/.info319'' (instead of ''~/.bashrc'').

Change the variable syntax in the files ''{core,hdfs,mapred,yarn}-site.xml'' from Exercise 4 from Linux to Ansible. For example
* from ''core-site.xml'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://${HADOOP_NAMENODE}:9000</value>
</property>
</configuration>
* to ''core-site.xml.j2'':
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://{{ hadoop_namenode } }:9000</value>
</property>
</configuration>
You can now use the ''ansible.builtin.template'' module instead of Linux' ''envsubst'' command to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.

Use ''ansible.builtin.shell'' to create Hadoop's ''worker'' and ''master'' files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the ''terraform-driver'' host.

Note that '''hdfs namenode -format''' has got a '''-nonInteractive''' option that does not re-format an already formatted namenode. Use ''failed_when'' to make Ansible ignore ''Exit code 1'' from hdfs in such cases:
- name: Format HDFS namenode
ansible.builtin.shell:
argv: ["/home/ubuntu/volume/hadoop/bin/hdfs", "namenode", "-format", "-nonInteractive"]
register: result
failed_when: result.rc not in [0, 1]

Now you can log into ''terraform-driver'' and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].

Finally, you can start HDFS and YARN from ''terraform-driver''. Note that the ''ansible.builtin.copy'' and ''ansible.builtin.shell'' modules will normally ''not'' run ''~/.bashrc''. The reason is that ''~/.bashrc'' is intended for interactive shells running, for example, in a terminal window. ''~/.bashrc'' is not needed for many simpler commands, but more complex programs and scripts like Hadoop's ''start-all.sh'' need many environment variables. Therefore, you must start your own ''/usr/bin/bash'', initialise it with ''~/.info319'', and then run ''start-all.sh'' inside it:
- name: Start HDFS and YARN
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "start-all.sh"]

=== Install Spark on the cluster ===
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into ''terraform-driver'' and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].

=== Install Zookeeper on the cluster ===
There are two challenges with Zookeeper:
# it may not run on all the machines in the cluster (it must be an odd number)
# each zookeeper needs to know its ''myid'' number

As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.

As for the second point, Exercise 4 suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can use.

This task will start Zookeeper on the selected nodes:
- name: Start Zookeper
ansible.builtin.shell:
argv: ["/usr/bin/bash", "--rcfile", "/home/ubuntu/.info319", "-c", "zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties"]

=== Install Kafka on the cluster ===
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its ''local_ip'', which you can set like this:
- name: Register local_ip expression
shell: ip -4 address | grep -o "^ *inet $.\+$\/.\+global.*$" | grep -o "[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+" | head -1
register: local_ip_expr

- name: Set local_ip fact
set_fact:
local_ip: "{{ local_ip_expr.stdout } }"

It also needs to know its ''id'', which was written to the file ''/home/ubuntu/volume/zookeeper/data/myid'' before (better than ''/tmp/zookeeper/myid'' which was suggested before).

Finally, you can log into ''terraform-driver'' and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!

Create Spark cluster using Terraform

2022-10-31T13:38:50Z

Sinoa: /* Configure SSH */

== OpenStack ==
=== Install OpenStack ===
On your local machine, create an ''exercise-5'' folder, for example a subfolder of ''info319-exercises'', and '''cd''' into it.

Install ''openstackclient''. There are several ways to do this. On Ubuntu Linux, you can do:
sudo apt install python3-openstackclient
(You may have to reinstall ''python3-six'' and ''python3-urllib3''. You may also need ''python3-dev''.)

Other guides suggest you install it as a Python package (using a virtual environment if you want):
pip install python-openstackclient

=== Configure OpenStack for command line ===
If you want to run OpenStack from the command line, create the file ''keystonerc.sh'' on your local machine:
touch keystonerc.sh
chmod 0600 keystonerc.sh

Use this as a template:
export OS_USERNAME=YOUR_USER_NAME@uib.no
export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT
export OS_PASSWORD=g3...YOUR_PASSWORD...Qb
export OS_AUTH_URL=https://api.nrec.no:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_USER_DOMAIN_NAME=dataporten
export OS_PROJECT_DOMAIN_NAME=dataporten
export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl
export OS_INTERFACE=public
export OS_NO_CACHE=1
export OS_TENANT_NAME=$OS_PROJECT_NAME

=== Test run OpenStack from command line ===
Test with:
. keystonerc.sh
openstack server list
(It can be quite slow.)

Other test commands:
openstack image list
openstack flavor list
openstack network list
openstack keypair list
openstack security group list
'''openstack --help''' lists all the possible commands.

'''Task:''' Try to create an instance with '''openstack server create ...''', then delete it. You can use the NREC Overview to see the results.

== Terraform ==
=== Install Terraform ===
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:
sudo apt update
sudo apt install software-properties-common gnupg2 curl
curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor > hashicorp.gpg
sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/
sudo apt-add-repository "deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main"
sudo apt update
sudo apt install terraform

=== Configure Terraform ===
Create a configuration file ''info319-cluster.tf'':
# configure the OpenStack provider
terraform {
required_providers {
openstack = {
source = "terraform-provider-openstack/openstack"
}
}
}

provider "openstack" {
}

Initialise Terraform with:
terraform init
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:
terraform init -upgrade
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.

=== Test run Terraform ===
Append something like this to ''info319-cluster.tf'':
# test instance
resource "openstack_compute_instance_v2" "terraform-test" {
name = "terraform-test"
image_name = "GOLD Ubuntu 22.04 LTS"
flavor_name = "m1.large"
security_groups = ["default", "info319-spark-cluster"]
network {
name = "dualStack"
}
}

Try to run Terraform with:
terraform plan
'''terraform plan''' lists everything ''terraform'' will do if you ''apply'' it. This list is important to check so you do not permanently delete something critical, like disks/volumes with important data on them...

To create the new test instance:
terraform apply # answer 'yes' to approve
Check ''Compute -> Instances'' in the ''NREC Overview'' to see the new instance appear.

(You can also run '''terraform plan -out plan.tf''' to save the plan to a file and then run it faster with '''terraform apply plan.tf'''.)

=== user-data.cfg ===
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like ''user-data.cfg'', or example:
#cloud-config

apt_upgrade: true

packages:
- emacs

power_state:
delay: "+3"
mode: reboot
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text ''emacs'' editor.)

To run this script when a new instance is created, add the line
user_data = file("user-data.cfg")
to ''info319-cluster.tf' and run '''terraform apply'''. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a ''Connection closed by UNKNOWN port 65535'' message.)

=== ~/.config/openstack/clouds.yaml (optional) ===
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (''OS_*'') defined in keystonerc.sh , you can keep them in a file.

On your local machine, create ~/.config/openstack/clouds.yaml with ''0600'' (or ''-rw-------'') access, for example:
clouds:
info319-cluster:
auth:
auth_url: https://api.nrec.no:5000/v3
project_name: uib-info-YOUR_NREC_PROJECT
username: YOUR_USER_NAME@uib.no
password: g3...YOUR_PASSWORD...Qb
user_domain_name: dataporten
project_domain_name: dataporten
identity_api_version: 3
region_name: YOUR_REGION_EITHER_bgo_OR_osl
interface: public
operation_log:
logging: TRUE
file: openstackclient_admin.log
level: info

You need to change your ''info319-cluster.tf'' file from
provider "openstack" {
}
to
provider "openstack" {
cloud = "info319-cluster" # defined in a clouds.yaml file
}
The advantage of this setup is that it easier manage multiple clusters from the same local machine. You do not need to run the ''keystonerc.sh'' script, and you avoid problems with changed environment variables.

''Note:'' There are now two ways to run '''openstack'''. One assumes that all the ''OS_*'' environment variables defined in ''keystonerc.sh'' are set. The other assumes that there is a ''~/.config/openstack/clouds.yaml'' file and that ''OS_CLOUD'' is set, for example:
OS_CLOUD=info319-cluster openstack server list

== Spark cluster ==
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The ''Resources'' menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] lists all the available resources, but it is easier to list the most relevant ones with '''openstack ''resource_type'' list'''.

=== Create or import a key pair ===
Your ''terraform-test'' instance still has no keypair. To add a keypair to ''info319-cluster.tf'', use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public ''~/.ssh/info319-spark-cluster.pub'' SSH key from before.

Add the keypair resource to the ''terraform-test'' resource with a line like this:
key_pair = "info319-spark-cluster"

Rerun '''terraform plan''', '''terraform apply''', and sometimes '''terraform destroy''' continuously as you build the cluster to check that things work.

=== Test login ===
Use
openstack server list # or OS_CLOUD=info319-cluster openstack ...
to find the IPv4 and IPv6 addresses of ''terraform-test''. Log in with for example:
ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.165.222
and
ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::21f4

=== Create the Spark driver and workers ===
The ''terraform-test'' resource can be renamed to for example ''terraform-driver'' and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].

You can create multiple ''terraform-worker-'' resources in a similar way, but adding a line
count = 6
and using '''${count.index}''' as a variable inside the worker ''name''. Use ''IPv6'' instead of ''dualStack''.

Check that you can login to the new instances using '''ssh''' and a IPv6 JumpHost.

=== Local Terraform variables ===
As ''info319-cluster.tf'' grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:
locals {
cluster_prefix = "terraform-"
num_workers = 6
driver_name = "${local.cluster_prefix}driver"
worker_prefix = "${local.cluster_prefix}worker-"
worker_names = [
for idx in range(local.num_workers) :
"${local.worker_prefix}${idx}"]
all_names = concat([local.driver_name], local.worker_names)
}
Inside strings, you can refer to a variable as ''${local.var_name}''. Outside of strings, you can write just ''local.var_name''.

=== Create and attach volumes ===
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances].

You can use ''local.num_workers'' and ''count='' both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:
resource "openstack_compute_volume_attach_v2" "attached-to-workers" {
count = local.num_workers
instance_id = "${openstack_compute_instance_v2.terraform-workers[count.index].id}"
volume_id = "${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}"
}

Log in with SSH and do '''ls /dev''' to see that the volumes are attached where they should (but they are not yet partitioned, formatted, and mounted).

=== Create ''hosts'' files ===
Make OpenStack write out the files ''ipv4-hosts'' looking like this:
158.37.65.59 terraform-driver
10.1.2.233 terraform-worker-0
...
10.1.2.63 terraform-worker-5
and ''ipv6-hosts'' looking like this:
2001:700:2:8300::27c4 terraform-driver
2001:700:2:8301::13a1 terraform-worker-0
...
2001:700:2:8310::110d terraform-worker-5

Tips:
* you can define outputs from terraform like this:
output "terraform-driver-ipv4" {
value = openstack_compute_instance_v2.terraform-driver.access_ip_v4
}
* you can use the '''[https://www.terraform.io/cli/commands/console terraform console]''' to explore expressions you can use to define outputs and locals, for example (inside the console):
openstack_compute_instance_v2.terraform-workers[5]
* when an output works as expected, you can redefine it as a local variable and use it to define more complex outputs
* a few useful functions (you may not need all of them) are:
length(string)
join(string, list)
concat([element], list)
substr(string, offset, length)
zipmap(list1, list2)
* you can create lists using this syntax:
terraform-workers-ipv6 = [
for idx in range(number):
list[idx].attribute
]
* you can output to a file like this:
resource "local_file" "ipv4-hosts-file" {
content = "${local.ipv4-hosts-string}\n"
filename = "ipv4-hosts"
}

=== Configure SSH ===
On your local host, change ''info319-cluster.tf'' so it also writes a file like this to ''~/.ssh/config.terraform-hosts'':
Host terraform-driver
Hostname 2001:700:2:8300::27c4

Host terraform-worker-0
Hostname 2001:700:2:8301::13a1

...

Host terraform-worker-5
Hostname 2001:700:2:8301::110d

On your local host, add these lines to ''~/.ssh/config'':
Host terraform-* localhost
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
ProxyJump sinoa@login.uib.no
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ./config.terraform-hosts
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).

You can now log into all the cluster machines by name, even after you increase the number of worker nodes:
ssh terraform-worker-4

Create Spark cluster using Terraform

2022-10-31T13:36:52Z

Sinoa: /* Create hosts files */

== OpenStack ==
=== Install OpenStack ===
On your local machine, create an ''exercise-5'' folder, for example a subfolder of ''info319-exercises'', and '''cd''' into it.

Install ''openstackclient''. There are several ways to do this. On Ubuntu Linux, you can do:
sudo apt install python3-openstackclient
(You may have to reinstall ''python3-six'' and ''python3-urllib3''. You may also need ''python3-dev''.)

Other guides suggest you install it as a Python package (using a virtual environment if you want):
pip install python-openstackclient

=== Configure OpenStack for command line ===
If you want to run OpenStack from the command line, create the file ''keystonerc.sh'' on your local machine:
touch keystonerc.sh
chmod 0600 keystonerc.sh

Use this as a template:
export OS_USERNAME=YOUR_USER_NAME@uib.no
export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT
export OS_PASSWORD=g3...YOUR_PASSWORD...Qb
export OS_AUTH_URL=https://api.nrec.no:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_USER_DOMAIN_NAME=dataporten
export OS_PROJECT_DOMAIN_NAME=dataporten
export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl
export OS_INTERFACE=public
export OS_NO_CACHE=1
export OS_TENANT_NAME=$OS_PROJECT_NAME

=== Test run OpenStack from command line ===
Test with:
. keystonerc.sh
openstack server list
(It can be quite slow.)

Other test commands:
openstack image list
openstack flavor list
openstack network list
openstack keypair list
openstack security group list
'''openstack --help''' lists all the possible commands.

'''Task:''' Try to create an instance with '''openstack server create ...''', then delete it. You can use the NREC Overview to see the results.

== Terraform ==
=== Install Terraform ===
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:
sudo apt update
sudo apt install software-properties-common gnupg2 curl
curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor > hashicorp.gpg
sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/
sudo apt-add-repository "deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main"
sudo apt update
sudo apt install terraform

=== Configure Terraform ===
Create a configuration file ''info319-cluster.tf'':
# configure the OpenStack provider
terraform {
required_providers {
openstack = {
source = "terraform-provider-openstack/openstack"
}
}
}

provider "openstack" {
}

Initialise Terraform with:
terraform init
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:
terraform init -upgrade
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.

=== Test run Terraform ===
Append something like this to ''info319-cluster.tf'':
# test instance
resource "openstack_compute_instance_v2" "terraform-test" {
name = "terraform-test"
image_name = "GOLD Ubuntu 22.04 LTS"
flavor_name = "m1.large"
security_groups = ["default", "info319-spark-cluster"]
network {
name = "dualStack"
}
}

Try to run Terraform with:
terraform plan
'''terraform plan''' lists everything ''terraform'' will do if you ''apply'' it. This list is important to check so you do not permanently delete something critical, like disks/volumes with important data on them...

To create the new test instance:
terraform apply # answer 'yes' to approve
Check ''Compute -> Instances'' in the ''NREC Overview'' to see the new instance appear.

(You can also run '''terraform plan -out plan.tf''' to save the plan to a file and then run it faster with '''terraform apply plan.tf'''.)

=== user-data.cfg ===
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like ''user-data.cfg'', or example:
#cloud-config

apt_upgrade: true

packages:
- emacs

power_state:
delay: "+3"
mode: reboot
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text ''emacs'' editor.)

To run this script when a new instance is created, add the line
user_data = file("user-data.cfg")
to ''info319-cluster.tf' and run '''terraform apply'''. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a ''Connection closed by UNKNOWN port 65535'' message.)

=== ~/.config/openstack/clouds.yaml (optional) ===
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (''OS_*'') defined in keystonerc.sh , you can keep them in a file.

On your local machine, create ~/.config/openstack/clouds.yaml with ''0600'' (or ''-rw-------'') access, for example:
clouds:
info319-cluster:
auth:
auth_url: https://api.nrec.no:5000/v3
project_name: uib-info-YOUR_NREC_PROJECT
username: YOUR_USER_NAME@uib.no
password: g3...YOUR_PASSWORD...Qb
user_domain_name: dataporten
project_domain_name: dataporten
identity_api_version: 3
region_name: YOUR_REGION_EITHER_bgo_OR_osl
interface: public
operation_log:
logging: TRUE
file: openstackclient_admin.log
level: info

You need to change your ''info319-cluster.tf'' file from
provider "openstack" {
}
to
provider "openstack" {
cloud = "info319-cluster" # defined in a clouds.yaml file
}
The advantage of this setup is that it easier manage multiple clusters from the same local machine. You do not need to run the ''keystonerc.sh'' script, and you avoid problems with changed environment variables.

''Note:'' There are now two ways to run '''openstack'''. One assumes that all the ''OS_*'' environment variables defined in ''keystonerc.sh'' are set. The other assumes that there is a ''~/.config/openstack/clouds.yaml'' file and that ''OS_CLOUD'' is set, for example:
OS_CLOUD=info319-cluster openstack server list

== Spark cluster ==
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The ''Resources'' menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] lists all the available resources, but it is easier to list the most relevant ones with '''openstack ''resource_type'' list'''.

=== Create or import a key pair ===
Your ''terraform-test'' instance still has no keypair. To add a keypair to ''info319-cluster.tf'', use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public ''~/.ssh/info319-spark-cluster.pub'' SSH key from before.

Add the keypair resource to the ''terraform-test'' resource with a line like this:
key_pair = "info319-spark-cluster"

Rerun '''terraform plan''', '''terraform apply''', and sometimes '''terraform destroy''' continuously as you build the cluster to check that things work.

=== Test login ===
Use
openstack server list # or OS_CLOUD=info319-cluster openstack ...
to find the IPv4 and IPv6 addresses of ''terraform-test''. Log in with for example:
ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.165.222
and
ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::21f4

=== Create the Spark driver and workers ===
The ''terraform-test'' resource can be renamed to for example ''terraform-driver'' and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].

You can create multiple ''terraform-worker-'' resources in a similar way, but adding a line
count = 6
and using '''${count.index}''' as a variable inside the worker ''name''. Use ''IPv6'' instead of ''dualStack''.

Check that you can login to the new instances using '''ssh''' and a IPv6 JumpHost.

=== Local Terraform variables ===
As ''info319-cluster.tf'' grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:
locals {
cluster_prefix = "terraform-"
num_workers = 6
driver_name = "${local.cluster_prefix}driver"
worker_prefix = "${local.cluster_prefix}worker-"
worker_names = [
for idx in range(local.num_workers) :
"${local.worker_prefix}${idx}"]
all_names = concat([local.driver_name], local.worker_names)
}
Inside strings, you can refer to a variable as ''${local.var_name}''. Outside of strings, you can write just ''local.var_name''.

=== Create and attach volumes ===
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances].

You can use ''local.num_workers'' and ''count='' both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:
resource "openstack_compute_volume_attach_v2" "attached-to-workers" {
count = local.num_workers
instance_id = "${openstack_compute_instance_v2.terraform-workers[count.index].id}"
volume_id = "${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}"
}

Log in with SSH and do '''ls /dev''' to see that the volumes are attached where they should (but they are not yet partitioned, formatted, and mounted).

=== Create ''hosts'' files ===
Make OpenStack write out the files ''ipv4-hosts'' looking like this:
158.37.65.59 terraform-driver
10.1.2.233 terraform-worker-0
...
10.1.2.63 terraform-worker-5
and ''ipv6-hosts'' looking like this:
2001:700:2:8300::27c4 terraform-driver
2001:700:2:8301::13a1 terraform-worker-0
...
2001:700:2:8310::110d terraform-worker-5

Tips:
* you can define outputs from terraform like this:
output "terraform-driver-ipv4" {
value = openstack_compute_instance_v2.terraform-driver.access_ip_v4
}
* you can use the '''[https://www.terraform.io/cli/commands/console terraform console]''' to explore expressions you can use to define outputs and locals, for example (inside the console):
openstack_compute_instance_v2.terraform-workers[5]
* when an output works as expected, you can redefine it as a local variable and use it to define more complex outputs
* a few useful functions (you may not need all of them) are:
length(string)
join(string, list)
concat([element], list)
substr(string, offset, length)
zipmap(list1, list2)
* you can create lists using this syntax:
terraform-workers-ipv6 = [
for idx in range(number):
list[idx].attribute
]
* you can output to a file like this:
resource "local_file" "ipv4-hosts-file" {
content = "${local.ipv4-hosts-string}\n"
filename = "ipv4-hosts"
}

=== Configure SSH ===
On your local host, change ''info319-cluster.tf'' so it also writes a file like this to ''~/.ssh/config.terraform-hosts'':
Host terraform-driver
Hostname 2001:700:2:8300::28c4

Host terraform-worker-0
Hostname 2001:700:2:8301::14a1

...

Host terraform-worker-5
Hostname 2001:700:2:8301::120d

On your local host, add these lines to ''~/.ssh/config'':
Host terraform-*
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
ProxyJump sinoa@login.uib.no
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ./config.terraform-hosts
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).

You can now log into all the cluster machines by name, even after you increase the number of worker nodes:
ssh terraform-worker-4

Create Spark cluster using Terraform

2022-10-31T13:32:02Z

Sinoa: /* Create and attach volumes */

== OpenStack ==
=== Install OpenStack ===
On your local machine, create an ''exercise-5'' folder, for example a subfolder of ''info319-exercises'', and '''cd''' into it.

Install ''openstackclient''. There are several ways to do this. On Ubuntu Linux, you can do:
sudo apt install python3-openstackclient
(You may have to reinstall ''python3-six'' and ''python3-urllib3''. You may also need ''python3-dev''.)

Other guides suggest you install it as a Python package (using a virtual environment if you want):
pip install python-openstackclient

=== Configure OpenStack for command line ===
If you want to run OpenStack from the command line, create the file ''keystonerc.sh'' on your local machine:
touch keystonerc.sh
chmod 0600 keystonerc.sh

Use this as a template:
export OS_USERNAME=YOUR_USER_NAME@uib.no
export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT
export OS_PASSWORD=g3...YOUR_PASSWORD...Qb
export OS_AUTH_URL=https://api.nrec.no:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_USER_DOMAIN_NAME=dataporten
export OS_PROJECT_DOMAIN_NAME=dataporten
export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl
export OS_INTERFACE=public
export OS_NO_CACHE=1
export OS_TENANT_NAME=$OS_PROJECT_NAME

=== Test run OpenStack from command line ===
Test with:
. keystonerc.sh
openstack server list
(It can be quite slow.)

Other test commands:
openstack image list
openstack flavor list
openstack network list
openstack keypair list
openstack security group list
'''openstack --help''' lists all the possible commands.

'''Task:''' Try to create an instance with '''openstack server create ...''', then delete it. You can use the NREC Overview to see the results.

== Terraform ==
=== Install Terraform ===
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:
sudo apt update
sudo apt install software-properties-common gnupg2 curl
curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor > hashicorp.gpg
sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/
sudo apt-add-repository "deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main"
sudo apt update
sudo apt install terraform

=== Configure Terraform ===
Create a configuration file ''info319-cluster.tf'':
# configure the OpenStack provider
terraform {
required_providers {
openstack = {
source = "terraform-provider-openstack/openstack"
}
}
}

provider "openstack" {
}

Initialise Terraform with:
terraform init
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:
terraform init -upgrade
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.

=== Test run Terraform ===
Append something like this to ''info319-cluster.tf'':
# test instance
resource "openstack_compute_instance_v2" "terraform-test" {
name = "terraform-test"
image_name = "GOLD Ubuntu 22.04 LTS"
flavor_name = "m1.large"
security_groups = ["default", "info319-spark-cluster"]
network {
name = "dualStack"
}
}

Try to run Terraform with:
terraform plan
'''terraform plan''' lists everything ''terraform'' will do if you ''apply'' it. This list is important to check so you do not permanently delete something critical, like disks/volumes with important data on them...

To create the new test instance:
terraform apply # answer 'yes' to approve
Check ''Compute -> Instances'' in the ''NREC Overview'' to see the new instance appear.

(You can also run '''terraform plan -out plan.tf''' to save the plan to a file and then run it faster with '''terraform apply plan.tf'''.)

=== user-data.cfg ===
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like ''user-data.cfg'', or example:
#cloud-config

apt_upgrade: true

packages:
- emacs

power_state:
delay: "+3"
mode: reboot
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text ''emacs'' editor.)

To run this script when a new instance is created, add the line
user_data = file("user-data.cfg")
to ''info319-cluster.tf' and run '''terraform apply'''. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a ''Connection closed by UNKNOWN port 65535'' message.)

=== ~/.config/openstack/clouds.yaml (optional) ===
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (''OS_*'') defined in keystonerc.sh , you can keep them in a file.

On your local machine, create ~/.config/openstack/clouds.yaml with ''0600'' (or ''-rw-------'') access, for example:
clouds:
info319-cluster:
auth:
auth_url: https://api.nrec.no:5000/v3
project_name: uib-info-YOUR_NREC_PROJECT
username: YOUR_USER_NAME@uib.no
password: g3...YOUR_PASSWORD...Qb
user_domain_name: dataporten
project_domain_name: dataporten
identity_api_version: 3
region_name: YOUR_REGION_EITHER_bgo_OR_osl
interface: public
operation_log:
logging: TRUE
file: openstackclient_admin.log
level: info

You need to change your ''info319-cluster.tf'' file from
provider "openstack" {
}
to
provider "openstack" {
cloud = "info319-cluster" # defined in a clouds.yaml file
}
The advantage of this setup is that it easier manage multiple clusters from the same local machine. You do not need to run the ''keystonerc.sh'' script, and you avoid problems with changed environment variables.

''Note:'' There are now two ways to run '''openstack'''. One assumes that all the ''OS_*'' environment variables defined in ''keystonerc.sh'' are set. The other assumes that there is a ''~/.config/openstack/clouds.yaml'' file and that ''OS_CLOUD'' is set, for example:
OS_CLOUD=info319-cluster openstack server list

== Spark cluster ==
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The ''Resources'' menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] lists all the available resources, but it is easier to list the most relevant ones with '''openstack ''resource_type'' list'''.

=== Create or import a key pair ===
Your ''terraform-test'' instance still has no keypair. To add a keypair to ''info319-cluster.tf'', use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public ''~/.ssh/info319-spark-cluster.pub'' SSH key from before.

Add the keypair resource to the ''terraform-test'' resource with a line like this:
key_pair = "info319-spark-cluster"

Rerun '''terraform plan''', '''terraform apply''', and sometimes '''terraform destroy''' continuously as you build the cluster to check that things work.

=== Test login ===
Use
openstack server list # or OS_CLOUD=info319-cluster openstack ...
to find the IPv4 and IPv6 addresses of ''terraform-test''. Log in with for example:
ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.165.222
and
ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::21f4

=== Create the Spark driver and workers ===
The ''terraform-test'' resource can be renamed to for example ''terraform-driver'' and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].

You can create multiple ''terraform-worker-'' resources in a similar way, but adding a line
count = 6
and using '''${count.index}''' as a variable inside the worker ''name''. Use ''IPv6'' instead of ''dualStack''.

Check that you can login to the new instances using '''ssh''' and a IPv6 JumpHost.

=== Local Terraform variables ===
As ''info319-cluster.tf'' grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:
locals {
cluster_prefix = "terraform-"
num_workers = 6
driver_name = "${local.cluster_prefix}driver"
worker_prefix = "${local.cluster_prefix}worker-"
worker_names = [
for idx in range(local.num_workers) :
"${local.worker_prefix}${idx}"]
all_names = concat([local.driver_name], local.worker_names)
}
Inside strings, you can refer to a variable as ''${local.var_name}''. Outside of strings, you can write just ''local.var_name''.

=== Create and attach volumes ===
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances].

You can use ''local.num_workers'' and ''count='' both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:
resource "openstack_compute_volume_attach_v2" "attached-to-workers" {
count = local.num_workers
instance_id = "${openstack_compute_instance_v2.terraform-workers[count.index].id}"
volume_id = "${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}"
}

Log in with SSH and do '''ls /dev''' to see that the volumes are attached where they should (but they are not yet partitioned, formatted, and mounted).

=== Create ''hosts'' files ===
Make OpenStack write out the file ''ipv4-hosts'' looking like this:
158.37.65.58 terraform-driver
10.1.2.234 terraform-worker-0
...
10.1.2.63 terraform-worker-5
and ''ipv6-hosts'' looking like this:
2001:700:2:8300::28c4 terraform-driver
2001:700:2:8301::14a1 terraform-worker-0
...
2001:700:2:8310::120d terraform-worker-5

Tips:
* you can define outputs from terraform like this:
output "terraform-driver-ipv4" {
value = openstack_compute_instance_v2.terraform-driver.access_ip_v4
}
* you can use the '''[https://www.terraform.io/cli/commands/console terraform console]''' to explore expressions you can use to define outputs and locals, for example:
openstack_compute_instance_v2.terraform-workers[5]
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs
* a few useful functions are:
concat([local.element], local.list)
length(string)
substr(string, offset, length)
zipmap(list1, list2)
* you can create lists using this syntax:
terraform-workers-ipv6 = [
for idx in range(number):
list[idx].attribute
]
* you can output to a file like this:
resource "local_file" "ipv4-hosts-file" {
content = "${local.ipv4-hosts-string}\n"
filename = "ipv4-hosts"
}

=== Configure SSH ===
On your local host, change ''info319-cluster.tf'' so it also writes a file like this to ''~/.ssh/config.terraform-hosts'':
Host terraform-driver
Hostname 2001:700:2:8300::28c4

Host terraform-worker-0
Hostname 2001:700:2:8301::14a1

...

Host terraform-worker-5
Hostname 2001:700:2:8301::120d

On your local host, add these lines to ''~/.ssh/config'':
Host terraform-*
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
ProxyJump sinoa@login.uib.no
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ./config.terraform-hosts
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).

You can now log into all the cluster machines by name, even after you increase the number of worker nodes:
ssh terraform-worker-4

Create Spark cluster using Terraform

2022-10-31T13:31:15Z

Sinoa: /* Spark cluster */

== OpenStack ==
=== Install OpenStack ===
On your local machine, create an ''exercise-5'' folder, for example a subfolder of ''info319-exercises'', and '''cd''' into it.

Install ''openstackclient''. There are several ways to do this. On Ubuntu Linux, you can do:
sudo apt install python3-openstackclient
(You may have to reinstall ''python3-six'' and ''python3-urllib3''. You may also need ''python3-dev''.)

Other guides suggest you install it as a Python package (using a virtual environment if you want):
pip install python-openstackclient

=== Configure OpenStack for command line ===
If you want to run OpenStack from the command line, create the file ''keystonerc.sh'' on your local machine:
touch keystonerc.sh
chmod 0600 keystonerc.sh

Use this as a template:
export OS_USERNAME=YOUR_USER_NAME@uib.no
export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT
export OS_PASSWORD=g3...YOUR_PASSWORD...Qb
export OS_AUTH_URL=https://api.nrec.no:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_USER_DOMAIN_NAME=dataporten
export OS_PROJECT_DOMAIN_NAME=dataporten
export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl
export OS_INTERFACE=public
export OS_NO_CACHE=1
export OS_TENANT_NAME=$OS_PROJECT_NAME

=== Test run OpenStack from command line ===
Test with:
. keystonerc.sh
openstack server list
(It can be quite slow.)

Other test commands:
openstack image list
openstack flavor list
openstack network list
openstack keypair list
openstack security group list
'''openstack --help''' lists all the possible commands.

'''Task:''' Try to create an instance with '''openstack server create ...''', then delete it. You can use the NREC Overview to see the results.

== Terraform ==
=== Install Terraform ===
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:
sudo apt update
sudo apt install software-properties-common gnupg2 curl
curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor > hashicorp.gpg
sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/
sudo apt-add-repository "deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main"
sudo apt update
sudo apt install terraform

=== Configure Terraform ===
Create a configuration file ''info319-cluster.tf'':
# configure the OpenStack provider
terraform {
required_providers {
openstack = {
source = "terraform-provider-openstack/openstack"
}
}
}

provider "openstack" {
}

Initialise Terraform with:
terraform init
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:
terraform init -upgrade
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.

=== Test run Terraform ===
Append something like this to ''info319-cluster.tf'':
# test instance
resource "openstack_compute_instance_v2" "terraform-test" {
name = "terraform-test"
image_name = "GOLD Ubuntu 22.04 LTS"
flavor_name = "m1.large"
security_groups = ["default", "info319-spark-cluster"]
network {
name = "dualStack"
}
}

Try to run Terraform with:
terraform plan
'''terraform plan''' lists everything ''terraform'' will do if you ''apply'' it. This list is important to check so you do not permanently delete something critical, like disks/volumes with important data on them...

To create the new test instance:
terraform apply # answer 'yes' to approve
Check ''Compute -> Instances'' in the ''NREC Overview'' to see the new instance appear.

(You can also run '''terraform plan -out plan.tf''' to save the plan to a file and then run it faster with '''terraform apply plan.tf'''.)

=== user-data.cfg ===
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like ''user-data.cfg'', or example:
#cloud-config

apt_upgrade: true

packages:
- emacs

power_state:
delay: "+3"
mode: reboot
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text ''emacs'' editor.)

To run this script when a new instance is created, add the line
user_data = file("user-data.cfg")
to ''info319-cluster.tf' and run '''terraform apply'''. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a ''Connection closed by UNKNOWN port 65535'' message.)

=== ~/.config/openstack/clouds.yaml (optional) ===
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (''OS_*'') defined in keystonerc.sh , you can keep them in a file.

On your local machine, create ~/.config/openstack/clouds.yaml with ''0600'' (or ''-rw-------'') access, for example:
clouds:
info319-cluster:
auth:
auth_url: https://api.nrec.no:5000/v3
project_name: uib-info-YOUR_NREC_PROJECT
username: YOUR_USER_NAME@uib.no
password: g3...YOUR_PASSWORD...Qb
user_domain_name: dataporten
project_domain_name: dataporten
identity_api_version: 3
region_name: YOUR_REGION_EITHER_bgo_OR_osl
interface: public
operation_log:
logging: TRUE
file: openstackclient_admin.log
level: info

You need to change your ''info319-cluster.tf'' file from
provider "openstack" {
}
to
provider "openstack" {
cloud = "info319-cluster" # defined in a clouds.yaml file
}
The advantage of this setup is that it easier manage multiple clusters from the same local machine. You do not need to run the ''keystonerc.sh'' script, and you avoid problems with changed environment variables.

''Note:'' There are now two ways to run '''openstack'''. One assumes that all the ''OS_*'' environment variables defined in ''keystonerc.sh'' are set. The other assumes that there is a ''~/.config/openstack/clouds.yaml'' file and that ''OS_CLOUD'' is set, for example:
OS_CLOUD=info319-cluster openstack server list

== Spark cluster ==
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The ''Resources'' menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] lists all the available resources, but it is easier to list the most relevant ones with '''openstack ''resource_type'' list'''.

=== Create or import a key pair ===
Your ''terraform-test'' instance still has no keypair. To add a keypair to ''info319-cluster.tf'', use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public ''~/.ssh/info319-spark-cluster.pub'' SSH key from before.

Add the keypair resource to the ''terraform-test'' resource with a line like this:
key_pair = "info319-spark-cluster"

Rerun '''terraform plan''', '''terraform apply''', and sometimes '''terraform destroy''' continuously as you build the cluster to check that things work.

=== Test login ===
Use
openstack server list # or OS_CLOUD=info319-cluster openstack ...
to find the IPv4 and IPv6 addresses of ''terraform-test''. Log in with for example:
ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.165.222
and
ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::21f4

=== Create the Spark driver and workers ===
The ''terraform-test'' resource can be renamed to for example ''terraform-driver'' and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].

You can create multiple ''terraform-worker-'' resources in a similar way, but adding a line
count = 6
and using '''${count.index}''' as a variable inside the worker ''name''. Use ''IPv6'' instead of ''dualStack''.

Check that you can login to the new instances using '''ssh''' and a IPv6 JumpHost.

=== Local Terraform variables ===
As ''info319-cluster.tf'' grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:
locals {
cluster_prefix = "terraform-"
num_workers = 6
driver_name = "${local.cluster_prefix}driver"
worker_prefix = "${local.cluster_prefix}worker-"
worker_names = [
for idx in range(local.num_workers) :
"${local.worker_prefix}${idx}"]
all_names = concat([local.driver_name], local.worker_names)
}
Inside strings, you can refer to a variable as ''${local.var_name}''. Outside of strings, you can write just ''local.var_name''.

=== Create and attach volumes ===
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances].

You can use ''local.num_workers'' and ''count='' both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:
resource "openstack_compute_volume_attach_v2" "attached-to-workers" {
count = local.num_workers
instance_id = "${openstack_compute_instance_v2.terraform-workers[count.index].id}"
volume_id = "${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}"
}

Log in with SSH and to '''ls /dev''' to see that the volumes are attached where they should (but they are not yet partitioned, formatted, and mounted).

=== Create ''hosts'' files ===
Make OpenStack write out the file ''ipv4-hosts'' looking like this:
158.37.65.58 terraform-driver
10.1.2.234 terraform-worker-0
...
10.1.2.63 terraform-worker-5
and ''ipv6-hosts'' looking like this:
2001:700:2:8300::28c4 terraform-driver
2001:700:2:8301::14a1 terraform-worker-0
...
2001:700:2:8310::120d terraform-worker-5

Tips:
* you can define outputs from terraform like this:
output "terraform-driver-ipv4" {
value = openstack_compute_instance_v2.terraform-driver.access_ip_v4
}
* you can use the '''[https://www.terraform.io/cli/commands/console terraform console]''' to explore expressions you can use to define outputs and locals, for example:
openstack_compute_instance_v2.terraform-workers[5]
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs
* a few useful functions are:
concat([local.element], local.list)
length(string)
substr(string, offset, length)
zipmap(list1, list2)
* you can create lists using this syntax:
terraform-workers-ipv6 = [
for idx in range(number):
list[idx].attribute
]
* you can output to a file like this:
resource "local_file" "ipv4-hosts-file" {
content = "${local.ipv4-hosts-string}\n"
filename = "ipv4-hosts"
}

=== Configure SSH ===
On your local host, change ''info319-cluster.tf'' so it also writes a file like this to ''~/.ssh/config.terraform-hosts'':
Host terraform-driver
Hostname 2001:700:2:8300::28c4

Host terraform-worker-0
Hostname 2001:700:2:8301::14a1

...

Host terraform-worker-5
Hostname 2001:700:2:8301::120d

On your local host, add these lines to ''~/.ssh/config'':
Host terraform-*
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
ProxyJump sinoa@login.uib.no
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ./config.terraform-hosts
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).

You can now log into all the cluster machines by name, even after you increase the number of worker nodes:
ssh terraform-worker-4

Create Spark cluster using Terraform

2022-10-31T13:23:00Z

Sinoa: /* Test login */

== OpenStack ==
=== Install OpenStack ===
On your local machine, create an ''exercise-5'' folder, for example a subfolder of ''info319-exercises'', and '''cd''' into it.

Install ''openstackclient''. There are several ways to do this. On Ubuntu Linux, you can do:
sudo apt install python3-openstackclient
(You may have to reinstall ''python3-six'' and ''python3-urllib3''. You may also need ''python3-dev''.)

Other guides suggest you install it as a Python package (using a virtual environment if you want):
pip install python-openstackclient

=== Configure OpenStack for command line ===
If you want to run OpenStack from the command line, create the file ''keystonerc.sh'' on your local machine:
touch keystonerc.sh
chmod 0600 keystonerc.sh

Use this as a template:
export OS_USERNAME=YOUR_USER_NAME@uib.no
export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT
export OS_PASSWORD=g3...YOUR_PASSWORD...Qb
export OS_AUTH_URL=https://api.nrec.no:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_USER_DOMAIN_NAME=dataporten
export OS_PROJECT_DOMAIN_NAME=dataporten
export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl
export OS_INTERFACE=public
export OS_NO_CACHE=1
export OS_TENANT_NAME=$OS_PROJECT_NAME

=== Test run OpenStack from command line ===
Test with:
. keystonerc.sh
openstack server list
(It can be quite slow.)

Other test commands:
openstack image list
openstack flavor list
openstack network list
openstack keypair list
openstack security group list
'''openstack --help''' lists all the possible commands.

'''Task:''' Try to create an instance with '''openstack server create ...''', then delete it. You can use the NREC Overview to see the results.

== Terraform ==
=== Install Terraform ===
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:
sudo apt update
sudo apt install software-properties-common gnupg2 curl
curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor > hashicorp.gpg
sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/
sudo apt-add-repository "deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main"
sudo apt update
sudo apt install terraform

=== Configure Terraform ===
Create a configuration file ''info319-cluster.tf'':
# configure the OpenStack provider
terraform {
required_providers {
openstack = {
source = "terraform-provider-openstack/openstack"
}
}
}

provider "openstack" {
}

Initialise Terraform with:
terraform init
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:
terraform init -upgrade
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.

=== Test run Terraform ===
Append something like this to ''info319-cluster.tf'':
# test instance
resource "openstack_compute_instance_v2" "terraform-test" {
name = "terraform-test"
image_name = "GOLD Ubuntu 22.04 LTS"
flavor_name = "m1.large"
security_groups = ["default", "info319-spark-cluster"]
network {
name = "dualStack"
}
}

Try to run Terraform with:
terraform plan
'''terraform plan''' lists everything ''terraform'' will do if you ''apply'' it. This list is important to check so you do not permanently delete something critical, like disks/volumes with important data on them...

To create the new test instance:
terraform apply # answer 'yes' to approve
Check ''Compute -> Instances'' in the ''NREC Overview'' to see the new instance appear.

(You can also run '''terraform plan -out plan.tf''' to save the plan to a file and then run it faster with '''terraform apply plan.tf'''.)

=== user-data.cfg ===
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like ''user-data.cfg'', or example:
#cloud-config

apt_upgrade: true

packages:
- emacs

power_state:
delay: "+3"
mode: reboot
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text ''emacs'' editor.)

To run this script when a new instance is created, add the line
user_data = file("user-data.cfg")
to ''info319-cluster.tf' and run '''terraform apply'''. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a ''Connection closed by UNKNOWN port 65535'' message.)

=== ~/.config/openstack/clouds.yaml (optional) ===
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (''OS_*'') defined in keystonerc.sh , you can keep them in a file.

On your local machine, create ~/.config/openstack/clouds.yaml with ''0600'' (or ''-rw-------'') access, for example:
clouds:
info319-cluster:
auth:
auth_url: https://api.nrec.no:5000/v3
project_name: uib-info-YOUR_NREC_PROJECT
username: YOUR_USER_NAME@uib.no
password: g3...YOUR_PASSWORD...Qb
user_domain_name: dataporten
project_domain_name: dataporten
identity_api_version: 3
region_name: YOUR_REGION_EITHER_bgo_OR_osl
interface: public
operation_log:
logging: TRUE
file: openstackclient_admin.log
level: info

You need to change your ''info319-cluster.tf'' file from
provider "openstack" {
}
to
provider "openstack" {
cloud = "info319-cluster" # defined in a clouds.yaml file
}
The advantage of this setup is that it easier manage multiple clusters from the same local machine. You do not need to run the ''keystonerc.sh'' script, and you avoid problems with changed environment variables.

''Note:'' There are now two ways to run '''openstack'''. One assumes that all the ''OS_*'' environment variables defined in ''keystonerc.sh'' are set. The other assumes that there is a ''~/.config/openstack/clouds.yaml'' file and that ''OS_CLOUD'' is set, for example:
OS_CLOUD=info319-cluster openstack server list

== Spark cluster ==
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The ''Resources'' menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] lists all the available resources, but it is easier to list the most relevant ones with '''openstack ''resource_type'' list'''.

=== Create or import a key pair ===
Your ''terraform-test'' instance still has no keypair. To add a keypair to ''info319-cluster.tf'', use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public ''~/.ssh/info319-spark-cluster.pub'' SSH key from before.

Add the keypair resource to the ''terraform-test'' resource with a line like this:
key_pair = "info319-spark-cluster"

Rerun '''terraform plan''', '''terraform apply''', and sometimes '''terraform destroy''' continuously as you build the cluster to check that things work.

=== Test login ===
Use
openstack server list # or OS_CLOUD=info319-cluster openstack ...
to find the IPv4 and IPv6 addresses of ''terraform-test''. Log in with for example:
ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.165.222
and
ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::21f4

=== Create the Spark driver ===
The ''terraform-test'' resource can be renamed to for example ''terraform-driver'' and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].

=== Create the Spark workers ===
You can create multiple ''terraform-worker-'' resources in a similar way, but adding a line
count = 6
and using '''${count.index}''' as a variable inside the worker ''name''. Use ''IPv6'' instead of ''dualStack''.

Check that you can login to the new instances using '''ssh''' and a IPv6 JumpHost.

As ''info319-cluster.tf'' grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:
locals {
cluster_prefix = "terraform-"
num_workers = 6
driver_name = "${local.cluster_prefix}driver"
worker_prefix = "${local.cluster_prefix}worker-"
worker_names = [
for idx in range(local.num_workers) :
"${local.worker_prefix}${idx}"]
all_names = concat([local.driver_name], local.worker_names)
}
Inside strings, you can refer to the variables using ''${local.var_name}''. Outside of string, you can write just ''local.var_name''.

=== Create and attach volumes ===
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances].

You can use ''local.num_workers'' and ''count='' both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:
resource "openstack_compute_volume_attach_v2" "attached-to-workers" {
count = local.num_workers
instance_id = "${openstack_compute_instance_v2.terraform-workers[count.index].id}"
volume_id = "${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}"
}

Log in with SSH and to '''ls /dev''' to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).

=== Create ''hosts'' files ===
Make OpenStack write out the file ''ipv4-hosts'' looking like this:
158.37.65.58 terraform-driver
10.1.2.234 terraform-worker-0
...
10.1.2.63 terraform-worker-5
and ''ipv6-hosts'' looking like this:
2001:700:2:8300::28c4 terraform-driver
2001:700:2:8301::14a1 terraform-worker-0
...
2001:700:2:8310::120d terraform-worker-5

Tips:
* you can define outputs from terraform like this:
output "terraform-driver-ipv4" {
value = openstack_compute_instance_v2.terraform-driver.access_ip_v4
}
* you can use the '''[https://www.terraform.io/cli/commands/console terraform console]''' to explore expressions you can use to define outputs and locals, for example:
openstack_compute_instance_v2.terraform-workers[5]
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs
* a few useful functions are:
concat([local.element], local.list)
length(string)
substr(string, offset, length)
zipmap(list1, list2)
* you can create lists using this syntax:
terraform-workers-ipv6 = [
for idx in range(number):
list[idx].attribute
]
* you can output to a file like this:
resource "local_file" "ipv4-hosts-file" {
content = "${local.ipv4-hosts-string}\n"
filename = "ipv4-hosts"
}

=== Configure SSH ===
On your local host, change ''info319-cluster.tf'' so it also writes a file like this to ''~/.ssh/config.terraform-hosts'':
Host terraform-driver
Hostname 2001:700:2:8300::28c4

Host terraform-worker-0
Hostname 2001:700:2:8301::14a1

...

Host terraform-worker-5
Hostname 2001:700:2:8301::120d

On your local host, add these lines to ''~/.ssh/config'':
Host terraform-*
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
ProxyJump sinoa@login.uib.no
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ./config.terraform-hosts
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).

You can now log into all the cluster machines by name, even after you increase the number of worker nodes:
ssh terraform-worker-4

Create Spark cluster using Terraform

2022-10-31T13:22:26Z

Sinoa: /* Test login */

== OpenStack ==
=== Install OpenStack ===
On your local machine, create an ''exercise-5'' folder, for example a subfolder of ''info319-exercises'', and '''cd''' into it.

Install ''openstackclient''. There are several ways to do this. On Ubuntu Linux, you can do:
sudo apt install python3-openstackclient
(You may have to reinstall ''python3-six'' and ''python3-urllib3''. You may also need ''python3-dev''.)

Other guides suggest you install it as a Python package (using a virtual environment if you want):
pip install python-openstackclient

=== Configure OpenStack for command line ===
If you want to run OpenStack from the command line, create the file ''keystonerc.sh'' on your local machine:
touch keystonerc.sh
chmod 0600 keystonerc.sh

Use this as a template:
export OS_USERNAME=YOUR_USER_NAME@uib.no
export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT
export OS_PASSWORD=g3...YOUR_PASSWORD...Qb
export OS_AUTH_URL=https://api.nrec.no:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_USER_DOMAIN_NAME=dataporten
export OS_PROJECT_DOMAIN_NAME=dataporten
export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl
export OS_INTERFACE=public
export OS_NO_CACHE=1
export OS_TENANT_NAME=$OS_PROJECT_NAME

=== Test run OpenStack from command line ===
Test with:
. keystonerc.sh
openstack server list
(It can be quite slow.)

Other test commands:
openstack image list
openstack flavor list
openstack network list
openstack keypair list
openstack security group list
'''openstack --help''' lists all the possible commands.

'''Task:''' Try to create an instance with '''openstack server create ...''', then delete it. You can use the NREC Overview to see the results.

== Terraform ==
=== Install Terraform ===
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:
sudo apt update
sudo apt install software-properties-common gnupg2 curl
curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor > hashicorp.gpg
sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/
sudo apt-add-repository "deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main"
sudo apt update
sudo apt install terraform

=== Configure Terraform ===
Create a configuration file ''info319-cluster.tf'':
# configure the OpenStack provider
terraform {
required_providers {
openstack = {
source = "terraform-provider-openstack/openstack"
}
}
}

provider "openstack" {
}

Initialise Terraform with:
terraform init
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:
terraform init -upgrade
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.

=== Test run Terraform ===
Append something like this to ''info319-cluster.tf'':
# test instance
resource "openstack_compute_instance_v2" "terraform-test" {
name = "terraform-test"
image_name = "GOLD Ubuntu 22.04 LTS"
flavor_name = "m1.large"
security_groups = ["default", "info319-spark-cluster"]
network {
name = "dualStack"
}
}

Try to run Terraform with:
terraform plan
'''terraform plan''' lists everything ''terraform'' will do if you ''apply'' it. This list is important to check so you do not permanently delete something critical, like disks/volumes with important data on them...

To create the new test instance:
terraform apply # answer 'yes' to approve
Check ''Compute -> Instances'' in the ''NREC Overview'' to see the new instance appear.

(You can also run '''terraform plan -out plan.tf''' to save the plan to a file and then run it faster with '''terraform apply plan.tf'''.)

=== user-data.cfg ===
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like ''user-data.cfg'', or example:
#cloud-config

apt_upgrade: true

packages:
- emacs

power_state:
delay: "+3"
mode: reboot
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text ''emacs'' editor.)

To run this script when a new instance is created, add the line
user_data = file("user-data.cfg")
to ''info319-cluster.tf' and run '''terraform apply'''. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a ''Connection closed by UNKNOWN port 65535'' message.)

=== ~/.config/openstack/clouds.yaml (optional) ===
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (''OS_*'') defined in keystonerc.sh , you can keep them in a file.

On your local machine, create ~/.config/openstack/clouds.yaml with ''0600'' (or ''-rw-------'') access, for example:
clouds:
info319-cluster:
auth:
auth_url: https://api.nrec.no:5000/v3
project_name: uib-info-YOUR_NREC_PROJECT
username: YOUR_USER_NAME@uib.no
password: g3...YOUR_PASSWORD...Qb
user_domain_name: dataporten
project_domain_name: dataporten
identity_api_version: 3
region_name: YOUR_REGION_EITHER_bgo_OR_osl
interface: public
operation_log:
logging: TRUE
file: openstackclient_admin.log
level: info

You need to change your ''info319-cluster.tf'' file from
provider "openstack" {
}
to
provider "openstack" {
cloud = "info319-cluster" # defined in a clouds.yaml file
}
The advantage of this setup is that it easier manage multiple clusters from the same local machine. You do not need to run the ''keystonerc.sh'' script, and you avoid problems with changed environment variables.

''Note:'' There are now two ways to run '''openstack'''. One assumes that all the ''OS_*'' environment variables defined in ''keystonerc.sh'' are set. The other assumes that there is a ''~/.config/openstack/clouds.yaml'' file and that ''OS_CLOUD'' is set, for example:
OS_CLOUD=info319-cluster openstack server list

== Spark cluster ==
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The ''Resources'' menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] lists all the available resources, but it is easier to list the most relevant ones with '''openstack ''resource_type'' list'''.

=== Create or import a key pair ===
Your ''terraform-test'' instance still has no keypair. To add a keypair to ''info319-cluster.tf'', use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public ''~/.ssh/info319-spark-cluster.pub'' SSH key from before.

Add the keypair resource to the ''terraform-test'' resource with a line like this:
key_pair = "info319-spark-cluster"

Rerun '''terraform plan''', '''terraform apply''', and sometimes '''terraform destroy''' continuously as you build the cluster to check that things work.

=== Test login ===
Use
openstack server list # or OS_CLOUD=info319-cluster openstack ...
to find the IPv4 and IPv6 addresses of ''terraform-test''. Log in with for example:
ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.65.222
and
ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::22f4

=== Create the Spark driver ===
The ''terraform-test'' resource can be renamed to for example ''terraform-driver'' and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].

=== Create the Spark workers ===
You can create multiple ''terraform-worker-'' resources in a similar way, but adding a line
count = 6
and using '''${count.index}''' as a variable inside the worker ''name''. Use ''IPv6'' instead of ''dualStack''.

Check that you can login to the new instances using '''ssh''' and a IPv6 JumpHost.

As ''info319-cluster.tf'' grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:
locals {
cluster_prefix = "terraform-"
num_workers = 6
driver_name = "${local.cluster_prefix}driver"
worker_prefix = "${local.cluster_prefix}worker-"
worker_names = [
for idx in range(local.num_workers) :
"${local.worker_prefix}${idx}"]
all_names = concat([local.driver_name], local.worker_names)
}
Inside strings, you can refer to the variables using ''${local.var_name}''. Outside of string, you can write just ''local.var_name''.

=== Create and attach volumes ===
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances].

You can use ''local.num_workers'' and ''count='' both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:
resource "openstack_compute_volume_attach_v2" "attached-to-workers" {
count = local.num_workers
instance_id = "${openstack_compute_instance_v2.terraform-workers[count.index].id}"
volume_id = "${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}"
}

Log in with SSH and to '''ls /dev''' to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).

=== Create ''hosts'' files ===
Make OpenStack write out the file ''ipv4-hosts'' looking like this:
158.37.65.58 terraform-driver
10.1.2.234 terraform-worker-0
...
10.1.2.63 terraform-worker-5
and ''ipv6-hosts'' looking like this:
2001:700:2:8300::28c4 terraform-driver
2001:700:2:8301::14a1 terraform-worker-0
...
2001:700:2:8310::120d terraform-worker-5

Tips:
* you can define outputs from terraform like this:
output "terraform-driver-ipv4" {
value = openstack_compute_instance_v2.terraform-driver.access_ip_v4
}
* you can use the '''[https://www.terraform.io/cli/commands/console terraform console]''' to explore expressions you can use to define outputs and locals, for example:
openstack_compute_instance_v2.terraform-workers[5]
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs
* a few useful functions are:
concat([local.element], local.list)
length(string)
substr(string, offset, length)
zipmap(list1, list2)
* you can create lists using this syntax:
terraform-workers-ipv6 = [
for idx in range(number):
list[idx].attribute
]
* you can output to a file like this:
resource "local_file" "ipv4-hosts-file" {
content = "${local.ipv4-hosts-string}\n"
filename = "ipv4-hosts"
}

=== Configure SSH ===
On your local host, change ''info319-cluster.tf'' so it also writes a file like this to ''~/.ssh/config.terraform-hosts'':
Host terraform-driver
Hostname 2001:700:2:8300::28c4

Host terraform-worker-0
Hostname 2001:700:2:8301::14a1

...

Host terraform-worker-5
Hostname 2001:700:2:8301::120d

On your local host, add these lines to ''~/.ssh/config'':
Host terraform-*
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
ProxyJump sinoa@login.uib.no
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ./config.terraform-hosts
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).

You can now log into all the cluster machines by name, even after you increase the number of worker nodes:
ssh terraform-worker-4

Create Spark cluster using Terraform

2022-10-31T13:21:15Z

Sinoa: /* ~/.config/openstack/clouds.yaml (optional) */

== OpenStack ==
=== Install OpenStack ===
On your local machine, create an ''exercise-5'' folder, for example a subfolder of ''info319-exercises'', and '''cd''' into it.

Install ''openstackclient''. There are several ways to do this. On Ubuntu Linux, you can do:
sudo apt install python3-openstackclient
(You may have to reinstall ''python3-six'' and ''python3-urllib3''. You may also need ''python3-dev''.)

Other guides suggest you install it as a Python package (using a virtual environment if you want):
pip install python-openstackclient

=== Configure OpenStack for command line ===
If you want to run OpenStack from the command line, create the file ''keystonerc.sh'' on your local machine:
touch keystonerc.sh
chmod 0600 keystonerc.sh

Use this as a template:
export OS_USERNAME=YOUR_USER_NAME@uib.no
export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT
export OS_PASSWORD=g3...YOUR_PASSWORD...Qb
export OS_AUTH_URL=https://api.nrec.no:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_USER_DOMAIN_NAME=dataporten
export OS_PROJECT_DOMAIN_NAME=dataporten
export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl
export OS_INTERFACE=public
export OS_NO_CACHE=1
export OS_TENANT_NAME=$OS_PROJECT_NAME

=== Test run OpenStack from command line ===
Test with:
. keystonerc.sh
openstack server list
(It can be quite slow.)

Other test commands:
openstack image list
openstack flavor list
openstack network list
openstack keypair list
openstack security group list
'''openstack --help''' lists all the possible commands.

'''Task:''' Try to create an instance with '''openstack server create ...''', then delete it. You can use the NREC Overview to see the results.

== Terraform ==
=== Install Terraform ===
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:
sudo apt update
sudo apt install software-properties-common gnupg2 curl
curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor > hashicorp.gpg
sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/
sudo apt-add-repository "deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main"
sudo apt update
sudo apt install terraform

=== Configure Terraform ===
Create a configuration file ''info319-cluster.tf'':
# configure the OpenStack provider
terraform {
required_providers {
openstack = {
source = "terraform-provider-openstack/openstack"
}
}
}

provider "openstack" {
}

Initialise Terraform with:
terraform init
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:
terraform init -upgrade
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.

=== Test run Terraform ===
Append something like this to ''info319-cluster.tf'':
# test instance
resource "openstack_compute_instance_v2" "terraform-test" {
name = "terraform-test"
image_name = "GOLD Ubuntu 22.04 LTS"
flavor_name = "m1.large"
security_groups = ["default", "info319-spark-cluster"]
network {
name = "dualStack"
}
}

Try to run Terraform with:
terraform plan
'''terraform plan''' lists everything ''terraform'' will do if you ''apply'' it. This list is important to check so you do not permanently delete something critical, like disks/volumes with important data on them...

To create the new test instance:
terraform apply # answer 'yes' to approve
Check ''Compute -> Instances'' in the ''NREC Overview'' to see the new instance appear.

(You can also run '''terraform plan -out plan.tf''' to save the plan to a file and then run it faster with '''terraform apply plan.tf'''.)

=== user-data.cfg ===
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like ''user-data.cfg'', or example:
#cloud-config

apt_upgrade: true

packages:
- emacs

power_state:
delay: "+3"
mode: reboot
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text ''emacs'' editor.)

To run this script when a new instance is created, add the line
user_data = file("user-data.cfg")
to ''info319-cluster.tf' and run '''terraform apply'''. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a ''Connection closed by UNKNOWN port 65535'' message.)

=== ~/.config/openstack/clouds.yaml (optional) ===
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (''OS_*'') defined in keystonerc.sh , you can keep them in a file.

On your local machine, create ~/.config/openstack/clouds.yaml with ''0600'' (or ''-rw-------'') access, for example:
clouds:
info319-cluster:
auth:
auth_url: https://api.nrec.no:5000/v3
project_name: uib-info-YOUR_NREC_PROJECT
username: YOUR_USER_NAME@uib.no
password: g3...YOUR_PASSWORD...Qb
user_domain_name: dataporten
project_domain_name: dataporten
identity_api_version: 3
region_name: YOUR_REGION_EITHER_bgo_OR_osl
interface: public
operation_log:
logging: TRUE
file: openstackclient_admin.log
level: info

You need to change your ''info319-cluster.tf'' file from
provider "openstack" {
}
to
provider "openstack" {
cloud = "info319-cluster" # defined in a clouds.yaml file
}
The advantage of this setup is that it easier manage multiple clusters from the same local machine. You do not need to run the ''keystonerc.sh'' script, and you avoid problems with changed environment variables.

''Note:'' There are now two ways to run '''openstack'''. One assumes that all the ''OS_*'' environment variables defined in ''keystonerc.sh'' are set. The other assumes that there is a ''~/.config/openstack/clouds.yaml'' file and that ''OS_CLOUD'' is set, for example:
OS_CLOUD=info319-cluster openstack server list

== Spark cluster ==
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The ''Resources'' menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] lists all the available resources, but it is easier to list the most relevant ones with '''openstack ''resource_type'' list'''.

=== Create or import a key pair ===
Your ''terraform-test'' instance still has no keypair. To add a keypair to ''info319-cluster.tf'', use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public ''~/.ssh/info319-spark-cluster.pub'' SSH key from before.

Add the keypair resource to the ''terraform-test'' resource with a line like this:
key_pair = "info319-spark-cluster"

Rerun '''terraform plan''', '''terraform apply''', and sometimes '''terraform destroy''' continuously as you build the cluster to check that things work.

=== Test login ===
Use
openstack server list
to find the IPv4 and IPv6 addresses of ''terraform-test''. Log in with for example:
ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.65.222
and
ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::22f4

=== Create the Spark driver ===
The ''terraform-test'' resource can be renamed to for example ''terraform-driver'' and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].

=== Create the Spark workers ===
You can create multiple ''terraform-worker-'' resources in a similar way, but adding a line
count = 6
and using '''${count.index}''' as a variable inside the worker ''name''. Use ''IPv6'' instead of ''dualStack''.

Check that you can login to the new instances using '''ssh''' and a IPv6 JumpHost.

As ''info319-cluster.tf'' grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:
locals {
cluster_prefix = "terraform-"
num_workers = 6
driver_name = "${local.cluster_prefix}driver"
worker_prefix = "${local.cluster_prefix}worker-"
worker_names = [
for idx in range(local.num_workers) :
"${local.worker_prefix}${idx}"]
all_names = concat([local.driver_name], local.worker_names)
}
Inside strings, you can refer to the variables using ''${local.var_name}''. Outside of string, you can write just ''local.var_name''.

=== Create and attach volumes ===
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances].

You can use ''local.num_workers'' and ''count='' both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:
resource "openstack_compute_volume_attach_v2" "attached-to-workers" {
count = local.num_workers
instance_id = "${openstack_compute_instance_v2.terraform-workers[count.index].id}"
volume_id = "${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}"
}

Log in with SSH and to '''ls /dev''' to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).

=== Create ''hosts'' files ===
Make OpenStack write out the file ''ipv4-hosts'' looking like this:
158.37.65.58 terraform-driver
10.1.2.234 terraform-worker-0
...
10.1.2.63 terraform-worker-5
and ''ipv6-hosts'' looking like this:
2001:700:2:8300::28c4 terraform-driver
2001:700:2:8301::14a1 terraform-worker-0
...
2001:700:2:8310::120d terraform-worker-5

Tips:
* you can define outputs from terraform like this:
output "terraform-driver-ipv4" {
value = openstack_compute_instance_v2.terraform-driver.access_ip_v4
}
* you can use the '''[https://www.terraform.io/cli/commands/console terraform console]''' to explore expressions you can use to define outputs and locals, for example:
openstack_compute_instance_v2.terraform-workers[5]
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs
* a few useful functions are:
concat([local.element], local.list)
length(string)
substr(string, offset, length)
zipmap(list1, list2)
* you can create lists using this syntax:
terraform-workers-ipv6 = [
for idx in range(number):
list[idx].attribute
]
* you can output to a file like this:
resource "local_file" "ipv4-hosts-file" {
content = "${local.ipv4-hosts-string}\n"
filename = "ipv4-hosts"
}

=== Configure SSH ===
On your local host, change ''info319-cluster.tf'' so it also writes a file like this to ''~/.ssh/config.terraform-hosts'':
Host terraform-driver
Hostname 2001:700:2:8300::28c4

Host terraform-worker-0
Hostname 2001:700:2:8301::14a1

...

Host terraform-worker-5
Hostname 2001:700:2:8301::120d

On your local host, add these lines to ''~/.ssh/config'':
Host terraform-*
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
ProxyJump sinoa@login.uib.no
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ./config.terraform-hosts
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).

You can now log into all the cluster machines by name, even after you increase the number of worker nodes:
ssh terraform-worker-4

Create Spark cluster using Terraform

2022-10-31T13:20:49Z

Sinoa: /* Spark cluster */

== OpenStack ==
=== Install OpenStack ===
On your local machine, create an ''exercise-5'' folder, for example a subfolder of ''info319-exercises'', and '''cd''' into it.

Install ''openstackclient''. There are several ways to do this. On Ubuntu Linux, you can do:
sudo apt install python3-openstackclient
(You may have to reinstall ''python3-six'' and ''python3-urllib3''. You may also need ''python3-dev''.)

Other guides suggest you install it as a Python package (using a virtual environment if you want):
pip install python-openstackclient

=== Configure OpenStack for command line ===
If you want to run OpenStack from the command line, create the file ''keystonerc.sh'' on your local machine:
touch keystonerc.sh
chmod 0600 keystonerc.sh

Use this as a template:
export OS_USERNAME=YOUR_USER_NAME@uib.no
export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT
export OS_PASSWORD=g3...YOUR_PASSWORD...Qb
export OS_AUTH_URL=https://api.nrec.no:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_USER_DOMAIN_NAME=dataporten
export OS_PROJECT_DOMAIN_NAME=dataporten
export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl
export OS_INTERFACE=public
export OS_NO_CACHE=1
export OS_TENANT_NAME=$OS_PROJECT_NAME

=== Test run OpenStack from command line ===
Test with:
. keystonerc.sh
openstack server list
(It can be quite slow.)

Other test commands:
openstack image list
openstack flavor list
openstack network list
openstack keypair list
openstack security group list
'''openstack --help''' lists all the possible commands.

'''Task:''' Try to create an instance with '''openstack server create ...''', then delete it. You can use the NREC Overview to see the results.

== Terraform ==
=== Install Terraform ===
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:
sudo apt update
sudo apt install software-properties-common gnupg2 curl
curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor > hashicorp.gpg
sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/
sudo apt-add-repository "deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main"
sudo apt update
sudo apt install terraform

=== Configure Terraform ===
Create a configuration file ''info319-cluster.tf'':
# configure the OpenStack provider
terraform {
required_providers {
openstack = {
source = "terraform-provider-openstack/openstack"
}
}
}

provider "openstack" {
}

Initialise Terraform with:
terraform init
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:
terraform init -upgrade
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.

=== Test run Terraform ===
Append something like this to ''info319-cluster.tf'':
# test instance
resource "openstack_compute_instance_v2" "terraform-test" {
name = "terraform-test"
image_name = "GOLD Ubuntu 22.04 LTS"
flavor_name = "m1.large"
security_groups = ["default", "info319-spark-cluster"]
network {
name = "dualStack"
}
}

Try to run Terraform with:
terraform plan
'''terraform plan''' lists everything ''terraform'' will do if you ''apply'' it. This list is important to check so you do not permanently delete something critical, like disks/volumes with important data on them...

To create the new test instance:
terraform apply # answer 'yes' to approve
Check ''Compute -> Instances'' in the ''NREC Overview'' to see the new instance appear.

(You can also run '''terraform plan -out plan.tf''' to save the plan to a file and then run it faster with '''terraform apply plan.tf'''.)

=== user-data.cfg ===
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like ''user-data.cfg'', or example:
#cloud-config

apt_upgrade: true

packages:
- emacs

power_state:
delay: "+3"
mode: reboot
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text ''emacs'' editor.)

To run this script when a new instance is created, add the line
user_data = file("user-data.cfg")
to ''info319-cluster.tf' and run '''terraform apply'''. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a ''Connection closed by UNKNOWN port 65535'' message.)

=== ~/.config/openstack/clouds.yaml (optional) ===
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (''OS_*'') defined in keystonerc.sh , you can keep them in a file.

On your local machine, create ~/.config/openstack/clouds.yaml with ''0600'' (or ''-rw-------'') access, for example:
clouds:
info319-cluster:
auth:
auth_url: https://api.nrec.no:5000/v3
project_name: uib-info-YOUR_NREC_PROJECT
username: YOUR_USER_NAME@uib.no
password: g3...YOUR_PASSWORD...Qb
user_domain_name: dataporten
project_domain_name: dataporten
identity_api_version: 3
region_name: YOUR_REGION_EITHER_bgo_OR_osl
interface: public
operation_log:
logging: TRUE
file: openstackclient_admin.log
level: info

You need to change your ''info319-cluster.tf'' file from
provider "openstack" {
}
to
provider "openstack" {
cloud = "info319-cluster" # defined in a clouds.yaml file
}
The advantage of this setup is that it easier manage multiple clusters from the same local machine. You do not need to run the ''keystonerc.sh'' script, and you avoid problems with changed environment variables.

== Spark cluster ==
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The ''Resources'' menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] lists all the available resources, but it is easier to list the most relevant ones with '''openstack ''resource_type'' list'''.

=== Create or import a key pair ===
Your ''terraform-test'' instance still has no keypair. To add a keypair to ''info319-cluster.tf'', use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public ''~/.ssh/info319-spark-cluster.pub'' SSH key from before.

Add the keypair resource to the ''terraform-test'' resource with a line like this:
key_pair = "info319-spark-cluster"

Rerun '''terraform plan''', '''terraform apply''', and sometimes '''terraform destroy''' continuously as you build the cluster to check that things work.

=== Test login ===
Use
openstack server list
to find the IPv4 and IPv6 addresses of ''terraform-test''. Log in with for example:
ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.65.222
and
ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::22f4

=== Create the Spark driver ===
The ''terraform-test'' resource can be renamed to for example ''terraform-driver'' and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].

=== Create the Spark workers ===
You can create multiple ''terraform-worker-'' resources in a similar way, but adding a line
count = 6
and using '''${count.index}''' as a variable inside the worker ''name''. Use ''IPv6'' instead of ''dualStack''.

Check that you can login to the new instances using '''ssh''' and a IPv6 JumpHost.

As ''info319-cluster.tf'' grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:
locals {
cluster_prefix = "terraform-"
num_workers = 6
driver_name = "${local.cluster_prefix}driver"
worker_prefix = "${local.cluster_prefix}worker-"
worker_names = [
for idx in range(local.num_workers) :
"${local.worker_prefix}${idx}"]
all_names = concat([local.driver_name], local.worker_names)
}
Inside strings, you can refer to the variables using ''${local.var_name}''. Outside of string, you can write just ''local.var_name''.

=== Create and attach volumes ===
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances].

You can use ''local.num_workers'' and ''count='' both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:
resource "openstack_compute_volume_attach_v2" "attached-to-workers" {
count = local.num_workers
instance_id = "${openstack_compute_instance_v2.terraform-workers[count.index].id}"
volume_id = "${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}"
}

Log in with SSH and to '''ls /dev''' to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).

=== Create ''hosts'' files ===
Make OpenStack write out the file ''ipv4-hosts'' looking like this:
158.37.65.58 terraform-driver
10.1.2.234 terraform-worker-0
...
10.1.2.63 terraform-worker-5
and ''ipv6-hosts'' looking like this:
2001:700:2:8300::28c4 terraform-driver
2001:700:2:8301::14a1 terraform-worker-0
...
2001:700:2:8310::120d terraform-worker-5

Tips:
* you can define outputs from terraform like this:
output "terraform-driver-ipv4" {
value = openstack_compute_instance_v2.terraform-driver.access_ip_v4
}
* you can use the '''[https://www.terraform.io/cli/commands/console terraform console]''' to explore expressions you can use to define outputs and locals, for example:
openstack_compute_instance_v2.terraform-workers[5]
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs
* a few useful functions are:
concat([local.element], local.list)
length(string)
substr(string, offset, length)
zipmap(list1, list2)
* you can create lists using this syntax:
terraform-workers-ipv6 = [
for idx in range(number):
list[idx].attribute
]
* you can output to a file like this:
resource "local_file" "ipv4-hosts-file" {
content = "${local.ipv4-hosts-string}\n"
filename = "ipv4-hosts"
}

=== Configure SSH ===
On your local host, change ''info319-cluster.tf'' so it also writes a file like this to ''~/.ssh/config.terraform-hosts'':
Host terraform-driver
Hostname 2001:700:2:8300::28c4

Host terraform-worker-0
Hostname 2001:700:2:8301::14a1

...

Host terraform-worker-5
Hostname 2001:700:2:8301::120d

On your local host, add these lines to ''~/.ssh/config'':
Host terraform-*
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
ProxyJump sinoa@login.uib.no
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ./config.terraform-hosts
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).

You can now log into all the cluster machines by name, even after you increase the number of worker nodes:
ssh terraform-worker-4

Create Spark cluster using Terraform

2022-10-31T13:10:05Z

Sinoa: /* ~/.config/openstack/clouds.yaml (optional) */

== OpenStack ==
=== Install OpenStack ===
On your local machine, create an ''exercise-5'' folder, for example a subfolder of ''info319-exercises'', and '''cd''' into it.

Install ''openstackclient''. There are several ways to do this. On Ubuntu Linux, you can do:
sudo apt install python3-openstackclient
(You may have to reinstall ''python3-six'' and ''python3-urllib3''. You may also need ''python3-dev''.)

Other guides suggest you install it as a Python package (using a virtual environment if you want):
pip install python-openstackclient

=== Configure OpenStack for command line ===
If you want to run OpenStack from the command line, create the file ''keystonerc.sh'' on your local machine:
touch keystonerc.sh
chmod 0600 keystonerc.sh

Use this as a template:
export OS_USERNAME=YOUR_USER_NAME@uib.no
export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT
export OS_PASSWORD=g3...YOUR_PASSWORD...Qb
export OS_AUTH_URL=https://api.nrec.no:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_USER_DOMAIN_NAME=dataporten
export OS_PROJECT_DOMAIN_NAME=dataporten
export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl
export OS_INTERFACE=public
export OS_NO_CACHE=1
export OS_TENANT_NAME=$OS_PROJECT_NAME

=== Test run OpenStack from command line ===
Test with:
. keystonerc.sh
openstack server list
(It can be quite slow.)

Other test commands:
openstack image list
openstack flavor list
openstack network list
openstack keypair list
openstack security group list
'''openstack --help''' lists all the possible commands.

'''Task:''' Try to create an instance with '''openstack server create ...''', then delete it. You can use the NREC Overview to see the results.

== Terraform ==
=== Install Terraform ===
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:
sudo apt update
sudo apt install software-properties-common gnupg2 curl
curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor > hashicorp.gpg
sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/
sudo apt-add-repository "deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main"
sudo apt update
sudo apt install terraform

=== Configure Terraform ===
Create a configuration file ''info319-cluster.tf'':
# configure the OpenStack provider
terraform {
required_providers {
openstack = {
source = "terraform-provider-openstack/openstack"
}
}
}

provider "openstack" {
}

Initialise Terraform with:
terraform init
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:
terraform init -upgrade
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.

=== Test run Terraform ===
Append something like this to ''info319-cluster.tf'':
# test instance
resource "openstack_compute_instance_v2" "terraform-test" {
name = "terraform-test"
image_name = "GOLD Ubuntu 22.04 LTS"
flavor_name = "m1.large"
security_groups = ["default", "info319-spark-cluster"]
network {
name = "dualStack"
}
}

Try to run Terraform with:
terraform plan
'''terraform plan''' lists everything ''terraform'' will do if you ''apply'' it. This list is important to check so you do not permanently delete something critical, like disks/volumes with important data on them...

To create the new test instance:
terraform apply # answer 'yes' to approve
Check ''Compute -> Instances'' in the ''NREC Overview'' to see the new instance appear.

(You can also run '''terraform plan -out plan.tf''' to save the plan to a file and then run it faster with '''terraform apply plan.tf'''.)

=== user-data.cfg ===
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like ''user-data.cfg'', or example:
#cloud-config

apt_upgrade: true

packages:
- emacs

power_state:
delay: "+3"
mode: reboot
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text ''emacs'' editor.)

To run this script when a new instance is created, add the line
user_data = file("user-data.cfg")
to ''info319-cluster.tf' and run '''terraform apply'''. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a ''Connection closed by UNKNOWN port 65535'' message.)

=== ~/.config/openstack/clouds.yaml (optional) ===
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (''OS_*'') defined in keystonerc.sh , you can keep them in a file.

On your local machine, create ~/.config/openstack/clouds.yaml with ''0600'' (or ''-rw-------'') access, for example:
clouds:
info319-cluster:
auth:
auth_url: https://api.nrec.no:5000/v3
project_name: uib-info-YOUR_NREC_PROJECT
username: YOUR_USER_NAME@uib.no
password: g3...YOUR_PASSWORD...Qb
user_domain_name: dataporten
project_domain_name: dataporten
identity_api_version: 3
region_name: YOUR_REGION_EITHER_bgo_OR_osl
interface: public
operation_log:
logging: TRUE
file: openstackclient_admin.log
level: info

You need to change your ''info319-cluster.tf'' file from
provider "openstack" {
}
to
provider "openstack" {
cloud = "info319-cluster" # defined in a clouds.yaml file
}
The advantage of this setup is that it easier manage multiple clusters from the same local machine. You do not need to run the ''keystonerc.sh'' script, and you avoid problems with changed environment variables.

== Spark cluster ==
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The ''Resources'' menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] gives you details.

=== Create or import a key pair ===
Your ''terraform-test'' instance still has no keypair. To add a keypair to ''info319-cluster.tf'', use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public ''~/.ssh/info319-spark-cluster.pub'' SSH key from before.

Add the keypair resource to the ''terraform-test'' resource with a line like this:
key_pair = "info319-spark-cluster"

Rerun '''terraform plan''' and '''terraform apply''' continuously to check that things work.

=== Test login ===
Use
openstack server list
to find the IPv4 and IPv6 addresses of ''terraform-test''. Log in with for example:
ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.65.222
and
ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::22f4

''Note:'' There are two ways to run '''openstack'''. One assumes that all the ''OS_*'' environment variables defined in ''keystonerc.sh'' are set. The other assumes that there is a ''~/.config/openstack/clouds.yaml'' file and that ''OS_CLOUD'' is set, for example:
OS_CLOUD=info319-cluster openstack server list

=== Create the Spark driver ===
The ''terraform-test'' resource can be renamed to for example ''terraform-driver'' and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].

=== Create the Spark workers ===
You can create multiple ''terraform-worker-'' resources in a similar way, but adding a line
count = 6
and using '''${count.index}''' as a variable inside the worker ''name''. Use ''IPv6'' instead of ''dualStack''.

Check that you can login to the new instances using '''ssh''' and a IPv6 JumpHost.

As ''info319-cluster.tf'' grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:
locals {
cluster_prefix = "terraform-"
num_workers = 6
driver_name = "${local.cluster_prefix}driver"
worker_prefix = "${local.cluster_prefix}worker-"
worker_names = [
for idx in range(local.num_workers) :
"${local.worker_prefix}${idx}"]
all_names = concat([local.driver_name], local.worker_names)
}
Inside strings, you can refer to the variables using ''${local.var_name}''. Outside of string, you can write just ''local.var_name''.

=== Create and attach volumes ===
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances].

You can use ''local.num_workers'' and ''count='' both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:
resource "openstack_compute_volume_attach_v2" "attached-to-workers" {
count = local.num_workers
instance_id = "${openstack_compute_instance_v2.terraform-workers[count.index].id}"
volume_id = "${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}"
}

Log in with SSH and to '''ls /dev''' to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).

=== Create ''hosts'' files ===
Make OpenStack write out the file ''ipv4-hosts'' looking like this:
158.37.65.58 terraform-driver
10.1.2.234 terraform-worker-0
...
10.1.2.63 terraform-worker-5
and ''ipv6-hosts'' looking like this:
2001:700:2:8300::28c4 terraform-driver
2001:700:2:8301::14a1 terraform-worker-0
...
2001:700:2:8310::120d terraform-worker-5

Tips:
* you can define outputs from terraform like this:
output "terraform-driver-ipv4" {
value = openstack_compute_instance_v2.terraform-driver.access_ip_v4
}
* you can use the '''[https://www.terraform.io/cli/commands/console terraform console]''' to explore expressions you can use to define outputs and locals, for example:
openstack_compute_instance_v2.terraform-workers[5]
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs
* a few useful functions are:
concat([local.element], local.list)
length(string)
substr(string, offset, length)
zipmap(list1, list2)
* you can create lists using this syntax:
terraform-workers-ipv6 = [
for idx in range(number):
list[idx].attribute
]
* you can output to a file like this:
resource "local_file" "ipv4-hosts-file" {
content = "${local.ipv4-hosts-string}\n"
filename = "ipv4-hosts"
}

=== Configure SSH ===
On your local host, change ''info319-cluster.tf'' so it also writes a file like this to ''~/.ssh/config.terraform-hosts'':
Host terraform-driver
Hostname 2001:700:2:8300::28c4

Host terraform-worker-0
Hostname 2001:700:2:8301::14a1

...

Host terraform-worker-5
Hostname 2001:700:2:8301::120d

On your local host, add these lines to ''~/.ssh/config'':
Host terraform-*
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
ProxyJump sinoa@login.uib.no
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ./config.terraform-hosts
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).

You can now log into all the cluster machines by name, even after you increase the number of worker nodes:
ssh terraform-worker-4

Create Spark cluster using Terraform

2022-10-31T13:02:19Z

Sinoa: /* Test run Terraform */

== OpenStack ==
=== Install OpenStack ===
On your local machine, create an ''exercise-5'' folder, for example a subfolder of ''info319-exercises'', and '''cd''' into it.

Install ''openstackclient''. There are several ways to do this. On Ubuntu Linux, you can do:
sudo apt install python3-openstackclient
(You may have to reinstall ''python3-six'' and ''python3-urllib3''. You may also need ''python3-dev''.)

Other guides suggest you install it as a Python package (using a virtual environment if you want):
pip install python-openstackclient

=== Configure OpenStack for command line ===
If you want to run OpenStack from the command line, create the file ''keystonerc.sh'' on your local machine:
touch keystonerc.sh
chmod 0600 keystonerc.sh

Use this as a template:
export OS_USERNAME=YOUR_USER_NAME@uib.no
export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT
export OS_PASSWORD=g3...YOUR_PASSWORD...Qb
export OS_AUTH_URL=https://api.nrec.no:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_USER_DOMAIN_NAME=dataporten
export OS_PROJECT_DOMAIN_NAME=dataporten
export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl
export OS_INTERFACE=public
export OS_NO_CACHE=1
export OS_TENANT_NAME=$OS_PROJECT_NAME

=== Test run OpenStack from command line ===
Test with:
. keystonerc.sh
openstack server list
(It can be quite slow.)

Other test commands:
openstack image list
openstack flavor list
openstack network list
openstack keypair list
openstack security group list
'''openstack --help''' lists all the possible commands.

'''Task:''' Try to create an instance with '''openstack server create ...''', then delete it. You can use the NREC Overview to see the results.

== Terraform ==
=== Install Terraform ===
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:
sudo apt update
sudo apt install software-properties-common gnupg2 curl
curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor > hashicorp.gpg
sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/
sudo apt-add-repository "deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main"
sudo apt update
sudo apt install terraform

=== Configure Terraform ===
Create a configuration file ''info319-cluster.tf'':
# configure the OpenStack provider
terraform {
required_providers {
openstack = {
source = "terraform-provider-openstack/openstack"
}
}
}

provider "openstack" {
}

Initialise Terraform with:
terraform init
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:
terraform init -upgrade
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.

=== Test run Terraform ===
Append something like this to ''info319-cluster.tf'':
# test instance
resource "openstack_compute_instance_v2" "terraform-test" {
name = "terraform-test"
image_name = "GOLD Ubuntu 22.04 LTS"
flavor_name = "m1.large"
security_groups = ["default", "info319-spark-cluster"]
network {
name = "dualStack"
}
}

Try to run Terraform with:
terraform plan
'''terraform plan''' lists everything ''terraform'' will do if you ''apply'' it. This list is important to check so you do not permanently delete something critical, like disks/volumes with important data on them...

To create the new test instance:
terraform apply # answer 'yes' to approve
Check ''Compute -> Instances'' in the ''NREC Overview'' to see the new instance appear.

(You can also run '''terraform plan -out plan.tf''' to save the plan to a file and then run it faster with '''terraform apply plan.tf'''.)

=== user-data.cfg ===
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like ''user-data.cfg'', or example:
#cloud-config

apt_upgrade: true

packages:
- emacs

power_state:
delay: "+3"
mode: reboot
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text ''emacs'' editor.)

To run this script when a new instance is created, add the line
user_data = file("user-data.cfg")
to ''info319-cluster.tf' and run '''terraform apply'''. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a ''Connection closed by UNKNOWN port 65535'' message.)

=== ~/.config/openstack/clouds.yaml (optional) ===
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (''OS_*'') defined in keystonerc.sh , you can keep them in a file.

On your local machine, create ~/.config/openstack/clouds.yaml with ''0600'' (or ''-rw-------'') access, for example:
clouds:
info319-cluster:
auth:
auth_url: https://api.nrec.no:5000/v3
project_name: uib-info-YOUR_NREC_PROJECT
username: YOUR_USER_NAME@uib.no
password: g3...YOUR_PASSWORD...Qb
user_domain_name: dataporten
project_domain_name: dataporten
identity_api_version: 3
region_name: YOUR_REGION_EITHER_bgo_OR_osl
interface: public
operation_log:
logging: TRUE
file: openstackclient_admin.log
level: info

You need to change your ''info319-cluster.tf'' file from
provider "openstack" {
}
to
provider "openstack" {
cloud = "info319-cluster" # defined in a clouds.yaml file
}
The advantage of this setup is that it easier manage multiple clusters from the same local machine, and you avoid problems with changed enrivonment variables.

== Spark cluster ==
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The ''Resources'' menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] gives you details.

=== Create or import a key pair ===
Your ''terraform-test'' instance still has no keypair. To add a keypair to ''info319-cluster.tf'', use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public ''~/.ssh/info319-spark-cluster.pub'' SSH key from before.

Add the keypair resource to the ''terraform-test'' resource with a line like this:
key_pair = "info319-spark-cluster"

Rerun '''terraform plan''' and '''terraform apply''' continuously to check that things work.

=== Test login ===
Use
openstack server list
to find the IPv4 and IPv6 addresses of ''terraform-test''. Log in with for example:
ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.65.222
and
ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::22f4

''Note:'' There are two ways to run '''openstack'''. One assumes that all the ''OS_*'' environment variables defined in ''keystonerc.sh'' are set. The other assumes that there is a ''~/.config/openstack/clouds.yaml'' file and that ''OS_CLOUD'' is set, for example:
OS_CLOUD=info319-cluster openstack server list

=== Create the Spark driver ===
The ''terraform-test'' resource can be renamed to for example ''terraform-driver'' and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].

=== Create the Spark workers ===
You can create multiple ''terraform-worker-'' resources in a similar way, but adding a line
count = 6
and using '''${count.index}''' as a variable inside the worker ''name''. Use ''IPv6'' instead of ''dualStack''.

Check that you can login to the new instances using '''ssh''' and a IPv6 JumpHost.

As ''info319-cluster.tf'' grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:
locals {
cluster_prefix = "terraform-"
num_workers = 6
driver_name = "${local.cluster_prefix}driver"
worker_prefix = "${local.cluster_prefix}worker-"
worker_names = [
for idx in range(local.num_workers) :
"${local.worker_prefix}${idx}"]
all_names = concat([local.driver_name], local.worker_names)
}
Inside strings, you can refer to the variables using ''${local.var_name}''. Outside of string, you can write just ''local.var_name''.

=== Create and attach volumes ===
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances].

You can use ''local.num_workers'' and ''count='' both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:
resource "openstack_compute_volume_attach_v2" "attached-to-workers" {
count = local.num_workers
instance_id = "${openstack_compute_instance_v2.terraform-workers[count.index].id}"
volume_id = "${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}"
}

Log in with SSH and to '''ls /dev''' to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).

=== Create ''hosts'' files ===
Make OpenStack write out the file ''ipv4-hosts'' looking like this:
158.37.65.58 terraform-driver
10.1.2.234 terraform-worker-0
...
10.1.2.63 terraform-worker-5
and ''ipv6-hosts'' looking like this:
2001:700:2:8300::28c4 terraform-driver
2001:700:2:8301::14a1 terraform-worker-0
...
2001:700:2:8310::120d terraform-worker-5

Tips:
* you can define outputs from terraform like this:
output "terraform-driver-ipv4" {
value = openstack_compute_instance_v2.terraform-driver.access_ip_v4
}
* you can use the '''[https://www.terraform.io/cli/commands/console terraform console]''' to explore expressions you can use to define outputs and locals, for example:
openstack_compute_instance_v2.terraform-workers[5]
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs
* a few useful functions are:
concat([local.element], local.list)
length(string)
substr(string, offset, length)
zipmap(list1, list2)
* you can create lists using this syntax:
terraform-workers-ipv6 = [
for idx in range(number):
list[idx].attribute
]
* you can output to a file like this:
resource "local_file" "ipv4-hosts-file" {
content = "${local.ipv4-hosts-string}\n"
filename = "ipv4-hosts"
}

=== Configure SSH ===
On your local host, change ''info319-cluster.tf'' so it also writes a file like this to ''~/.ssh/config.terraform-hosts'':
Host terraform-driver
Hostname 2001:700:2:8300::28c4

Host terraform-worker-0
Hostname 2001:700:2:8301::14a1

...

Host terraform-worker-5
Hostname 2001:700:2:8301::120d

On your local host, add these lines to ''~/.ssh/config'':
Host terraform-*
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
ProxyJump sinoa@login.uib.no
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ./config.terraform-hosts
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).

You can now log into all the cluster machines by name, even after you increase the number of worker nodes:
ssh terraform-worker-4

Create Spark cluster using Terraform

2022-10-31T12:58:05Z

Sinoa: /* user-data.cfg */

== OpenStack ==
=== Install OpenStack ===
On your local machine, create an ''exercise-5'' folder, for example a subfolder of ''info319-exercises'', and '''cd''' into it.

Install ''openstackclient''. There are several ways to do this. On Ubuntu Linux, you can do:
sudo apt install python3-openstackclient
(You may have to reinstall ''python3-six'' and ''python3-urllib3''. You may also need ''python3-dev''.)

Other guides suggest you install it as a Python package (using a virtual environment if you want):
pip install python-openstackclient

=== Configure OpenStack for command line ===
If you want to run OpenStack from the command line, create the file ''keystonerc.sh'' on your local machine:
touch keystonerc.sh
chmod 0600 keystonerc.sh

Use this as a template:
export OS_USERNAME=YOUR_USER_NAME@uib.no
export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT
export OS_PASSWORD=g3...YOUR_PASSWORD...Qb
export OS_AUTH_URL=https://api.nrec.no:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_USER_DOMAIN_NAME=dataporten
export OS_PROJECT_DOMAIN_NAME=dataporten
export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl
export OS_INTERFACE=public
export OS_NO_CACHE=1
export OS_TENANT_NAME=$OS_PROJECT_NAME

=== Test run OpenStack from command line ===
Test with:
. keystonerc.sh
openstack server list
(It can be quite slow.)

Other test commands:
openstack image list
openstack flavor list
openstack network list
openstack keypair list
openstack security group list
'''openstack --help''' lists all the possible commands.

'''Task:''' Try to create an instance with '''openstack server create ...''', then delete it. You can use the NREC Overview to see the results.

== Terraform ==
=== Install Terraform ===
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:
sudo apt update
sudo apt install software-properties-common gnupg2 curl
curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor > hashicorp.gpg
sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/
sudo apt-add-repository "deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main"
sudo apt update
sudo apt install terraform

=== Configure Terraform ===
Create a configuration file ''info319-cluster.tf'':
# configure the OpenStack provider
terraform {
required_providers {
openstack = {
source = "terraform-provider-openstack/openstack"
}
}
}

provider "openstack" {
}

Initialise Terraform with:
terraform init
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:
terraform init -upgrade
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.

=== Test run Terraform ===
Append something like this to ''info319-cluster.tf'':
# test instance
resource "openstack_compute_instance_v2" "terraform-test" {
name = "terraform-test"
image_name = "GOLD Ubuntu 22.04 LTS"
flavor_name = "m1.large"
security_groups = ["default", "info319-spark-cluster"]
network {
name = "dualStack"
}
}

Try to run Terraform with:
terraform plan
'''terraform plan''' lists everything ''terraform'' will do if you ''apply'' it. This list is important to check so you do not permanently delete something critical, like disks/volumes.

To create the new test instance:
terraform apply # answer 'yes' to approve
Check ''Compute -> Instances'' in the ''NREC Overview'' to see the new instance appear.

=== user-data.cfg ===
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like ''user-data.cfg'', or example:
#cloud-config

apt_upgrade: true

packages:
- emacs

power_state:
delay: "+3"
mode: reboot
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text ''emacs'' editor.)

To run this script when a new instance is created, add the line
user_data = file("user-data.cfg")
to ''info319-cluster.tf' and run '''terraform apply'''. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a ''Connection closed by UNKNOWN port 65535'' message.)

=== ~/.config/openstack/clouds.yaml (optional) ===
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (''OS_*'') defined in keystonerc.sh , you can keep them in a file.

On your local machine, create ~/.config/openstack/clouds.yaml with ''0600'' (or ''-rw-------'') access, for example:
clouds:
info319-cluster:
auth:
auth_url: https://api.nrec.no:5000/v3
project_name: uib-info-YOUR_NREC_PROJECT
username: YOUR_USER_NAME@uib.no
password: g3...YOUR_PASSWORD...Qb
user_domain_name: dataporten
project_domain_name: dataporten
identity_api_version: 3
region_name: YOUR_REGION_EITHER_bgo_OR_osl
interface: public
operation_log:
logging: TRUE
file: openstackclient_admin.log
level: info

You need to change your ''info319-cluster.tf'' file from
provider "openstack" {
}
to
provider "openstack" {
cloud = "info319-cluster" # defined in a clouds.yaml file
}
The advantage of this setup is that it easier manage multiple clusters from the same local machine, and you avoid problems with changed enrivonment variables.

== Spark cluster ==
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The ''Resources'' menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] gives you details.

=== Create or import a key pair ===
Your ''terraform-test'' instance still has no keypair. To add a keypair to ''info319-cluster.tf'', use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public ''~/.ssh/info319-spark-cluster.pub'' SSH key from before.

Add the keypair resource to the ''terraform-test'' resource with a line like this:
key_pair = "info319-spark-cluster"

Rerun '''terraform plan''' and '''terraform apply''' continuously to check that things work.

=== Test login ===
Use
openstack server list
to find the IPv4 and IPv6 addresses of ''terraform-test''. Log in with for example:
ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.65.222
and
ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::22f4

''Note:'' There are two ways to run '''openstack'''. One assumes that all the ''OS_*'' environment variables defined in ''keystonerc.sh'' are set. The other assumes that there is a ''~/.config/openstack/clouds.yaml'' file and that ''OS_CLOUD'' is set, for example:
OS_CLOUD=info319-cluster openstack server list

=== Create the Spark driver ===
The ''terraform-test'' resource can be renamed to for example ''terraform-driver'' and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].

=== Create the Spark workers ===
You can create multiple ''terraform-worker-'' resources in a similar way, but adding a line
count = 6
and using '''${count.index}''' as a variable inside the worker ''name''. Use ''IPv6'' instead of ''dualStack''.

Check that you can login to the new instances using '''ssh''' and a IPv6 JumpHost.

As ''info319-cluster.tf'' grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:
locals {
cluster_prefix = "terraform-"
num_workers = 6
driver_name = "${local.cluster_prefix}driver"
worker_prefix = "${local.cluster_prefix}worker-"
worker_names = [
for idx in range(local.num_workers) :
"${local.worker_prefix}${idx}"]
all_names = concat([local.driver_name], local.worker_names)
}
Inside strings, you can refer to the variables using ''${local.var_name}''. Outside of string, you can write just ''local.var_name''.

=== Create and attach volumes ===
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances].

You can use ''local.num_workers'' and ''count='' both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:
resource "openstack_compute_volume_attach_v2" "attached-to-workers" {
count = local.num_workers
instance_id = "${openstack_compute_instance_v2.terraform-workers[count.index].id}"
volume_id = "${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}"
}

Log in with SSH and to '''ls /dev''' to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).

=== Create ''hosts'' files ===
Make OpenStack write out the file ''ipv4-hosts'' looking like this:
158.37.65.58 terraform-driver
10.1.2.234 terraform-worker-0
...
10.1.2.63 terraform-worker-5
and ''ipv6-hosts'' looking like this:
2001:700:2:8300::28c4 terraform-driver
2001:700:2:8301::14a1 terraform-worker-0
...
2001:700:2:8310::120d terraform-worker-5

Tips:
* you can define outputs from terraform like this:
output "terraform-driver-ipv4" {
value = openstack_compute_instance_v2.terraform-driver.access_ip_v4
}
* you can use the '''[https://www.terraform.io/cli/commands/console terraform console]''' to explore expressions you can use to define outputs and locals, for example:
openstack_compute_instance_v2.terraform-workers[5]
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs
* a few useful functions are:
concat([local.element], local.list)
length(string)
substr(string, offset, length)
zipmap(list1, list2)
* you can create lists using this syntax:
terraform-workers-ipv6 = [
for idx in range(number):
list[idx].attribute
]
* you can output to a file like this:
resource "local_file" "ipv4-hosts-file" {
content = "${local.ipv4-hosts-string}\n"
filename = "ipv4-hosts"
}

=== Configure SSH ===
On your local host, change ''info319-cluster.tf'' so it also writes a file like this to ''~/.ssh/config.terraform-hosts'':
Host terraform-driver
Hostname 2001:700:2:8300::28c4

Host terraform-worker-0
Hostname 2001:700:2:8301::14a1

...

Host terraform-worker-5
Hostname 2001:700:2:8301::120d

On your local host, add these lines to ''~/.ssh/config'':
Host terraform-*
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
ProxyJump sinoa@login.uib.no
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ./config.terraform-hosts
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).

You can now log into all the cluster machines by name, even after you increase the number of worker nodes:
ssh terraform-worker-4

Create Spark cluster using Terraform

2022-10-31T12:54:38Z

Sinoa: /* Configure Terraform */

== OpenStack ==
=== Install OpenStack ===
On your local machine, create an ''exercise-5'' folder, for example a subfolder of ''info319-exercises'', and '''cd''' into it.

Install ''openstackclient''. There are several ways to do this. On Ubuntu Linux, you can do:
sudo apt install python3-openstackclient
(You may have to reinstall ''python3-six'' and ''python3-urllib3''. You may also need ''python3-dev''.)

Other guides suggest you install it as a Python package (using a virtual environment if you want):
pip install python-openstackclient

=== Configure OpenStack for command line ===
If you want to run OpenStack from the command line, create the file ''keystonerc.sh'' on your local machine:
touch keystonerc.sh
chmod 0600 keystonerc.sh

Use this as a template:
export OS_USERNAME=YOUR_USER_NAME@uib.no
export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT
export OS_PASSWORD=g3...YOUR_PASSWORD...Qb
export OS_AUTH_URL=https://api.nrec.no:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_USER_DOMAIN_NAME=dataporten
export OS_PROJECT_DOMAIN_NAME=dataporten
export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl
export OS_INTERFACE=public
export OS_NO_CACHE=1
export OS_TENANT_NAME=$OS_PROJECT_NAME

=== Test run OpenStack from command line ===
Test with:
. keystonerc.sh
openstack server list
(It can be quite slow.)

Other test commands:
openstack image list
openstack flavor list
openstack network list
openstack keypair list
openstack security group list
'''openstack --help''' lists all the possible commands.

'''Task:''' Try to create an instance with '''openstack server create ...''', then delete it. You can use the NREC Overview to see the results.

== Terraform ==
=== Install Terraform ===
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:
sudo apt update
sudo apt install software-properties-common gnupg2 curl
curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor > hashicorp.gpg
sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/
sudo apt-add-repository "deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main"
sudo apt update
sudo apt install terraform

=== Configure Terraform ===
Create a configuration file ''info319-cluster.tf'':
# configure the OpenStack provider
terraform {
required_providers {
openstack = {
source = "terraform-provider-openstack/openstack"
}
}
}

provider "openstack" {
}

Initialise Terraform with:
terraform init
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:
terraform init -upgrade
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.

=== Test run Terraform ===
Append something like this to ''info319-cluster.tf'':
# test instance
resource "openstack_compute_instance_v2" "terraform-test" {
name = "terraform-test"
image_name = "GOLD Ubuntu 22.04 LTS"
flavor_name = "m1.large"
security_groups = ["default", "info319-spark-cluster"]
network {
name = "dualStack"
}
}

Try to run Terraform with:
terraform plan
'''terraform plan''' lists everything ''terraform'' will do if you ''apply'' it. This list is important to check so you do not permanently delete something critical, like disks/volumes.

To create the new test instance:
terraform apply # answer 'yes' to approve
Check ''Compute -> Instances'' in the ''NREC Overview'' to see the new instance appear.

=== user-data.cfg ===
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is useful to do a few initialisation steps already when an instance is created. To do this, you can create in initialisation script called something like ''user-data.cfg'':
#cloud-config

apt_upgrade: true

packages:
- emacs

power_state:
delay: "+3"
mode: reboot
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text ''emacs'' editor.)

Add the line
user_data = file("user-data.cfg")
to ''info319-cluster.tf' and run '''terraform apply'''. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a ''Connection closed by UNKNOWN port 65535'' message.)

=== ~/.config/openstack/clouds.yaml (optional) ===
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (''OS_*'') defined in keystonerc.sh , you can keep them in a file.

On your local machine, create ~/.config/openstack/clouds.yaml with ''0600'' (or ''-rw-------'') access, for example:
clouds:
info319-cluster:
auth:
auth_url: https://api.nrec.no:5000/v3
project_name: uib-info-YOUR_NREC_PROJECT
username: YOUR_USER_NAME@uib.no
password: g3...YOUR_PASSWORD...Qb
user_domain_name: dataporten
project_domain_name: dataporten
identity_api_version: 3
region_name: YOUR_REGION_EITHER_bgo_OR_osl
interface: public
operation_log:
logging: TRUE
file: openstackclient_admin.log
level: info

You need to change your ''info319-cluster.tf'' file from
provider "openstack" {
}
to
provider "openstack" {
cloud = "info319-cluster" # defined in a clouds.yaml file
}
The advantage of this setup is that it easier manage multiple clusters from the same local machine, and you avoid problems with changed enrivonment variables.

== Spark cluster ==
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The ''Resources'' menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] gives you details.

=== Create or import a key pair ===
Your ''terraform-test'' instance still has no keypair. To add a keypair to ''info319-cluster.tf'', use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public ''~/.ssh/info319-spark-cluster.pub'' SSH key from before.

Add the keypair resource to the ''terraform-test'' resource with a line like this:
key_pair = "info319-spark-cluster"

Rerun '''terraform plan''' and '''terraform apply''' continuously to check that things work.

=== Test login ===
Use
openstack server list
to find the IPv4 and IPv6 addresses of ''terraform-test''. Log in with for example:
ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.65.222
and
ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::22f4

''Note:'' There are two ways to run '''openstack'''. One assumes that all the ''OS_*'' environment variables defined in ''keystonerc.sh'' are set. The other assumes that there is a ''~/.config/openstack/clouds.yaml'' file and that ''OS_CLOUD'' is set, for example:
OS_CLOUD=info319-cluster openstack server list

=== Create the Spark driver ===
The ''terraform-test'' resource can be renamed to for example ''terraform-driver'' and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].

=== Create the Spark workers ===
You can create multiple ''terraform-worker-'' resources in a similar way, but adding a line
count = 6
and using '''${count.index}''' as a variable inside the worker ''name''. Use ''IPv6'' instead of ''dualStack''.

Check that you can login to the new instances using '''ssh''' and a IPv6 JumpHost.

As ''info319-cluster.tf'' grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:
locals {
cluster_prefix = "terraform-"
num_workers = 6
driver_name = "${local.cluster_prefix}driver"
worker_prefix = "${local.cluster_prefix}worker-"
worker_names = [
for idx in range(local.num_workers) :
"${local.worker_prefix}${idx}"]
all_names = concat([local.driver_name], local.worker_names)
}
Inside strings, you can refer to the variables using ''${local.var_name}''. Outside of string, you can write just ''local.var_name''.

=== Create and attach volumes ===
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances].

You can use ''local.num_workers'' and ''count='' both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:
resource "openstack_compute_volume_attach_v2" "attached-to-workers" {
count = local.num_workers
instance_id = "${openstack_compute_instance_v2.terraform-workers[count.index].id}"
volume_id = "${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}"
}

Log in with SSH and to '''ls /dev''' to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).

=== Create ''hosts'' files ===
Make OpenStack write out the file ''ipv4-hosts'' looking like this:
158.37.65.58 terraform-driver
10.1.2.234 terraform-worker-0
...
10.1.2.63 terraform-worker-5
and ''ipv6-hosts'' looking like this:
2001:700:2:8300::28c4 terraform-driver
2001:700:2:8301::14a1 terraform-worker-0
...
2001:700:2:8310::120d terraform-worker-5

Tips:
* you can define outputs from terraform like this:
output "terraform-driver-ipv4" {
value = openstack_compute_instance_v2.terraform-driver.access_ip_v4
}
* you can use the '''[https://www.terraform.io/cli/commands/console terraform console]''' to explore expressions you can use to define outputs and locals, for example:
openstack_compute_instance_v2.terraform-workers[5]
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs
* a few useful functions are:
concat([local.element], local.list)
length(string)
substr(string, offset, length)
zipmap(list1, list2)
* you can create lists using this syntax:
terraform-workers-ipv6 = [
for idx in range(number):
list[idx].attribute
]
* you can output to a file like this:
resource "local_file" "ipv4-hosts-file" {
content = "${local.ipv4-hosts-string}\n"
filename = "ipv4-hosts"
}

=== Configure SSH ===
On your local host, change ''info319-cluster.tf'' so it also writes a file like this to ''~/.ssh/config.terraform-hosts'':
Host terraform-driver
Hostname 2001:700:2:8300::28c4

Host terraform-worker-0
Hostname 2001:700:2:8301::14a1

...

Host terraform-worker-5
Hostname 2001:700:2:8301::120d

On your local host, add these lines to ''~/.ssh/config'':
Host terraform-*
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
ProxyJump sinoa@login.uib.no
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ./config.terraform-hosts
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).

You can now log into all the cluster machines by name, even after you increase the number of worker nodes:
ssh terraform-worker-4

Create Spark cluster using Terraform

2022-10-31T12:54:12Z

Sinoa: /* Configure Terraform */

== OpenStack ==
=== Install OpenStack ===
On your local machine, create an ''exercise-5'' folder, for example a subfolder of ''info319-exercises'', and '''cd''' into it.

Install ''openstackclient''. There are several ways to do this. On Ubuntu Linux, you can do:
sudo apt install python3-openstackclient
(You may have to reinstall ''python3-six'' and ''python3-urllib3''. You may also need ''python3-dev''.)

Other guides suggest you install it as a Python package (using a virtual environment if you want):
pip install python-openstackclient

=== Configure OpenStack for command line ===
If you want to run OpenStack from the command line, create the file ''keystonerc.sh'' on your local machine:
touch keystonerc.sh
chmod 0600 keystonerc.sh

Use this as a template:
export OS_USERNAME=YOUR_USER_NAME@uib.no
export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT
export OS_PASSWORD=g3...YOUR_PASSWORD...Qb
export OS_AUTH_URL=https://api.nrec.no:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_USER_DOMAIN_NAME=dataporten
export OS_PROJECT_DOMAIN_NAME=dataporten
export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl
export OS_INTERFACE=public
export OS_NO_CACHE=1
export OS_TENANT_NAME=$OS_PROJECT_NAME

=== Test run OpenStack from command line ===
Test with:
. keystonerc.sh
openstack server list
(It can be quite slow.)

Other test commands:
openstack image list
openstack flavor list
openstack network list
openstack keypair list
openstack security group list
'''openstack --help''' lists all the possible commands.

'''Task:''' Try to create an instance with '''openstack server create ...''', then delete it. You can use the NREC Overview to see the results.

== Terraform ==
=== Install Terraform ===
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:
sudo apt update
sudo apt install software-properties-common gnupg2 curl
curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor > hashicorp.gpg
sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/
sudo apt-add-repository "deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main"
sudo apt update
sudo apt install terraform

=== Configure Terraform ===
Create a configuration file ''info319-cluster.tf'':
# configure the OpenStack provider
terraform {
required_providers {
openstack = {
source = "terraform-provider-openstack/openstack"
}
}
}

provider "openstack" {
}

Initialise Terraform with:
terraform init
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise terraform, for example with:
terraform init -upgrade
This can happen if you run terraform in the same shared folder from different local machines and is usually unproblematic.

=== Test run Terraform ===
Append something like this to ''info319-cluster.tf'':
# test instance
resource "openstack_compute_instance_v2" "terraform-test" {
name = "terraform-test"
image_name = "GOLD Ubuntu 22.04 LTS"
flavor_name = "m1.large"
security_groups = ["default", "info319-spark-cluster"]
network {
name = "dualStack"
}
}

Try to run Terraform with:
terraform plan
'''terraform plan''' lists everything ''terraform'' will do if you ''apply'' it. This list is important to check so you do not permanently delete something critical, like disks/volumes.

To create the new test instance:
terraform apply # answer 'yes' to approve
Check ''Compute -> Instances'' in the ''NREC Overview'' to see the new instance appear.

=== user-data.cfg ===
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is useful to do a few initialisation steps already when an instance is created. To do this, you can create in initialisation script called something like ''user-data.cfg'':
#cloud-config

apt_upgrade: true

packages:
- emacs

power_state:
delay: "+3"
mode: reboot
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text ''emacs'' editor.)

Add the line
user_data = file("user-data.cfg")
to ''info319-cluster.tf' and run '''terraform apply'''. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a ''Connection closed by UNKNOWN port 65535'' message.)

=== ~/.config/openstack/clouds.yaml (optional) ===
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (''OS_*'') defined in keystonerc.sh , you can keep them in a file.

On your local machine, create ~/.config/openstack/clouds.yaml with ''0600'' (or ''-rw-------'') access, for example:
clouds:
info319-cluster:
auth:
auth_url: https://api.nrec.no:5000/v3
project_name: uib-info-YOUR_NREC_PROJECT
username: YOUR_USER_NAME@uib.no
password: g3...YOUR_PASSWORD...Qb
user_domain_name: dataporten
project_domain_name: dataporten
identity_api_version: 3
region_name: YOUR_REGION_EITHER_bgo_OR_osl
interface: public
operation_log:
logging: TRUE
file: openstackclient_admin.log
level: info

You need to change your ''info319-cluster.tf'' file from
provider "openstack" {
}
to
provider "openstack" {
cloud = "info319-cluster" # defined in a clouds.yaml file
}
The advantage of this setup is that it easier manage multiple clusters from the same local machine, and you avoid problems with changed enrivonment variables.

== Spark cluster ==
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The ''Resources'' menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] gives you details.

=== Create or import a key pair ===
Your ''terraform-test'' instance still has no keypair. To add a keypair to ''info319-cluster.tf'', use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public ''~/.ssh/info319-spark-cluster.pub'' SSH key from before.

Add the keypair resource to the ''terraform-test'' resource with a line like this:
key_pair = "info319-spark-cluster"

Rerun '''terraform plan''' and '''terraform apply''' continuously to check that things work.

=== Test login ===
Use
openstack server list
to find the IPv4 and IPv6 addresses of ''terraform-test''. Log in with for example:
ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.65.222
and
ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::22f4

''Note:'' There are two ways to run '''openstack'''. One assumes that all the ''OS_*'' environment variables defined in ''keystonerc.sh'' are set. The other assumes that there is a ''~/.config/openstack/clouds.yaml'' file and that ''OS_CLOUD'' is set, for example:
OS_CLOUD=info319-cluster openstack server list

=== Create the Spark driver ===
The ''terraform-test'' resource can be renamed to for example ''terraform-driver'' and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].

=== Create the Spark workers ===
You can create multiple ''terraform-worker-'' resources in a similar way, but adding a line
count = 6
and using '''${count.index}''' as a variable inside the worker ''name''. Use ''IPv6'' instead of ''dualStack''.

Check that you can login to the new instances using '''ssh''' and a IPv6 JumpHost.

As ''info319-cluster.tf'' grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:
locals {
cluster_prefix = "terraform-"
num_workers = 6
driver_name = "${local.cluster_prefix}driver"
worker_prefix = "${local.cluster_prefix}worker-"
worker_names = [
for idx in range(local.num_workers) :
"${local.worker_prefix}${idx}"]
all_names = concat([local.driver_name], local.worker_names)
}
Inside strings, you can refer to the variables using ''${local.var_name}''. Outside of string, you can write just ''local.var_name''.

=== Create and attach volumes ===
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances].

You can use ''local.num_workers'' and ''count='' both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:
resource "openstack_compute_volume_attach_v2" "attached-to-workers" {
count = local.num_workers
instance_id = "${openstack_compute_instance_v2.terraform-workers[count.index].id}"
volume_id = "${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}"
}

Log in with SSH and to '''ls /dev''' to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).

=== Create ''hosts'' files ===
Make OpenStack write out the file ''ipv4-hosts'' looking like this:
158.37.65.58 terraform-driver
10.1.2.234 terraform-worker-0
...
10.1.2.63 terraform-worker-5
and ''ipv6-hosts'' looking like this:
2001:700:2:8300::28c4 terraform-driver
2001:700:2:8301::14a1 terraform-worker-0
...
2001:700:2:8310::120d terraform-worker-5

Tips:
* you can define outputs from terraform like this:
output "terraform-driver-ipv4" {
value = openstack_compute_instance_v2.terraform-driver.access_ip_v4
}
* you can use the '''[https://www.terraform.io/cli/commands/console terraform console]''' to explore expressions you can use to define outputs and locals, for example:
openstack_compute_instance_v2.terraform-workers[5]
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs
* a few useful functions are:
concat([local.element], local.list)
length(string)
substr(string, offset, length)
zipmap(list1, list2)
* you can create lists using this syntax:
terraform-workers-ipv6 = [
for idx in range(number):
list[idx].attribute
]
* you can output to a file like this:
resource "local_file" "ipv4-hosts-file" {
content = "${local.ipv4-hosts-string}\n"
filename = "ipv4-hosts"
}

=== Configure SSH ===
On your local host, change ''info319-cluster.tf'' so it also writes a file like this to ''~/.ssh/config.terraform-hosts'':
Host terraform-driver
Hostname 2001:700:2:8300::28c4

Host terraform-worker-0
Hostname 2001:700:2:8301::14a1

...

Host terraform-worker-5
Hostname 2001:700:2:8301::120d

On your local host, add these lines to ''~/.ssh/config'':
Host terraform-*
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
ProxyJump sinoa@login.uib.no
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ./config.terraform-hosts
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).

You can now log into all the cluster machines by name, even after you increase the number of worker nodes:
ssh terraform-worker-4

Create Spark cluster using Terraform

2022-10-31T12:52:47Z

Sinoa: /* Install Terraform */

== OpenStack ==
=== Install OpenStack ===
On your local machine, create an ''exercise-5'' folder, for example a subfolder of ''info319-exercises'', and '''cd''' into it.

Install ''openstackclient''. There are several ways to do this. On Ubuntu Linux, you can do:
sudo apt install python3-openstackclient
(You may have to reinstall ''python3-six'' and ''python3-urllib3''. You may also need ''python3-dev''.)

Other guides suggest you install it as a Python package (using a virtual environment if you want):
pip install python-openstackclient

=== Configure OpenStack for command line ===
If you want to run OpenStack from the command line, create the file ''keystonerc.sh'' on your local machine:
touch keystonerc.sh
chmod 0600 keystonerc.sh

Use this as a template:
export OS_USERNAME=YOUR_USER_NAME@uib.no
export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT
export OS_PASSWORD=g3...YOUR_PASSWORD...Qb
export OS_AUTH_URL=https://api.nrec.no:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_USER_DOMAIN_NAME=dataporten
export OS_PROJECT_DOMAIN_NAME=dataporten
export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl
export OS_INTERFACE=public
export OS_NO_CACHE=1
export OS_TENANT_NAME=$OS_PROJECT_NAME

=== Test run OpenStack from command line ===
Test with:
. keystonerc.sh
openstack server list
(It can be quite slow.)

Other test commands:
openstack image list
openstack flavor list
openstack network list
openstack keypair list
openstack security group list
'''openstack --help''' lists all the possible commands.

'''Task:''' Try to create an instance with '''openstack server create ...''', then delete it. You can use the NREC Overview to see the results.

== Terraform ==
=== Install Terraform ===
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:
sudo apt update
sudo apt install software-properties-common gnupg2 curl
curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor > hashicorp.gpg
sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/
sudo apt-add-repository "deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main"
sudo apt update
sudo apt install terraform

=== Configure Terraform ===
Create a configuration file ''info319-cluster.tf'':
# configure the OpenStack provider
terraform {
required_providers {
openstack = {
source = "terraform-provider-openstack/openstack"
}
}
}

provider "openstack" {
}

Initialise Terraform with:
terraform init
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise terraform, for example with:
terraform init -upgrade
This is usually unproblematic.

=== Test run Terraform ===
Append something like this to ''info319-cluster.tf'':
# test instance
resource "openstack_compute_instance_v2" "terraform-test" {
name = "terraform-test"
image_name = "GOLD Ubuntu 22.04 LTS"
flavor_name = "m1.large"
security_groups = ["default", "info319-spark-cluster"]
network {
name = "dualStack"
}
}

Try to run Terraform with:
terraform plan
'''terraform plan''' lists everything ''terraform'' will do if you ''apply'' it. This list is important to check so you do not permanently delete something critical, like disks/volumes.

To create the new test instance:
terraform apply # answer 'yes' to approve
Check ''Compute -> Instances'' in the ''NREC Overview'' to see the new instance appear.

=== user-data.cfg ===
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is useful to do a few initialisation steps already when an instance is created. To do this, you can create in initialisation script called something like ''user-data.cfg'':
#cloud-config

apt_upgrade: true

packages:
- emacs

power_state:
delay: "+3"
mode: reboot
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text ''emacs'' editor.)

Add the line
user_data = file("user-data.cfg")
to ''info319-cluster.tf' and run '''terraform apply'''. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a ''Connection closed by UNKNOWN port 65535'' message.)

=== ~/.config/openstack/clouds.yaml (optional) ===
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (''OS_*'') defined in keystonerc.sh , you can keep them in a file.

On your local machine, create ~/.config/openstack/clouds.yaml with ''0600'' (or ''-rw-------'') access, for example:
clouds:
info319-cluster:
auth:
auth_url: https://api.nrec.no:5000/v3
project_name: uib-info-YOUR_NREC_PROJECT
username: YOUR_USER_NAME@uib.no
password: g3...YOUR_PASSWORD...Qb
user_domain_name: dataporten
project_domain_name: dataporten
identity_api_version: 3
region_name: YOUR_REGION_EITHER_bgo_OR_osl
interface: public
operation_log:
logging: TRUE
file: openstackclient_admin.log
level: info

You need to change your ''info319-cluster.tf'' file from
provider "openstack" {
}
to
provider "openstack" {
cloud = "info319-cluster" # defined in a clouds.yaml file
}
The advantage of this setup is that it easier manage multiple clusters from the same local machine, and you avoid problems with changed enrivonment variables.

== Spark cluster ==
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The ''Resources'' menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] gives you details.

=== Create or import a key pair ===
Your ''terraform-test'' instance still has no keypair. To add a keypair to ''info319-cluster.tf'', use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public ''~/.ssh/info319-spark-cluster.pub'' SSH key from before.

Add the keypair resource to the ''terraform-test'' resource with a line like this:
key_pair = "info319-spark-cluster"

Rerun '''terraform plan''' and '''terraform apply''' continuously to check that things work.

=== Test login ===
Use
openstack server list
to find the IPv4 and IPv6 addresses of ''terraform-test''. Log in with for example:
ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.65.222
and
ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::22f4

''Note:'' There are two ways to run '''openstack'''. One assumes that all the ''OS_*'' environment variables defined in ''keystonerc.sh'' are set. The other assumes that there is a ''~/.config/openstack/clouds.yaml'' file and that ''OS_CLOUD'' is set, for example:
OS_CLOUD=info319-cluster openstack server list

=== Create the Spark driver ===
The ''terraform-test'' resource can be renamed to for example ''terraform-driver'' and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].

=== Create the Spark workers ===
You can create multiple ''terraform-worker-'' resources in a similar way, but adding a line
count = 6
and using '''${count.index}''' as a variable inside the worker ''name''. Use ''IPv6'' instead of ''dualStack''.

Check that you can login to the new instances using '''ssh''' and a IPv6 JumpHost.

As ''info319-cluster.tf'' grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:
locals {
cluster_prefix = "terraform-"
num_workers = 6
driver_name = "${local.cluster_prefix}driver"
worker_prefix = "${local.cluster_prefix}worker-"
worker_names = [
for idx in range(local.num_workers) :
"${local.worker_prefix}${idx}"]
all_names = concat([local.driver_name], local.worker_names)
}
Inside strings, you can refer to the variables using ''${local.var_name}''. Outside of string, you can write just ''local.var_name''.

=== Create and attach volumes ===
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances].

You can use ''local.num_workers'' and ''count='' both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:
resource "openstack_compute_volume_attach_v2" "attached-to-workers" {
count = local.num_workers
instance_id = "${openstack_compute_instance_v2.terraform-workers[count.index].id}"
volume_id = "${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}"
}

Log in with SSH and to '''ls /dev''' to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).

=== Create ''hosts'' files ===
Make OpenStack write out the file ''ipv4-hosts'' looking like this:
158.37.65.58 terraform-driver
10.1.2.234 terraform-worker-0
...
10.1.2.63 terraform-worker-5
and ''ipv6-hosts'' looking like this:
2001:700:2:8300::28c4 terraform-driver
2001:700:2:8301::14a1 terraform-worker-0
...
2001:700:2:8310::120d terraform-worker-5

Tips:
* you can define outputs from terraform like this:
output "terraform-driver-ipv4" {
value = openstack_compute_instance_v2.terraform-driver.access_ip_v4
}
* you can use the '''[https://www.terraform.io/cli/commands/console terraform console]''' to explore expressions you can use to define outputs and locals, for example:
openstack_compute_instance_v2.terraform-workers[5]
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs
* a few useful functions are:
concat([local.element], local.list)
length(string)
substr(string, offset, length)
zipmap(list1, list2)
* you can create lists using this syntax:
terraform-workers-ipv6 = [
for idx in range(number):
list[idx].attribute
]
* you can output to a file like this:
resource "local_file" "ipv4-hosts-file" {
content = "${local.ipv4-hosts-string}\n"
filename = "ipv4-hosts"
}

=== Configure SSH ===
On your local host, change ''info319-cluster.tf'' so it also writes a file like this to ''~/.ssh/config.terraform-hosts'':
Host terraform-driver
Hostname 2001:700:2:8300::28c4

Host terraform-worker-0
Hostname 2001:700:2:8301::14a1

...

Host terraform-worker-5
Hostname 2001:700:2:8301::120d

On your local host, add these lines to ''~/.ssh/config'':
Host terraform-*
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
ProxyJump sinoa@login.uib.no
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ./config.terraform-hosts
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).

You can now log into all the cluster machines by name, even after you increase the number of worker nodes:
ssh terraform-worker-4

Create Spark cluster using Terraform

2022-10-31T12:47:38Z

Sinoa: /* Configure OpenStack for command line */

== OpenStack ==
=== Install OpenStack ===
On your local machine, create an ''exercise-5'' folder, for example a subfolder of ''info319-exercises'', and '''cd''' into it.

Install ''openstackclient''. There are several ways to do this. On Ubuntu Linux, you can do:
sudo apt install python3-openstackclient
(You may have to reinstall ''python3-six'' and ''python3-urllib3''. You may also need ''python3-dev''.)

Other guides suggest you install it as a Python package (using a virtual environment if you want):
pip install python-openstackclient

=== Configure OpenStack for command line ===
If you want to run OpenStack from the command line, create the file ''keystonerc.sh'' on your local machine:
touch keystonerc.sh
chmod 0600 keystonerc.sh

Use this as a template:
export OS_USERNAME=YOUR_USER_NAME@uib.no
export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT
export OS_PASSWORD=g3...YOUR_PASSWORD...Qb
export OS_AUTH_URL=https://api.nrec.no:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_USER_DOMAIN_NAME=dataporten
export OS_PROJECT_DOMAIN_NAME=dataporten
export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl
export OS_INTERFACE=public
export OS_NO_CACHE=1
export OS_TENANT_NAME=$OS_PROJECT_NAME

=== Test run OpenStack from command line ===
Test with:
. keystonerc.sh
openstack server list
(It can be quite slow.)

Other test commands:
openstack image list
openstack flavor list
openstack network list
openstack keypair list
openstack security group list
'''openstack --help''' lists all the possible commands.

'''Task:''' Try to create an instance with '''openstack server create ...''', then delete it. You can use the NREC Overview to see the results.

== Terraform ==
=== Install Terraform ===
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide].

=== Configure Terraform ===
Create a configuration file ''info319-cluster.tf'':
# configure the OpenStack provider
terraform {
required_providers {
openstack = {
source = "terraform-provider-openstack/openstack"
}
}
}

provider "openstack" {
}

Initialise Terraform with:
terraform init
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise terraform, for example with:
terraform init -upgrade
This is usually unproblematic.

=== Test run Terraform ===
Append something like this to ''info319-cluster.tf'':
# test instance
resource "openstack_compute_instance_v2" "terraform-test" {
name = "terraform-test"
image_name = "GOLD Ubuntu 22.04 LTS"
flavor_name = "m1.large"
security_groups = ["default", "info319-spark-cluster"]
network {
name = "dualStack"
}
}

Try to run Terraform with:
terraform plan
'''terraform plan''' lists everything ''terraform'' will do if you ''apply'' it. This list is important to check so you do not permanently delete something critical, like disks/volumes.

To create the new test instance:
terraform apply # answer 'yes' to approve
Check ''Compute -> Instances'' in the ''NREC Overview'' to see the new instance appear.

=== user-data.cfg ===
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is useful to do a few initialisation steps already when an instance is created. To do this, you can create in initialisation script called something like ''user-data.cfg'':
#cloud-config

apt_upgrade: true

packages:
- emacs

power_state:
delay: "+3"
mode: reboot
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text ''emacs'' editor.)

Add the line
user_data = file("user-data.cfg")
to ''info319-cluster.tf' and run '''terraform apply'''. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a ''Connection closed by UNKNOWN port 65535'' message.)

=== ~/.config/openstack/clouds.yaml (optional) ===
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (''OS_*'') defined in keystonerc.sh , you can keep them in a file.

On your local machine, create ~/.config/openstack/clouds.yaml with ''0600'' (or ''-rw-------'') access, for example:
clouds:
info319-cluster:
auth:
auth_url: https://api.nrec.no:5000/v3
project_name: uib-info-YOUR_NREC_PROJECT
username: YOUR_USER_NAME@uib.no
password: g3...YOUR_PASSWORD...Qb
user_domain_name: dataporten
project_domain_name: dataporten
identity_api_version: 3
region_name: YOUR_REGION_EITHER_bgo_OR_osl
interface: public
operation_log:
logging: TRUE
file: openstackclient_admin.log
level: info

You need to change your ''info319-cluster.tf'' file from
provider "openstack" {
}
to
provider "openstack" {
cloud = "info319-cluster" # defined in a clouds.yaml file
}
The advantage of this setup is that it easier manage multiple clusters from the same local machine, and you avoid problems with changed enrivonment variables.

== Spark cluster ==
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The ''Resources'' menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] gives you details.

=== Create or import a key pair ===
Your ''terraform-test'' instance still has no keypair. To add a keypair to ''info319-cluster.tf'', use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public ''~/.ssh/info319-spark-cluster.pub'' SSH key from before.

Add the keypair resource to the ''terraform-test'' resource with a line like this:
key_pair = "info319-spark-cluster"

Rerun '''terraform plan''' and '''terraform apply''' continuously to check that things work.

=== Test login ===
Use
openstack server list
to find the IPv4 and IPv6 addresses of ''terraform-test''. Log in with for example:
ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.65.222
and
ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::22f4

''Note:'' There are two ways to run '''openstack'''. One assumes that all the ''OS_*'' environment variables defined in ''keystonerc.sh'' are set. The other assumes that there is a ''~/.config/openstack/clouds.yaml'' file and that ''OS_CLOUD'' is set, for example:
OS_CLOUD=info319-cluster openstack server list

=== Create the Spark driver ===
The ''terraform-test'' resource can be renamed to for example ''terraform-driver'' and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].

=== Create the Spark workers ===
You can create multiple ''terraform-worker-'' resources in a similar way, but adding a line
count = 6
and using '''${count.index}''' as a variable inside the worker ''name''. Use ''IPv6'' instead of ''dualStack''.

Check that you can login to the new instances using '''ssh''' and a IPv6 JumpHost.

As ''info319-cluster.tf'' grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:
locals {
cluster_prefix = "terraform-"
num_workers = 6
driver_name = "${local.cluster_prefix}driver"
worker_prefix = "${local.cluster_prefix}worker-"
worker_names = [
for idx in range(local.num_workers) :
"${local.worker_prefix}${idx}"]
all_names = concat([local.driver_name], local.worker_names)
}
Inside strings, you can refer to the variables using ''${local.var_name}''. Outside of string, you can write just ''local.var_name''.

=== Create and attach volumes ===
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances].

You can use ''local.num_workers'' and ''count='' both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:
resource "openstack_compute_volume_attach_v2" "attached-to-workers" {
count = local.num_workers
instance_id = "${openstack_compute_instance_v2.terraform-workers[count.index].id}"
volume_id = "${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}"
}

Log in with SSH and to '''ls /dev''' to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).

=== Create ''hosts'' files ===
Make OpenStack write out the file ''ipv4-hosts'' looking like this:
158.37.65.58 terraform-driver
10.1.2.234 terraform-worker-0
...
10.1.2.63 terraform-worker-5
and ''ipv6-hosts'' looking like this:
2001:700:2:8300::28c4 terraform-driver
2001:700:2:8301::14a1 terraform-worker-0
...
2001:700:2:8310::120d terraform-worker-5

Tips:
* you can define outputs from terraform like this:
output "terraform-driver-ipv4" {
value = openstack_compute_instance_v2.terraform-driver.access_ip_v4
}
* you can use the '''[https://www.terraform.io/cli/commands/console terraform console]''' to explore expressions you can use to define outputs and locals, for example:
openstack_compute_instance_v2.terraform-workers[5]
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs
* a few useful functions are:
concat([local.element], local.list)
length(string)
substr(string, offset, length)
zipmap(list1, list2)
* you can create lists using this syntax:
terraform-workers-ipv6 = [
for idx in range(number):
list[idx].attribute
]
* you can output to a file like this:
resource "local_file" "ipv4-hosts-file" {
content = "${local.ipv4-hosts-string}\n"
filename = "ipv4-hosts"
}

=== Configure SSH ===
On your local host, change ''info319-cluster.tf'' so it also writes a file like this to ''~/.ssh/config.terraform-hosts'':
Host terraform-driver
Hostname 2001:700:2:8300::28c4

Host terraform-worker-0
Hostname 2001:700:2:8301::14a1

...

Host terraform-worker-5
Hostname 2001:700:2:8301::120d

On your local host, add these lines to ''~/.ssh/config'':
Host terraform-*
User ubuntu
IdentityFile ~/.ssh/info319-spark-cluster
ProxyJump sinoa@login.uib.no
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

Include ./config.terraform-hosts
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).

You can now log into all the cluster machines by name, even after you increase the number of worker nodes:
ssh terraform-worker-4