<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://info319.wiki.uib.no/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Sinoa</id>
	<title>info319 - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="http://info319.wiki.uib.no/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Sinoa"/>
	<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/Special:Contributions/Sinoa"/>
	<updated>2026-04-25T14:02:24Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.44.2</generator>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Readings&amp;diff=1245</id>
		<title>Readings</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Readings&amp;diff=1245"/>
		<updated>2022-12-02T13:04:35Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Papers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Books ==&lt;br /&gt;
Text books:&lt;br /&gt;
* Rob Kitchin. &#039;&#039;The Data Revolution - A Critical Analysis of Big Data, Open Data and Data Infrastructures&#039;&#039;, 2nd Edition. Sage, 2021.&lt;br /&gt;
** chapters 1, 3-5, 13-14, 17-19 are mandatory (12 and 15-16 are supplementary)&lt;br /&gt;
&lt;br /&gt;
* Bill Chambers and Matei Zaharia: &#039;&#039;Sprk: The Definitive Guide - Big Data Processing Made Simple&#039;&#039;. O&#039;Riley, 2018. [[File:Spark-TheDefinitiveGuide.pdf]]&lt;br /&gt;
** chapters 1-9, 12, 15, 20-21 are mandatory (chapter 10 on SQL is also highly relevant)&lt;br /&gt;
** [https://github.com/databricks/Spark-The-Definitive-Guide GitHub repository with code and data examples]&lt;br /&gt;
&lt;br /&gt;
== Papers ==&lt;br /&gt;
Selected papers will become available here, including:&lt;br /&gt;
* [https://arxiv.org/pdf/2012.09109 Section 1] in Opdahl, A. L., &amp;amp; Nunavath, V. (2020). Big Data. Big Data in Emergency Management: Exploitation Techniques for Social and Mobile Data, 15-29. Book chapter&lt;br /&gt;
* Gallofré, M., Opdahl, A. L., Stoppel, S., Tessem, B., &amp;amp; Veres, C. (2021). The News Angler Project: Exploring the Next Generation of Journalistic Knowledge Platforms. In Proceedings of Norsk IKT-konferanse for forskning og utdanning. [https://ojs.bibsys.no/index.php/NIK/article/view/939/792 Short Paper] and poster: [[File:A1-Poster-NIKT2021.pdf]]&lt;br /&gt;
&amp;lt;!-- Architecture stuff:&lt;br /&gt;
* Lambda: Introduced in Mathan Marz and James Warren (2013). Big Data Principles and Best Practices of Scalable Real-Time Data Systems. Slides 14-27 in [http://2014.berlinbuzzwords.de/sites/2014.berlinbuzzwords.de/files/media/documents/michael_hausenblas_-_lambda_architecture.pdf this presentation] gives an overview of the idea! &lt;br /&gt;
* Kappa: Kreps, J.: Questioning the lambda architecture (2014). [https://www.oreilly.com/radar/questioning-the-lambda-architecture/ White paper]&lt;br /&gt;
* Liquid: Fernandez, Raul Castro, Peter R. Pietzuch, Jay Kreps, Neha Narkhede, Jun Rao, Joel Koshy, Dong Lin, Chris Riccomini, and Guozhang Wang. &amp;quot;Liquid: Unifying nearline and offline big data integration.&amp;quot; In CIDR. 2015. [https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1088.2602&amp;amp;rep=rep1&amp;amp;type=pdf Paper]&lt;br /&gt;
* Sigma: Cassavia, N., &amp;amp; Masciari, E. (2021, March). Sigma: a scalable high performance big data architecture. In 2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP) (pp. 236-239). IEEE. [https://bibsys-almaprimo.hosted.exlibrisgroup.com/primo-explore/openurl?sid=google&amp;amp;auinit=N&amp;amp;aulast=Cassavia&amp;amp;atitle=Sigma:%20a%20scalable%20high%20performance%20big%20data%20architecture&amp;amp;id=doi:10.1109%2FPDP52278.2021.00044&amp;amp;vid=UBB&amp;amp;institution=UBB&amp;amp;url_ctx_val=&amp;amp;url_ctx_fmt=null&amp;amp;isSerivcesPage=true Paper]&lt;br /&gt;
* Maamouri, A., Sfaxi, L., &amp;amp; Robbana, R. (2021, December). Phi: A Generic Microservices-Based Big Data Architecture. In European, Mediterranean, and Middle Eastern Conference on Information Systems (pp. 3-16). Springer, Cham. [https://link.springer.com/chapter/10.1007/978-3-030-95947-0_1 Paper]&lt;br /&gt;
&lt;br /&gt;
Marc:&lt;br /&gt;
You found the other Phi architecture. 😃 The one I meant was: https://ieeexplore.ieee.org/abstract/document/8712381 But both have interesting contributions. The one you found considers the training part which it is not instantiated in the others.&lt;br /&gt;
&lt;br /&gt;
This is the &amp;quot;original publication&amp;quot; of Lambda: http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html , it is a blog entry.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
* Michael Armbrust, Armando Fox, Rean Griffith, Anthony D Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, Matei Zaharia (2010). A view of cloud computing. Communications of the ACM 53 (4), 50-58. [https://dl.acm.org/doi/fullHtml/10.1145/1721654.1721672 Paper]&lt;br /&gt;
* M Zaharia, M Chowdhury, MJ Franklin, S Shenker, I Stoica (2010). Spark: Cluster computing with working sets. 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10). [https://www.usenix.org/event/hotcloud10/tech/full_papers/Zaharia.pdf Paper]&lt;br /&gt;
* Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J Franklin, Scott Shenker, Ion Stoica (2012). Resilient distributed datasets: A Fault-Tolerant abstraction for In-Memory cluster computing. In Prof. 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp. 15-28. [https://scholar.google.com/citations?view_op=view_citation&amp;amp;hl=en&amp;amp;user=I1EvjZsAAAAJ&amp;amp;citation_for_view=I1EvjZsAAAAJ:Tyk-4Ss8FVUC Paper]&lt;br /&gt;
* Karun, A. K., &amp;amp; Chitharanjan, K. (2013, April). A review on hadoop—HDFS infrastructure extensions. In 2013 IEEE conference on information &amp;amp; communication technologies (pp. 132-137). IEEE. [https://scholar.google.com/scholar?output=instlink&amp;amp;q=info:GIm8aG-ScOsJ:scholar.google.com/&amp;amp;hl=en&amp;amp;as_sdt=0,5&amp;amp;scillfp=6854624816870725192&amp;amp;oi=lle Paper]&lt;br /&gt;
* Kafka?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Supplementary:&lt;br /&gt;
* Opdahl, A. L., &amp;amp; Tessem, B. (2021). Ontologies for finding journalistic angles. Software and Systems Modeling, 20(1), 71-87. [https://scholar.google.com/scholar?output=instlink&amp;amp;q=info:pKELE6iBzpAJ:scholar.google.com/&amp;amp;hl=en&amp;amp;as_sdt=0,5&amp;amp;as_ylo=2021&amp;amp;scillfp=4299025271368542631&amp;amp;oi=lle Paper]&lt;br /&gt;
* Berven, A., Christensen, O. A., Moldeklev, S., Opdahl, A. L., &amp;amp; Villanger, K. J. (2020). A knowledge-graph platform for newsrooms. Computers in Industry, 123, 103321. [https://scholar.google.com/scholar?output=instlink&amp;amp;q=info:0K5dB1_9nusJ:scholar.google.com/&amp;amp;hl=en&amp;amp;as_sdt=0,5&amp;amp;as_ylo=2018&amp;amp;scillfp=11776208952974186557&amp;amp;oi=lle Paper]&lt;br /&gt;
* [https://www.jstor.org/stable/25148625#metadata_info_tab_contents Design Science in Information Systems Research] by Alan R. Hevner, Salvatore T. March, Jinsoo Park and Sudha Ram. MIS Quarterly 28(1):75-105, March 2004. &#039;&#039;(You need to be on UiB&#039;s network to access the link - I have uploaded it under Files in mitt.uib.no, but it may soon be deleted from there...)&#039;&#039;&lt;br /&gt;
* Hevner, A. R. (2007). A three cycle view of design science research. Scandinavian journal of information systems, 19(2), 4. [[File:Hevner2007-ThreeCycleView-SJIS.pdf]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!-- Architectures: kappa, lambda, phi, Liquid --&amp;gt;&lt;br /&gt;
&amp;lt;!-- Classic papers: HDFS, Spark, RDDs --&amp;gt;&lt;br /&gt;
&amp;lt;!-- Privacy? --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Technical introductions ==&lt;br /&gt;
Selected web pages will become available here, including:&lt;br /&gt;
* [https://kafka.apache.org/intro Kafka Introduction]&lt;br /&gt;
* [https://docs.nrec.no/index.html NREC and OpenStack], the following sections/pages: Introduction, Project application, Logging in, The dashboard, Create a Linux virtual machine (skip: Windows), Using SSH, Working with Security Groups, Create and manage volumes, Create and manage snapshots (skip: images), Instance console&lt;br /&gt;
* [https://docs.nrec.no/terraform-part1.html TerraForm and NREC part I], [https://docs.nrec.no/terraform-part2.html part II], and [https://docs.nrec.no/terraform-part3.html part III]&lt;br /&gt;
* [https://www.ansible.com/overview/how-ansible-works How Ansible Works] and [https://docs.ansible.com/ansible_community.html the Ansible Community portal]&lt;br /&gt;
* Docker Docs: [https://docs.docker.com/get-started/overview/ Docker overview] and [https://docs.docker.com/get-started/overview/ Get started]&lt;br /&gt;
* [https://kubernetes.io/docs/tutorials/kubernetes-basics/ Learn Kubernetes basics], modules 1-6&lt;br /&gt;
* [https://gdpr.eu/what-is-gdpr/ What is GDPR, the EU’s new data protection law?]&lt;br /&gt;
&lt;br /&gt;
Supplementary:&lt;br /&gt;
* Spark 3.3.0 [https://spark.apache.org/docs/latest/index.html Overview] and [https://spark.apache.org/docs/latest/quick-start.html Quick Start (with Python examples)]&lt;br /&gt;
* [https://developer.twitter.com/en/docs/twitter-api Twitter API v2]&lt;br /&gt;
* [https://github.com/tweepy/tweepy Tweepy: Twitter for Python]&lt;br /&gt;
* [https://docs.tweepy.org/en/latest/ Tweepy Documentation]&lt;br /&gt;
* [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Structured Streaming Spark Programming Guide]&lt;br /&gt;
* Apache Spark [https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/index.html Structured Streaming API]&lt;br /&gt;
* [https://kafka-python.readthedocs.io/en/master/ kafka-python API]&lt;br /&gt;
* EU&#039;s [https://gdpr-info.eu/ General Data Protection Regulation (GDPR)] - the official legal text&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!-- GDELT --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Lecture slides==&lt;br /&gt;
See the [[Sessions|Session page]] for lecture slides after each session.&lt;br /&gt;
&lt;br /&gt;
==Readings for each session==&lt;br /&gt;
The [[Sessions|Sessions page]] will suggest specific readings for each session and its associated exercise.&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Install_Kafka_on_the_cluster&amp;diff=1244</id>
		<title>Install Kafka on the cluster</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Install_Kafka_on_the_cluster&amp;diff=1244"/>
		<updated>2022-12-01T11:15:11Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Test run Kafka */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Install Zookeeper ==&lt;br /&gt;
On each instance, go to https://zookeeper.apache.org/releases.html#download . Download and unpack a recent binary distribution to &#039;&#039;~/volume&#039;&#039;. For example:&lt;br /&gt;
 cd ~/volume&lt;br /&gt;
 wget https://dlcdn.apache.org/zookeeper/zookeeper-3.8.0/apache-zookeeper-3.8.0-bin.tar.gz&lt;br /&gt;
 tar zxvf apache-zookeeper-3.8.0-bin.tar.gz&lt;br /&gt;
 rm apache-zookeeper-3.8.0-bin.tar.gz&lt;br /&gt;
 ln -fns apache-zookeeper-3.8.0-bin zookeeper&lt;br /&gt;
&lt;br /&gt;
== Configure Zookeeper ==&lt;br /&gt;
On your local computer, create the file &#039;&#039;zookeeper.properties&#039;&#039;:&lt;br /&gt;
 dataDir=/home/ubuntu/volume/zookeeper/data&lt;br /&gt;
 clientPort=2181&lt;br /&gt;
 maxClientCnxns=200&lt;br /&gt;
 tickTime=2000&lt;br /&gt;
 initLimit=20&lt;br /&gt;
 syncLimit=10&lt;br /&gt;
&lt;br /&gt;
From your local computer, upload:&lt;br /&gt;
 scp zookeeper.properties spark-driver:&lt;br /&gt;
 scp zookeeper.properties spark-worker-1:&lt;br /&gt;
 scp zookeeper.properties spark-worker-2:&lt;br /&gt;
We will not run zookeeper on spark-worker-3, because the number of hosts must be odd to allow majority voting.&lt;br /&gt;
 &lt;br /&gt;
You can also run this as a bash-loop:&lt;br /&gt;
 for host in spark-{driver,worker-{1,2}}: ; do&lt;br /&gt;
     scp zookeeper.properties $host: ;&lt;br /&gt;
 done&lt;br /&gt;
&lt;br /&gt;
In any case, you need to run this loop on your local computer, to set the &amp;quot;id&amp;quot; of each zookeeper (and later kafka) node:&lt;br /&gt;
 i=1&lt;br /&gt;
 for host in spark-{driver,worker-{1,2,3}}: ; &lt;br /&gt;
     do echo $i &amp;gt; myid; scp myid $host; ((i++));&lt;br /&gt;
 done&lt;br /&gt;
&lt;br /&gt;
On each instance &#039;&#039;except spark-worker-3&#039;&#039;, add addresses from &#039;&#039;/etc/hosts&#039;&#039; to &#039;&#039;zookeeper.properties&#039;&#039;:&lt;br /&gt;
 mv ~/zookeeper.properties ~/volume/zookeeper/conf/&lt;br /&gt;
 i=1&lt;br /&gt;
 for line in $(grep spark- /etc/hosts | cut -d&amp;quot; &amp;quot; -f1 | head -3); do&lt;br /&gt;
     echo &amp;quot;server.$i=$line:2888:3888&amp;quot;; ((i++));&lt;br /&gt;
 done &amp;gt;&amp;gt; ~/volume/zookeeper/conf/zookeeper.properties&lt;br /&gt;
 mkdir -p /home/ubuntu/volume/zookeeper/data&lt;br /&gt;
 mv ~/myid /home/ubuntu/volume/zookeeper/data&lt;br /&gt;
&lt;br /&gt;
Set ZOOKEEPER_HOME and add bin/ to PATH:&lt;br /&gt;
 export ZOOKEEPER_HOME=/home/ubuntu/volume/zookeeper&lt;br /&gt;
 export PATH=${PATH}:${ZOOKEEPER_HOME}/bin&lt;br /&gt;
 cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc)&lt;br /&gt;
 echo &amp;quot;export ZOOKEEPER_HOME=/home/ubuntu/volume/zookeeper&amp;quot; &amp;gt;&amp;gt;  ~/.bashrc&lt;br /&gt;
 echo &amp;quot;export PATH=\${PATH}:\${ZOOKEEPER_HOME}/bin&amp;quot; &amp;gt;&amp;gt;  ~/.bashrc&lt;br /&gt;
&lt;br /&gt;
== Test run zookeeper ==&lt;br /&gt;
On &#039;&#039;spark-driver&#039;&#039; and &#039;&#039;-worker-1&#039;&#039; and &#039;&#039;-2&#039;&#039;:&lt;br /&gt;
 zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties&lt;br /&gt;
This will start zookeeper in the background on each host. Afterwards, check its status:&lt;br /&gt;
 zkServer.sh status ${ZOOKEEPER_HOME}/conf/zookeeper.properties&lt;br /&gt;
One of the zookeepers is the leader, the two others are followers. In case of trouble, check ${ZOOKEEPER_HOME}/logs/ .&lt;br /&gt;
&lt;br /&gt;
To stop zookeeper, do&lt;br /&gt;
 zkServer.sh stop ${ZOOKEEPER_HOME}/conf/zookeeper.properties&lt;br /&gt;
on each host.&lt;br /&gt;
&lt;br /&gt;
== Install Kafka ==&lt;br /&gt;
On each instance (including &#039;&#039;spark-worker-3&#039;&#039;), go to http://kafka.apache.org/downloads . Download and unpack a recent binary distribution to &#039;&#039;~/volume&#039;&#039;. It is practical, but not critical, to install a Kafka that runs on the same Scala version as your Spark. If you want to know the Scala version of your installed Spark, you can check this file:&lt;br /&gt;
 ls $SPARK_HOME/jars/spark-sql*&lt;br /&gt;
For example, &#039;&#039;.../spark-sql_2.12-3.3.0.jar&#039;&#039; means that you run Scala 2.12 and Spark 3.3.0.&lt;br /&gt;
&lt;br /&gt;
On each instance:&lt;br /&gt;
 cd ~/volume&lt;br /&gt;
 wget https://downloads.apache.org/kafka/3.3.1/kafka_2.12-3.3.1.tgz&lt;br /&gt;
 tar xzvf kafka_2.12-3.3.1.tgz&lt;br /&gt;
 rm kafka_2.12-3.3.1.tgz&lt;br /&gt;
 ln -fns kafka_2.12-3.3.1 kafka&lt;br /&gt;
&lt;br /&gt;
== Configure Kafka ==&lt;br /&gt;
On each instance, configure Kafka:&lt;br /&gt;
 cp ~/volume/kafka/config/server.properties ~/volume/kafka/config/server.properties.original&lt;br /&gt;
 sed -i &amp;quot;s/broker.id=/broker.id=$(cat /tmp/zookeeper/myid)/g&amp;quot; ~/volume/kafka/config/server.properties&lt;br /&gt;
 export ZOOKEEPER_HOSTLIST=$(grep spark- /etc/hosts | cut -d&amp;quot; &amp;quot; -f1 | head -3 | paste -sd,)&lt;br /&gt;
 echo &amp;quot;export ZOOKEEPER_HOSTLIST=\$(grep spark- /etc/hosts | cut -d\&amp;quot; \&amp;quot; -f1 | head -3 | paste -sd,)&amp;quot; &amp;gt;&amp;gt; ~/.bashrc&lt;br /&gt;
 sed -i &amp;quot;s/zookeeper.connect=localhost:2181/zookeeper.connect=$ZOOKEEPER_HOSTLIST/g&amp;quot; ~/volume/kafka/config/server.properties&lt;br /&gt;
 local_ip=$(ip -4 address | grep -o &amp;quot;^ *inet \(.\+\)\/.\+global.*$&amp;quot; | grep -o &amp;quot;[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+&amp;quot; | head -1) ; \&lt;br /&gt;
 echo &amp;quot;advertised.host.name=$local_ip&amp;quot; &amp;gt;&amp;gt; ~/volume/kafka/config/server.properties&lt;br /&gt;
&lt;br /&gt;
Set KAFKA_HOME and add bin/ to PATH:&lt;br /&gt;
 export KAFKA_HOME=/home/ubuntu/volume/kafka&lt;br /&gt;
 export PATH=${PATH}:${KAFKA_HOME}/bin&lt;br /&gt;
 cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc)&lt;br /&gt;
 echo &amp;quot;export KAFKA_HOME=/home/ubuntu/volume/kafka&amp;quot; &amp;gt;&amp;gt;  ~/.bashrc&lt;br /&gt;
 echo &amp;quot;export PATH=${PATH}:${KAFKA_HOME}/bin&amp;quot; &amp;gt;&amp;gt;  ~/.bashrc&lt;br /&gt;
&lt;br /&gt;
== Test run Kafka ==&lt;br /&gt;
Ensure Zookeeper is still running (or restart it) on the three nodes. On all four nodes:&lt;br /&gt;
 kafka-server-start.sh -daemon ${KAFKA_HOME}/config/server.properties&lt;br /&gt;
(It is a good sign if nothing happens.)&lt;br /&gt;
&lt;br /&gt;
Here are some test commands you can run on the different instances:&lt;br /&gt;
 kafka-topics.sh --create --topic test --replication-factor 2 --partitions 3 --bootstrap-server ${MASTER_NODE}:9092&lt;br /&gt;
 kafka-topics.sh --list --bootstrap-server ${MASTER_NODE}:9092&lt;br /&gt;
&lt;br /&gt;
Run these two commands on different instances to see that lines you type into the &#039;&#039;producer&#039;&#039; console show up in the &#039;&#039;consumer&#039;&#039;:&lt;br /&gt;
 kafka-console-producer.sh --topic test --bootstrap-server ${MASTER_NODE}:9092&lt;br /&gt;
 kafka-console-consumer.sh --topic test --bootstrap-server ${MASTER_NODE}:9092 --from-beginning&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task 2:&#039;&#039;&#039; Run the full Twitter pipeline from Exercise 3. Tips:&lt;br /&gt;
* You may want to create a virtual environment to install Python packages such as &#039;&#039;tweepy&#039;&#039;, &#039;&#039;kafka-python&#039;&#039;, and &#039;&#039;afinn&#039;&#039;.&lt;br /&gt;
* You need to create &#039;&#039;keys/bearer_token&#039;&#039; with restricted access.&lt;br /&gt;
* With the default settings, it can take time before the final join is written to console. Be patient...&lt;br /&gt;
* Start-up sequence from &#039;&#039;spark-driver&#039;&#039;:&lt;br /&gt;
 # assuming zookeeper and kafka run already&lt;br /&gt;
 # create kafka topics&lt;br /&gt;
 kafka-topics.sh --create --topic tweets --bootstrap-server localhost:9092&lt;br /&gt;
 kafka-topics.sh --create --topic media --bootstrap-server localhost:9092&lt;br /&gt;
 kafka-topics.sh --create --topic photos --bootstrap-server localhost:9092&lt;br /&gt;
 # optional monitoring&lt;br /&gt;
 kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic tweets --from-beginning &amp;amp;&lt;br /&gt;
 kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic media --from-beginning &amp;amp;&lt;br /&gt;
 kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic photos --from-beginning &amp;amp;&lt;br /&gt;
 # main programs&lt;br /&gt;
 python twitterpipe_tweet_harvester.py  # run this repeatedly or modify to reconnect after sleeping&lt;br /&gt;
 python twitterpipe_media_harvester.py &amp;amp;  # this one can run in background&lt;br /&gt;
 spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 twitterpipe_tweet_pipeline.py &amp;amp;&lt;br /&gt;
Because you no longer use &#039;&#039;findspark&#039;&#039;, you need to use the &#039;&#039;&#039;--packages&#039;&#039;&#039; option with &#039;&#039;&#039;spark-submit&#039;&#039;&#039;. Make sure you use exactly the same package as you did with &#039;&#039;findspark.add_packages(...)&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Web UIs ==&lt;br /&gt;
Assuming that 158.39.77.227 is the IPv4 address of &#039;&#039;spark-driver&#039;&#039;, you can access Zookeeper&#039;s very simple web UI at http://158.39.77.227:8080/commands/stat .&lt;br /&gt;
&lt;br /&gt;
Kafka has no built in web UI, but third-party UIs are available. &amp;lt;!-- If you [https://docs.docker.com/engine/install/ install Docker] on &#039;&#039;spark-driver&#039;&#039;, you can run something like:&lt;br /&gt;
 docker run -it -p 9001:9000 \&lt;br /&gt;
     -e KAFKA_BROKERCONNECT=${MASTER_NODE}:9092 \&lt;br /&gt;
     obsidiandynamics/kafdrop&lt;br /&gt;
When the docker runs, you can access the Kafka UI through a web browser at http://158.39.77.227:9001 (as usual assuming that 158.39.77.227 is the IPv4 address of &#039;&#039;spark-driver&#039;&#039;).&lt;br /&gt;
&lt;br /&gt;
You can install docker on &#039;&#039;spark-driver&#039;&#039; like this:&lt;br /&gt;
 sudo apt install docker.io&lt;br /&gt;
 sudo groupadd docker&lt;br /&gt;
 sudo usermod -aG docker ${USER}&lt;br /&gt;
&#039;&#039;Log out (&#039;exit&#039;) and then login in again (&#039;ssh spark-driver&#039;).&#039;&#039;--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Install_Kafka_on_the_cluster&amp;diff=1243</id>
		<title>Install Kafka on the cluster</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Install_Kafka_on_the_cluster&amp;diff=1243"/>
		<updated>2022-12-01T11:13:29Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Configure Kafka */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Install Zookeeper ==&lt;br /&gt;
On each instance, go to https://zookeeper.apache.org/releases.html#download . Download and unpack a recent binary distribution to &#039;&#039;~/volume&#039;&#039;. For example:&lt;br /&gt;
 cd ~/volume&lt;br /&gt;
 wget https://dlcdn.apache.org/zookeeper/zookeeper-3.8.0/apache-zookeeper-3.8.0-bin.tar.gz&lt;br /&gt;
 tar zxvf apache-zookeeper-3.8.0-bin.tar.gz&lt;br /&gt;
 rm apache-zookeeper-3.8.0-bin.tar.gz&lt;br /&gt;
 ln -fns apache-zookeeper-3.8.0-bin zookeeper&lt;br /&gt;
&lt;br /&gt;
== Configure Zookeeper ==&lt;br /&gt;
On your local computer, create the file &#039;&#039;zookeeper.properties&#039;&#039;:&lt;br /&gt;
 dataDir=/home/ubuntu/volume/zookeeper/data&lt;br /&gt;
 clientPort=2181&lt;br /&gt;
 maxClientCnxns=200&lt;br /&gt;
 tickTime=2000&lt;br /&gt;
 initLimit=20&lt;br /&gt;
 syncLimit=10&lt;br /&gt;
&lt;br /&gt;
From your local computer, upload:&lt;br /&gt;
 scp zookeeper.properties spark-driver:&lt;br /&gt;
 scp zookeeper.properties spark-worker-1:&lt;br /&gt;
 scp zookeeper.properties spark-worker-2:&lt;br /&gt;
We will not run zookeeper on spark-worker-3, because the number of hosts must be odd to allow majority voting.&lt;br /&gt;
 &lt;br /&gt;
You can also run this as a bash-loop:&lt;br /&gt;
 for host in spark-{driver,worker-{1,2}}: ; do&lt;br /&gt;
     scp zookeeper.properties $host: ;&lt;br /&gt;
 done&lt;br /&gt;
&lt;br /&gt;
In any case, you need to run this loop on your local computer, to set the &amp;quot;id&amp;quot; of each zookeeper (and later kafka) node:&lt;br /&gt;
 i=1&lt;br /&gt;
 for host in spark-{driver,worker-{1,2,3}}: ; &lt;br /&gt;
     do echo $i &amp;gt; myid; scp myid $host; ((i++));&lt;br /&gt;
 done&lt;br /&gt;
&lt;br /&gt;
On each instance &#039;&#039;except spark-worker-3&#039;&#039;, add addresses from &#039;&#039;/etc/hosts&#039;&#039; to &#039;&#039;zookeeper.properties&#039;&#039;:&lt;br /&gt;
 mv ~/zookeeper.properties ~/volume/zookeeper/conf/&lt;br /&gt;
 i=1&lt;br /&gt;
 for line in $(grep spark- /etc/hosts | cut -d&amp;quot; &amp;quot; -f1 | head -3); do&lt;br /&gt;
     echo &amp;quot;server.$i=$line:2888:3888&amp;quot;; ((i++));&lt;br /&gt;
 done &amp;gt;&amp;gt; ~/volume/zookeeper/conf/zookeeper.properties&lt;br /&gt;
 mkdir -p /home/ubuntu/volume/zookeeper/data&lt;br /&gt;
 mv ~/myid /home/ubuntu/volume/zookeeper/data&lt;br /&gt;
&lt;br /&gt;
Set ZOOKEEPER_HOME and add bin/ to PATH:&lt;br /&gt;
 export ZOOKEEPER_HOME=/home/ubuntu/volume/zookeeper&lt;br /&gt;
 export PATH=${PATH}:${ZOOKEEPER_HOME}/bin&lt;br /&gt;
 cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc)&lt;br /&gt;
 echo &amp;quot;export ZOOKEEPER_HOME=/home/ubuntu/volume/zookeeper&amp;quot; &amp;gt;&amp;gt;  ~/.bashrc&lt;br /&gt;
 echo &amp;quot;export PATH=\${PATH}:\${ZOOKEEPER_HOME}/bin&amp;quot; &amp;gt;&amp;gt;  ~/.bashrc&lt;br /&gt;
&lt;br /&gt;
== Test run zookeeper ==&lt;br /&gt;
On &#039;&#039;spark-driver&#039;&#039; and &#039;&#039;-worker-1&#039;&#039; and &#039;&#039;-2&#039;&#039;:&lt;br /&gt;
 zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties&lt;br /&gt;
This will start zookeeper in the background on each host. Afterwards, check its status:&lt;br /&gt;
 zkServer.sh status ${ZOOKEEPER_HOME}/conf/zookeeper.properties&lt;br /&gt;
One of the zookeepers is the leader, the two others are followers. In case of trouble, check ${ZOOKEEPER_HOME}/logs/ .&lt;br /&gt;
&lt;br /&gt;
To stop zookeeper, do&lt;br /&gt;
 zkServer.sh stop ${ZOOKEEPER_HOME}/conf/zookeeper.properties&lt;br /&gt;
on each host.&lt;br /&gt;
&lt;br /&gt;
== Install Kafka ==&lt;br /&gt;
On each instance (including &#039;&#039;spark-worker-3&#039;&#039;), go to http://kafka.apache.org/downloads . Download and unpack a recent binary distribution to &#039;&#039;~/volume&#039;&#039;. It is practical, but not critical, to install a Kafka that runs on the same Scala version as your Spark. If you want to know the Scala version of your installed Spark, you can check this file:&lt;br /&gt;
 ls $SPARK_HOME/jars/spark-sql*&lt;br /&gt;
For example, &#039;&#039;.../spark-sql_2.12-3.3.0.jar&#039;&#039; means that you run Scala 2.12 and Spark 3.3.0.&lt;br /&gt;
&lt;br /&gt;
On each instance:&lt;br /&gt;
 cd ~/volume&lt;br /&gt;
 wget https://downloads.apache.org/kafka/3.3.1/kafka_2.12-3.3.1.tgz&lt;br /&gt;
 tar xzvf kafka_2.12-3.3.1.tgz&lt;br /&gt;
 rm kafka_2.12-3.3.1.tgz&lt;br /&gt;
 ln -fns kafka_2.12-3.3.1 kafka&lt;br /&gt;
&lt;br /&gt;
== Configure Kafka ==&lt;br /&gt;
On each instance, configure Kafka:&lt;br /&gt;
 cp ~/volume/kafka/config/server.properties ~/volume/kafka/config/server.properties.original&lt;br /&gt;
 sed -i &amp;quot;s/broker.id=/broker.id=$(cat /tmp/zookeeper/myid)/g&amp;quot; ~/volume/kafka/config/server.properties&lt;br /&gt;
 export ZOOKEEPER_HOSTLIST=$(grep spark- /etc/hosts | cut -d&amp;quot; &amp;quot; -f1 | head -3 | paste -sd,)&lt;br /&gt;
 echo &amp;quot;export ZOOKEEPER_HOSTLIST=\$(grep spark- /etc/hosts | cut -d\&amp;quot; \&amp;quot; -f1 | head -3 | paste -sd,)&amp;quot; &amp;gt;&amp;gt; ~/.bashrc&lt;br /&gt;
 sed -i &amp;quot;s/zookeeper.connect=localhost:2181/zookeeper.connect=$ZOOKEEPER_HOSTLIST/g&amp;quot; ~/volume/kafka/config/server.properties&lt;br /&gt;
 local_ip=$(ip -4 address | grep -o &amp;quot;^ *inet \(.\+\)\/.\+global.*$&amp;quot; | grep -o &amp;quot;[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+&amp;quot; | head -1) ; \&lt;br /&gt;
 echo &amp;quot;advertised.host.name=$local_ip&amp;quot; &amp;gt;&amp;gt; ~/volume/kafka/config/server.properties&lt;br /&gt;
&lt;br /&gt;
Set KAFKA_HOME and add bin/ to PATH:&lt;br /&gt;
 export KAFKA_HOME=/home/ubuntu/volume/kafka&lt;br /&gt;
 export PATH=${PATH}:${KAFKA_HOME}/bin&lt;br /&gt;
 cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc)&lt;br /&gt;
 echo &amp;quot;export KAFKA_HOME=/home/ubuntu/volume/kafka&amp;quot; &amp;gt;&amp;gt;  ~/.bashrc&lt;br /&gt;
 echo &amp;quot;export PATH=${PATH}:${KAFKA_HOME}/bin&amp;quot; &amp;gt;&amp;gt;  ~/.bashrc&lt;br /&gt;
&lt;br /&gt;
== Test run Kafka ==&lt;br /&gt;
Ensure Zookeeper is still running (or restart it) on the three nodes. On all four nodes:&lt;br /&gt;
 kafka-server-start.sh -daemon ${KAFKA_HOME}/config/server.properties&lt;br /&gt;
(It is a good sign if nothing happens.)&lt;br /&gt;
&lt;br /&gt;
Here are some test commands you can run on the different instances:&lt;br /&gt;
 kafka-topics.sh --create --topic test --replication-factor 2 --partitions 3 --bootstrap-server ${MASTER_NODE}:9092&lt;br /&gt;
 kafka-topics.sh --list --bootstrap-server ${MASTER_NODE}:9092&lt;br /&gt;
&lt;br /&gt;
Run these two commands on different instances to see that lines you type into the &#039;&#039;producer&#039;&#039; console show up in the &#039;&#039;consumer&#039;&#039;:&lt;br /&gt;
 bin/kafka-console-producer.sh --topic test --bootstrap-server ${MASTER_NODE}:9092&lt;br /&gt;
 bin/kafka-console-consumer.sh --topic test --bootstrap-server ${MASTER_NODE}:9092 --from-beginning&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task 2:&#039;&#039;&#039; Run the full Twitter pipeline from Exercise 3. Tips:&lt;br /&gt;
* You may want to create a virtual environment to install Python packages such as &#039;&#039;tweepy&#039;&#039;, &#039;&#039;kafka-python&#039;&#039;, and &#039;&#039;afinn&#039;&#039;.&lt;br /&gt;
* You need to create &#039;&#039;keys/bearer_token&#039;&#039; with restricted access.&lt;br /&gt;
* With the default settings, it can take time before the final join is written to console. Be patient...&lt;br /&gt;
* Start-up sequence from &#039;&#039;spark-driver&#039;&#039;:&lt;br /&gt;
 # assuming zookeeper and kafka run already&lt;br /&gt;
 # create kafka topics&lt;br /&gt;
 kafka-topics.sh --create --topic tweets --bootstrap-server localhost:9092&lt;br /&gt;
 kafka-topics.sh --create --topic media --bootstrap-server localhost:9092&lt;br /&gt;
 kafka-topics.sh --create --topic photos --bootstrap-server localhost:9092&lt;br /&gt;
 # optional monitoring&lt;br /&gt;
 kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic tweets --from-beginning &amp;amp;&lt;br /&gt;
 kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic media --from-beginning &amp;amp;&lt;br /&gt;
 kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic photos --from-beginning &amp;amp;&lt;br /&gt;
 # main programs&lt;br /&gt;
 python twitterpipe_tweet_harvester.py  # run this repeatedly or modify to reconnect after sleeping&lt;br /&gt;
 python twitterpipe_media_harvester.py &amp;amp;  # this one can run in background&lt;br /&gt;
 spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 twitterpipe_tweet_pipeline.py &amp;amp;&lt;br /&gt;
Because you no longer use &#039;&#039;findspark&#039;&#039;, you need to use the &#039;&#039;&#039;--packages&#039;&#039;&#039; option with &#039;&#039;&#039;spark-submit&#039;&#039;&#039;. Make sure you use exactly the same package as you did with &#039;&#039;findspark.add_packages(...)&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Web UIs ==&lt;br /&gt;
Assuming that 158.39.77.227 is the IPv4 address of &#039;&#039;spark-driver&#039;&#039;, you can access Zookeeper&#039;s very simple web UI at http://158.39.77.227:8080/commands/stat .&lt;br /&gt;
&lt;br /&gt;
Kafka has no built in web UI, but third-party UIs are available. &amp;lt;!-- If you [https://docs.docker.com/engine/install/ install Docker] on &#039;&#039;spark-driver&#039;&#039;, you can run something like:&lt;br /&gt;
 docker run -it -p 9001:9000 \&lt;br /&gt;
     -e KAFKA_BROKERCONNECT=${MASTER_NODE}:9092 \&lt;br /&gt;
     obsidiandynamics/kafdrop&lt;br /&gt;
When the docker runs, you can access the Kafka UI through a web browser at http://158.39.77.227:9001 (as usual assuming that 158.39.77.227 is the IPv4 address of &#039;&#039;spark-driver&#039;&#039;).&lt;br /&gt;
&lt;br /&gt;
You can install docker on &#039;&#039;spark-driver&#039;&#039; like this:&lt;br /&gt;
 sudo apt install docker.io&lt;br /&gt;
 sudo groupadd docker&lt;br /&gt;
 sudo usermod -aG docker ${USER}&lt;br /&gt;
&#039;&#039;Log out (&#039;exit&#039;) and then login in again (&#039;ssh spark-driver&#039;).&#039;&#039;--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Install_Kafka_on_the_cluster&amp;diff=1242</id>
		<title>Install Kafka on the cluster</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Install_Kafka_on_the_cluster&amp;diff=1242"/>
		<updated>2022-12-01T11:09:03Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Test run zookeeper */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Install Zookeeper ==&lt;br /&gt;
On each instance, go to https://zookeeper.apache.org/releases.html#download . Download and unpack a recent binary distribution to &#039;&#039;~/volume&#039;&#039;. For example:&lt;br /&gt;
 cd ~/volume&lt;br /&gt;
 wget https://dlcdn.apache.org/zookeeper/zookeeper-3.8.0/apache-zookeeper-3.8.0-bin.tar.gz&lt;br /&gt;
 tar zxvf apache-zookeeper-3.8.0-bin.tar.gz&lt;br /&gt;
 rm apache-zookeeper-3.8.0-bin.tar.gz&lt;br /&gt;
 ln -fns apache-zookeeper-3.8.0-bin zookeeper&lt;br /&gt;
&lt;br /&gt;
== Configure Zookeeper ==&lt;br /&gt;
On your local computer, create the file &#039;&#039;zookeeper.properties&#039;&#039;:&lt;br /&gt;
 dataDir=/home/ubuntu/volume/zookeeper/data&lt;br /&gt;
 clientPort=2181&lt;br /&gt;
 maxClientCnxns=200&lt;br /&gt;
 tickTime=2000&lt;br /&gt;
 initLimit=20&lt;br /&gt;
 syncLimit=10&lt;br /&gt;
&lt;br /&gt;
From your local computer, upload:&lt;br /&gt;
 scp zookeeper.properties spark-driver:&lt;br /&gt;
 scp zookeeper.properties spark-worker-1:&lt;br /&gt;
 scp zookeeper.properties spark-worker-2:&lt;br /&gt;
We will not run zookeeper on spark-worker-3, because the number of hosts must be odd to allow majority voting.&lt;br /&gt;
 &lt;br /&gt;
You can also run this as a bash-loop:&lt;br /&gt;
 for host in spark-{driver,worker-{1,2}}: ; do&lt;br /&gt;
     scp zookeeper.properties $host: ;&lt;br /&gt;
 done&lt;br /&gt;
&lt;br /&gt;
In any case, you need to run this loop on your local computer, to set the &amp;quot;id&amp;quot; of each zookeeper (and later kafka) node:&lt;br /&gt;
 i=1&lt;br /&gt;
 for host in spark-{driver,worker-{1,2,3}}: ; &lt;br /&gt;
     do echo $i &amp;gt; myid; scp myid $host; ((i++));&lt;br /&gt;
 done&lt;br /&gt;
&lt;br /&gt;
On each instance &#039;&#039;except spark-worker-3&#039;&#039;, add addresses from &#039;&#039;/etc/hosts&#039;&#039; to &#039;&#039;zookeeper.properties&#039;&#039;:&lt;br /&gt;
 mv ~/zookeeper.properties ~/volume/zookeeper/conf/&lt;br /&gt;
 i=1&lt;br /&gt;
 for line in $(grep spark- /etc/hosts | cut -d&amp;quot; &amp;quot; -f1 | head -3); do&lt;br /&gt;
     echo &amp;quot;server.$i=$line:2888:3888&amp;quot;; ((i++));&lt;br /&gt;
 done &amp;gt;&amp;gt; ~/volume/zookeeper/conf/zookeeper.properties&lt;br /&gt;
 mkdir -p /home/ubuntu/volume/zookeeper/data&lt;br /&gt;
 mv ~/myid /home/ubuntu/volume/zookeeper/data&lt;br /&gt;
&lt;br /&gt;
Set ZOOKEEPER_HOME and add bin/ to PATH:&lt;br /&gt;
 export ZOOKEEPER_HOME=/home/ubuntu/volume/zookeeper&lt;br /&gt;
 export PATH=${PATH}:${ZOOKEEPER_HOME}/bin&lt;br /&gt;
 cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc)&lt;br /&gt;
 echo &amp;quot;export ZOOKEEPER_HOME=/home/ubuntu/volume/zookeeper&amp;quot; &amp;gt;&amp;gt;  ~/.bashrc&lt;br /&gt;
 echo &amp;quot;export PATH=\${PATH}:\${ZOOKEEPER_HOME}/bin&amp;quot; &amp;gt;&amp;gt;  ~/.bashrc&lt;br /&gt;
&lt;br /&gt;
== Test run zookeeper ==&lt;br /&gt;
On &#039;&#039;spark-driver&#039;&#039; and &#039;&#039;-worker-1&#039;&#039; and &#039;&#039;-2&#039;&#039;:&lt;br /&gt;
 zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties&lt;br /&gt;
This will start zookeeper in the background on each host. Afterwards, check its status:&lt;br /&gt;
 zkServer.sh status ${ZOOKEEPER_HOME}/conf/zookeeper.properties&lt;br /&gt;
One of the zookeepers is the leader, the two others are followers. In case of trouble, check ${ZOOKEEPER_HOME}/logs/ .&lt;br /&gt;
&lt;br /&gt;
To stop zookeeper, do&lt;br /&gt;
 zkServer.sh stop ${ZOOKEEPER_HOME}/conf/zookeeper.properties&lt;br /&gt;
on each host.&lt;br /&gt;
&lt;br /&gt;
== Install Kafka ==&lt;br /&gt;
On each instance (including &#039;&#039;spark-worker-3&#039;&#039;), go to http://kafka.apache.org/downloads . Download and unpack a recent binary distribution to &#039;&#039;~/volume&#039;&#039;. It is practical, but not critical, to install a Kafka that runs on the same Scala version as your Spark. If you want to know the Scala version of your installed Spark, you can check this file:&lt;br /&gt;
 ls $SPARK_HOME/jars/spark-sql*&lt;br /&gt;
For example, &#039;&#039;.../spark-sql_2.12-3.3.0.jar&#039;&#039; means that you run Scala 2.12 and Spark 3.3.0.&lt;br /&gt;
&lt;br /&gt;
On each instance:&lt;br /&gt;
 cd ~/volume&lt;br /&gt;
 wget https://downloads.apache.org/kafka/3.3.1/kafka_2.12-3.3.1.tgz&lt;br /&gt;
 tar xzvf kafka_2.12-3.3.1.tgz&lt;br /&gt;
 rm kafka_2.12-3.3.1.tgz&lt;br /&gt;
 ln -fns kafka_2.12-3.3.1 kafka&lt;br /&gt;
&lt;br /&gt;
== Configure Kafka ==&lt;br /&gt;
On each instance, configure Kafka:&lt;br /&gt;
 cp ~/volume/kafka/config/server.properties ~/volume/kafka/config/server.properties.original&lt;br /&gt;
 sed -i &amp;quot;s/broker.id=0/broker.id=$(cat /tmp/zookeeper/myid)/g&amp;quot; ~/volume/kafka/config/server.properties&lt;br /&gt;
 export ZOOKEEPER_HOSTLIST=$(grep spark- /etc/hosts | cut -d&amp;quot; &amp;quot; -f1 | head -3 | paste -sd,)&lt;br /&gt;
 echo &amp;quot;export ZOOKEEPER_HOSTLIST=\$(grep spark- /etc/hosts | cut -d\&amp;quot; \&amp;quot; -f1 | head -3 | paste -sd,)&amp;quot; &amp;gt;&amp;gt; ~/.bashrc&lt;br /&gt;
 sed -i &amp;quot;s/zookeeper.connect=localhost:2181/zookeeper.connect=$ZOOKEEPER_HOSTLIST/g&amp;quot; ~/volume/kafka/config/server.properties&lt;br /&gt;
 local_ip=$(ip -4 address | grep -o &amp;quot;^ *inet \(.\+\)\/.\+global.*$&amp;quot; | grep -o &amp;quot;[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+&amp;quot; | head -1) ; \&lt;br /&gt;
 echo &amp;quot;advertised.host.name=$local_ip&amp;quot; &amp;gt;&amp;gt; ~/volume/kafka/config/server.properties&lt;br /&gt;
&lt;br /&gt;
Set KAFKA_HOME and add bin/ to PATH:&lt;br /&gt;
 export KAFKA_HOME=/home/ubuntu/volume/kafka&lt;br /&gt;
 export PATH=${PATH}:${KAFKA_HOME}/bin&lt;br /&gt;
 cp ~/.bashrc ~/.bashrc-bkp-$(date --iso-8601=minutes --utc)&lt;br /&gt;
 echo &amp;quot;export KAFKA_HOME=/home/ubuntu/volume/kafka&amp;quot; &amp;gt;&amp;gt;  ~/.bashrc&lt;br /&gt;
 echo &amp;quot;export PATH=${PATH}:${KAFKA_HOME}/bin&amp;quot; &amp;gt;&amp;gt;  ~/.bashrc&lt;br /&gt;
&lt;br /&gt;
== Test run Kafka ==&lt;br /&gt;
Ensure Zookeeper is still running (or restart it) on the three nodes. On all four nodes:&lt;br /&gt;
 kafka-server-start.sh -daemon ${KAFKA_HOME}/config/server.properties&lt;br /&gt;
(It is a good sign if nothing happens.)&lt;br /&gt;
&lt;br /&gt;
Here are some test commands you can run on the different instances:&lt;br /&gt;
 kafka-topics.sh --create --topic test --replication-factor 2 --partitions 3 --bootstrap-server ${MASTER_NODE}:9092&lt;br /&gt;
 kafka-topics.sh --list --bootstrap-server ${MASTER_NODE}:9092&lt;br /&gt;
&lt;br /&gt;
Run these two commands on different instances to see that lines you type into the &#039;&#039;producer&#039;&#039; console show up in the &#039;&#039;consumer&#039;&#039;:&lt;br /&gt;
 bin/kafka-console-producer.sh --topic test --bootstrap-server ${MASTER_NODE}:9092&lt;br /&gt;
 bin/kafka-console-consumer.sh --topic test --bootstrap-server ${MASTER_NODE}:9092 --from-beginning&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task 2:&#039;&#039;&#039; Run the full Twitter pipeline from Exercise 3. Tips:&lt;br /&gt;
* You may want to create a virtual environment to install Python packages such as &#039;&#039;tweepy&#039;&#039;, &#039;&#039;kafka-python&#039;&#039;, and &#039;&#039;afinn&#039;&#039;.&lt;br /&gt;
* You need to create &#039;&#039;keys/bearer_token&#039;&#039; with restricted access.&lt;br /&gt;
* With the default settings, it can take time before the final join is written to console. Be patient...&lt;br /&gt;
* Start-up sequence from &#039;&#039;spark-driver&#039;&#039;:&lt;br /&gt;
 # assuming zookeeper and kafka run already&lt;br /&gt;
 # create kafka topics&lt;br /&gt;
 kafka-topics.sh --create --topic tweets --bootstrap-server localhost:9092&lt;br /&gt;
 kafka-topics.sh --create --topic media --bootstrap-server localhost:9092&lt;br /&gt;
 kafka-topics.sh --create --topic photos --bootstrap-server localhost:9092&lt;br /&gt;
 # optional monitoring&lt;br /&gt;
 kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic tweets --from-beginning &amp;amp;&lt;br /&gt;
 kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic media --from-beginning &amp;amp;&lt;br /&gt;
 kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic photos --from-beginning &amp;amp;&lt;br /&gt;
 # main programs&lt;br /&gt;
 python twitterpipe_tweet_harvester.py  # run this repeatedly or modify to reconnect after sleeping&lt;br /&gt;
 python twitterpipe_media_harvester.py &amp;amp;  # this one can run in background&lt;br /&gt;
 spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 twitterpipe_tweet_pipeline.py &amp;amp;&lt;br /&gt;
Because you no longer use &#039;&#039;findspark&#039;&#039;, you need to use the &#039;&#039;&#039;--packages&#039;&#039;&#039; option with &#039;&#039;&#039;spark-submit&#039;&#039;&#039;. Make sure you use exactly the same package as you did with &#039;&#039;findspark.add_packages(...)&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Web UIs ==&lt;br /&gt;
Assuming that 158.39.77.227 is the IPv4 address of &#039;&#039;spark-driver&#039;&#039;, you can access Zookeeper&#039;s very simple web UI at http://158.39.77.227:8080/commands/stat .&lt;br /&gt;
&lt;br /&gt;
Kafka has no built in web UI, but third-party UIs are available. &amp;lt;!-- If you [https://docs.docker.com/engine/install/ install Docker] on &#039;&#039;spark-driver&#039;&#039;, you can run something like:&lt;br /&gt;
 docker run -it -p 9001:9000 \&lt;br /&gt;
     -e KAFKA_BROKERCONNECT=${MASTER_NODE}:9092 \&lt;br /&gt;
     obsidiandynamics/kafdrop&lt;br /&gt;
When the docker runs, you can access the Kafka UI through a web browser at http://158.39.77.227:9001 (as usual assuming that 158.39.77.227 is the IPv4 address of &#039;&#039;spark-driver&#039;&#039;).&lt;br /&gt;
&lt;br /&gt;
You can install docker on &#039;&#039;spark-driver&#039;&#039; like this:&lt;br /&gt;
 sudo apt install docker.io&lt;br /&gt;
 sudo groupadd docker&lt;br /&gt;
 sudo usermod -aG docker ${USER}&lt;br /&gt;
&#039;&#039;Log out (&#039;exit&#039;) and then login in again (&#039;ssh spark-driver&#039;).&#039;&#039;--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Essay&amp;diff=1241</id>
		<title>Essay</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Essay&amp;diff=1241"/>
		<updated>2022-11-15T16:54:06Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Essay presentations */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The essay shall present and discuss selected theory, technology and tools related to big data.&lt;br /&gt;
&lt;br /&gt;
This autumn, we specifically invite essays that are related to your group assignments, for example discussion privacy aspects of your project (if any).&lt;br /&gt;
&lt;br /&gt;
== Theme ==&lt;br /&gt;
Your mandatory essay is supposed to be &amp;quot;an individual, theoretical essay on chosen topic&amp;quot;. In practice, it is you who propose a theme, and then I either accept it as is or guide you towards a more suitable theme.&lt;br /&gt;
&lt;br /&gt;
I invite essays that are not purely theoretical, but that present and reflect over your work with the group assignment. For example, the essay can discuss privacy-related aspects of your project. But in any case, your essay should demonstrate &amp;quot;thoughtful research and discussion&amp;quot;. It should be well backed by scholarly literature. &lt;br /&gt;
&lt;br /&gt;
== Finding a theme ==&lt;br /&gt;
It is a good idea to start thinking about possible essay themes early in the semester, at least to get the processes started! Send me your ideas in an email, and I will provide early feedback. All I need is a few keywords or a sentence about one or more ideas your are considering. Later we can talk face to face.&lt;br /&gt;
&lt;br /&gt;
* Here is an example of an initial idea on the keyword stage: &amp;quot;Ethics and Privacy in using social media for news production.&amp;quot;&lt;br /&gt;
* A step further: &amp;quot;I want to look into how open and non-sensitive information collected from independent sources about the same individuals can become sensitive when recombined.&amp;quot;&lt;br /&gt;
* Here is an even more developed idea: &amp;quot;How can potentially sensitive information be identified in big-data streams for news production? And how can guards be put in place to ensure that sensitive information is not created as part of the news production stream.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== Proposing a theme ==&lt;br /&gt;
Everyone who intends to take the course must send an essay proposal by email to [mailto:Andreas.Opdahl@uib.no Andreas.Opdahl@uib.no]. The subject line must contain the string &amp;quot;INFO319 Essay Proposal&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Proposal deadline:&#039;&#039;&#039; Wednesday October 12th 1500.&lt;br /&gt;
&lt;br /&gt;
== Essay presentations ==&lt;br /&gt;
&#039;&#039;&#039;Final essay presentations:&#039;&#039;&#039; Thursday November 24th 1015&lt;br /&gt;
&lt;br /&gt;
10 minutes and around 5-8 slides are an appropriate length for your presentations. Then we will have around 5 minutes for questions and comments.&lt;br /&gt;
&lt;br /&gt;
Examples of things you might want to touch in your presentation are: &lt;br /&gt;
* your problem and why it is important&lt;br /&gt;
* the most central theory/literature you have found for your work&lt;br /&gt;
* the question(s) you want to answer/topic(s) you want to illuminate&lt;br /&gt;
* your choice of method if you want to do something empirical&lt;br /&gt;
* what you have done do far&lt;br /&gt;
* what you hope to achieve&lt;br /&gt;
* what could be the most important outcomes of your essays (and are there pitfalls?)&lt;br /&gt;
&lt;br /&gt;
The 10 minutes include getting your slides up and running. This means that you need to be really well prepared. A short presentation is much harder than a long one: &amp;quot;Lincoln [... w]hen asked to appear upon some important occasion and deliver a five-minute speech, he said that he had no time to prepare five-minute speeches, but that he could go and speak an hour at any time.&amp;quot; (H. H. Markham, Governor of California, 1893)&lt;br /&gt;
&lt;br /&gt;
== Essay submission ==&lt;br /&gt;
&#039;&#039;&#039;Final essay submission:&#039;&#039;&#039; December 12th.&lt;br /&gt;
&lt;br /&gt;
Submit your file through Inspera as a single PDF file. The version of your essay that you submit should be anonymous.&lt;br /&gt;
&lt;br /&gt;
Since the essay is graded, this is an official deadline. Submission is through Inspera and if you do not submit on time, you will be not allowed to take the course exam a week later.&lt;br /&gt;
&lt;br /&gt;
Suggested length is 3000-4000 words. But effort and quality is much more important than length.&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Programming_project&amp;diff=1240</id>
		<title>Programming project</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Programming_project&amp;diff=1240"/>
		<updated>2022-11-15T16:53:04Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Project presentations */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The project shall develop an application that uses big data technologies on social-media and/or other open data data. At least a part of the project shall use Spark and run in the NREC cloud. The project should be carried out in groups of three, and never more. Working individually or in pairs is possible, but not recommended. &lt;br /&gt;
&lt;br /&gt;
This autumn, we specifically invite projects that use &#039;&#039;big data for the news&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
More information about possible projects, deadlines, and other requirements will appear here soon.&lt;br /&gt;
&lt;br /&gt;
== Proposing a theme: deadline ==&lt;br /&gt;
Everyone who intends to take the course must be included in a project proposal sent by email to [mailto:Andreas.Opdahl@uib.no Andreas.Opdahl@uib.no] and with all the group members on Cc. The subject line must contain the string &amp;quot;INFO319 Project Proposal&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Proposal deadline:&#039;&#039;&#039; Wednesday October 12th 1500&lt;br /&gt;
&amp;lt;!-- The proposal does not have to be long, but the following points must be made clear:&lt;br /&gt;
*    What you are planning to make using big data and big data technologies.&lt;br /&gt;
*    Why it is a good idea to use big data and big data technologies for this purpose.&lt;br /&gt;
*    Exactly what you think is new with your idea.&lt;br /&gt;
*    What you have done to ensure that something very similar has not been done before.&lt;br /&gt;
*    Which datasets you are planning to use.&lt;br /&gt;
*    What technologies (programming language, libraries, development and collaboration tools) you are planning to use. --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Project presentations ==&lt;br /&gt;
&#039;&#039;&#039;Final project presentations:&#039;&#039;&#039; Thursday December 8th 1015&lt;br /&gt;
&lt;br /&gt;
Depending a little on the number of project groups, each presentation will be brief: 15 minutes for each group + 5 minutes for questions and comments.&lt;br /&gt;
&lt;br /&gt;
You may demonstrate your project live (most convincing), or you may replay a recorded demonstration (which is good to have as a backup in any case). In addition, I expect each presentation to address/answer at least these points:&lt;br /&gt;
*    what have you made? - or: what is your application doing?&lt;br /&gt;
*    which technologies have you used (languages, APIs, other software etc.)&lt;br /&gt;
*    where did you get your data from? - and/or: which datasets have you used?&lt;br /&gt;
*    why is it a good idea to do this using big data and big data technologies? - or: what does your system do that was not possible (or at least not easy) to do before?&lt;br /&gt;
*    exactly what have you done and programmed so far?&lt;br /&gt;
*    what are you planning to do in the final few days?&lt;br /&gt;
*    have you got any particular problems you need to address?&lt;br /&gt;
&lt;br /&gt;
== Project submission ==&lt;br /&gt;
&#039;&#039;&#039;Final project submission:&#039;&#039;&#039; December 12th. &lt;br /&gt;
&lt;br /&gt;
Submit your project through Inspera as a single ZIP archive. The version of your project that you submit should be anonymous.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Since the project is graded, this is an official deadline. If you do not submit on time, you will be not allowed to take the course exam a week later.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Provide a short video (max 5 minutes) that shows your system running, which voice comments.&lt;br /&gt;
&lt;br /&gt;
Comment your code sparsely and in-line. You do not need additional documentation, but you should provide a precise description for how to run your system. For example, explain:&lt;br /&gt;
* which additional packages that need to be installed&lt;br /&gt;
* which datasets that need to be downloaded&lt;br /&gt;
** do not include large datasets &amp;gt;10M in your Zip file&lt;br /&gt;
** but it is fine to include smaller test datasets&lt;br /&gt;
* if credentials (like a Twitter token) is needed to run the code, explain where they must be added&lt;br /&gt;
* which other systems that must be running first (e.g., Kafka, HDFS, YARN)&lt;br /&gt;
* how to start your system (in particular if it consists of several programs)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
The end result of the project should be submitted as a ZIP archive through Inspera:&lt;br /&gt;
*     Just one person in the group shall deliver the group assignment (ZIP file) in Inspera.&lt;br /&gt;
*      The groups must be created by the candidates. To do that you must create an ID for your group. The ID must be four digits and all the members must know it, and use it.&lt;br /&gt;
*      When you log in to Inspera you get two options: &amp;quot;Join existing group&amp;quot; or &amp;quot;Create new group&amp;quot;.&lt;br /&gt;
*      The first member in a group logging in to Inspera must choose: &amp;quot;Create new group&amp;quot;.&lt;br /&gt;
*      All the other members in the same group must choose: &amp;quot;Join existing group&amp;quot;.&lt;br /&gt;
*      All members in the group must log in to Inspera Assessment (and join the Group), BEFORE one person delivers the group assignment. (Each group can only deliver once)&lt;br /&gt;
&lt;br /&gt;
The submitted ZIP archive should contain your complete project in a single directory. Include a file README.TXT in the root of the project directory to let us know if you have used a particular development environment (like Eclipse), what is needed before your code can be run, and how to run it, and if there are other things to do.&lt;br /&gt;
&lt;br /&gt;
The file name of your archive should contain the student numbers of everyone in the group. (Not that your student number is different from your student card number...). &lt;br /&gt;
&lt;br /&gt;
Your ZIP archive should contain a 2-page project description. Put this description in the root folder of your project directory before you ZIP it. The project description file should be anonymous, and contain the exam numbers of all group members, BOTH on the first page and in the file name (e.g., ProjectDescription_102_113.pdf .)&lt;br /&gt;
&lt;br /&gt;
The length of the project description is max 2 A4 pages with 11pt font and 2.5 cm margins. This is a HARD limit. You can have appendices, though, and any figures or tables come in addition to the two pages. The quality of your code is more important than the quality of the 2 page description. You receive a grade on the project, not on the report.&lt;br /&gt;
&lt;br /&gt;
* You should briefly explain the purpose of your system. Why have you made this? Why is it a good idea to do this using big data technologies? What can you do now that wasn&#039;t possible before?&lt;br /&gt;
* You should probably list the technologies/tools/standards/ datasets you have used and explain briefly why you chose each of them. Did you consider alternatives? Why were the ones you chose better?&lt;br /&gt;
* If you are reading/converting/lifting data from multiple sources and/or using existing tools in addition to your own program, you should probably include a flow chart or architecture sketch (which is different from a class diagram).&lt;br /&gt;
* You should probably include a class diagram and/or data flow diagram of your system.&lt;br /&gt;
* You should mention any particular problems you have had and/or things you want to do differently next time.&lt;br /&gt;
* If you want to briefly describe how to run the code you have submitted, you can do that separately in a README.TXT file.&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Essay&amp;diff=1239</id>
		<title>Essay</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Essay&amp;diff=1239"/>
		<updated>2022-11-15T16:52:40Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The essay shall present and discuss selected theory, technology and tools related to big data.&lt;br /&gt;
&lt;br /&gt;
This autumn, we specifically invite essays that are related to your group assignments, for example discussion privacy aspects of your project (if any).&lt;br /&gt;
&lt;br /&gt;
== Theme ==&lt;br /&gt;
Your mandatory essay is supposed to be &amp;quot;an individual, theoretical essay on chosen topic&amp;quot;. In practice, it is you who propose a theme, and then I either accept it as is or guide you towards a more suitable theme.&lt;br /&gt;
&lt;br /&gt;
I invite essays that are not purely theoretical, but that present and reflect over your work with the group assignment. For example, the essay can discuss privacy-related aspects of your project. But in any case, your essay should demonstrate &amp;quot;thoughtful research and discussion&amp;quot;. It should be well backed by scholarly literature. &lt;br /&gt;
&lt;br /&gt;
== Finding a theme ==&lt;br /&gt;
It is a good idea to start thinking about possible essay themes early in the semester, at least to get the processes started! Send me your ideas in an email, and I will provide early feedback. All I need is a few keywords or a sentence about one or more ideas your are considering. Later we can talk face to face.&lt;br /&gt;
&lt;br /&gt;
* Here is an example of an initial idea on the keyword stage: &amp;quot;Ethics and Privacy in using social media for news production.&amp;quot;&lt;br /&gt;
* A step further: &amp;quot;I want to look into how open and non-sensitive information collected from independent sources about the same individuals can become sensitive when recombined.&amp;quot;&lt;br /&gt;
* Here is an even more developed idea: &amp;quot;How can potentially sensitive information be identified in big-data streams for news production? And how can guards be put in place to ensure that sensitive information is not created as part of the news production stream.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== Proposing a theme ==&lt;br /&gt;
Everyone who intends to take the course must send an essay proposal by email to [mailto:Andreas.Opdahl@uib.no Andreas.Opdahl@uib.no]. The subject line must contain the string &amp;quot;INFO319 Essay Proposal&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Proposal deadline:&#039;&#039;&#039; Wednesday October 12th 1500.&lt;br /&gt;
&lt;br /&gt;
== Essay presentations ==&lt;br /&gt;
The session on Thursday, November 24th will focus on essay presentations. 10 minutes and around 5-8 slides are an appropriate length. Then we have around 5 minutes for questions and comments.&lt;br /&gt;
&lt;br /&gt;
The 10 minutes include getting your slides up and running. This means that you need to be really well prepared. A short presentation is much harder than a long one: &amp;quot;Lincoln [... w]hen asked to appear upon some important occasion and deliver a five-minute speech, he said that he had no time to prepare five-minute speeches, but that he could go and speak an hour at any time.&amp;quot; (H. H. Markham, Governor of California, 1893)&lt;br /&gt;
&lt;br /&gt;
Examples of things you might want to touch in your presentation are: &lt;br /&gt;
* your problem and why it is important&lt;br /&gt;
* the most central theory/literature you have found for your work&lt;br /&gt;
* the question(s) you want to answer/topic(s) you want to illuminate&lt;br /&gt;
* your choice of method if you want to do something empirical&lt;br /&gt;
* what you have done do far&lt;br /&gt;
* what you hope to achieve&lt;br /&gt;
* what could be the most important outcomes of your essays (and are there pitfalls?)&lt;br /&gt;
&lt;br /&gt;
== Essay submission ==&lt;br /&gt;
&#039;&#039;&#039;Final essay submission:&#039;&#039;&#039; December 12th.&lt;br /&gt;
&lt;br /&gt;
Submit your file through Inspera as a single PDF file. The version of your essay that you submit should be anonymous.&lt;br /&gt;
&lt;br /&gt;
Since the essay is graded, this is an official deadline. Submission is through Inspera and if you do not submit on time, you will be not allowed to take the course exam a week later.&lt;br /&gt;
&lt;br /&gt;
Suggested length is 3000-4000 words. But effort and quality is much more important than length.&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Programming_project&amp;diff=1238</id>
		<title>Programming project</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Programming_project&amp;diff=1238"/>
		<updated>2022-11-15T16:44:43Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Proposing a theme: deadline */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The project shall develop an application that uses big data technologies on social-media and/or other open data data. At least a part of the project shall use Spark and run in the NREC cloud. The project should be carried out in groups of three, and never more. Working individually or in pairs is possible, but not recommended. &lt;br /&gt;
&lt;br /&gt;
This autumn, we specifically invite projects that use &#039;&#039;big data for the news&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
More information about possible projects, deadlines, and other requirements will appear here soon.&lt;br /&gt;
&lt;br /&gt;
== Proposing a theme: deadline ==&lt;br /&gt;
Everyone who intends to take the course must be included in a project proposal sent by email to [mailto:Andreas.Opdahl@uib.no Andreas.Opdahl@uib.no] and with all the group members on Cc. The subject line must contain the string &amp;quot;INFO319 Project Proposal&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Proposal deadline:&#039;&#039;&#039; Wednesday October 12th 1500&lt;br /&gt;
&amp;lt;!-- The proposal does not have to be long, but the following points must be made clear:&lt;br /&gt;
*    What you are planning to make using big data and big data technologies.&lt;br /&gt;
*    Why it is a good idea to use big data and big data technologies for this purpose.&lt;br /&gt;
*    Exactly what you think is new with your idea.&lt;br /&gt;
*    What you have done to ensure that something very similar has not been done before.&lt;br /&gt;
*    Which datasets you are planning to use.&lt;br /&gt;
*    What technologies (programming language, libraries, development and collaboration tools) you are planning to use. --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Project presentations ==&lt;br /&gt;
&#039;&#039;&#039;The final project presentations:&#039;&#039;&#039; Thursday December 8th 1015&lt;br /&gt;
&lt;br /&gt;
Depending a little on the number of project groups, each presentation will be brief: 15 minutes for each group + 5 minutes for questions and comments.&lt;br /&gt;
&lt;br /&gt;
You may demonstrate your project live (most convincing), or you may replay a recorded demonstration (which is good to have as a backup in any case). In addition, I expect each presentation to address/answer at least these points:&lt;br /&gt;
*    what have you made? - or: what is your application doing?&lt;br /&gt;
*    which technologies have you used (languages, APIs, other software etc.)&lt;br /&gt;
*    where did you get your data from? - and/or: which datasets have you used?&lt;br /&gt;
*    why is it a good idea to do this using big data and big data technologies? - or: what does your system do that was not possible (or at least not easy) to do before?&lt;br /&gt;
*    exactly what have you done and programmed so far?&lt;br /&gt;
*    what are you planning to do in the final few days?&lt;br /&gt;
*    have you got any particular problems you need to address?&lt;br /&gt;
&lt;br /&gt;
== Project submission ==&lt;br /&gt;
&#039;&#039;&#039;Final project submission:&#039;&#039;&#039; December 12th. &lt;br /&gt;
&lt;br /&gt;
Submit your project through Inspera as a single ZIP archive. The version of your project that you submit should be anonymous.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Since the project is graded, this is an official deadline. If you do not submit on time, you will be not allowed to take the course exam a week later.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Provide a short video (max 5 minutes) that shows your system running, which voice comments.&lt;br /&gt;
&lt;br /&gt;
Comment your code sparsely and in-line. You do not need additional documentation, but you should provide a precise description for how to run your system. For example, explain:&lt;br /&gt;
* which additional packages that need to be installed&lt;br /&gt;
* which datasets that need to be downloaded&lt;br /&gt;
** do not include large datasets &amp;gt;10M in your Zip file&lt;br /&gt;
** but it is fine to include smaller test datasets&lt;br /&gt;
* if credentials (like a Twitter token) is needed to run the code, explain where they must be added&lt;br /&gt;
* which other systems that must be running first (e.g., Kafka, HDFS, YARN)&lt;br /&gt;
* how to start your system (in particular if it consists of several programs)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
The end result of the project should be submitted as a ZIP archive through Inspera:&lt;br /&gt;
*     Just one person in the group shall deliver the group assignment (ZIP file) in Inspera.&lt;br /&gt;
*      The groups must be created by the candidates. To do that you must create an ID for your group. The ID must be four digits and all the members must know it, and use it.&lt;br /&gt;
*      When you log in to Inspera you get two options: &amp;quot;Join existing group&amp;quot; or &amp;quot;Create new group&amp;quot;.&lt;br /&gt;
*      The first member in a group logging in to Inspera must choose: &amp;quot;Create new group&amp;quot;.&lt;br /&gt;
*      All the other members in the same group must choose: &amp;quot;Join existing group&amp;quot;.&lt;br /&gt;
*      All members in the group must log in to Inspera Assessment (and join the Group), BEFORE one person delivers the group assignment. (Each group can only deliver once)&lt;br /&gt;
&lt;br /&gt;
The submitted ZIP archive should contain your complete project in a single directory. Include a file README.TXT in the root of the project directory to let us know if you have used a particular development environment (like Eclipse), what is needed before your code can be run, and how to run it, and if there are other things to do.&lt;br /&gt;
&lt;br /&gt;
The file name of your archive should contain the student numbers of everyone in the group. (Not that your student number is different from your student card number...). &lt;br /&gt;
&lt;br /&gt;
Your ZIP archive should contain a 2-page project description. Put this description in the root folder of your project directory before you ZIP it. The project description file should be anonymous, and contain the exam numbers of all group members, BOTH on the first page and in the file name (e.g., ProjectDescription_102_113.pdf .)&lt;br /&gt;
&lt;br /&gt;
The length of the project description is max 2 A4 pages with 11pt font and 2.5 cm margins. This is a HARD limit. You can have appendices, though, and any figures or tables come in addition to the two pages. The quality of your code is more important than the quality of the 2 page description. You receive a grade on the project, not on the report.&lt;br /&gt;
&lt;br /&gt;
* You should briefly explain the purpose of your system. Why have you made this? Why is it a good idea to do this using big data technologies? What can you do now that wasn&#039;t possible before?&lt;br /&gt;
* You should probably list the technologies/tools/standards/ datasets you have used and explain briefly why you chose each of them. Did you consider alternatives? Why were the ones you chose better?&lt;br /&gt;
* If you are reading/converting/lifting data from multiple sources and/or using existing tools in addition to your own program, you should probably include a flow chart or architecture sketch (which is different from a class diagram).&lt;br /&gt;
* You should probably include a class diagram and/or data flow diagram of your system.&lt;br /&gt;
* You should mention any particular problems you have had and/or things you want to do differently next time.&lt;br /&gt;
* If you want to briefly describe how to run the code you have submitted, you can do that separately in a README.TXT file.&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Programming_project&amp;diff=1237</id>
		<title>Programming project</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Programming_project&amp;diff=1237"/>
		<updated>2022-11-15T16:44:24Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Project submission */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The project shall develop an application that uses big data technologies on social-media and/or other open data data. At least a part of the project shall use Spark and run in the NREC cloud. The project should be carried out in groups of three, and never more. Working individually or in pairs is possible, but not recommended. &lt;br /&gt;
&lt;br /&gt;
This autumn, we specifically invite projects that use &#039;&#039;big data for the news&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
More information about possible projects, deadlines, and other requirements will appear here soon.&lt;br /&gt;
&lt;br /&gt;
== Proposing a theme: deadline ==&lt;br /&gt;
Everyone who intends to take the course must be included in a project proposal sent by email to [mailto:Andreas.Opdahl@uib.no Andreas.Opdahl@uib.no] and with all the group members on Cc. The subject line must contain the string &amp;quot;INFO319 Project Proposal&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Proposal deadline:&#039;&#039;&#039; Wednesday October 12th 1500&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!-- The proposal does not have to be long, but the following points must be made clear:&lt;br /&gt;
*    What you are planning to make using big data and big data technologies.&lt;br /&gt;
*    Why it is a good idea to use big data and big data technologies for this purpose.&lt;br /&gt;
*    Exactly what you think is new with your idea.&lt;br /&gt;
*    What you have done to ensure that something very similar has not been done before.&lt;br /&gt;
*    Which datasets you are planning to use.&lt;br /&gt;
*    What technologies (programming language, libraries, development and collaboration tools) you are planning to use. --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project presentations ==&lt;br /&gt;
&#039;&#039;&#039;The final project presentations:&#039;&#039;&#039; Thursday December 8th 1015&lt;br /&gt;
&lt;br /&gt;
Depending a little on the number of project groups, each presentation will be brief: 15 minutes for each group + 5 minutes for questions and comments.&lt;br /&gt;
&lt;br /&gt;
You may demonstrate your project live (most convincing), or you may replay a recorded demonstration (which is good to have as a backup in any case). In addition, I expect each presentation to address/answer at least these points:&lt;br /&gt;
*    what have you made? - or: what is your application doing?&lt;br /&gt;
*    which technologies have you used (languages, APIs, other software etc.)&lt;br /&gt;
*    where did you get your data from? - and/or: which datasets have you used?&lt;br /&gt;
*    why is it a good idea to do this using big data and big data technologies? - or: what does your system do that was not possible (or at least not easy) to do before?&lt;br /&gt;
*    exactly what have you done and programmed so far?&lt;br /&gt;
*    what are you planning to do in the final few days?&lt;br /&gt;
*    have you got any particular problems you need to address?&lt;br /&gt;
&lt;br /&gt;
== Project submission ==&lt;br /&gt;
&#039;&#039;&#039;Final project submission:&#039;&#039;&#039; December 12th. &lt;br /&gt;
&lt;br /&gt;
Submit your project through Inspera as a single ZIP archive. The version of your project that you submit should be anonymous.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Since the project is graded, this is an official deadline. If you do not submit on time, you will be not allowed to take the course exam a week later.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Provide a short video (max 5 minutes) that shows your system running, which voice comments.&lt;br /&gt;
&lt;br /&gt;
Comment your code sparsely and in-line. You do not need additional documentation, but you should provide a precise description for how to run your system. For example, explain:&lt;br /&gt;
* which additional packages that need to be installed&lt;br /&gt;
* which datasets that need to be downloaded&lt;br /&gt;
** do not include large datasets &amp;gt;10M in your Zip file&lt;br /&gt;
** but it is fine to include smaller test datasets&lt;br /&gt;
* if credentials (like a Twitter token) is needed to run the code, explain where they must be added&lt;br /&gt;
* which other systems that must be running first (e.g., Kafka, HDFS, YARN)&lt;br /&gt;
* how to start your system (in particular if it consists of several programs)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
The end result of the project should be submitted as a ZIP archive through Inspera:&lt;br /&gt;
*     Just one person in the group shall deliver the group assignment (ZIP file) in Inspera.&lt;br /&gt;
*      The groups must be created by the candidates. To do that you must create an ID for your group. The ID must be four digits and all the members must know it, and use it.&lt;br /&gt;
*      When you log in to Inspera you get two options: &amp;quot;Join existing group&amp;quot; or &amp;quot;Create new group&amp;quot;.&lt;br /&gt;
*      The first member in a group logging in to Inspera must choose: &amp;quot;Create new group&amp;quot;.&lt;br /&gt;
*      All the other members in the same group must choose: &amp;quot;Join existing group&amp;quot;.&lt;br /&gt;
*      All members in the group must log in to Inspera Assessment (and join the Group), BEFORE one person delivers the group assignment. (Each group can only deliver once)&lt;br /&gt;
&lt;br /&gt;
The submitted ZIP archive should contain your complete project in a single directory. Include a file README.TXT in the root of the project directory to let us know if you have used a particular development environment (like Eclipse), what is needed before your code can be run, and how to run it, and if there are other things to do.&lt;br /&gt;
&lt;br /&gt;
The file name of your archive should contain the student numbers of everyone in the group. (Not that your student number is different from your student card number...). &lt;br /&gt;
&lt;br /&gt;
Your ZIP archive should contain a 2-page project description. Put this description in the root folder of your project directory before you ZIP it. The project description file should be anonymous, and contain the exam numbers of all group members, BOTH on the first page and in the file name (e.g., ProjectDescription_102_113.pdf .)&lt;br /&gt;
&lt;br /&gt;
The length of the project description is max 2 A4 pages with 11pt font and 2.5 cm margins. This is a HARD limit. You can have appendices, though, and any figures or tables come in addition to the two pages. The quality of your code is more important than the quality of the 2 page description. You receive a grade on the project, not on the report.&lt;br /&gt;
&lt;br /&gt;
* You should briefly explain the purpose of your system. Why have you made this? Why is it a good idea to do this using big data technologies? What can you do now that wasn&#039;t possible before?&lt;br /&gt;
* You should probably list the technologies/tools/standards/ datasets you have used and explain briefly why you chose each of them. Did you consider alternatives? Why were the ones you chose better?&lt;br /&gt;
* If you are reading/converting/lifting data from multiple sources and/or using existing tools in addition to your own program, you should probably include a flow chart or architecture sketch (which is different from a class diagram).&lt;br /&gt;
* You should probably include a class diagram and/or data flow diagram of your system.&lt;br /&gt;
* You should mention any particular problems you have had and/or things you want to do differently next time.&lt;br /&gt;
* If you want to briefly describe how to run the code you have submitted, you can do that separately in a README.TXT file.&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Programming_project&amp;diff=1236</id>
		<title>Programming project</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Programming_project&amp;diff=1236"/>
		<updated>2022-11-15T16:43:58Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Project submission */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The project shall develop an application that uses big data technologies on social-media and/or other open data data. At least a part of the project shall use Spark and run in the NREC cloud. The project should be carried out in groups of three, and never more. Working individually or in pairs is possible, but not recommended. &lt;br /&gt;
&lt;br /&gt;
This autumn, we specifically invite projects that use &#039;&#039;big data for the news&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
More information about possible projects, deadlines, and other requirements will appear here soon.&lt;br /&gt;
&lt;br /&gt;
== Proposing a theme: deadline ==&lt;br /&gt;
Everyone who intends to take the course must be included in a project proposal sent by email to [mailto:Andreas.Opdahl@uib.no Andreas.Opdahl@uib.no] and with all the group members on Cc. The subject line must contain the string &amp;quot;INFO319 Project Proposal&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Proposal deadline:&#039;&#039;&#039; Wednesday October 12th 1500&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!-- The proposal does not have to be long, but the following points must be made clear:&lt;br /&gt;
*    What you are planning to make using big data and big data technologies.&lt;br /&gt;
*    Why it is a good idea to use big data and big data technologies for this purpose.&lt;br /&gt;
*    Exactly what you think is new with your idea.&lt;br /&gt;
*    What you have done to ensure that something very similar has not been done before.&lt;br /&gt;
*    Which datasets you are planning to use.&lt;br /&gt;
*    What technologies (programming language, libraries, development and collaboration tools) you are planning to use. --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project presentations ==&lt;br /&gt;
&#039;&#039;&#039;The final project presentations:&#039;&#039;&#039; Thursday December 8th 1015&lt;br /&gt;
&lt;br /&gt;
Depending a little on the number of project groups, each presentation will be brief: 15 minutes for each group + 5 minutes for questions and comments.&lt;br /&gt;
&lt;br /&gt;
You may demonstrate your project live (most convincing), or you may replay a recorded demonstration (which is good to have as a backup in any case). In addition, I expect each presentation to address/answer at least these points:&lt;br /&gt;
*    what have you made? - or: what is your application doing?&lt;br /&gt;
*    which technologies have you used (languages, APIs, other software etc.)&lt;br /&gt;
*    where did you get your data from? - and/or: which datasets have you used?&lt;br /&gt;
*    why is it a good idea to do this using big data and big data technologies? - or: what does your system do that was not possible (or at least not easy) to do before?&lt;br /&gt;
*    exactly what have you done and programmed so far?&lt;br /&gt;
*    what are you planning to do in the final few days?&lt;br /&gt;
*    have you got any particular problems you need to address?&lt;br /&gt;
&lt;br /&gt;
== Project submission ==&lt;br /&gt;
Submit your project through Inspera as a single ZIP archive. The version of your project that you submit should be anonymous.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Final project submission:&#039;&#039;&#039; December 12th. Since the project is graded, this is an official deadline. If you do not submit on time, you will be not allowed to take the course exam a week later.&lt;br /&gt;
&lt;br /&gt;
Provide a short video (max 5 minutes) that shows your system running, which voice comments.&lt;br /&gt;
&lt;br /&gt;
Comment your code sparsely and in-line. You do not need additional documentation, but you should provide a precise description for how to run your system. For example, explain:&lt;br /&gt;
* which additional packages that need to be installed&lt;br /&gt;
* which datasets that need to be downloaded&lt;br /&gt;
** do not include large datasets &amp;gt;10M in your Zip file)&lt;br /&gt;
** but it is fine to include smaller test datasets&lt;br /&gt;
* if credentials (like a Twitter token) is needed to run the code, explain where they must be added&lt;br /&gt;
* which other systems that must be running first (e.g., Kafka, HDFS, YARN)&lt;br /&gt;
* how to start your system (in particular if it consists of several programs)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
The end result of the project should be submitted as a ZIP archive through Inspera:&lt;br /&gt;
*     Just one person in the group shall deliver the group assignment (ZIP file) in Inspera.&lt;br /&gt;
*      The groups must be created by the candidates. To do that you must create an ID for your group. The ID must be four digits and all the members must know it, and use it.&lt;br /&gt;
*      When you log in to Inspera you get two options: &amp;quot;Join existing group&amp;quot; or &amp;quot;Create new group&amp;quot;.&lt;br /&gt;
*      The first member in a group logging in to Inspera must choose: &amp;quot;Create new group&amp;quot;.&lt;br /&gt;
*      All the other members in the same group must choose: &amp;quot;Join existing group&amp;quot;.&lt;br /&gt;
*      All members in the group must log in to Inspera Assessment (and join the Group), BEFORE one person delivers the group assignment. (Each group can only deliver once)&lt;br /&gt;
&lt;br /&gt;
The submitted ZIP archive should contain your complete project in a single directory. Include a file README.TXT in the root of the project directory to let us know if you have used a particular development environment (like Eclipse), what is needed before your code can be run, and how to run it, and if there are other things to do.&lt;br /&gt;
&lt;br /&gt;
The file name of your archive should contain the student numbers of everyone in the group. (Not that your student number is different from your student card number...). &lt;br /&gt;
&lt;br /&gt;
Your ZIP archive should contain a 2-page project description. Put this description in the root folder of your project directory before you ZIP it. The project description file should be anonymous, and contain the exam numbers of all group members, BOTH on the first page and in the file name (e.g., ProjectDescription_102_113.pdf .)&lt;br /&gt;
&lt;br /&gt;
The length of the project description is max 2 A4 pages with 11pt font and 2.5 cm margins. This is a HARD limit. You can have appendices, though, and any figures or tables come in addition to the two pages. The quality of your code is more important than the quality of the 2 page description. You receive a grade on the project, not on the report.&lt;br /&gt;
&lt;br /&gt;
* You should briefly explain the purpose of your system. Why have you made this? Why is it a good idea to do this using big data technologies? What can you do now that wasn&#039;t possible before?&lt;br /&gt;
* You should probably list the technologies/tools/standards/ datasets you have used and explain briefly why you chose each of them. Did you consider alternatives? Why were the ones you chose better?&lt;br /&gt;
* If you are reading/converting/lifting data from multiple sources and/or using existing tools in addition to your own program, you should probably include a flow chart or architecture sketch (which is different from a class diagram).&lt;br /&gt;
* You should probably include a class diagram and/or data flow diagram of your system.&lt;br /&gt;
* You should mention any particular problems you have had and/or things you want to do differently next time.&lt;br /&gt;
* If you want to briefly describe how to run the code you have submitted, you can do that separately in a README.TXT file.&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Programming_project&amp;diff=1235</id>
		<title>Programming project</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Programming_project&amp;diff=1235"/>
		<updated>2022-11-15T16:43:43Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The project shall develop an application that uses big data technologies on social-media and/or other open data data. At least a part of the project shall use Spark and run in the NREC cloud. The project should be carried out in groups of three, and never more. Working individually or in pairs is possible, but not recommended. &lt;br /&gt;
&lt;br /&gt;
This autumn, we specifically invite projects that use &#039;&#039;big data for the news&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
More information about possible projects, deadlines, and other requirements will appear here soon.&lt;br /&gt;
&lt;br /&gt;
== Proposing a theme: deadline ==&lt;br /&gt;
Everyone who intends to take the course must be included in a project proposal sent by email to [mailto:Andreas.Opdahl@uib.no Andreas.Opdahl@uib.no] and with all the group members on Cc. The subject line must contain the string &amp;quot;INFO319 Project Proposal&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Proposal deadline:&#039;&#039;&#039; Wednesday October 12th 1500&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!-- The proposal does not have to be long, but the following points must be made clear:&lt;br /&gt;
*    What you are planning to make using big data and big data technologies.&lt;br /&gt;
*    Why it is a good idea to use big data and big data technologies for this purpose.&lt;br /&gt;
*    Exactly what you think is new with your idea.&lt;br /&gt;
*    What you have done to ensure that something very similar has not been done before.&lt;br /&gt;
*    Which datasets you are planning to use.&lt;br /&gt;
*    What technologies (programming language, libraries, development and collaboration tools) you are planning to use. --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Project presentations ==&lt;br /&gt;
&#039;&#039;&#039;The final project presentations:&#039;&#039;&#039; Thursday December 8th 1015&lt;br /&gt;
&lt;br /&gt;
Depending a little on the number of project groups, each presentation will be brief: 15 minutes for each group + 5 minutes for questions and comments.&lt;br /&gt;
&lt;br /&gt;
You may demonstrate your project live (most convincing), or you may replay a recorded demonstration (which is good to have as a backup in any case). In addition, I expect each presentation to address/answer at least these points:&lt;br /&gt;
*    what have you made? - or: what is your application doing?&lt;br /&gt;
*    which technologies have you used (languages, APIs, other software etc.)&lt;br /&gt;
*    where did you get your data from? - and/or: which datasets have you used?&lt;br /&gt;
*    why is it a good idea to do this using big data and big data technologies? - or: what does your system do that was not possible (or at least not easy) to do before?&lt;br /&gt;
*    exactly what have you done and programmed so far?&lt;br /&gt;
*    what are you planning to do in the final few days?&lt;br /&gt;
*    have you got any particular problems you need to address?&lt;br /&gt;
&lt;br /&gt;
== Project submission ==&lt;br /&gt;
Submit your project through Inspera as a single ZIP archive. The version of your project that you submit should be anonymous.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Final project submission:&#039;&#039;&#039; December 12th. Since the project is graded, this is an official deadline. If you do not submit on time, you will be not allowed to take the course exam a week later.&lt;br /&gt;
&lt;br /&gt;
Provide a short video (max 5 minutes) that shows your system running, which voice comments.&lt;br /&gt;
&lt;br /&gt;
Comment your code sparsely and in-line. You do not need additional documentation, but you should provide a precise description for how to run your system. For example, explain:&lt;br /&gt;
* which additional packages that need to be installed&lt;br /&gt;
* which datasets that need to be downloaded&lt;br /&gt;
  * do not include large datasets &amp;gt;10M in your Zip file)&lt;br /&gt;
  * but it is fine to include smaller test datasets&lt;br /&gt;
* if credentials (like a Twitter token) is needed to run the code, explain where they must be added&lt;br /&gt;
* which other systems that must be running first (e.g., Kafka, HDFS, YARN)&lt;br /&gt;
* how to start your system (in particular if it consists of several programs)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
The end result of the project should be submitted as a ZIP archive through Inspera:&lt;br /&gt;
*     Just one person in the group shall deliver the group assignment (ZIP file) in Inspera.&lt;br /&gt;
*      The groups must be created by the candidates. To do that you must create an ID for your group. The ID must be four digits and all the members must know it, and use it.&lt;br /&gt;
*      When you log in to Inspera you get two options: &amp;quot;Join existing group&amp;quot; or &amp;quot;Create new group&amp;quot;.&lt;br /&gt;
*      The first member in a group logging in to Inspera must choose: &amp;quot;Create new group&amp;quot;.&lt;br /&gt;
*      All the other members in the same group must choose: &amp;quot;Join existing group&amp;quot;.&lt;br /&gt;
*      All members in the group must log in to Inspera Assessment (and join the Group), BEFORE one person delivers the group assignment. (Each group can only deliver once)&lt;br /&gt;
&lt;br /&gt;
The submitted ZIP archive should contain your complete project in a single directory. Include a file README.TXT in the root of the project directory to let us know if you have used a particular development environment (like Eclipse), what is needed before your code can be run, and how to run it, and if there are other things to do.&lt;br /&gt;
&lt;br /&gt;
The file name of your archive should contain the student numbers of everyone in the group. (Not that your student number is different from your student card number...). &lt;br /&gt;
&lt;br /&gt;
Your ZIP archive should contain a 2-page project description. Put this description in the root folder of your project directory before you ZIP it. The project description file should be anonymous, and contain the exam numbers of all group members, BOTH on the first page and in the file name (e.g., ProjectDescription_102_113.pdf .)&lt;br /&gt;
&lt;br /&gt;
The length of the project description is max 2 A4 pages with 11pt font and 2.5 cm margins. This is a HARD limit. You can have appendices, though, and any figures or tables come in addition to the two pages. The quality of your code is more important than the quality of the 2 page description. You receive a grade on the project, not on the report.&lt;br /&gt;
&lt;br /&gt;
* You should briefly explain the purpose of your system. Why have you made this? Why is it a good idea to do this using big data technologies? What can you do now that wasn&#039;t possible before?&lt;br /&gt;
* You should probably list the technologies/tools/standards/ datasets you have used and explain briefly why you chose each of them. Did you consider alternatives? Why were the ones you chose better?&lt;br /&gt;
* If you are reading/converting/lifting data from multiple sources and/or using existing tools in addition to your own program, you should probably include a flow chart or architecture sketch (which is different from a class diagram).&lt;br /&gt;
* You should probably include a class diagram and/or data flow diagram of your system.&lt;br /&gt;
* You should mention any particular problems you have had and/or things you want to do differently next time.&lt;br /&gt;
* If you want to briefly describe how to run the code you have submitted, you can do that separately in a README.TXT file.&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Essay&amp;diff=1234</id>
		<title>Essay</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Essay&amp;diff=1234"/>
		<updated>2022-11-15T16:30:20Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Essay submission */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The essay shall present and discuss selected theory, technology and tools related to big data.&lt;br /&gt;
&lt;br /&gt;
This autumn, we specifically invite essays that are related to your group assignments, for example discussion privacy aspects of your project (if any).&lt;br /&gt;
&lt;br /&gt;
== Theme ==&lt;br /&gt;
Your mandatory essay is supposed to be &amp;quot;an individual, theoretical essay on chosen topic&amp;quot;. In practice, it is you who propose a theme, and then I either accept it as is or guide you towards a more suitable theme.&lt;br /&gt;
&lt;br /&gt;
I invite essays that are not purely theoretical, but that present and reflect over your work with the group assignment. For example, the essay can discuss privacy-related aspects of your project. But in any case, your essay should demonstrate &amp;quot;thoughtful research and discussion&amp;quot;. It should be well backed by scholarly literature. &lt;br /&gt;
&lt;br /&gt;
== Finding a theme ==&lt;br /&gt;
It is a good idea to start thinking about possible essay themes early in the semester, at least to get the processes started! Send me your ideas in an email, and I will provide early feedback. All I need is a few keywords or a sentence about one or more ideas your are considering. Later we can talk face to face.&lt;br /&gt;
&lt;br /&gt;
* Here is an example of an initial idea on the keyword stage: &amp;quot;Ethics and Privacy in using social media for news production.&amp;quot;&lt;br /&gt;
* A step further: &amp;quot;I want to look into how open and non-sensitive information collected from independent sources about the same individuals can become sensitive when recombined.&amp;quot;&lt;br /&gt;
* Here is an even more developed idea: &amp;quot;How can potentially sensitive information be identified in big-data streams for news production? And how can guards be put in place to ensure that sensitive information is not created as part of the news production stream.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== Proposing a theme: deadline ==&lt;br /&gt;
Everyone who intends to take the course must send an essay proposal by email to [mailto:Andreas.Opdahl@uib.no Andreas.Opdahl@uib.no]. The subject line must contain the string &amp;quot;INFO319 Essay Proposal&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Deadline:&#039;&#039;&#039; Wednesday October 12th 1500.&lt;br /&gt;
&lt;br /&gt;
== Length ==&lt;br /&gt;
Suggested length is 3000-4000 words. But effort and quality is much more important than length.&lt;br /&gt;
&lt;br /&gt;
== Essay submission ==&lt;br /&gt;
Submit your file through Inspera as a single PDF file. The version of your essay that you submit should be anonymous.&lt;br /&gt;
&lt;br /&gt;
Deadline for final essay submission is December 12th. Since the essay is graded, this is an official deadline. Submission is through Inspera and if you do not submit on time, you will be not allowed to take the course exam a week later.&lt;br /&gt;
&lt;br /&gt;
== Essay presentations ==&lt;br /&gt;
The session on Thursday, November 24th will focus on essay presentations. 10 minutes and around 5-8 slides are an appropriate length.&lt;br /&gt;
&lt;br /&gt;
Depending on the number of people who take the course, we may not have much time per essay. This includes: getting your slides up and running, presenting the actual essay, presenting critique, and posing/answering questions. In addition to presenting your own essay. you are supposed to offer comments to at least one other essay.&lt;br /&gt;
&lt;br /&gt;
This means that you need to be really well prepared. A short presentation is much harder than a long one: &amp;quot;Lincoln [... w]hen asked to appear upon some important occasion and deliver a five-minute speech, he said that he had no time to prepare five-minute speeches, but that he could go and speak an hour at any time.&amp;quot; (H. H. Markham, Governor of California, 1893)&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Essay&amp;diff=1233</id>
		<title>Essay</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Essay&amp;diff=1233"/>
		<updated>2022-11-15T16:13:36Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Proposing a theme: deadline */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The essay shall present and discuss selected theory, technology and tools related to big data.&lt;br /&gt;
&lt;br /&gt;
This autumn, we specifically invite essays that are related to your group assignments, for example discussion privacy aspects of your project (if any).&lt;br /&gt;
&lt;br /&gt;
== Theme ==&lt;br /&gt;
Your mandatory essay is supposed to be &amp;quot;an individual, theoretical essay on chosen topic&amp;quot;. In practice, it is you who propose a theme, and then I either accept it as is or guide you towards a more suitable theme.&lt;br /&gt;
&lt;br /&gt;
I invite essays that are not purely theoretical, but that present and reflect over your work with the group assignment. For example, the essay can discuss privacy-related aspects of your project. But in any case, your essay should demonstrate &amp;quot;thoughtful research and discussion&amp;quot;. It should be well backed by scholarly literature. &lt;br /&gt;
&lt;br /&gt;
== Finding a theme ==&lt;br /&gt;
It is a good idea to start thinking about possible essay themes early in the semester, at least to get the processes started! Send me your ideas in an email, and I will provide early feedback. All I need is a few keywords or a sentence about one or more ideas your are considering. Later we can talk face to face.&lt;br /&gt;
&lt;br /&gt;
* Here is an example of an initial idea on the keyword stage: &amp;quot;Ethics and Privacy in using social media for news production.&amp;quot;&lt;br /&gt;
* A step further: &amp;quot;I want to look into how open and non-sensitive information collected from independent sources about the same individuals can become sensitive when recombined.&amp;quot;&lt;br /&gt;
* Here is an even more developed idea: &amp;quot;How can potentially sensitive information be identified in big-data streams for news production? And how can guards be put in place to ensure that sensitive information is not created as part of the news production stream.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== Proposing a theme: deadline ==&lt;br /&gt;
Everyone who intends to take the course must send an essay proposal by email to [mailto:Andreas.Opdahl@uib.no Andreas.Opdahl@uib.no]. The subject line must contain the string &amp;quot;INFO319 Essay Proposal&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Deadline:&#039;&#039;&#039; Wednesday October 12th 1500.&lt;br /&gt;
&lt;br /&gt;
== Length ==&lt;br /&gt;
Suggested length is 3000-4000 words. But effort and quality is much more important than length.&lt;br /&gt;
&lt;br /&gt;
== Essay submission ==&lt;br /&gt;
TBD&lt;br /&gt;
&lt;br /&gt;
== Essay presentations ==&lt;br /&gt;
The session on Thursday, November 24th will focus on essay presentations. 10 minutes and around 5-8 slides are an appropriate length.&lt;br /&gt;
&lt;br /&gt;
Depending on the number of people who take the course, we may not have much time per essay. This includes: getting your slides up and running, presenting the actual essay, presenting critique, and posing/answering questions. In addition to presenting your own essay. you are supposed to offer comments to at least one other essay.&lt;br /&gt;
&lt;br /&gt;
This means that you need to be really well prepared. A short presentation is much harder than a long one: &amp;quot;Lincoln [... w]hen asked to appear upon some important occasion and deliver a five-minute speech, he said that he had no time to prepare five-minute speeches, but that he could go and speak an hour at any time.&amp;quot; (H. H. Markham, Governor of California, 1893)&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Essay&amp;diff=1232</id>
		<title>Essay</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Essay&amp;diff=1232"/>
		<updated>2022-11-15T16:12:46Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Length */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The essay shall present and discuss selected theory, technology and tools related to big data.&lt;br /&gt;
&lt;br /&gt;
This autumn, we specifically invite essays that are related to your group assignments, for example discussion privacy aspects of your project (if any).&lt;br /&gt;
&lt;br /&gt;
== Theme ==&lt;br /&gt;
Your mandatory essay is supposed to be &amp;quot;an individual, theoretical essay on chosen topic&amp;quot;. In practice, it is you who propose a theme, and then I either accept it as is or guide you towards a more suitable theme.&lt;br /&gt;
&lt;br /&gt;
I invite essays that are not purely theoretical, but that present and reflect over your work with the group assignment. For example, the essay can discuss privacy-related aspects of your project. But in any case, your essay should demonstrate &amp;quot;thoughtful research and discussion&amp;quot;. It should be well backed by scholarly literature. &lt;br /&gt;
&lt;br /&gt;
== Finding a theme ==&lt;br /&gt;
It is a good idea to start thinking about possible essay themes early in the semester, at least to get the processes started! Send me your ideas in an email, and I will provide early feedback. All I need is a few keywords or a sentence about one or more ideas your are considering. Later we can talk face to face.&lt;br /&gt;
&lt;br /&gt;
* Here is an example of an initial idea on the keyword stage: &amp;quot;Ethics and Privacy in using social media for news production.&amp;quot;&lt;br /&gt;
* A step further: &amp;quot;I want to look into how open and non-sensitive information collected from independent sources about the same individuals can become sensitive when recombined.&amp;quot;&lt;br /&gt;
* Here is an even more developed idea: &amp;quot;How can potentially sensitive information be identified in big-data streams for news production? And how can guards be put in place to ensure that sensitive information is not created as part of the news production stream.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== Proposing a theme: deadline ==&lt;br /&gt;
Everyone who intends to take the course must send an essay proposal by email to [mailto:Andreas.Opdahl@uib.no Andreas.Opdahl@uib.no]. The subject line must contain the string &amp;quot;INFO319 Essay Proposal&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Deadline:&#039;&#039;&#039; Wednesday October 12th 1500&lt;br /&gt;
&lt;br /&gt;
But it is a good idea to contact me about this earlier :-)&lt;br /&gt;
&lt;br /&gt;
== Length ==&lt;br /&gt;
Suggested length is 3000-4000 words. But effort and quality is much more important than length.&lt;br /&gt;
&lt;br /&gt;
== Essay submission ==&lt;br /&gt;
TBD&lt;br /&gt;
&lt;br /&gt;
== Essay presentations ==&lt;br /&gt;
The session on Thursday, November 24th will focus on essay presentations. 10 minutes and around 5-8 slides are an appropriate length.&lt;br /&gt;
&lt;br /&gt;
Depending on the number of people who take the course, we may not have much time per essay. This includes: getting your slides up and running, presenting the actual essay, presenting critique, and posing/answering questions. In addition to presenting your own essay. you are supposed to offer comments to at least one other essay.&lt;br /&gt;
&lt;br /&gt;
This means that you need to be really well prepared. A short presentation is much harder than a long one: &amp;quot;Lincoln [... w]hen asked to appear upon some important occasion and deliver a five-minute speech, he said that he had no time to prepare five-minute speeches, but that he could go and speak an hour at any time.&amp;quot; (H. H. Markham, Governor of California, 1893)&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Course_wiki_for_INFO319&amp;diff=1231</id>
		<title>Course wiki for INFO319</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Course_wiki_for_INFO319&amp;diff=1231"/>
		<updated>2022-11-15T16:11:31Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This wiki (under development) contains practical information about INFO319 - Big Data in the autumn of 2022, including readings and exercises.&lt;br /&gt;
&lt;br /&gt;
* [[Readings]]: Mandatory and supplementary readings for the course. In addition, the sessions page proposes specific readings for each session.&lt;br /&gt;
* [[Sessions]]: The course comprises 8 full-day seminars. The sessions involve lectures, demos, student presentations, and practical work.  &lt;br /&gt;
* [[Exercises]]: Exercises related to the sessions. Solutions to the exercises can form the backbone of the mandatory group assignment.&lt;br /&gt;
* [https://www.uib.no/emne/INFO319?sem=2022h Assessment]: The course assessment has three parts:&lt;br /&gt;
** Portfolio evaluation of the essay and group assignment (55%)&lt;br /&gt;
** Oral presentations of essay and group assignment (15%)&lt;br /&gt;
** Written exam (3 hours) (30%)&lt;br /&gt;
* [[Essay]]: One part of the portfolio evaluation is &amp;quot;an individual, theoretical essay with thoughtful research and discussion of an assigned topic&amp;quot;.&lt;br /&gt;
* [[Programming project|Project]]: Another part of the portfolio evaluation is a &amp;quot;practical assignment in groups&amp;quot; that has the form of a group programming project. Solutions to the suggested exercises can form the backbone of your project.&lt;br /&gt;
* Participation: Participation in 80% of the course seminars is &#039;&#039;mandatory&#039;&#039;. Participation in &#039;&#039;the two last sessions is also mandatory&#039;&#039; (because your presentations there are part of the course assessment).&lt;br /&gt;
&amp;lt;!-- * [[Datasets]] : Available datasets that can be used for data analysis. --&amp;gt; &lt;br /&gt;
* [https://mitt.uib.no/courses/37204/pages/info319-big-data-h22 Administration]: For formal and administrative information, see [https://mitt.uib.no/courses/37204/pages/info319-big-data-h22 UiB&#039;s Study Portal].&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Contact:&#039;&#039;&#039; [mailto:Andreas.Opdahl@uib.no Andreas.Opdahl@uib.no]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;For questions that are not strictly personal, always use the [https://mitt.uib.no/courses/37204/discussion_topics/321322 discussion forum (requires login)] at [https://mitt.uib.no/courses/37204/pages/info319-big-data-h22 mitt.uib.no]. If I receive general questions about INFO319 by email, I will answer them in the forum anyway, so it is fastest to post them directly there.&#039;&#039;&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Sessions&amp;diff=1230</id>
		<title>Sessions</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Sessions&amp;diff=1230"/>
		<updated>2022-11-10T21:16:06Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Session 6 - Societal issues. Privacy. GDPR */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Tentative themes for each session ==&lt;br /&gt;
* Thursday August 18th: Introduction meeting [[File:IntroductionMeeting.pdf]]&lt;br /&gt;
* Thursday September 1st: Session 1 - Introduction to big data. Big-data processing. Spark&lt;br /&gt;
* Thursday September 15th: Session 2 - More about Spark. Data sources. Twitter&lt;br /&gt;
* Thursday September 29th: Session 3 - Streaming Spark. Big-data architectures. Kafka&lt;br /&gt;
* Thursday October 13th: Session 4 - Cloud computing. NREC an Openstack&lt;br /&gt;
* Thursday October 27th: Session 5 - Cloud management. Terraform and Ansible. Docker and Kubernetes&lt;br /&gt;
* Thursday November 10th: Session 6 - Societal issues. Privacy. GDPR&lt;br /&gt;
* Thursday November 24th: Session 7 - Essay presentations&lt;br /&gt;
* Thursday December 8th: Session 8 - Project demonstrations&lt;br /&gt;
&lt;br /&gt;
== Session 1 - Introduction to big data. Big-data processing. Spark ==&lt;br /&gt;
* Kitchin, chapters 1, 4-5&lt;br /&gt;
* Chambers &amp;amp; Zaharia, chapters 1-3, 12, 15&lt;br /&gt;
* Slides: [[File:S01-BigData-published.pdf]] [[File:S01-Spark-published.pdf]]&lt;br /&gt;
&lt;br /&gt;
Supplementary:&lt;br /&gt;
* Section 1 in Opdahl, A. L., &amp;amp; Nunavath, V. (2020). Big Data. Big Data in Emergency Management: Exploitation Techniques for Social and Mobile Data, 15-29. [https://link.springer.com/chapter/10.1007/978-3-030-48099-8_2 Paper]&lt;br /&gt;
* Spark 3.3.0  [https://spark.apache.org/docs/latest/overview.html Overview] and [https://spark.apache.org/docs/latest/quick-start.html Quick Start (with Python examples)]&lt;br /&gt;
&lt;br /&gt;
== Session 2 - More about Spark. Data sources. Twitter ==&lt;br /&gt;
* Chambers &amp;amp; Zaharia, chapters 4-9 (chapter 10 on SQL is also very relevant)&lt;br /&gt;
* Kitchin, chapter 3&lt;br /&gt;
* Slides: [[File:S02-OrganisationINFO319-published.pdf]] [[File:S02-DataSources-published.pdf]] [[File:DanielRosnes-Introduction-to-Tweepy-and-Twitter-API-2.0.pdf]] [[File:S02-MoreSpark-published.pdf]]&lt;br /&gt;
&lt;br /&gt;
Guest presentation: Daniel Rosnes on using Twitter data for the news: Introduction to Twitter API v2 and Tweepy&lt;br /&gt;
&lt;br /&gt;
Supplementary:&lt;br /&gt;
* Chambers &amp;amp; Zaharia, chapter 10 &#039;&#039;(perhaps mandatory too)&#039;&#039;&lt;br /&gt;
* [https://developer.twitter.com/en/docs/twitter-api Twitter API v2]&lt;br /&gt;
* [https://github.com/tweepy/tweepy Tweepy: Twitter for Python]&lt;br /&gt;
* [https://docs.tweepy.org/en/latest/ Tweepy Documentation]&lt;br /&gt;
&lt;br /&gt;
== Session 3 - Streaming Spark. Big-data architectures. Kafka ==&lt;br /&gt;
* Chambers &amp;amp; Zaharia, chapters 20-21&lt;br /&gt;
* Gallofré, M., Opdahl, A. L., Stoppel, S., Tessem, B., &amp;amp; Veres, C. (2021). The News Angler Project: Exploring the Next Generation of Journalistic Knowledge Platforms. In Proceedings of Norsk IKT-konferanse for forskning og utdanning. [https://ojs.bibsys.no/index.php/NIK/article/view/939/792 Short Paper] and poster: [[file:A1-Poster-NIKT2021.pdf]]&lt;br /&gt;
* [https://kafka.apache.org/intro Kafka Introduction]&lt;br /&gt;
* Slides: [[file:S03-StreamingSpark-published.pdf]] [[file:S03-MoreSpark-published.pdf]] [[file:S03-Kafka-published.pdf]] [[file:S03-ResearchMethod-published.pdf]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;The Guest Talk on architectures and the News Hunter platform is postponed to a later session.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Supplementary:&lt;br /&gt;
* [https://kafka-python.readthedocs.io/en/master/ kafka-python API]&lt;br /&gt;
* [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Structured Streaming Spark Programming Guide]&lt;br /&gt;
* News Hunter:&lt;br /&gt;
** Berven, A., Christensen, O. A., Moldeklev, S., Opdahl, A. L., &amp;amp; Villanger, K. J. (2020). A knowledge-graph platform for newsrooms. Computers in Industry, 123, 103321. [https://scholar.google.com/scholar?output=instlink&amp;amp;q=info:0K5dB1_9nusJ:scholar.google.com/&amp;amp;hl=en&amp;amp;as_sdt=0,5&amp;amp;as_ylo=2018&amp;amp;scillfp=11776208952974186557&amp;amp;oi=lle Paper]&lt;br /&gt;
** Opdahl, A. L., &amp;amp; Tessem, B. (2021). Ontologies for finding journalistic angles. Software and Systems Modeling, 20(1), 71-87. [https://link.springer.com/article/10.1007/s10270-020-00801-w Paper]&lt;br /&gt;
* Design science research method:&lt;br /&gt;
** [https://www.jstor.org/stable/25148625#metadata_info_tab_contents Design Science in Information Systems Research] by Alan R. Hevner, Salvatore T. March, Jinsoo Park and Sudha Ram. MIS Quarterly 28(1):75-105, March 2004. &#039;&#039;(You need to be on UiB&#039;s network to access the link - I have uploaded it under Files in mitt.uib.no, but it may soon be deleted from there...)&#039;&#039;&lt;br /&gt;
** Hevner, A. R. (2007). A three cycle view of design science research. Scandinavian journal of information systems, 19(2), 4. [[File:Hevner2007-ThreeCycleView-SJIS.pdf]]&lt;br /&gt;
&lt;br /&gt;
== Session 4 - Cloud computing. NREC and Openstack ==&lt;br /&gt;
* [https://docs.nrec.no/index.html NREC and OpenStack], the following sections/pages: Introduction, Project application, Logging in, The dashboard, Create a Linux virtual machine (skip: Windows), Using SSH, Working with Security Groups, Create and manage volumes, Create and manage snapshots (skip: images), Instance console&lt;br /&gt;
* Slides: [[file:S04-OpenStack-published.pdf]] [[file:S04-UbuntuLinux-published.pdf]]&lt;br /&gt;
&lt;br /&gt;
Guest presentation: Sohail Khan on computer vision and deep networks for image analysis. His slides and demo code are uploaded to mitt.uib.no under Files (size and file-type limitations).&lt;br /&gt;
&lt;br /&gt;
There are not so many readings for this session, because it is where we will start running Spark in a cluster, so there will be practical work that takes some time. Computer networks and image analysis is not a mandatory part of the course, but something you may want to use in your projects. Sohail&#039;s presentation will include suggestions for further reading.&lt;br /&gt;
&lt;br /&gt;
== Session 5 - Cloud management. Terraform and Ansible. &amp;lt;!-- Docker and Kubernetes --&amp;gt; ==&lt;br /&gt;
* [https://docs.nrec.no/terraform-part1.html TerraForm and NREC part I], [https://docs.nrec.no/terraform-part2.html part II], and [https://docs.nrec.no/terraform-part3.html part III]&lt;br /&gt;
* [https://www.ansible.com/overview/how-ansible-works How Ansible Works] and [https://docs.ansible.com/ansible_community.html the Ansible Community portal]&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
* Docker Docs: [https://docs.docker.com/get-started/overview/ Docker overview] and [https://docs.docker.com/get-started/overview/ Get started]&lt;br /&gt;
* [https://kubernetes.io/docs/tutorials/kubernetes-basics/ Learn Kubernetes basics], modules 1-6&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
* Slides: [[File:S05-Terraform-Ansible-published.pdf]] [[File:S05-NewsAngler-published.pdf]]&lt;br /&gt;
&lt;br /&gt;
Guest presentation: Marc Gallofré Ocaña on the News Hunter platform and its big-data ready architecture. Slides: [[File:MarcGallofre-BigDataArchitecture.pdf]] &lt;br /&gt;
&lt;br /&gt;
Comment: Hopefully, we can introduce Docker and Kubernetes in later sesssions.&lt;br /&gt;
&lt;br /&gt;
== Session 6 - Societal issues. Privacy. GDPR ==&lt;br /&gt;
* Kitchin, chapters 13-14 and 17-19&lt;br /&gt;
* [https://gdpr.eu/what-is-gdpr/ What is GDPR, the EU’s new data protection law?]&lt;br /&gt;
* Slides: [[File:S06-Privacy.pdf]]&lt;br /&gt;
&lt;br /&gt;
Guest presentation: Ghazaal Sheiki on fact checking. Slides:  [[File:GhazaalSheiki-AutomatedFactChecking.pdf]]&lt;br /&gt;
&lt;br /&gt;
Supplementary:&lt;br /&gt;
* Kitchin, chapters 12 and 15-16 are also recommended reading&lt;br /&gt;
* EU&#039;s [https://gdpr-info.eu/ General Data Protection Regulation (GDPR)] - the official legal text&lt;br /&gt;
&lt;br /&gt;
== Session 7 - Essay presentations ==&lt;br /&gt;
&lt;br /&gt;
== Session 8 - Project demonstrations ==&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Sessions&amp;diff=1229</id>
		<title>Sessions</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Sessions&amp;diff=1229"/>
		<updated>2022-11-10T21:13:31Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Session 6 - Societal issues. Privacy. GDPR */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Tentative themes for each session ==&lt;br /&gt;
* Thursday August 18th: Introduction meeting [[File:IntroductionMeeting.pdf]]&lt;br /&gt;
* Thursday September 1st: Session 1 - Introduction to big data. Big-data processing. Spark&lt;br /&gt;
* Thursday September 15th: Session 2 - More about Spark. Data sources. Twitter&lt;br /&gt;
* Thursday September 29th: Session 3 - Streaming Spark. Big-data architectures. Kafka&lt;br /&gt;
* Thursday October 13th: Session 4 - Cloud computing. NREC an Openstack&lt;br /&gt;
* Thursday October 27th: Session 5 - Cloud management. Terraform and Ansible. Docker and Kubernetes&lt;br /&gt;
* Thursday November 10th: Session 6 - Societal issues. Privacy. GDPR&lt;br /&gt;
* Thursday November 24th: Session 7 - Essay presentations&lt;br /&gt;
* Thursday December 8th: Session 8 - Project demonstrations&lt;br /&gt;
&lt;br /&gt;
== Session 1 - Introduction to big data. Big-data processing. Spark ==&lt;br /&gt;
* Kitchin, chapters 1, 4-5&lt;br /&gt;
* Chambers &amp;amp; Zaharia, chapters 1-3, 12, 15&lt;br /&gt;
* Slides: [[File:S01-BigData-published.pdf]] [[File:S01-Spark-published.pdf]]&lt;br /&gt;
&lt;br /&gt;
Supplementary:&lt;br /&gt;
* Section 1 in Opdahl, A. L., &amp;amp; Nunavath, V. (2020). Big Data. Big Data in Emergency Management: Exploitation Techniques for Social and Mobile Data, 15-29. [https://link.springer.com/chapter/10.1007/978-3-030-48099-8_2 Paper]&lt;br /&gt;
* Spark 3.3.0  [https://spark.apache.org/docs/latest/overview.html Overview] and [https://spark.apache.org/docs/latest/quick-start.html Quick Start (with Python examples)]&lt;br /&gt;
&lt;br /&gt;
== Session 2 - More about Spark. Data sources. Twitter ==&lt;br /&gt;
* Chambers &amp;amp; Zaharia, chapters 4-9 (chapter 10 on SQL is also very relevant)&lt;br /&gt;
* Kitchin, chapter 3&lt;br /&gt;
* Slides: [[File:S02-OrganisationINFO319-published.pdf]] [[File:S02-DataSources-published.pdf]] [[File:DanielRosnes-Introduction-to-Tweepy-and-Twitter-API-2.0.pdf]] [[File:S02-MoreSpark-published.pdf]]&lt;br /&gt;
&lt;br /&gt;
Guest presentation: Daniel Rosnes on using Twitter data for the news: Introduction to Twitter API v2 and Tweepy&lt;br /&gt;
&lt;br /&gt;
Supplementary:&lt;br /&gt;
* Chambers &amp;amp; Zaharia, chapter 10 &#039;&#039;(perhaps mandatory too)&#039;&#039;&lt;br /&gt;
* [https://developer.twitter.com/en/docs/twitter-api Twitter API v2]&lt;br /&gt;
* [https://github.com/tweepy/tweepy Tweepy: Twitter for Python]&lt;br /&gt;
* [https://docs.tweepy.org/en/latest/ Tweepy Documentation]&lt;br /&gt;
&lt;br /&gt;
== Session 3 - Streaming Spark. Big-data architectures. Kafka ==&lt;br /&gt;
* Chambers &amp;amp; Zaharia, chapters 20-21&lt;br /&gt;
* Gallofré, M., Opdahl, A. L., Stoppel, S., Tessem, B., &amp;amp; Veres, C. (2021). The News Angler Project: Exploring the Next Generation of Journalistic Knowledge Platforms. In Proceedings of Norsk IKT-konferanse for forskning og utdanning. [https://ojs.bibsys.no/index.php/NIK/article/view/939/792 Short Paper] and poster: [[file:A1-Poster-NIKT2021.pdf]]&lt;br /&gt;
* [https://kafka.apache.org/intro Kafka Introduction]&lt;br /&gt;
* Slides: [[file:S03-StreamingSpark-published.pdf]] [[file:S03-MoreSpark-published.pdf]] [[file:S03-Kafka-published.pdf]] [[file:S03-ResearchMethod-published.pdf]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;The Guest Talk on architectures and the News Hunter platform is postponed to a later session.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Supplementary:&lt;br /&gt;
* [https://kafka-python.readthedocs.io/en/master/ kafka-python API]&lt;br /&gt;
* [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Structured Streaming Spark Programming Guide]&lt;br /&gt;
* News Hunter:&lt;br /&gt;
** Berven, A., Christensen, O. A., Moldeklev, S., Opdahl, A. L., &amp;amp; Villanger, K. J. (2020). A knowledge-graph platform for newsrooms. Computers in Industry, 123, 103321. [https://scholar.google.com/scholar?output=instlink&amp;amp;q=info:0K5dB1_9nusJ:scholar.google.com/&amp;amp;hl=en&amp;amp;as_sdt=0,5&amp;amp;as_ylo=2018&amp;amp;scillfp=11776208952974186557&amp;amp;oi=lle Paper]&lt;br /&gt;
** Opdahl, A. L., &amp;amp; Tessem, B. (2021). Ontologies for finding journalistic angles. Software and Systems Modeling, 20(1), 71-87. [https://link.springer.com/article/10.1007/s10270-020-00801-w Paper]&lt;br /&gt;
* Design science research method:&lt;br /&gt;
** [https://www.jstor.org/stable/25148625#metadata_info_tab_contents Design Science in Information Systems Research] by Alan R. Hevner, Salvatore T. March, Jinsoo Park and Sudha Ram. MIS Quarterly 28(1):75-105, March 2004. &#039;&#039;(You need to be on UiB&#039;s network to access the link - I have uploaded it under Files in mitt.uib.no, but it may soon be deleted from there...)&#039;&#039;&lt;br /&gt;
** Hevner, A. R. (2007). A three cycle view of design science research. Scandinavian journal of information systems, 19(2), 4. [[File:Hevner2007-ThreeCycleView-SJIS.pdf]]&lt;br /&gt;
&lt;br /&gt;
== Session 4 - Cloud computing. NREC and Openstack ==&lt;br /&gt;
* [https://docs.nrec.no/index.html NREC and OpenStack], the following sections/pages: Introduction, Project application, Logging in, The dashboard, Create a Linux virtual machine (skip: Windows), Using SSH, Working with Security Groups, Create and manage volumes, Create and manage snapshots (skip: images), Instance console&lt;br /&gt;
* Slides: [[file:S04-OpenStack-published.pdf]] [[file:S04-UbuntuLinux-published.pdf]]&lt;br /&gt;
&lt;br /&gt;
Guest presentation: Sohail Khan on computer vision and deep networks for image analysis. His slides and demo code are uploaded to mitt.uib.no under Files (size and file-type limitations).&lt;br /&gt;
&lt;br /&gt;
There are not so many readings for this session, because it is where we will start running Spark in a cluster, so there will be practical work that takes some time. Computer networks and image analysis is not a mandatory part of the course, but something you may want to use in your projects. Sohail&#039;s presentation will include suggestions for further reading.&lt;br /&gt;
&lt;br /&gt;
== Session 5 - Cloud management. Terraform and Ansible. &amp;lt;!-- Docker and Kubernetes --&amp;gt; ==&lt;br /&gt;
* [https://docs.nrec.no/terraform-part1.html TerraForm and NREC part I], [https://docs.nrec.no/terraform-part2.html part II], and [https://docs.nrec.no/terraform-part3.html part III]&lt;br /&gt;
* [https://www.ansible.com/overview/how-ansible-works How Ansible Works] and [https://docs.ansible.com/ansible_community.html the Ansible Community portal]&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
* Docker Docs: [https://docs.docker.com/get-started/overview/ Docker overview] and [https://docs.docker.com/get-started/overview/ Get started]&lt;br /&gt;
* [https://kubernetes.io/docs/tutorials/kubernetes-basics/ Learn Kubernetes basics], modules 1-6&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
* Slides: [[File:S05-Terraform-Ansible-published.pdf]] [[File:S05-NewsAngler-published.pdf]]&lt;br /&gt;
&lt;br /&gt;
Guest presentation: Marc Gallofré Ocaña on the News Hunter platform and its big-data ready architecture. Slides: [[File:MarcGallofre-BigDataArchitecture.pdf]] &lt;br /&gt;
&lt;br /&gt;
Comment: Hopefully, we can introduce Docker and Kubernetes in later sesssions.&lt;br /&gt;
&lt;br /&gt;
== Session 6 - Societal issues. Privacy. GDPR ==&lt;br /&gt;
* Kitchin, chapters 13-14 and 17-19&lt;br /&gt;
* [https://gdpr.eu/what-is-gdpr/ What is GDPR, the EU’s new data protection law?]&lt;br /&gt;
* Slides: [[File:S06-Privacy.pdf]] [[File:GhazaalSheiki-AutomatedFactChecking.pdf]]&lt;br /&gt;
&lt;br /&gt;
Guest presentation: Ghazaal Sheiki on fact checking&lt;br /&gt;
&lt;br /&gt;
Supplementary:&lt;br /&gt;
* Kitchin, chapters 12 and 15-16 are also recommended reading&lt;br /&gt;
* EU&#039;s [https://gdpr-info.eu/ General Data Protection Regulation (GDPR)] - the official legal text&lt;br /&gt;
&lt;br /&gt;
== Session 7 - Essay presentations ==&lt;br /&gt;
&lt;br /&gt;
== Session 8 - Project demonstrations ==&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=File:GhazaalSheiki-AutomatedFactChecking.pdf&amp;diff=1228</id>
		<title>File:GhazaalSheiki-AutomatedFactChecking.pdf</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=File:GhazaalSheiki-AutomatedFactChecking.pdf&amp;diff=1228"/>
		<updated>2022-11-10T21:12:11Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=File:S06-Privacy.pdf&amp;diff=1227</id>
		<title>File:S06-Privacy.pdf</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=File:S06-Privacy.pdf&amp;diff=1227"/>
		<updated>2022-11-10T21:11:22Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Course_wiki_for_INFO319&amp;diff=1226</id>
		<title>Course wiki for INFO319</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Course_wiki_for_INFO319&amp;diff=1226"/>
		<updated>2022-11-10T10:35:14Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This wiki (under development) contains practical information about INFO319 - Big Data in the autumn of 2022, including readings and exercises.&lt;br /&gt;
&lt;br /&gt;
https://miro.com/welcomeonboard/UlBicXNHQXFtZDVTV2lDZzdrbzdoYVBFdXdoUzdkTXBNR0x0MEZQVDljYk1qSGxYd1FWMVRXUEM5OHFYUlh6RnwzMDc0NDU3MzY4MjYxOTAxNTIzfDI=?share_link_id=781099561364&lt;br /&gt;
&lt;br /&gt;
* [[Readings]]: Mandatory and supplementary readings for the course. In addition, the sessions page proposes specific readings for each session.&lt;br /&gt;
* [[Sessions]]: The course comprises 8 full-day seminars. The sessions involve lectures, demos, student presentations, and practical work.  &lt;br /&gt;
* [[Exercises]]: Exercises related to the sessions. Solutions to the exercises can form the backbone of the mandatory group assignment.&lt;br /&gt;
* [https://www.uib.no/emne/INFO319?sem=2022h Assessment]: The course assessment has three parts:&lt;br /&gt;
** Portfolio evaluation of the essay and group assignment (55%)&lt;br /&gt;
** Oral presentations of essay and group assignment (15%)&lt;br /&gt;
** Written exam (3 hours) (30%)&lt;br /&gt;
* [[Essay]]: One part of the portfolio evaluation is &amp;quot;an individual, theoretical essay with thoughtful research and discussion of an assigned topic&amp;quot;.&lt;br /&gt;
* [[Programming project|Project]]: Another part of the portfolio evaluation is a &amp;quot;practical assignment in groups&amp;quot; that has the form of a group programming project. Solutions to the suggested exercises can form the backbone of your project.&lt;br /&gt;
* Participation: Participation in 80% of the course seminars is &#039;&#039;mandatory&#039;&#039;. Participation in &#039;&#039;the two last sessions is also mandatory&#039;&#039; (because your presentations there are part of the course assessment).&lt;br /&gt;
&amp;lt;!-- * [[Datasets]] : Available datasets that can be used for data analysis. --&amp;gt; &lt;br /&gt;
* [https://mitt.uib.no/courses/37204/pages/info319-big-data-h22 Administration]: For formal and administrative information, see [https://mitt.uib.no/courses/37204/pages/info319-big-data-h22 UiB&#039;s Study Portal].&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Contact:&#039;&#039;&#039; [mailto:Andreas.Opdahl@uib.no Andreas.Opdahl@uib.no]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;For questions that are not strictly personal, always use the [https://mitt.uib.no/courses/37204/discussion_topics/321322 discussion forum (requires login)] at [https://mitt.uib.no/courses/37204/pages/info319-big-data-h22 mitt.uib.no]. If I receive general questions about INFO319 by email, I will answer them in the forum anyway, so it is fastest to post them directly there.&#039;&#039;&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Sessions&amp;diff=1225</id>
		<title>Sessions</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Sessions&amp;diff=1225"/>
		<updated>2022-11-09T12:53:17Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Session 6 - Societal issues. Privacy. GDPR */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Tentative themes for each session ==&lt;br /&gt;
* Thursday August 18th: Introduction meeting [[File:IntroductionMeeting.pdf]]&lt;br /&gt;
* Thursday September 1st: Session 1 - Introduction to big data. Big-data processing. Spark&lt;br /&gt;
* Thursday September 15th: Session 2 - More about Spark. Data sources. Twitter&lt;br /&gt;
* Thursday September 29th: Session 3 - Streaming Spark. Big-data architectures. Kafka&lt;br /&gt;
* Thursday October 13th: Session 4 - Cloud computing. NREC an Openstack&lt;br /&gt;
* Thursday October 27th: Session 5 - Cloud management. Terraform and Ansible. Docker and Kubernetes&lt;br /&gt;
* Thursday November 10th: Session 6 - Societal issues. Privacy. GDPR&lt;br /&gt;
* Thursday November 24th: Session 7 - Essay presentations&lt;br /&gt;
* Thursday December 8th: Session 8 - Project demonstrations&lt;br /&gt;
&lt;br /&gt;
== Session 1 - Introduction to big data. Big-data processing. Spark ==&lt;br /&gt;
* Kitchin, chapters 1, 4-5&lt;br /&gt;
* Chambers &amp;amp; Zaharia, chapters 1-3, 12, 15&lt;br /&gt;
* Slides: [[File:S01-BigData-published.pdf]] [[File:S01-Spark-published.pdf]]&lt;br /&gt;
&lt;br /&gt;
Supplementary:&lt;br /&gt;
* Section 1 in Opdahl, A. L., &amp;amp; Nunavath, V. (2020). Big Data. Big Data in Emergency Management: Exploitation Techniques for Social and Mobile Data, 15-29. [https://link.springer.com/chapter/10.1007/978-3-030-48099-8_2 Paper]&lt;br /&gt;
* Spark 3.3.0  [https://spark.apache.org/docs/latest/overview.html Overview] and [https://spark.apache.org/docs/latest/quick-start.html Quick Start (with Python examples)]&lt;br /&gt;
&lt;br /&gt;
== Session 2 - More about Spark. Data sources. Twitter ==&lt;br /&gt;
* Chambers &amp;amp; Zaharia, chapters 4-9 (chapter 10 on SQL is also very relevant)&lt;br /&gt;
* Kitchin, chapter 3&lt;br /&gt;
* Slides: [[File:S02-OrganisationINFO319-published.pdf]] [[File:S02-DataSources-published.pdf]] [[File:DanielRosnes-Introduction-to-Tweepy-and-Twitter-API-2.0.pdf]] [[File:S02-MoreSpark-published.pdf]]&lt;br /&gt;
&lt;br /&gt;
Guest presentation: Daniel Rosnes on using Twitter data for the news: Introduction to Twitter API v2 and Tweepy&lt;br /&gt;
&lt;br /&gt;
Supplementary:&lt;br /&gt;
* Chambers &amp;amp; Zaharia, chapter 10 &#039;&#039;(perhaps mandatory too)&#039;&#039;&lt;br /&gt;
* [https://developer.twitter.com/en/docs/twitter-api Twitter API v2]&lt;br /&gt;
* [https://github.com/tweepy/tweepy Tweepy: Twitter for Python]&lt;br /&gt;
* [https://docs.tweepy.org/en/latest/ Tweepy Documentation]&lt;br /&gt;
&lt;br /&gt;
== Session 3 - Streaming Spark. Big-data architectures. Kafka ==&lt;br /&gt;
* Chambers &amp;amp; Zaharia, chapters 20-21&lt;br /&gt;
* Gallofré, M., Opdahl, A. L., Stoppel, S., Tessem, B., &amp;amp; Veres, C. (2021). The News Angler Project: Exploring the Next Generation of Journalistic Knowledge Platforms. In Proceedings of Norsk IKT-konferanse for forskning og utdanning. [https://ojs.bibsys.no/index.php/NIK/article/view/939/792 Short Paper] and poster: [[file:A1-Poster-NIKT2021.pdf]]&lt;br /&gt;
* [https://kafka.apache.org/intro Kafka Introduction]&lt;br /&gt;
* Slides: [[file:S03-StreamingSpark-published.pdf]] [[file:S03-MoreSpark-published.pdf]] [[file:S03-Kafka-published.pdf]] [[file:S03-ResearchMethod-published.pdf]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;The Guest Talk on architectures and the News Hunter platform is postponed to a later session.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Supplementary:&lt;br /&gt;
* [https://kafka-python.readthedocs.io/en/master/ kafka-python API]&lt;br /&gt;
* [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Structured Streaming Spark Programming Guide]&lt;br /&gt;
* News Hunter:&lt;br /&gt;
** Berven, A., Christensen, O. A., Moldeklev, S., Opdahl, A. L., &amp;amp; Villanger, K. J. (2020). A knowledge-graph platform for newsrooms. Computers in Industry, 123, 103321. [https://scholar.google.com/scholar?output=instlink&amp;amp;q=info:0K5dB1_9nusJ:scholar.google.com/&amp;amp;hl=en&amp;amp;as_sdt=0,5&amp;amp;as_ylo=2018&amp;amp;scillfp=11776208952974186557&amp;amp;oi=lle Paper]&lt;br /&gt;
** Opdahl, A. L., &amp;amp; Tessem, B. (2021). Ontologies for finding journalistic angles. Software and Systems Modeling, 20(1), 71-87. [https://link.springer.com/article/10.1007/s10270-020-00801-w Paper]&lt;br /&gt;
* Design science research method:&lt;br /&gt;
** [https://www.jstor.org/stable/25148625#metadata_info_tab_contents Design Science in Information Systems Research] by Alan R. Hevner, Salvatore T. March, Jinsoo Park and Sudha Ram. MIS Quarterly 28(1):75-105, March 2004. &#039;&#039;(You need to be on UiB&#039;s network to access the link - I have uploaded it under Files in mitt.uib.no, but it may soon be deleted from there...)&#039;&#039;&lt;br /&gt;
** Hevner, A. R. (2007). A three cycle view of design science research. Scandinavian journal of information systems, 19(2), 4. [[File:Hevner2007-ThreeCycleView-SJIS.pdf]]&lt;br /&gt;
&lt;br /&gt;
== Session 4 - Cloud computing. NREC and Openstack ==&lt;br /&gt;
* [https://docs.nrec.no/index.html NREC and OpenStack], the following sections/pages: Introduction, Project application, Logging in, The dashboard, Create a Linux virtual machine (skip: Windows), Using SSH, Working with Security Groups, Create and manage volumes, Create and manage snapshots (skip: images), Instance console&lt;br /&gt;
* Slides: [[file:S04-OpenStack-published.pdf]] [[file:S04-UbuntuLinux-published.pdf]]&lt;br /&gt;
&lt;br /&gt;
Guest presentation: Sohail Khan on computer vision and deep networks for image analysis. His slides and demo code are uploaded to mitt.uib.no under Files (size and file-type limitations).&lt;br /&gt;
&lt;br /&gt;
There are not so many readings for this session, because it is where we will start running Spark in a cluster, so there will be practical work that takes some time. Computer networks and image analysis is not a mandatory part of the course, but something you may want to use in your projects. Sohail&#039;s presentation will include suggestions for further reading.&lt;br /&gt;
&lt;br /&gt;
== Session 5 - Cloud management. Terraform and Ansible. &amp;lt;!-- Docker and Kubernetes --&amp;gt; ==&lt;br /&gt;
* [https://docs.nrec.no/terraform-part1.html TerraForm and NREC part I], [https://docs.nrec.no/terraform-part2.html part II], and [https://docs.nrec.no/terraform-part3.html part III]&lt;br /&gt;
* [https://www.ansible.com/overview/how-ansible-works How Ansible Works] and [https://docs.ansible.com/ansible_community.html the Ansible Community portal]&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
* Docker Docs: [https://docs.docker.com/get-started/overview/ Docker overview] and [https://docs.docker.com/get-started/overview/ Get started]&lt;br /&gt;
* [https://kubernetes.io/docs/tutorials/kubernetes-basics/ Learn Kubernetes basics], modules 1-6&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
* Slides: [[File:S05-Terraform-Ansible-published.pdf]] [[File:S05-NewsAngler-published.pdf]]&lt;br /&gt;
&lt;br /&gt;
Guest presentation: Marc Gallofré Ocaña on the News Hunter platform and its big-data ready architecture. Slides: [[File:MarcGallofre-BigDataArchitecture.pdf]] &lt;br /&gt;
&lt;br /&gt;
Comment: Hopefully, we can introduce Docker and Kubernetes in later sesssions.&lt;br /&gt;
&lt;br /&gt;
== Session 6 - Societal issues. Privacy. GDPR ==&lt;br /&gt;
* Kitchin, chapters 13-14 and 17-19&lt;br /&gt;
* [https://gdpr.eu/what-is-gdpr/ What is GDPR, the EU’s new data protection law?]&lt;br /&gt;
&lt;br /&gt;
Guest presentation: Ghazaal Sheiki on fact checking&lt;br /&gt;
&lt;br /&gt;
Supplementary:&lt;br /&gt;
* Kitchin, chapters 12 and 15-16 are also recommended reading&lt;br /&gt;
* EU&#039;s [https://gdpr-info.eu/ General Data Protection Regulation (GDPR)] - the official legal text&lt;br /&gt;
&lt;br /&gt;
== Session 7 - Essay presentations ==&lt;br /&gt;
&lt;br /&gt;
== Session 8 - Project demonstrations ==&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Sessions&amp;diff=1224</id>
		<title>Sessions</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Sessions&amp;diff=1224"/>
		<updated>2022-11-01T11:21:33Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Session 6 - Societal issues. Privacy. GDPR */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Tentative themes for each session ==&lt;br /&gt;
* Thursday August 18th: Introduction meeting [[File:IntroductionMeeting.pdf]]&lt;br /&gt;
* Thursday September 1st: Session 1 - Introduction to big data. Big-data processing. Spark&lt;br /&gt;
* Thursday September 15th: Session 2 - More about Spark. Data sources. Twitter&lt;br /&gt;
* Thursday September 29th: Session 3 - Streaming Spark. Big-data architectures. Kafka&lt;br /&gt;
* Thursday October 13th: Session 4 - Cloud computing. NREC an Openstack&lt;br /&gt;
* Thursday October 27th: Session 5 - Cloud management. Terraform and Ansible. Docker and Kubernetes&lt;br /&gt;
* Thursday November 10th: Session 6 - Societal issues. Privacy. GDPR&lt;br /&gt;
* Thursday November 24th: Session 7 - Essay presentations&lt;br /&gt;
* Thursday December 8th: Session 8 - Project demonstrations&lt;br /&gt;
&lt;br /&gt;
== Session 1 - Introduction to big data. Big-data processing. Spark ==&lt;br /&gt;
* Kitchin, chapters 1, 4-5&lt;br /&gt;
* Chambers &amp;amp; Zaharia, chapters 1-3, 12, 15&lt;br /&gt;
* Slides: [[File:S01-BigData-published.pdf]] [[File:S01-Spark-published.pdf]]&lt;br /&gt;
&lt;br /&gt;
Supplementary:&lt;br /&gt;
* Section 1 in Opdahl, A. L., &amp;amp; Nunavath, V. (2020). Big Data. Big Data in Emergency Management: Exploitation Techniques for Social and Mobile Data, 15-29. [https://link.springer.com/chapter/10.1007/978-3-030-48099-8_2 Paper]&lt;br /&gt;
* Spark 3.3.0  [https://spark.apache.org/docs/latest/overview.html Overview] and [https://spark.apache.org/docs/latest/quick-start.html Quick Start (with Python examples)]&lt;br /&gt;
&lt;br /&gt;
== Session 2 - More about Spark. Data sources. Twitter ==&lt;br /&gt;
* Chambers &amp;amp; Zaharia, chapters 4-9 (chapter 10 on SQL is also very relevant)&lt;br /&gt;
* Kitchin, chapter 3&lt;br /&gt;
* Slides: [[File:S02-OrganisationINFO319-published.pdf]] [[File:S02-DataSources-published.pdf]] [[File:DanielRosnes-Introduction-to-Tweepy-and-Twitter-API-2.0.pdf]] [[File:S02-MoreSpark-published.pdf]]&lt;br /&gt;
&lt;br /&gt;
Guest presentation: Daniel Rosnes on using Twitter data for the news: Introduction to Twitter API v2 and Tweepy&lt;br /&gt;
&lt;br /&gt;
Supplementary:&lt;br /&gt;
* Chambers &amp;amp; Zaharia, chapter 10 &#039;&#039;(perhaps mandatory too)&#039;&#039;&lt;br /&gt;
* [https://developer.twitter.com/en/docs/twitter-api Twitter API v2]&lt;br /&gt;
* [https://github.com/tweepy/tweepy Tweepy: Twitter for Python]&lt;br /&gt;
* [https://docs.tweepy.org/en/latest/ Tweepy Documentation]&lt;br /&gt;
&lt;br /&gt;
== Session 3 - Streaming Spark. Big-data architectures. Kafka ==&lt;br /&gt;
* Chambers &amp;amp; Zaharia, chapters 20-21&lt;br /&gt;
* Gallofré, M., Opdahl, A. L., Stoppel, S., Tessem, B., &amp;amp; Veres, C. (2021). The News Angler Project: Exploring the Next Generation of Journalistic Knowledge Platforms. In Proceedings of Norsk IKT-konferanse for forskning og utdanning. [https://ojs.bibsys.no/index.php/NIK/article/view/939/792 Short Paper] and poster: [[file:A1-Poster-NIKT2021.pdf]]&lt;br /&gt;
* [https://kafka.apache.org/intro Kafka Introduction]&lt;br /&gt;
* Slides: [[file:S03-StreamingSpark-published.pdf]] [[file:S03-MoreSpark-published.pdf]] [[file:S03-Kafka-published.pdf]] [[file:S03-ResearchMethod-published.pdf]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;The Guest Talk on architectures and the News Hunter platform is postponed to a later session.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Supplementary:&lt;br /&gt;
* [https://kafka-python.readthedocs.io/en/master/ kafka-python API]&lt;br /&gt;
* [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Structured Streaming Spark Programming Guide]&lt;br /&gt;
* News Hunter:&lt;br /&gt;
** Berven, A., Christensen, O. A., Moldeklev, S., Opdahl, A. L., &amp;amp; Villanger, K. J. (2020). A knowledge-graph platform for newsrooms. Computers in Industry, 123, 103321. [https://scholar.google.com/scholar?output=instlink&amp;amp;q=info:0K5dB1_9nusJ:scholar.google.com/&amp;amp;hl=en&amp;amp;as_sdt=0,5&amp;amp;as_ylo=2018&amp;amp;scillfp=11776208952974186557&amp;amp;oi=lle Paper]&lt;br /&gt;
** Opdahl, A. L., &amp;amp; Tessem, B. (2021). Ontologies for finding journalistic angles. Software and Systems Modeling, 20(1), 71-87. [https://link.springer.com/article/10.1007/s10270-020-00801-w Paper]&lt;br /&gt;
* Design science research method:&lt;br /&gt;
** [https://www.jstor.org/stable/25148625#metadata_info_tab_contents Design Science in Information Systems Research] by Alan R. Hevner, Salvatore T. March, Jinsoo Park and Sudha Ram. MIS Quarterly 28(1):75-105, March 2004. &#039;&#039;(You need to be on UiB&#039;s network to access the link - I have uploaded it under Files in mitt.uib.no, but it may soon be deleted from there...)&#039;&#039;&lt;br /&gt;
** Hevner, A. R. (2007). A three cycle view of design science research. Scandinavian journal of information systems, 19(2), 4. [[File:Hevner2007-ThreeCycleView-SJIS.pdf]]&lt;br /&gt;
&lt;br /&gt;
== Session 4 - Cloud computing. NREC and Openstack ==&lt;br /&gt;
* [https://docs.nrec.no/index.html NREC and OpenStack], the following sections/pages: Introduction, Project application, Logging in, The dashboard, Create a Linux virtual machine (skip: Windows), Using SSH, Working with Security Groups, Create and manage volumes, Create and manage snapshots (skip: images), Instance console&lt;br /&gt;
* Slides: [[file:S04-OpenStack-published.pdf]] [[file:S04-UbuntuLinux-published.pdf]]&lt;br /&gt;
&lt;br /&gt;
Guest presentation: Sohail Khan on computer vision and deep networks for image analysis. His slides and demo code are uploaded to mitt.uib.no under Files (size and file-type limitations).&lt;br /&gt;
&lt;br /&gt;
There are not so many readings for this session, because it is where we will start running Spark in a cluster, so there will be practical work that takes some time. Computer networks and image analysis is not a mandatory part of the course, but something you may want to use in your projects. Sohail&#039;s presentation will include suggestions for further reading.&lt;br /&gt;
&lt;br /&gt;
== Session 5 - Cloud management. Terraform and Ansible. &amp;lt;!-- Docker and Kubernetes --&amp;gt; ==&lt;br /&gt;
* [https://docs.nrec.no/terraform-part1.html TerraForm and NREC part I], [https://docs.nrec.no/terraform-part2.html part II], and [https://docs.nrec.no/terraform-part3.html part III]&lt;br /&gt;
* [https://www.ansible.com/overview/how-ansible-works How Ansible Works] and [https://docs.ansible.com/ansible_community.html the Ansible Community portal]&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
* Docker Docs: [https://docs.docker.com/get-started/overview/ Docker overview] and [https://docs.docker.com/get-started/overview/ Get started]&lt;br /&gt;
* [https://kubernetes.io/docs/tutorials/kubernetes-basics/ Learn Kubernetes basics], modules 1-6&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
* Slides: [[File:S05-Terraform-Ansible-published.pdf]] [[File:S05-NewsAngler-published.pdf]]&lt;br /&gt;
&lt;br /&gt;
Guest presentation: Marc Gallofré Ocaña on the News Hunter platform and its big-data ready architecture. Slides: [[File:MarcGallofre-BigDataArchitecture.pdf]] &lt;br /&gt;
&lt;br /&gt;
Comment: Hopefully, we can introduce Docker and Kubernetes in later sesssions.&lt;br /&gt;
&lt;br /&gt;
== Session 6 - Societal issues. Privacy. GDPR ==&lt;br /&gt;
* Kitchin, chapters 13-14 and 17-19&lt;br /&gt;
* [https://gdpr.eu/what-is-gdpr/ What is GDPR, the EU’s new data protection law?]&lt;br /&gt;
&lt;br /&gt;
Guest presentation: Laurence Dierickx on aspects of big-data quality&lt;br /&gt;
&lt;br /&gt;
Supplementary:&lt;br /&gt;
* Kitchin, chapters 12 and 15-16 are also recommended reading&lt;br /&gt;
* EU&#039;s [https://gdpr-info.eu/ General Data Protection Regulation (GDPR)] - the official legal text&lt;br /&gt;
&lt;br /&gt;
== Session 7 - Essay presentations ==&lt;br /&gt;
&lt;br /&gt;
== Session 8 - Project demonstrations ==&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Readings&amp;diff=1223</id>
		<title>Readings</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Readings&amp;diff=1223"/>
		<updated>2022-11-01T11:19:59Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Books */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Books ==&lt;br /&gt;
Text books:&lt;br /&gt;
* Rob Kitchin. &#039;&#039;The Data Revolution - A Critical Analysis of Big Data, Open Data and Data Infrastructures&#039;&#039;, 2nd Edition. Sage, 2021.&lt;br /&gt;
** chapters 1, 3-5, 13-14, 17-19 are mandatory (12 and 15-16 are supplementary)&lt;br /&gt;
&lt;br /&gt;
* Bill Chambers and Matei Zaharia: &#039;&#039;Sprk: The Definitive Guide - Big Data Processing Made Simple&#039;&#039;. O&#039;Riley, 2018. [[File:Spark-TheDefinitiveGuide.pdf]]&lt;br /&gt;
** chapters 1-9, 12, 15, 20-21 are mandatory (chapter 10 on SQL is also highly relevant)&lt;br /&gt;
** [https://github.com/databricks/Spark-The-Definitive-Guide GitHub repository with code and data examples]&lt;br /&gt;
&lt;br /&gt;
== Papers ==&lt;br /&gt;
Selected papers will become available here, including:&lt;br /&gt;
* [https://arxiv.org/pdf/2012.09109 Section 1] in Opdahl, A. L., &amp;amp; Nunavath, V. (2020). Big Data. Big Data in Emergency Management: Exploitation Techniques for Social and Mobile Data, 15-29. Book chapter&lt;br /&gt;
* Gallofré, M., Opdahl, A. L., Stoppel, S., Tessem, B., &amp;amp; Veres, C. (2021). The News Angler Project: Exploring the Next Generation of Journalistic Knowledge Platforms. In Proceedings of Norsk IKT-konferanse for forskning og utdanning. [https://ojs.bibsys.no/index.php/NIK/article/view/939/792 Short Paper] and poster: [[File:A1-Poster-NIKT2021.pdf]]&lt;br /&gt;
&amp;lt;!-- Architecture stuff:&lt;br /&gt;
* Lambda: Introduced in Mathan Marz and James Warren (2013). Big Data Principles and Best Practices of Scalable Real-Time Data Systems. Slides 14-27 in [http://2014.berlinbuzzwords.de/sites/2014.berlinbuzzwords.de/files/media/documents/michael_hausenblas_-_lambda_architecture.pdf this presentation] gives an overview of the idea! &lt;br /&gt;
* Kappa: Kreps, J.: Questioning the lambda architecture (2014). [https://www.oreilly.com/radar/questioning-the-lambda-architecture/ White paper]&lt;br /&gt;
* Liquid: Fernandez, Raul Castro, Peter R. Pietzuch, Jay Kreps, Neha Narkhede, Jun Rao, Joel Koshy, Dong Lin, Chris Riccomini, and Guozhang Wang. &amp;quot;Liquid: Unifying nearline and offline big data integration.&amp;quot; In CIDR. 2015. [https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1088.2602&amp;amp;rep=rep1&amp;amp;type=pdf Paper]&lt;br /&gt;
* Sigma: Cassavia, N., &amp;amp; Masciari, E. (2021, March). Sigma: a scalable high performance big data architecture. In 2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP) (pp. 236-239). IEEE. [https://bibsys-almaprimo.hosted.exlibrisgroup.com/primo-explore/openurl?sid=google&amp;amp;auinit=N&amp;amp;aulast=Cassavia&amp;amp;atitle=Sigma:%20a%20scalable%20high%20performance%20big%20data%20architecture&amp;amp;id=doi:10.1109%2FPDP52278.2021.00044&amp;amp;vid=UBB&amp;amp;institution=UBB&amp;amp;url_ctx_val=&amp;amp;url_ctx_fmt=null&amp;amp;isSerivcesPage=true Paper]&lt;br /&gt;
* Maamouri, A., Sfaxi, L., &amp;amp; Robbana, R. (2021, December). Phi: A Generic Microservices-Based Big Data Architecture. In European, Mediterranean, and Middle Eastern Conference on Information Systems (pp. 3-16). Springer, Cham. [https://link.springer.com/chapter/10.1007/978-3-030-95947-0_1 Paper]&lt;br /&gt;
&lt;br /&gt;
Marc:&lt;br /&gt;
You found the other Phi architecture. 😃 The one I meant was: https://ieeexplore.ieee.org/abstract/document/8712381 But both have interesting contributions. The one you found considers the training part which it is not instantiated in the others.&lt;br /&gt;
&lt;br /&gt;
This is the &amp;quot;original publication&amp;quot; of Lambda: http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html , it is a blog entry.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
* Michael Armbrust, Armando Fox, Rean Griffith, Anthony D Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, Matei Zaharia (2010). A view of cloud computing. Communications of the ACM 53 (4), 50-58. [https://dl.acm.org/doi/fullHtml/10.1145/1721654.1721672 Paper]&lt;br /&gt;
* M Zaharia, M Chowdhury, MJ Franklin, S Shenker, I Stoica (2010). Spark: Cluster computing with working sets. 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10). [https://www.usenix.org/event/hotcloud10/tech/full_papers/Zaharia.pdf Paper]&lt;br /&gt;
* Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J Franklin, Scott Shenker, Ion Stoica (2012). Resilient distributed datasets: A Fault-Tolerant abstraction for In-Memory cluster computing. In Prof. 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp. 15-28. [https://scholar.google.com/citations?view_op=view_citation&amp;amp;hl=en&amp;amp;user=I1EvjZsAAAAJ&amp;amp;citation_for_view=I1EvjZsAAAAJ:Tyk-4Ss8FVUC Paper]&lt;br /&gt;
* Karun, A. K., &amp;amp; Chitharanjan, K. (2013, April). A review on hadoop—HDFS infrastructure extensions. In 2013 IEEE conference on information &amp;amp; communication technologies (pp. 132-137). IEEE. [https://scholar.google.com/scholar?output=instlink&amp;amp;q=info:GIm8aG-ScOsJ:scholar.google.com/&amp;amp;hl=en&amp;amp;as_sdt=0,5&amp;amp;scillfp=6854624816870725192&amp;amp;oi=lle Paper]&lt;br /&gt;
* Kafka?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Supplementary:&lt;br /&gt;
* Opdahl, A. L., &amp;amp; Tessem, B. (2021). Ontologies for finding journalistic angles. Software and Systems Modeling, 20(1), 71-87. [https://scholar.google.com/scholar?output=instlink&amp;amp;q=info:pKELE6iBzpAJ:scholar.google.com/&amp;amp;hl=en&amp;amp;as_sdt=0,5&amp;amp;as_ylo=2021&amp;amp;scillfp=4299025271368542631&amp;amp;oi=lle Paper]&lt;br /&gt;
* Berven, A., Christensen, O. A., Moldeklev, S., Opdahl, A. L., &amp;amp; Villanger, K. J. (2020). A knowledge-graph platform for newsrooms. Computers in Industry, 123, 103321. [https://scholar.google.com/scholar?output=instlink&amp;amp;q=info:0K5dB1_9nusJ:scholar.google.com/&amp;amp;hl=en&amp;amp;as_sdt=0,5&amp;amp;as_ylo=2018&amp;amp;scillfp=11776208952974186557&amp;amp;oi=lle Paper]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!-- Architectures: kappa, lambda, phi, Liquid --&amp;gt;&lt;br /&gt;
&amp;lt;!-- Classic papers: HDFS, Spark, RDDs --&amp;gt;&lt;br /&gt;
&amp;lt;!-- Privacy? --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Technical introductions ==&lt;br /&gt;
Selected web pages will become available here, including:&lt;br /&gt;
* [https://kafka.apache.org/intro Kafka Introduction]&lt;br /&gt;
* [https://docs.nrec.no/index.html NREC and OpenStack], the following sections/pages: Introduction, Project application, Logging in, The dashboard, Create a Linux virtual machine (skip: Windows), Using SSH, Working with Security Groups, Create and manage volumes, Create and manage snapshots (skip: images), Instance console&lt;br /&gt;
* [https://docs.nrec.no/terraform-part1.html TerraForm and NREC part I], [https://docs.nrec.no/terraform-part2.html part II], and [https://docs.nrec.no/terraform-part3.html part III]&lt;br /&gt;
* [https://www.ansible.com/overview/how-ansible-works How Ansible Works] and [https://docs.ansible.com/ansible_community.html the Ansible Community portal]&lt;br /&gt;
* Docker Docs: [https://docs.docker.com/get-started/overview/ Docker overview] and [https://docs.docker.com/get-started/overview/ Get started]&lt;br /&gt;
* [https://kubernetes.io/docs/tutorials/kubernetes-basics/ Learn Kubernetes basics], modules 1-6&lt;br /&gt;
* [https://gdpr.eu/what-is-gdpr/ What is GDPR, the EU’s new data protection law?]&lt;br /&gt;
&lt;br /&gt;
Supplementary:&lt;br /&gt;
* Spark 3.3.0 [https://spark.apache.org/docs/latest/index.html Overview] and [https://spark.apache.org/docs/latest/quick-start.html Quick Start (with Python examples)]&lt;br /&gt;
* [https://developer.twitter.com/en/docs/twitter-api Twitter API v2]&lt;br /&gt;
* [https://github.com/tweepy/tweepy Tweepy: Twitter for Python]&lt;br /&gt;
* [https://docs.tweepy.org/en/latest/ Tweepy Documentation]&lt;br /&gt;
* [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Structured Streaming Spark Programming Guide]&lt;br /&gt;
* Apache Spark [https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/index.html Structured Streaming API]&lt;br /&gt;
* [https://kafka-python.readthedocs.io/en/master/ kafka-python API]&lt;br /&gt;
* EU&#039;s [https://gdpr-info.eu/ General Data Protection Regulation (GDPR)] - the official legal text&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!-- GDELT --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Lecture slides==&lt;br /&gt;
See the [[Sessions|Session page]] for lecture slides after each session.&lt;br /&gt;
&lt;br /&gt;
==Readings for each session==&lt;br /&gt;
The [[Sessions|Sessions page]] will suggest specific readings for each session and its associated exercise.&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Exercises&amp;diff=1222</id>
		<title>Exercises</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Exercises&amp;diff=1222"/>
		<updated>2022-10-31T13:56:30Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Outline of the exercises. Because the exercises are new this year, it is hard to plan exactly, so this is likely to change a bit!&lt;br /&gt;
* Exercise 1: [[Getting started with Apache Spark]] and [[Processing tweets with Spark]].&lt;br /&gt;
* Exercise 2: [[Streaming tweets with Twitter API]]&lt;br /&gt;
* Exercise 3: [[Streaming tweets with Kafka and Spark]]&lt;br /&gt;
* Exercise 4:&lt;br /&gt;
** [[Create Spark cluster]]&lt;br /&gt;
** [[Install HDFS and YARN on the cluster]]&lt;br /&gt;
** [[Install Spark on the cluster]]&lt;br /&gt;
** [[Install Kafka on the cluster]]&lt;br /&gt;
* Exercise 5: &lt;br /&gt;
** [[Create Spark cluster using Terraform]]&lt;br /&gt;
** [[Configure Spark cluster using Ansible]]&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1221</id>
		<title>Configure Spark cluster using Ansible</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1221"/>
		<updated>2022-10-31T13:55:23Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Install Zookeeper on the cluster */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Ansible ==&lt;br /&gt;
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:&lt;br /&gt;
 sudo apt install ansible&lt;br /&gt;
&lt;br /&gt;
=== Configure Ansible ===&lt;br /&gt;
To prepare:&lt;br /&gt;
 sudo cp /etc/ansible/hosts /etc/ansible/hosts.original&lt;br /&gt;
 sudo chmod 666 /etc/ansible/hosts&lt;br /&gt;
&lt;br /&gt;
Ansible needs to know the names of your cluster machines. Change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;/etc/ansible/hosts&#039;&#039;:&lt;br /&gt;
 terraform-driver&lt;br /&gt;
 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 terraform-worker-5&lt;br /&gt;
&lt;br /&gt;
Finally, Ansible must be installed on all the hosts too. Add the line&lt;br /&gt;
 - ansible&lt;br /&gt;
to the &#039;&#039;packages:&#039;&#039; section of &#039;&#039;user-data.cfg&#039;&#039;, and re-run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Test run Ansible ===&lt;br /&gt;
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:&lt;br /&gt;
 ansible all -m ping&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;info319-cluster.yaml&#039;&#039; with a simple task that backs up &#039;&#039;~/.bashrc&#039;&#039;:&lt;br /&gt;
 - name: Prepare .bashrc&lt;br /&gt;
   hosts: all&lt;br /&gt;
   tasks:&lt;br /&gt;
 &lt;br /&gt;
     - name: Save original .bashrc&lt;br /&gt;
       ansible.builtin.copy:&lt;br /&gt;
         src: /home/ubuntu/.bashrc&lt;br /&gt;
         dest: /home/ubuntu/.bashrc.original&lt;br /&gt;
         remote_src: yes&lt;br /&gt;
&lt;br /&gt;
Run Ansible:&lt;br /&gt;
 ansible-playbook info319-cluster.yaml&lt;br /&gt;
&lt;br /&gt;
== Ansible playbook ==&lt;br /&gt;
Extend the playbook file &#039;&#039;info319-cluster.yaml&#039;&#039; to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.&lt;br /&gt;
&lt;br /&gt;
=== Preparing .bashrc ===&lt;br /&gt;
In Exercise 4 we made a lot of modifications to &#039;&#039;~/.bashrc&#039;&#039;. In some cases it is more practical to have the cluster configuration in a separate file, for example &#039;&#039;~/.info319&#039;&#039;. &lt;br /&gt;
&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.file&#039;&#039; module to create (&amp;quot;touch&amp;quot;) &#039;&#039;/home/ubuntu/.info319&#039;&#039; on all the hosts.&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.lineinfile&#039;&#039; module to add this line to the end of &#039;&#039;/home/ubuntu/.bashrc&#039;&#039; on all the hosts:&lt;br /&gt;
 source .info319&lt;br /&gt;
&lt;br /&gt;
See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example &lt;br /&gt;
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
You can use the &#039;&#039;blockinfile&#039;&#039; module to add your local &#039;&#039;ipv4-hosts&#039;&#039; to &#039;&#039;/etc/hosts&#039;&#039; on each node:&lt;br /&gt;
    - name: Copy IPv4 addresses to /etc/hosts&lt;br /&gt;
      ansible.builtin.blockinfile:&lt;br /&gt;
        path: /etc/hosts&lt;br /&gt;
        block: &amp;quot;{{ lookup(&#039;file&#039;, &#039;ipv4-hosts&#039;) } }&amp;quot;&lt;br /&gt;
      become: yes  # because you need root privilege (sudo) to update /etc/hosts&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There should not be a space between the two curly braces at the end of the &#039;&#039;key:&#039;&#039; line. But without the space, WikiText misinterprets them as a template marker.&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;config&#039;&#039; in your &#039;&#039;exercise-5&#039;&#039; folder:&lt;br /&gt;
 Host terraform-* localhost&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
 &lt;br /&gt;
 Include ~/.ssh/config.terraform-hosts&lt;br /&gt;
(This is the &#039;&#039;config.stub&#039;&#039; file from Exercise 4, with the &#039;&#039;Include&#039;&#039; line added. Also, &#039;&#039;localhost&#039;&#039; has been added to the first line to allow nodes to &#039;&#039;ssh&#039;&#039; themselves...)&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;copy&#039;&#039; module to upload this file, along with &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039; and &#039;&#039;~/.ssh/info319-spark-cluster&#039;&#039; to all hosts.&lt;br /&gt;
&lt;br /&gt;
In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this: &lt;br /&gt;
     - name: Authorise public cluster key&lt;br /&gt;
       ansible.posix.authorized_key:&lt;br /&gt;
         key: &amp;quot;{{ lookup(&#039;file&#039;, &#039;/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub&#039;) } }&amp;quot;&lt;br /&gt;
         user: ubuntu&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Tip:&#039;&#039; &#039;&#039;&#039;ansible-playbook&#039;&#039;&#039; has a &#039;&#039;&#039;--start-at-task &amp;quot;Task name&amp;quot;&#039;&#039;&#039; option you can use to avoid repeating all earlier plays (blocks) and stages. You can also use &#039;&#039;&#039;--step&#039;&#039;&#039; to have Ansible ask before each step whether to execute or skip it.&lt;br /&gt;
&lt;br /&gt;
=== Install Java ===&lt;br /&gt;
Use Ansible&#039;s &#039;&#039;ansible.builtin.apt&#039;&#039; module and install an old and stable Java version, for example &#039;&#039;openjdk-8-jdk-headless&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Mount volumes ===&lt;br /&gt;
Use the &#039;&#039;community.general.parted&#039;&#039;, &#039;&#039;community.general.filesystem&#039;&#039; and &#039;&#039;ansible.posix.mount&#039;&#039; modules to do this. They may require installation on your local machine:&lt;br /&gt;
 ansible-galaxy collection install community.general&lt;br /&gt;
 ansible-galaxy collection install ansible.posix&lt;br /&gt;
&lt;br /&gt;
=== Install HDFS and YARN ===&lt;br /&gt;
To install HDFS and YARN you need the &#039;&#039;master_node&#039;&#039; and &#039;&#039;num_workers&#039;&#039; available as Ansible variables (facts). You can use the &#039;&#039;ansible.builtin.shell&#039;&#039; and &#039;&#039;.set_fact&#039;&#039; modules to do this, for example at the start of a new Ansible play:&lt;br /&gt;
 - name: Install HDFS and YARN&lt;br /&gt;
   hosts: all&lt;br /&gt;
   tasks:&lt;br /&gt;
 &lt;br /&gt;
     - name: Register master_node expression&lt;br /&gt;
       shell: grep tf-driver /etc/hosts | cut -d&#039; &#039; -f1&lt;br /&gt;
       register: master_node_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set master_node fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         master_node: &amp;quot;{{ master_node_expr.stdout } }&amp;quot;&lt;br /&gt;
Write two corresponding tasks for &#039;&#039;num_workers&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.get_url&#039;&#039; module to download the Hadoop (and other) archives directly to each cluster host. But if you re-run your script many times, this takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.unarchive&#039;&#039; module to unpack the archives. Use the &#039;&#039;apt&#039;&#039; module to install &#039;&#039;gzip&#039;&#039; if you need it. Use &#039;&#039;ansible.builtin.file&#039;&#039; to create symbolic links as in Exercise 4.&lt;br /&gt;
&lt;br /&gt;
=== Configure HDFS and YARN ===&lt;br /&gt;
Use &#039;&#039;ansible.builtin.lineinfile&#039;&#039; to define environment variables by adding them to &#039;&#039;~/.info319&#039;&#039; (instead of  &#039;&#039;~/.bashrc&#039;&#039;). &lt;br /&gt;
&lt;br /&gt;
Change the variable syntax in the files &#039;&#039;{core,hdfs,mapred,yarn}-site.xml&#039;&#039; from Exercise 4 from Linux to Ansible. For example&lt;br /&gt;
* from &#039;&#039;core-site.xml&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://${HADOOP_NAMENODE}:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
* to &#039;&#039;core-site.xml.j2&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://{{ hadoop_namenode } }:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
You can now use the &#039;&#039;ansible.builtin.template&#039;&#039; module (instead of Linux&#039; &#039;&#039;envsubst&#039;&#039; command) to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.&lt;br /&gt;
&lt;br /&gt;
Use &#039;&#039;ansible.builtin.shell&#039;&#039; to create Hadoop&#039;s &#039;&#039;worker&#039;&#039; and &#039;&#039;master&#039;&#039; files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the &#039;&#039;terraform-driver&#039;&#039; host.&lt;br /&gt;
&lt;br /&gt;
Note that &#039;&#039;&#039;hdfs namenode -format&#039;&#039;&#039; has got a &#039;&#039;&#039;-nonInteractive&#039;&#039;&#039; option that does not re-format an already formatted namenode.  Use &#039;&#039;failed_when&#039;&#039; to make Ansible ignore &#039;&#039;Exit code 1&#039;&#039; from hdfs in such cases:&lt;br /&gt;
     - name: Format HDFS namenode&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/home/ubuntu/volume/hadoop/bin/hdfs&amp;quot;, &amp;quot;namenode&amp;quot;, &amp;quot;-format&amp;quot;, &amp;quot;-nonInteractive&amp;quot;]&lt;br /&gt;
       register: result&lt;br /&gt;
       failed_when: result.rc not in [0, 1]&lt;br /&gt;
&lt;br /&gt;
Now you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
Finally, you can start HDFS and YARN from &#039;&#039;terraform-driver&#039;&#039;. Note that the &#039;&#039;ansible.builtin.copy&#039;&#039; and &#039;&#039;ansible.builtin.shell&#039;&#039; modules will normally &#039;&#039;not&#039;&#039; run &#039;&#039;~/.bashrc&#039;&#039;. The reason is that &#039;&#039;~/.bashrc&#039;&#039; is intended for interactive shells running, for example, in a terminal window. &#039;&#039;~/.bashrc&#039;&#039; is not needed for many simpler commands, but more complex programs and scripts like Hadoop&#039;s &#039;&#039;start-all.sh&#039;&#039; need many environment variables. Therefore, you must start your own &#039;&#039;/usr/bin/bash&#039;&#039;, initialise it with &#039;&#039;~/.info319&#039;&#039;, and then run &#039;&#039;start-all.sh&#039;&#039; inside it:&lt;br /&gt;
     - name: Start HDFS and YARN&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;start-all.sh&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Spark on the cluster ===&lt;br /&gt;
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
=== Install Zookeeper on the cluster ===&lt;br /&gt;
There are two challenges with Zookeeper:&lt;br /&gt;
# it may not run on all the machines in the cluster (it must be an odd number)&lt;br /&gt;
# each zookeeper needs to know its &#039;&#039;myid&#039;&#039; number&lt;br /&gt;
&lt;br /&gt;
As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.&lt;br /&gt;
&lt;br /&gt;
As for the second point, Exercise 4 has already suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can explore.&lt;br /&gt;
&lt;br /&gt;
In the end, this task will start Zookeeper on the selected nodes:&lt;br /&gt;
    - name: Start Zookeper&lt;br /&gt;
      ansible.builtin.shell: &lt;br /&gt;
        argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Kafka on the cluster ===&lt;br /&gt;
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its &#039;&#039;local_ip&#039;&#039;, which you can set like this:&lt;br /&gt;
     - name: Register local_ip expression&lt;br /&gt;
       shell: ip -4 address | grep -o &amp;quot;^ *inet \(.\+\)\/.\+global.*$&amp;quot; | grep -o &amp;quot;[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+&amp;quot; | head -1&lt;br /&gt;
       register: local_ip_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set local_ip fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         local_ip: &amp;quot;{{ local_ip_expr.stdout } }&amp;quot;&lt;br /&gt;
&lt;br /&gt;
It also needs to know its &#039;&#039;id&#039;&#039;, which was written to the file &#039;&#039;/home/ubuntu/volume/zookeeper/data/myid&#039;&#039; before (better than &#039;&#039;/tmp/zookeeper/myid&#039;&#039; which was suggested before).&lt;br /&gt;
&lt;br /&gt;
Finally, you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1220</id>
		<title>Configure Spark cluster using Ansible</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1220"/>
		<updated>2022-10-31T13:52:21Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Install HDFS and YARN */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Ansible ==&lt;br /&gt;
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:&lt;br /&gt;
 sudo apt install ansible&lt;br /&gt;
&lt;br /&gt;
=== Configure Ansible ===&lt;br /&gt;
To prepare:&lt;br /&gt;
 sudo cp /etc/ansible/hosts /etc/ansible/hosts.original&lt;br /&gt;
 sudo chmod 666 /etc/ansible/hosts&lt;br /&gt;
&lt;br /&gt;
Ansible needs to know the names of your cluster machines. Change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;/etc/ansible/hosts&#039;&#039;:&lt;br /&gt;
 terraform-driver&lt;br /&gt;
 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 terraform-worker-5&lt;br /&gt;
&lt;br /&gt;
Finally, Ansible must be installed on all the hosts too. Add the line&lt;br /&gt;
 - ansible&lt;br /&gt;
to the &#039;&#039;packages:&#039;&#039; section of &#039;&#039;user-data.cfg&#039;&#039;, and re-run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Test run Ansible ===&lt;br /&gt;
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:&lt;br /&gt;
 ansible all -m ping&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;info319-cluster.yaml&#039;&#039; with a simple task that backs up &#039;&#039;~/.bashrc&#039;&#039;:&lt;br /&gt;
 - name: Prepare .bashrc&lt;br /&gt;
   hosts: all&lt;br /&gt;
   tasks:&lt;br /&gt;
 &lt;br /&gt;
     - name: Save original .bashrc&lt;br /&gt;
       ansible.builtin.copy:&lt;br /&gt;
         src: /home/ubuntu/.bashrc&lt;br /&gt;
         dest: /home/ubuntu/.bashrc.original&lt;br /&gt;
         remote_src: yes&lt;br /&gt;
&lt;br /&gt;
Run Ansible:&lt;br /&gt;
 ansible-playbook info319-cluster.yaml&lt;br /&gt;
&lt;br /&gt;
== Ansible playbook ==&lt;br /&gt;
Extend the playbook file &#039;&#039;info319-cluster.yaml&#039;&#039; to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.&lt;br /&gt;
&lt;br /&gt;
=== Preparing .bashrc ===&lt;br /&gt;
In Exercise 4 we made a lot of modifications to &#039;&#039;~/.bashrc&#039;&#039;. In some cases it is more practical to have the cluster configuration in a separate file, for example &#039;&#039;~/.info319&#039;&#039;. &lt;br /&gt;
&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.file&#039;&#039; module to create (&amp;quot;touch&amp;quot;) &#039;&#039;/home/ubuntu/.info319&#039;&#039; on all the hosts.&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.lineinfile&#039;&#039; module to add this line to the end of &#039;&#039;/home/ubuntu/.bashrc&#039;&#039; on all the hosts:&lt;br /&gt;
 source .info319&lt;br /&gt;
&lt;br /&gt;
See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example &lt;br /&gt;
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
You can use the &#039;&#039;blockinfile&#039;&#039; module to add your local &#039;&#039;ipv4-hosts&#039;&#039; to &#039;&#039;/etc/hosts&#039;&#039; on each node:&lt;br /&gt;
    - name: Copy IPv4 addresses to /etc/hosts&lt;br /&gt;
      ansible.builtin.blockinfile:&lt;br /&gt;
        path: /etc/hosts&lt;br /&gt;
        block: &amp;quot;{{ lookup(&#039;file&#039;, &#039;ipv4-hosts&#039;) } }&amp;quot;&lt;br /&gt;
      become: yes  # because you need root privilege (sudo) to update /etc/hosts&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There should not be a space between the two curly braces at the end of the &#039;&#039;key:&#039;&#039; line. But without the space, WikiText misinterprets them as a template marker.&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;config&#039;&#039; in your &#039;&#039;exercise-5&#039;&#039; folder:&lt;br /&gt;
 Host terraform-* localhost&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
 &lt;br /&gt;
 Include ~/.ssh/config.terraform-hosts&lt;br /&gt;
(This is the &#039;&#039;config.stub&#039;&#039; file from Exercise 4, with the &#039;&#039;Include&#039;&#039; line added. Also, &#039;&#039;localhost&#039;&#039; has been added to the first line to allow nodes to &#039;&#039;ssh&#039;&#039; themselves...)&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;copy&#039;&#039; module to upload this file, along with &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039; and &#039;&#039;~/.ssh/info319-spark-cluster&#039;&#039; to all hosts.&lt;br /&gt;
&lt;br /&gt;
In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this: &lt;br /&gt;
     - name: Authorise public cluster key&lt;br /&gt;
       ansible.posix.authorized_key:&lt;br /&gt;
         key: &amp;quot;{{ lookup(&#039;file&#039;, &#039;/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub&#039;) } }&amp;quot;&lt;br /&gt;
         user: ubuntu&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Tip:&#039;&#039; &#039;&#039;&#039;ansible-playbook&#039;&#039;&#039; has a &#039;&#039;&#039;--start-at-task &amp;quot;Task name&amp;quot;&#039;&#039;&#039; option you can use to avoid repeating all earlier plays (blocks) and stages. You can also use &#039;&#039;&#039;--step&#039;&#039;&#039; to have Ansible ask before each step whether to execute or skip it.&lt;br /&gt;
&lt;br /&gt;
=== Install Java ===&lt;br /&gt;
Use Ansible&#039;s &#039;&#039;ansible.builtin.apt&#039;&#039; module and install an old and stable Java version, for example &#039;&#039;openjdk-8-jdk-headless&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Mount volumes ===&lt;br /&gt;
Use the &#039;&#039;community.general.parted&#039;&#039;, &#039;&#039;community.general.filesystem&#039;&#039; and &#039;&#039;ansible.posix.mount&#039;&#039; modules to do this. They may require installation on your local machine:&lt;br /&gt;
 ansible-galaxy collection install community.general&lt;br /&gt;
 ansible-galaxy collection install ansible.posix&lt;br /&gt;
&lt;br /&gt;
=== Install HDFS and YARN ===&lt;br /&gt;
To install HDFS and YARN you need the &#039;&#039;master_node&#039;&#039; and &#039;&#039;num_workers&#039;&#039; available as Ansible variables (facts). You can use the &#039;&#039;ansible.builtin.shell&#039;&#039; and &#039;&#039;.set_fact&#039;&#039; modules to do this, for example at the start of a new Ansible play:&lt;br /&gt;
 - name: Install HDFS and YARN&lt;br /&gt;
   hosts: all&lt;br /&gt;
   tasks:&lt;br /&gt;
 &lt;br /&gt;
     - name: Register master_node expression&lt;br /&gt;
       shell: grep tf-driver /etc/hosts | cut -d&#039; &#039; -f1&lt;br /&gt;
       register: master_node_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set master_node fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         master_node: &amp;quot;{{ master_node_expr.stdout } }&amp;quot;&lt;br /&gt;
Write two corresponding tasks for &#039;&#039;num_workers&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.get_url&#039;&#039; module to download the Hadoop (and other) archives directly to each cluster host. But if you re-run your script many times, this takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.unarchive&#039;&#039; module to unpack the archives. Use the &#039;&#039;apt&#039;&#039; module to install &#039;&#039;gzip&#039;&#039; if you need it. Use &#039;&#039;ansible.builtin.file&#039;&#039; to create symbolic links as in Exercise 4.&lt;br /&gt;
&lt;br /&gt;
=== Configure HDFS and YARN ===&lt;br /&gt;
Use &#039;&#039;ansible.builtin.lineinfile&#039;&#039; to define environment variables by adding them to &#039;&#039;~/.info319&#039;&#039; (instead of  &#039;&#039;~/.bashrc&#039;&#039;). &lt;br /&gt;
&lt;br /&gt;
Change the variable syntax in the files &#039;&#039;{core,hdfs,mapred,yarn}-site.xml&#039;&#039; from Exercise 4 from Linux to Ansible. For example&lt;br /&gt;
* from &#039;&#039;core-site.xml&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://${HADOOP_NAMENODE}:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
* to &#039;&#039;core-site.xml.j2&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://{{ hadoop_namenode } }:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
You can now use the &#039;&#039;ansible.builtin.template&#039;&#039; module (instead of Linux&#039; &#039;&#039;envsubst&#039;&#039; command) to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.&lt;br /&gt;
&lt;br /&gt;
Use &#039;&#039;ansible.builtin.shell&#039;&#039; to create Hadoop&#039;s &#039;&#039;worker&#039;&#039; and &#039;&#039;master&#039;&#039; files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the &#039;&#039;terraform-driver&#039;&#039; host.&lt;br /&gt;
&lt;br /&gt;
Note that &#039;&#039;&#039;hdfs namenode -format&#039;&#039;&#039; has got a &#039;&#039;&#039;-nonInteractive&#039;&#039;&#039; option that does not re-format an already formatted namenode.  Use &#039;&#039;failed_when&#039;&#039; to make Ansible ignore &#039;&#039;Exit code 1&#039;&#039; from hdfs in such cases:&lt;br /&gt;
     - name: Format HDFS namenode&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/home/ubuntu/volume/hadoop/bin/hdfs&amp;quot;, &amp;quot;namenode&amp;quot;, &amp;quot;-format&amp;quot;, &amp;quot;-nonInteractive&amp;quot;]&lt;br /&gt;
       register: result&lt;br /&gt;
       failed_when: result.rc not in [0, 1]&lt;br /&gt;
&lt;br /&gt;
Now you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
Finally, you can start HDFS and YARN from &#039;&#039;terraform-driver&#039;&#039;. Note that the &#039;&#039;ansible.builtin.copy&#039;&#039; and &#039;&#039;ansible.builtin.shell&#039;&#039; modules will normally &#039;&#039;not&#039;&#039; run &#039;&#039;~/.bashrc&#039;&#039;. The reason is that &#039;&#039;~/.bashrc&#039;&#039; is intended for interactive shells running, for example, in a terminal window. &#039;&#039;~/.bashrc&#039;&#039; is not needed for many simpler commands, but more complex programs and scripts like Hadoop&#039;s &#039;&#039;start-all.sh&#039;&#039; need many environment variables. Therefore, you must start your own &#039;&#039;/usr/bin/bash&#039;&#039;, initialise it with &#039;&#039;~/.info319&#039;&#039;, and then run &#039;&#039;start-all.sh&#039;&#039; inside it:&lt;br /&gt;
     - name: Start HDFS and YARN&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;start-all.sh&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Spark on the cluster ===&lt;br /&gt;
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
=== Install Zookeeper on the cluster ===&lt;br /&gt;
There are two challenges with Zookeeper:&lt;br /&gt;
# it may not run on all the machines in the cluster (it must be an odd number)&lt;br /&gt;
# each zookeeper needs to know its &#039;&#039;myid&#039;&#039; number&lt;br /&gt;
&lt;br /&gt;
As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.&lt;br /&gt;
&lt;br /&gt;
As for the second point, Exercise 4 suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can use.&lt;br /&gt;
&lt;br /&gt;
This task will start Zookeeper on the selected nodes:&lt;br /&gt;
    - name: Start Zookeper&lt;br /&gt;
      ansible.builtin.shell: &lt;br /&gt;
        argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Kafka on the cluster ===&lt;br /&gt;
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its &#039;&#039;local_ip&#039;&#039;, which you can set like this:&lt;br /&gt;
     - name: Register local_ip expression&lt;br /&gt;
       shell: ip -4 address | grep -o &amp;quot;^ *inet \(.\+\)\/.\+global.*$&amp;quot; | grep -o &amp;quot;[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+&amp;quot; | head -1&lt;br /&gt;
       register: local_ip_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set local_ip fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         local_ip: &amp;quot;{{ local_ip_expr.stdout } }&amp;quot;&lt;br /&gt;
&lt;br /&gt;
It also needs to know its &#039;&#039;id&#039;&#039;, which was written to the file &#039;&#039;/home/ubuntu/volume/zookeeper/data/myid&#039;&#039; before (better than &#039;&#039;/tmp/zookeeper/myid&#039;&#039; which was suggested before).&lt;br /&gt;
&lt;br /&gt;
Finally, you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1219</id>
		<title>Configure Spark cluster using Ansible</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1219"/>
		<updated>2022-10-31T13:51:39Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Install HDFS and YARN */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Ansible ==&lt;br /&gt;
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:&lt;br /&gt;
 sudo apt install ansible&lt;br /&gt;
&lt;br /&gt;
=== Configure Ansible ===&lt;br /&gt;
To prepare:&lt;br /&gt;
 sudo cp /etc/ansible/hosts /etc/ansible/hosts.original&lt;br /&gt;
 sudo chmod 666 /etc/ansible/hosts&lt;br /&gt;
&lt;br /&gt;
Ansible needs to know the names of your cluster machines. Change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;/etc/ansible/hosts&#039;&#039;:&lt;br /&gt;
 terraform-driver&lt;br /&gt;
 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 terraform-worker-5&lt;br /&gt;
&lt;br /&gt;
Finally, Ansible must be installed on all the hosts too. Add the line&lt;br /&gt;
 - ansible&lt;br /&gt;
to the &#039;&#039;packages:&#039;&#039; section of &#039;&#039;user-data.cfg&#039;&#039;, and re-run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Test run Ansible ===&lt;br /&gt;
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:&lt;br /&gt;
 ansible all -m ping&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;info319-cluster.yaml&#039;&#039; with a simple task that backs up &#039;&#039;~/.bashrc&#039;&#039;:&lt;br /&gt;
 - name: Prepare .bashrc&lt;br /&gt;
   hosts: all&lt;br /&gt;
   tasks:&lt;br /&gt;
 &lt;br /&gt;
     - name: Save original .bashrc&lt;br /&gt;
       ansible.builtin.copy:&lt;br /&gt;
         src: /home/ubuntu/.bashrc&lt;br /&gt;
         dest: /home/ubuntu/.bashrc.original&lt;br /&gt;
         remote_src: yes&lt;br /&gt;
&lt;br /&gt;
Run Ansible:&lt;br /&gt;
 ansible-playbook info319-cluster.yaml&lt;br /&gt;
&lt;br /&gt;
== Ansible playbook ==&lt;br /&gt;
Extend the playbook file &#039;&#039;info319-cluster.yaml&#039;&#039; to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.&lt;br /&gt;
&lt;br /&gt;
=== Preparing .bashrc ===&lt;br /&gt;
In Exercise 4 we made a lot of modifications to &#039;&#039;~/.bashrc&#039;&#039;. In some cases it is more practical to have the cluster configuration in a separate file, for example &#039;&#039;~/.info319&#039;&#039;. &lt;br /&gt;
&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.file&#039;&#039; module to create (&amp;quot;touch&amp;quot;) &#039;&#039;/home/ubuntu/.info319&#039;&#039; on all the hosts.&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.lineinfile&#039;&#039; module to add this line to the end of &#039;&#039;/home/ubuntu/.bashrc&#039;&#039; on all the hosts:&lt;br /&gt;
 source .info319&lt;br /&gt;
&lt;br /&gt;
See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example &lt;br /&gt;
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
You can use the &#039;&#039;blockinfile&#039;&#039; module to add your local &#039;&#039;ipv4-hosts&#039;&#039; to &#039;&#039;/etc/hosts&#039;&#039; on each node:&lt;br /&gt;
    - name: Copy IPv4 addresses to /etc/hosts&lt;br /&gt;
      ansible.builtin.blockinfile:&lt;br /&gt;
        path: /etc/hosts&lt;br /&gt;
        block: &amp;quot;{{ lookup(&#039;file&#039;, &#039;ipv4-hosts&#039;) } }&amp;quot;&lt;br /&gt;
      become: yes  # because you need root privilege (sudo) to update /etc/hosts&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There should not be a space between the two curly braces at the end of the &#039;&#039;key:&#039;&#039; line. But without the space, WikiText misinterprets them as a template marker.&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;config&#039;&#039; in your &#039;&#039;exercise-5&#039;&#039; folder:&lt;br /&gt;
 Host terraform-* localhost&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
 &lt;br /&gt;
 Include ~/.ssh/config.terraform-hosts&lt;br /&gt;
(This is the &#039;&#039;config.stub&#039;&#039; file from Exercise 4, with the &#039;&#039;Include&#039;&#039; line added. Also, &#039;&#039;localhost&#039;&#039; has been added to the first line to allow nodes to &#039;&#039;ssh&#039;&#039; themselves...)&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;copy&#039;&#039; module to upload this file, along with &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039; and &#039;&#039;~/.ssh/info319-spark-cluster&#039;&#039; to all hosts.&lt;br /&gt;
&lt;br /&gt;
In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this: &lt;br /&gt;
     - name: Authorise public cluster key&lt;br /&gt;
       ansible.posix.authorized_key:&lt;br /&gt;
         key: &amp;quot;{{ lookup(&#039;file&#039;, &#039;/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub&#039;) } }&amp;quot;&lt;br /&gt;
         user: ubuntu&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Tip:&#039;&#039; &#039;&#039;&#039;ansible-playbook&#039;&#039;&#039; has a &#039;&#039;&#039;--start-at-task &amp;quot;Task name&amp;quot;&#039;&#039;&#039; option you can use to avoid repeating all earlier plays (blocks) and stages. You can also use &#039;&#039;&#039;--step&#039;&#039;&#039; to have Ansible ask before each step whether to execute or skip it.&lt;br /&gt;
&lt;br /&gt;
=== Install Java ===&lt;br /&gt;
Use Ansible&#039;s &#039;&#039;ansible.builtin.apt&#039;&#039; module and install an old and stable Java version, for example &#039;&#039;openjdk-8-jdk-headless&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Mount volumes ===&lt;br /&gt;
Use the &#039;&#039;community.general.parted&#039;&#039;, &#039;&#039;community.general.filesystem&#039;&#039; and &#039;&#039;ansible.posix.mount&#039;&#039; modules to do this. They may require installation on your local machine:&lt;br /&gt;
 ansible-galaxy collection install community.general&lt;br /&gt;
 ansible-galaxy collection install ansible.posix&lt;br /&gt;
&lt;br /&gt;
=== Install HDFS and YARN ===&lt;br /&gt;
To install HDFS and YARN you need the &#039;&#039;master_node&#039;&#039; and &#039;&#039;num_workers&#039;&#039; available as Ansible variables (facts). You can use the &#039;&#039;ansible.builtin.shell&#039;&#039; and &#039;&#039;.set_fact&#039;&#039; modules to do this, for example at the start of a new Ansible play:&lt;br /&gt;
 - name: Install HDFS and YARN&lt;br /&gt;
   hosts: all&lt;br /&gt;
   tasks:&lt;br /&gt;
 &lt;br /&gt;
     - name: Register master_node expression&lt;br /&gt;
       shell: grep tf-driver /etc/hosts | cut -d&#039; &#039; -f1&lt;br /&gt;
       register: master_node_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set master_node fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         master_node: &amp;quot;{{ master_node_expr.stdout } }&amp;quot;&lt;br /&gt;
Write two corresponding tasks for &#039;&#039;num_workers&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.get_url&#039;&#039; module to download the Hadoop (and other) archives directly to each cluster host. But if you re-run your script many times, this takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.unarchive&#039;&#039; module to unpack the archives. Use the &#039;&#039;apt&#039;&#039; module to install &#039;&#039;gzip&#039;&#039; if you need it. Use &#039;&#039;ansible.builtin.file&#039;&#039; to create symbolic links as in Exercise 4.&lt;br /&gt;
&lt;br /&gt;
Use &#039;&#039;ansible.builtin.lineinfile&#039;&#039; to define environment variables by adding them to &#039;&#039;~/.info319&#039;&#039; (instead of  &#039;&#039;~/.bashrc&#039;&#039;). &lt;br /&gt;
&lt;br /&gt;
Change the variable syntax in the files &#039;&#039;{core,hdfs,mapred,yarn}-site.xml&#039;&#039; from Exercise 4 from Linux to Ansible. For example&lt;br /&gt;
* from &#039;&#039;core-site.xml&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://${HADOOP_NAMENODE}:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
* to &#039;&#039;core-site.xml.j2&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://{{ hadoop_namenode } }:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
You can now use the &#039;&#039;ansible.builtin.template&#039;&#039; module (instead of Linux&#039; &#039;&#039;envsubst&#039;&#039; command) to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.&lt;br /&gt;
&lt;br /&gt;
Use &#039;&#039;ansible.builtin.shell&#039;&#039; to create Hadoop&#039;s &#039;&#039;worker&#039;&#039; and &#039;&#039;master&#039;&#039; files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the &#039;&#039;terraform-driver&#039;&#039; host.&lt;br /&gt;
&lt;br /&gt;
Note that &#039;&#039;&#039;hdfs namenode -format&#039;&#039;&#039; has got a &#039;&#039;&#039;-nonInteractive&#039;&#039;&#039; option that does not re-format an already formatted namenode.  Use &#039;&#039;failed_when&#039;&#039; to make Ansible ignore &#039;&#039;Exit code 1&#039;&#039; from hdfs in such cases:&lt;br /&gt;
     - name: Format HDFS namenode&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/home/ubuntu/volume/hadoop/bin/hdfs&amp;quot;, &amp;quot;namenode&amp;quot;, &amp;quot;-format&amp;quot;, &amp;quot;-nonInteractive&amp;quot;]&lt;br /&gt;
       register: result&lt;br /&gt;
       failed_when: result.rc not in [0, 1]&lt;br /&gt;
&lt;br /&gt;
Now you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
Finally, you can start HDFS and YARN from &#039;&#039;terraform-driver&#039;&#039;. Note that the &#039;&#039;ansible.builtin.copy&#039;&#039; and &#039;&#039;ansible.builtin.shell&#039;&#039; modules will normally &#039;&#039;not&#039;&#039; run &#039;&#039;~/.bashrc&#039;&#039;. The reason is that &#039;&#039;~/.bashrc&#039;&#039; is intended for interactive shells running, for example, in a terminal window. &#039;&#039;~/.bashrc&#039;&#039; is not needed for many simpler commands, but more complex programs and scripts like Hadoop&#039;s &#039;&#039;start-all.sh&#039;&#039; need many environment variables. Therefore, you must start your own &#039;&#039;/usr/bin/bash&#039;&#039;, initialise it with &#039;&#039;~/.info319&#039;&#039;, and then run &#039;&#039;start-all.sh&#039;&#039; inside it:&lt;br /&gt;
     - name: Start HDFS and YARN&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;start-all.sh&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Spark on the cluster ===&lt;br /&gt;
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
=== Install Zookeeper on the cluster ===&lt;br /&gt;
There are two challenges with Zookeeper:&lt;br /&gt;
# it may not run on all the machines in the cluster (it must be an odd number)&lt;br /&gt;
# each zookeeper needs to know its &#039;&#039;myid&#039;&#039; number&lt;br /&gt;
&lt;br /&gt;
As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.&lt;br /&gt;
&lt;br /&gt;
As for the second point, Exercise 4 suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can use.&lt;br /&gt;
&lt;br /&gt;
This task will start Zookeeper on the selected nodes:&lt;br /&gt;
    - name: Start Zookeper&lt;br /&gt;
      ansible.builtin.shell: &lt;br /&gt;
        argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Kafka on the cluster ===&lt;br /&gt;
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its &#039;&#039;local_ip&#039;&#039;, which you can set like this:&lt;br /&gt;
     - name: Register local_ip expression&lt;br /&gt;
       shell: ip -4 address | grep -o &amp;quot;^ *inet \(.\+\)\/.\+global.*$&amp;quot; | grep -o &amp;quot;[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+&amp;quot; | head -1&lt;br /&gt;
       register: local_ip_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set local_ip fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         local_ip: &amp;quot;{{ local_ip_expr.stdout } }&amp;quot;&lt;br /&gt;
&lt;br /&gt;
It also needs to know its &#039;&#039;id&#039;&#039;, which was written to the file &#039;&#039;/home/ubuntu/volume/zookeeper/data/myid&#039;&#039; before (better than &#039;&#039;/tmp/zookeeper/myid&#039;&#039; which was suggested before).&lt;br /&gt;
&lt;br /&gt;
Finally, you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1218</id>
		<title>Configure Spark cluster using Ansible</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1218"/>
		<updated>2022-10-31T13:49:33Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Mount volumes */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Ansible ==&lt;br /&gt;
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:&lt;br /&gt;
 sudo apt install ansible&lt;br /&gt;
&lt;br /&gt;
=== Configure Ansible ===&lt;br /&gt;
To prepare:&lt;br /&gt;
 sudo cp /etc/ansible/hosts /etc/ansible/hosts.original&lt;br /&gt;
 sudo chmod 666 /etc/ansible/hosts&lt;br /&gt;
&lt;br /&gt;
Ansible needs to know the names of your cluster machines. Change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;/etc/ansible/hosts&#039;&#039;:&lt;br /&gt;
 terraform-driver&lt;br /&gt;
 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 terraform-worker-5&lt;br /&gt;
&lt;br /&gt;
Finally, Ansible must be installed on all the hosts too. Add the line&lt;br /&gt;
 - ansible&lt;br /&gt;
to the &#039;&#039;packages:&#039;&#039; section of &#039;&#039;user-data.cfg&#039;&#039;, and re-run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Test run Ansible ===&lt;br /&gt;
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:&lt;br /&gt;
 ansible all -m ping&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;info319-cluster.yaml&#039;&#039; with a simple task that backs up &#039;&#039;~/.bashrc&#039;&#039;:&lt;br /&gt;
 - name: Prepare .bashrc&lt;br /&gt;
   hosts: all&lt;br /&gt;
   tasks:&lt;br /&gt;
 &lt;br /&gt;
     - name: Save original .bashrc&lt;br /&gt;
       ansible.builtin.copy:&lt;br /&gt;
         src: /home/ubuntu/.bashrc&lt;br /&gt;
         dest: /home/ubuntu/.bashrc.original&lt;br /&gt;
         remote_src: yes&lt;br /&gt;
&lt;br /&gt;
Run Ansible:&lt;br /&gt;
 ansible-playbook info319-cluster.yaml&lt;br /&gt;
&lt;br /&gt;
== Ansible playbook ==&lt;br /&gt;
Extend the playbook file &#039;&#039;info319-cluster.yaml&#039;&#039; to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.&lt;br /&gt;
&lt;br /&gt;
=== Preparing .bashrc ===&lt;br /&gt;
In Exercise 4 we made a lot of modifications to &#039;&#039;~/.bashrc&#039;&#039;. In some cases it is more practical to have the cluster configuration in a separate file, for example &#039;&#039;~/.info319&#039;&#039;. &lt;br /&gt;
&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.file&#039;&#039; module to create (&amp;quot;touch&amp;quot;) &#039;&#039;/home/ubuntu/.info319&#039;&#039; on all the hosts.&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.lineinfile&#039;&#039; module to add this line to the end of &#039;&#039;/home/ubuntu/.bashrc&#039;&#039; on all the hosts:&lt;br /&gt;
 source .info319&lt;br /&gt;
&lt;br /&gt;
See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example &lt;br /&gt;
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
You can use the &#039;&#039;blockinfile&#039;&#039; module to add your local &#039;&#039;ipv4-hosts&#039;&#039; to &#039;&#039;/etc/hosts&#039;&#039; on each node:&lt;br /&gt;
    - name: Copy IPv4 addresses to /etc/hosts&lt;br /&gt;
      ansible.builtin.blockinfile:&lt;br /&gt;
        path: /etc/hosts&lt;br /&gt;
        block: &amp;quot;{{ lookup(&#039;file&#039;, &#039;ipv4-hosts&#039;) } }&amp;quot;&lt;br /&gt;
      become: yes  # because you need root privilege (sudo) to update /etc/hosts&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There should not be a space between the two curly braces at the end of the &#039;&#039;key:&#039;&#039; line. But without the space, WikiText misinterprets them as a template marker.&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;config&#039;&#039; in your &#039;&#039;exercise-5&#039;&#039; folder:&lt;br /&gt;
 Host terraform-* localhost&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
 &lt;br /&gt;
 Include ~/.ssh/config.terraform-hosts&lt;br /&gt;
(This is the &#039;&#039;config.stub&#039;&#039; file from Exercise 4, with the &#039;&#039;Include&#039;&#039; line added. Also, &#039;&#039;localhost&#039;&#039; has been added to the first line to allow nodes to &#039;&#039;ssh&#039;&#039; themselves...)&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;copy&#039;&#039; module to upload this file, along with &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039; and &#039;&#039;~/.ssh/info319-spark-cluster&#039;&#039; to all hosts.&lt;br /&gt;
&lt;br /&gt;
In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this: &lt;br /&gt;
     - name: Authorise public cluster key&lt;br /&gt;
       ansible.posix.authorized_key:&lt;br /&gt;
         key: &amp;quot;{{ lookup(&#039;file&#039;, &#039;/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub&#039;) } }&amp;quot;&lt;br /&gt;
         user: ubuntu&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Tip:&#039;&#039; &#039;&#039;&#039;ansible-playbook&#039;&#039;&#039; has a &#039;&#039;&#039;--start-at-task &amp;quot;Task name&amp;quot;&#039;&#039;&#039; option you can use to avoid repeating all earlier plays (blocks) and stages. You can also use &#039;&#039;&#039;--step&#039;&#039;&#039; to have Ansible ask before each step whether to execute or skip it.&lt;br /&gt;
&lt;br /&gt;
=== Install Java ===&lt;br /&gt;
Use Ansible&#039;s &#039;&#039;ansible.builtin.apt&#039;&#039; module and install an old and stable Java version, for example &#039;&#039;openjdk-8-jdk-headless&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Mount volumes ===&lt;br /&gt;
Use the &#039;&#039;community.general.parted&#039;&#039;, &#039;&#039;community.general.filesystem&#039;&#039; and &#039;&#039;ansible.posix.mount&#039;&#039; modules to do this. They may require installation on your local machine:&lt;br /&gt;
 ansible-galaxy collection install community.general&lt;br /&gt;
 ansible-galaxy collection install ansible.posix&lt;br /&gt;
&lt;br /&gt;
=== Install HDFS and YARN ===&lt;br /&gt;
To install HDFS and YARN you need the &#039;&#039;master_node&#039;&#039; and &#039;&#039;num_workers&#039;&#039; available as Ansible variables (facts). You can use the &#039;&#039;ansible.builtin.shell&#039;&#039; and &#039;&#039;.set_fact&#039;&#039; modules to do this, for example at the start of a new Ansible play:&lt;br /&gt;
 - name: Install HDFS and YARN&lt;br /&gt;
   hosts: all&lt;br /&gt;
   tasks:&lt;br /&gt;
 &lt;br /&gt;
     - name: Register master_node expression&lt;br /&gt;
       shell: grep tf-driver /etc/hosts | cut -d&#039; &#039; -f1&lt;br /&gt;
       register: master_node_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set master_node fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         master_node: &amp;quot;{{ master_node_expr.stdout } }&amp;quot;&lt;br /&gt;
Write two corresponding tasks for &#039;&#039;num_workers&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.get_url&#039;&#039; module to download the Hadoop and later Spark archives directly to each cluster host. But if you re-run your script many times, it takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.unarchive&#039;&#039; module to unpack the archives. Use the &#039;&#039;apt&#039;&#039; module to install &#039;&#039;gzip&#039;&#039; if you need it. Use &#039;&#039;ansible.builtin.file&#039;&#039; to create symbolic links as in Exercise 4.&lt;br /&gt;
&lt;br /&gt;
Use &#039;&#039;ansible.builtin.lineinfile&#039;&#039; to define environment variables by adding them to &#039;&#039;~/.info319&#039;&#039; (instead of  &#039;&#039;~/.bashrc&#039;&#039;). &lt;br /&gt;
&lt;br /&gt;
Change the variable syntax in the files &#039;&#039;{core,hdfs,mapred,yarn}-site.xml&#039;&#039; from Exercise 4 from Linux to Ansible. For example&lt;br /&gt;
* from &#039;&#039;core-site.xml&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://${HADOOP_NAMENODE}:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
* to &#039;&#039;core-site.xml.j2&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://{{ hadoop_namenode } }:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
You can now use the &#039;&#039;ansible.builtin.template&#039;&#039; module instead of Linux&#039; &#039;&#039;envsubst&#039;&#039; command to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.&lt;br /&gt;
&lt;br /&gt;
Use &#039;&#039;ansible.builtin.shell&#039;&#039; to create Hadoop&#039;s &#039;&#039;worker&#039;&#039; and &#039;&#039;master&#039;&#039; files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the &#039;&#039;terraform-driver&#039;&#039; host.&lt;br /&gt;
&lt;br /&gt;
Note that &#039;&#039;&#039;hdfs namenode -format&#039;&#039;&#039; has got a &#039;&#039;&#039;-nonInteractive&#039;&#039;&#039; option that does not re-format an already formatted namenode.  Use &#039;&#039;failed_when&#039;&#039; to make Ansible ignore &#039;&#039;Exit code 1&#039;&#039; from hdfs in such cases:&lt;br /&gt;
     - name: Format HDFS namenode&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/home/ubuntu/volume/hadoop/bin/hdfs&amp;quot;, &amp;quot;namenode&amp;quot;, &amp;quot;-format&amp;quot;, &amp;quot;-nonInteractive&amp;quot;]&lt;br /&gt;
       register: result&lt;br /&gt;
       failed_when: result.rc not in [0, 1]&lt;br /&gt;
&lt;br /&gt;
Now you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
Finally, you can start HDFS and YARN from &#039;&#039;terraform-driver&#039;&#039;. Note that the &#039;&#039;ansible.builtin.copy&#039;&#039; and &#039;&#039;ansible.builtin.shell&#039;&#039; modules will normally &#039;&#039;not&#039;&#039; run &#039;&#039;~/.bashrc&#039;&#039;. The reason is that &#039;&#039;~/.bashrc&#039;&#039; is intended for interactive shells running, for example, in a terminal window. &#039;&#039;~/.bashrc&#039;&#039; is not needed for many simpler commands, but more complex programs and scripts like Hadoop&#039;s &#039;&#039;start-all.sh&#039;&#039; need many environment variables. Therefore, you must start your own &#039;&#039;/usr/bin/bash&#039;&#039;, initialise it with &#039;&#039;~/.info319&#039;&#039;, and then run &#039;&#039;start-all.sh&#039;&#039; inside it:&lt;br /&gt;
     - name: Start HDFS and YARN&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;start-all.sh&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Spark on the cluster ===&lt;br /&gt;
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
=== Install Zookeeper on the cluster ===&lt;br /&gt;
There are two challenges with Zookeeper:&lt;br /&gt;
# it may not run on all the machines in the cluster (it must be an odd number)&lt;br /&gt;
# each zookeeper needs to know its &#039;&#039;myid&#039;&#039; number&lt;br /&gt;
&lt;br /&gt;
As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.&lt;br /&gt;
&lt;br /&gt;
As for the second point, Exercise 4 suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can use.&lt;br /&gt;
&lt;br /&gt;
This task will start Zookeeper on the selected nodes:&lt;br /&gt;
    - name: Start Zookeper&lt;br /&gt;
      ansible.builtin.shell: &lt;br /&gt;
        argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Kafka on the cluster ===&lt;br /&gt;
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its &#039;&#039;local_ip&#039;&#039;, which you can set like this:&lt;br /&gt;
     - name: Register local_ip expression&lt;br /&gt;
       shell: ip -4 address | grep -o &amp;quot;^ *inet \(.\+\)\/.\+global.*$&amp;quot; | grep -o &amp;quot;[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+&amp;quot; | head -1&lt;br /&gt;
       register: local_ip_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set local_ip fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         local_ip: &amp;quot;{{ local_ip_expr.stdout } }&amp;quot;&lt;br /&gt;
&lt;br /&gt;
It also needs to know its &#039;&#039;id&#039;&#039;, which was written to the file &#039;&#039;/home/ubuntu/volume/zookeeper/data/myid&#039;&#039; before (better than &#039;&#039;/tmp/zookeeper/myid&#039;&#039; which was suggested before).&lt;br /&gt;
&lt;br /&gt;
Finally, you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1217</id>
		<title>Configure Spark cluster using Ansible</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1217"/>
		<updated>2022-10-31T13:47:34Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Configure SSH */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Ansible ==&lt;br /&gt;
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:&lt;br /&gt;
 sudo apt install ansible&lt;br /&gt;
&lt;br /&gt;
=== Configure Ansible ===&lt;br /&gt;
To prepare:&lt;br /&gt;
 sudo cp /etc/ansible/hosts /etc/ansible/hosts.original&lt;br /&gt;
 sudo chmod 666 /etc/ansible/hosts&lt;br /&gt;
&lt;br /&gt;
Ansible needs to know the names of your cluster machines. Change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;/etc/ansible/hosts&#039;&#039;:&lt;br /&gt;
 terraform-driver&lt;br /&gt;
 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 terraform-worker-5&lt;br /&gt;
&lt;br /&gt;
Finally, Ansible must be installed on all the hosts too. Add the line&lt;br /&gt;
 - ansible&lt;br /&gt;
to the &#039;&#039;packages:&#039;&#039; section of &#039;&#039;user-data.cfg&#039;&#039;, and re-run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Test run Ansible ===&lt;br /&gt;
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:&lt;br /&gt;
 ansible all -m ping&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;info319-cluster.yaml&#039;&#039; with a simple task that backs up &#039;&#039;~/.bashrc&#039;&#039;:&lt;br /&gt;
 - name: Prepare .bashrc&lt;br /&gt;
   hosts: all&lt;br /&gt;
   tasks:&lt;br /&gt;
 &lt;br /&gt;
     - name: Save original .bashrc&lt;br /&gt;
       ansible.builtin.copy:&lt;br /&gt;
         src: /home/ubuntu/.bashrc&lt;br /&gt;
         dest: /home/ubuntu/.bashrc.original&lt;br /&gt;
         remote_src: yes&lt;br /&gt;
&lt;br /&gt;
Run Ansible:&lt;br /&gt;
 ansible-playbook info319-cluster.yaml&lt;br /&gt;
&lt;br /&gt;
== Ansible playbook ==&lt;br /&gt;
Extend the playbook file &#039;&#039;info319-cluster.yaml&#039;&#039; to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.&lt;br /&gt;
&lt;br /&gt;
=== Preparing .bashrc ===&lt;br /&gt;
In Exercise 4 we made a lot of modifications to &#039;&#039;~/.bashrc&#039;&#039;. In some cases it is more practical to have the cluster configuration in a separate file, for example &#039;&#039;~/.info319&#039;&#039;. &lt;br /&gt;
&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.file&#039;&#039; module to create (&amp;quot;touch&amp;quot;) &#039;&#039;/home/ubuntu/.info319&#039;&#039; on all the hosts.&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.lineinfile&#039;&#039; module to add this line to the end of &#039;&#039;/home/ubuntu/.bashrc&#039;&#039; on all the hosts:&lt;br /&gt;
 source .info319&lt;br /&gt;
&lt;br /&gt;
See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example &lt;br /&gt;
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
You can use the &#039;&#039;blockinfile&#039;&#039; module to add your local &#039;&#039;ipv4-hosts&#039;&#039; to &#039;&#039;/etc/hosts&#039;&#039; on each node:&lt;br /&gt;
    - name: Copy IPv4 addresses to /etc/hosts&lt;br /&gt;
      ansible.builtin.blockinfile:&lt;br /&gt;
        path: /etc/hosts&lt;br /&gt;
        block: &amp;quot;{{ lookup(&#039;file&#039;, &#039;ipv4-hosts&#039;) } }&amp;quot;&lt;br /&gt;
      become: yes  # because you need root privilege (sudo) to update /etc/hosts&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There should not be a space between the two curly braces at the end of the &#039;&#039;key:&#039;&#039; line. But without the space, WikiText misinterprets them as a template marker.&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;config&#039;&#039; in your &#039;&#039;exercise-5&#039;&#039; folder:&lt;br /&gt;
 Host terraform-* localhost&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
 &lt;br /&gt;
 Include ~/.ssh/config.terraform-hosts&lt;br /&gt;
(This is the &#039;&#039;config.stub&#039;&#039; file from Exercise 4, with the &#039;&#039;Include&#039;&#039; line added. Also, &#039;&#039;localhost&#039;&#039; has been added to the first line to allow nodes to &#039;&#039;ssh&#039;&#039; themselves...)&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;copy&#039;&#039; module to upload this file, along with &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039; and &#039;&#039;~/.ssh/info319-spark-cluster&#039;&#039; to all hosts.&lt;br /&gt;
&lt;br /&gt;
In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this: &lt;br /&gt;
     - name: Authorise public cluster key&lt;br /&gt;
       ansible.posix.authorized_key:&lt;br /&gt;
         key: &amp;quot;{{ lookup(&#039;file&#039;, &#039;/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub&#039;) } }&amp;quot;&lt;br /&gt;
         user: ubuntu&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Tip:&#039;&#039; &#039;&#039;&#039;ansible-playbook&#039;&#039;&#039; has a &#039;&#039;&#039;--start-at-task &amp;quot;Task name&amp;quot;&#039;&#039;&#039; option you can use to avoid repeating all earlier plays (blocks) and stages. You can also use &#039;&#039;&#039;--step&#039;&#039;&#039; to have Ansible ask before each step whether to execute or skip it.&lt;br /&gt;
&lt;br /&gt;
=== Install Java ===&lt;br /&gt;
Use Ansible&#039;s &#039;&#039;ansible.builtin.apt&#039;&#039; module and install an old and stable Java version, for example &#039;&#039;openjdk-8-jdk-headless&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Mount volumes ===&lt;br /&gt;
Using the &#039;&#039;community.general.parted&#039;&#039;, &#039;&#039;community.general.filesystem&#039;&#039; and &#039;&#039;ansible.posix.mount&#039;&#039; modules from your local machine may require installation:&lt;br /&gt;
 ansible-galaxy collection install community.general&lt;br /&gt;
 ansible-galaxy collection install ansible.posix&lt;br /&gt;
&lt;br /&gt;
=== Install HDFS and YARN ===&lt;br /&gt;
To install HDFS and YARN you need the &#039;&#039;master_node&#039;&#039; and &#039;&#039;num_workers&#039;&#039; available as Ansible variables (facts). You can use the &#039;&#039;ansible.builtin.shell&#039;&#039; and &#039;&#039;.set_fact&#039;&#039; modules to do this, for example at the start of a new Ansible play:&lt;br /&gt;
 - name: Install HDFS and YARN&lt;br /&gt;
   hosts: all&lt;br /&gt;
   tasks:&lt;br /&gt;
 &lt;br /&gt;
     - name: Register master_node expression&lt;br /&gt;
       shell: grep tf-driver /etc/hosts | cut -d&#039; &#039; -f1&lt;br /&gt;
       register: master_node_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set master_node fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         master_node: &amp;quot;{{ master_node_expr.stdout } }&amp;quot;&lt;br /&gt;
Write two corresponding tasks for &#039;&#039;num_workers&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.get_url&#039;&#039; module to download the Hadoop and later Spark archives directly to each cluster host. But if you re-run your script many times, it takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.unarchive&#039;&#039; module to unpack the archives. Use the &#039;&#039;apt&#039;&#039; module to install &#039;&#039;gzip&#039;&#039; if you need it. Use &#039;&#039;ansible.builtin.file&#039;&#039; to create symbolic links as in Exercise 4.&lt;br /&gt;
&lt;br /&gt;
Use &#039;&#039;ansible.builtin.lineinfile&#039;&#039; to define environment variables by adding them to &#039;&#039;~/.info319&#039;&#039; (instead of  &#039;&#039;~/.bashrc&#039;&#039;). &lt;br /&gt;
&lt;br /&gt;
Change the variable syntax in the files &#039;&#039;{core,hdfs,mapred,yarn}-site.xml&#039;&#039; from Exercise 4 from Linux to Ansible. For example&lt;br /&gt;
* from &#039;&#039;core-site.xml&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://${HADOOP_NAMENODE}:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
* to &#039;&#039;core-site.xml.j2&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://{{ hadoop_namenode } }:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
You can now use the &#039;&#039;ansible.builtin.template&#039;&#039; module instead of Linux&#039; &#039;&#039;envsubst&#039;&#039; command to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.&lt;br /&gt;
&lt;br /&gt;
Use &#039;&#039;ansible.builtin.shell&#039;&#039; to create Hadoop&#039;s &#039;&#039;worker&#039;&#039; and &#039;&#039;master&#039;&#039; files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the &#039;&#039;terraform-driver&#039;&#039; host.&lt;br /&gt;
&lt;br /&gt;
Note that &#039;&#039;&#039;hdfs namenode -format&#039;&#039;&#039; has got a &#039;&#039;&#039;-nonInteractive&#039;&#039;&#039; option that does not re-format an already formatted namenode.  Use &#039;&#039;failed_when&#039;&#039; to make Ansible ignore &#039;&#039;Exit code 1&#039;&#039; from hdfs in such cases:&lt;br /&gt;
     - name: Format HDFS namenode&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/home/ubuntu/volume/hadoop/bin/hdfs&amp;quot;, &amp;quot;namenode&amp;quot;, &amp;quot;-format&amp;quot;, &amp;quot;-nonInteractive&amp;quot;]&lt;br /&gt;
       register: result&lt;br /&gt;
       failed_when: result.rc not in [0, 1]&lt;br /&gt;
&lt;br /&gt;
Now you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
Finally, you can start HDFS and YARN from &#039;&#039;terraform-driver&#039;&#039;. Note that the &#039;&#039;ansible.builtin.copy&#039;&#039; and &#039;&#039;ansible.builtin.shell&#039;&#039; modules will normally &#039;&#039;not&#039;&#039; run &#039;&#039;~/.bashrc&#039;&#039;. The reason is that &#039;&#039;~/.bashrc&#039;&#039; is intended for interactive shells running, for example, in a terminal window. &#039;&#039;~/.bashrc&#039;&#039; is not needed for many simpler commands, but more complex programs and scripts like Hadoop&#039;s &#039;&#039;start-all.sh&#039;&#039; need many environment variables. Therefore, you must start your own &#039;&#039;/usr/bin/bash&#039;&#039;, initialise it with &#039;&#039;~/.info319&#039;&#039;, and then run &#039;&#039;start-all.sh&#039;&#039; inside it:&lt;br /&gt;
     - name: Start HDFS and YARN&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;start-all.sh&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Spark on the cluster ===&lt;br /&gt;
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
=== Install Zookeeper on the cluster ===&lt;br /&gt;
There are two challenges with Zookeeper:&lt;br /&gt;
# it may not run on all the machines in the cluster (it must be an odd number)&lt;br /&gt;
# each zookeeper needs to know its &#039;&#039;myid&#039;&#039; number&lt;br /&gt;
&lt;br /&gt;
As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.&lt;br /&gt;
&lt;br /&gt;
As for the second point, Exercise 4 suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can use.&lt;br /&gt;
&lt;br /&gt;
This task will start Zookeeper on the selected nodes:&lt;br /&gt;
    - name: Start Zookeper&lt;br /&gt;
      ansible.builtin.shell: &lt;br /&gt;
        argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Kafka on the cluster ===&lt;br /&gt;
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its &#039;&#039;local_ip&#039;&#039;, which you can set like this:&lt;br /&gt;
     - name: Register local_ip expression&lt;br /&gt;
       shell: ip -4 address | grep -o &amp;quot;^ *inet \(.\+\)\/.\+global.*$&amp;quot; | grep -o &amp;quot;[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+&amp;quot; | head -1&lt;br /&gt;
       register: local_ip_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set local_ip fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         local_ip: &amp;quot;{{ local_ip_expr.stdout } }&amp;quot;&lt;br /&gt;
&lt;br /&gt;
It also needs to know its &#039;&#039;id&#039;&#039;, which was written to the file &#039;&#039;/home/ubuntu/volume/zookeeper/data/myid&#039;&#039; before (better than &#039;&#039;/tmp/zookeeper/myid&#039;&#039; which was suggested before).&lt;br /&gt;
&lt;br /&gt;
Finally, you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1216</id>
		<title>Configure Spark cluster using Ansible</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1216"/>
		<updated>2022-10-31T13:45:05Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Preparing .bashrc */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Ansible ==&lt;br /&gt;
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:&lt;br /&gt;
 sudo apt install ansible&lt;br /&gt;
&lt;br /&gt;
=== Configure Ansible ===&lt;br /&gt;
To prepare:&lt;br /&gt;
 sudo cp /etc/ansible/hosts /etc/ansible/hosts.original&lt;br /&gt;
 sudo chmod 666 /etc/ansible/hosts&lt;br /&gt;
&lt;br /&gt;
Ansible needs to know the names of your cluster machines. Change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;/etc/ansible/hosts&#039;&#039;:&lt;br /&gt;
 terraform-driver&lt;br /&gt;
 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 terraform-worker-5&lt;br /&gt;
&lt;br /&gt;
Finally, Ansible must be installed on all the hosts too. Add the line&lt;br /&gt;
 - ansible&lt;br /&gt;
to the &#039;&#039;packages:&#039;&#039; section of &#039;&#039;user-data.cfg&#039;&#039;, and re-run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Test run Ansible ===&lt;br /&gt;
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:&lt;br /&gt;
 ansible all -m ping&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;info319-cluster.yaml&#039;&#039; with a simple task that backs up &#039;&#039;~/.bashrc&#039;&#039;:&lt;br /&gt;
 - name: Prepare .bashrc&lt;br /&gt;
   hosts: all&lt;br /&gt;
   tasks:&lt;br /&gt;
 &lt;br /&gt;
     - name: Save original .bashrc&lt;br /&gt;
       ansible.builtin.copy:&lt;br /&gt;
         src: /home/ubuntu/.bashrc&lt;br /&gt;
         dest: /home/ubuntu/.bashrc.original&lt;br /&gt;
         remote_src: yes&lt;br /&gt;
&lt;br /&gt;
Run Ansible:&lt;br /&gt;
 ansible-playbook info319-cluster.yaml&lt;br /&gt;
&lt;br /&gt;
== Ansible playbook ==&lt;br /&gt;
Extend the playbook file &#039;&#039;info319-cluster.yaml&#039;&#039; to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.&lt;br /&gt;
&lt;br /&gt;
=== Preparing .bashrc ===&lt;br /&gt;
In Exercise 4 we made a lot of modifications to &#039;&#039;~/.bashrc&#039;&#039;. In some cases it is more practical to have the cluster configuration in a separate file, for example &#039;&#039;~/.info319&#039;&#039;. &lt;br /&gt;
&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.file&#039;&#039; module to create (&amp;quot;touch&amp;quot;) &#039;&#039;/home/ubuntu/.info319&#039;&#039; on all the hosts.&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.lineinfile&#039;&#039; module to add this line to the end of &#039;&#039;/home/ubuntu/.bashrc&#039;&#039; on all the hosts:&lt;br /&gt;
 source .info319&lt;br /&gt;
&lt;br /&gt;
See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example &lt;br /&gt;
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
You can use the &#039;&#039;blockinfile&#039;&#039; module to add your local &#039;&#039;ipv4-hosts&#039;&#039; to &#039;&#039;/etc/hosts&#039;&#039; on each node:&lt;br /&gt;
    - name: Copy IPv4 addresses to /etc/hosts&lt;br /&gt;
      ansible.builtin.blockinfile:&lt;br /&gt;
        path: /etc/hosts&lt;br /&gt;
        block: &amp;quot;{{ lookup(&#039;file&#039;, &#039;ipv4-hosts&#039;) } }&amp;quot;&lt;br /&gt;
      become: yes&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There should not be a space between the two curly braces at the end of the &#039;&#039;key:&#039;&#039; line. But without the space, WikiText misinterprets them as a template marker.&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;config&#039;&#039; in your &#039;&#039;exercise-5&#039;&#039; folder:&lt;br /&gt;
 Host terraform-* localhost&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
 &lt;br /&gt;
 Include ~/.ssh/config.terraform-hosts&lt;br /&gt;
(This is the &#039;&#039;config.stub&#039;&#039; file from Exercise 4, with the &#039;&#039;Include&#039;&#039; line added. Also, &#039;&#039;localhost&#039;&#039; has been added to the first line to allow nodes to &#039;&#039;ssh&#039;&#039; themselves...)&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;copy&#039;&#039; module to upload this file, along with &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039; and &#039;&#039;~/.ssh/info319-spark-cluster&#039;&#039; to all hosts.&lt;br /&gt;
&lt;br /&gt;
In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this: &lt;br /&gt;
     - name: Authorise public cluster key&lt;br /&gt;
       ansible.posix.authorized_key:&lt;br /&gt;
         key: &amp;quot;{{ lookup(&#039;file&#039;, &#039;/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub&#039;) } }&amp;quot;&lt;br /&gt;
         user: ubuntu&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Tip:&#039;&#039; &#039;&#039;&#039;ansible-playbook&#039;&#039;&#039; has a &#039;&#039;&#039;--start-at-task &amp;quot;Task name&amp;quot;&#039;&#039;&#039; option to avoid repeating all earlier blocks and stages. You can also use &#039;&#039;&#039;--step&#039;&#039;&#039; to have Ansible ask before each step whether to execute, skip, or finish.&lt;br /&gt;
&lt;br /&gt;
=== Install Java ===&lt;br /&gt;
Use Ansible&#039;s &#039;&#039;ansible.builtin.apt&#039;&#039; module and install an old and stable Java version, for example &#039;&#039;openjdk-8-jdk-headless&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Mount volumes ===&lt;br /&gt;
Using the &#039;&#039;community.general.parted&#039;&#039;, &#039;&#039;community.general.filesystem&#039;&#039; and &#039;&#039;ansible.posix.mount&#039;&#039; modules from your local machine may require installation:&lt;br /&gt;
 ansible-galaxy collection install community.general&lt;br /&gt;
 ansible-galaxy collection install ansible.posix&lt;br /&gt;
&lt;br /&gt;
=== Install HDFS and YARN ===&lt;br /&gt;
To install HDFS and YARN you need the &#039;&#039;master_node&#039;&#039; and &#039;&#039;num_workers&#039;&#039; available as Ansible variables (facts). You can use the &#039;&#039;ansible.builtin.shell&#039;&#039; and &#039;&#039;.set_fact&#039;&#039; modules to do this, for example at the start of a new Ansible play:&lt;br /&gt;
 - name: Install HDFS and YARN&lt;br /&gt;
   hosts: all&lt;br /&gt;
   tasks:&lt;br /&gt;
 &lt;br /&gt;
     - name: Register master_node expression&lt;br /&gt;
       shell: grep tf-driver /etc/hosts | cut -d&#039; &#039; -f1&lt;br /&gt;
       register: master_node_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set master_node fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         master_node: &amp;quot;{{ master_node_expr.stdout } }&amp;quot;&lt;br /&gt;
Write two corresponding tasks for &#039;&#039;num_workers&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.get_url&#039;&#039; module to download the Hadoop and later Spark archives directly to each cluster host. But if you re-run your script many times, it takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.unarchive&#039;&#039; module to unpack the archives. Use the &#039;&#039;apt&#039;&#039; module to install &#039;&#039;gzip&#039;&#039; if you need it. Use &#039;&#039;ansible.builtin.file&#039;&#039; to create symbolic links as in Exercise 4.&lt;br /&gt;
&lt;br /&gt;
Use &#039;&#039;ansible.builtin.lineinfile&#039;&#039; to define environment variables by adding them to &#039;&#039;~/.info319&#039;&#039; (instead of  &#039;&#039;~/.bashrc&#039;&#039;). &lt;br /&gt;
&lt;br /&gt;
Change the variable syntax in the files &#039;&#039;{core,hdfs,mapred,yarn}-site.xml&#039;&#039; from Exercise 4 from Linux to Ansible. For example&lt;br /&gt;
* from &#039;&#039;core-site.xml&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://${HADOOP_NAMENODE}:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
* to &#039;&#039;core-site.xml.j2&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://{{ hadoop_namenode } }:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
You can now use the &#039;&#039;ansible.builtin.template&#039;&#039; module instead of Linux&#039; &#039;&#039;envsubst&#039;&#039; command to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.&lt;br /&gt;
&lt;br /&gt;
Use &#039;&#039;ansible.builtin.shell&#039;&#039; to create Hadoop&#039;s &#039;&#039;worker&#039;&#039; and &#039;&#039;master&#039;&#039; files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the &#039;&#039;terraform-driver&#039;&#039; host.&lt;br /&gt;
&lt;br /&gt;
Note that &#039;&#039;&#039;hdfs namenode -format&#039;&#039;&#039; has got a &#039;&#039;&#039;-nonInteractive&#039;&#039;&#039; option that does not re-format an already formatted namenode.  Use &#039;&#039;failed_when&#039;&#039; to make Ansible ignore &#039;&#039;Exit code 1&#039;&#039; from hdfs in such cases:&lt;br /&gt;
     - name: Format HDFS namenode&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/home/ubuntu/volume/hadoop/bin/hdfs&amp;quot;, &amp;quot;namenode&amp;quot;, &amp;quot;-format&amp;quot;, &amp;quot;-nonInteractive&amp;quot;]&lt;br /&gt;
       register: result&lt;br /&gt;
       failed_when: result.rc not in [0, 1]&lt;br /&gt;
&lt;br /&gt;
Now you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
Finally, you can start HDFS and YARN from &#039;&#039;terraform-driver&#039;&#039;. Note that the &#039;&#039;ansible.builtin.copy&#039;&#039; and &#039;&#039;ansible.builtin.shell&#039;&#039; modules will normally &#039;&#039;not&#039;&#039; run &#039;&#039;~/.bashrc&#039;&#039;. The reason is that &#039;&#039;~/.bashrc&#039;&#039; is intended for interactive shells running, for example, in a terminal window. &#039;&#039;~/.bashrc&#039;&#039; is not needed for many simpler commands, but more complex programs and scripts like Hadoop&#039;s &#039;&#039;start-all.sh&#039;&#039; need many environment variables. Therefore, you must start your own &#039;&#039;/usr/bin/bash&#039;&#039;, initialise it with &#039;&#039;~/.info319&#039;&#039;, and then run &#039;&#039;start-all.sh&#039;&#039; inside it:&lt;br /&gt;
     - name: Start HDFS and YARN&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;start-all.sh&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Spark on the cluster ===&lt;br /&gt;
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
=== Install Zookeeper on the cluster ===&lt;br /&gt;
There are two challenges with Zookeeper:&lt;br /&gt;
# it may not run on all the machines in the cluster (it must be an odd number)&lt;br /&gt;
# each zookeeper needs to know its &#039;&#039;myid&#039;&#039; number&lt;br /&gt;
&lt;br /&gt;
As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.&lt;br /&gt;
&lt;br /&gt;
As for the second point, Exercise 4 suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can use.&lt;br /&gt;
&lt;br /&gt;
This task will start Zookeeper on the selected nodes:&lt;br /&gt;
    - name: Start Zookeper&lt;br /&gt;
      ansible.builtin.shell: &lt;br /&gt;
        argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Kafka on the cluster ===&lt;br /&gt;
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its &#039;&#039;local_ip&#039;&#039;, which you can set like this:&lt;br /&gt;
     - name: Register local_ip expression&lt;br /&gt;
       shell: ip -4 address | grep -o &amp;quot;^ *inet \(.\+\)\/.\+global.*$&amp;quot; | grep -o &amp;quot;[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+&amp;quot; | head -1&lt;br /&gt;
       register: local_ip_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set local_ip fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         local_ip: &amp;quot;{{ local_ip_expr.stdout } }&amp;quot;&lt;br /&gt;
&lt;br /&gt;
It also needs to know its &#039;&#039;id&#039;&#039;, which was written to the file &#039;&#039;/home/ubuntu/volume/zookeeper/data/myid&#039;&#039; before (better than &#039;&#039;/tmp/zookeeper/myid&#039;&#039; which was suggested before).&lt;br /&gt;
&lt;br /&gt;
Finally, you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1215</id>
		<title>Configure Spark cluster using Ansible</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1215"/>
		<updated>2022-10-31T13:44:17Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Create Ansible playbook */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Ansible ==&lt;br /&gt;
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:&lt;br /&gt;
 sudo apt install ansible&lt;br /&gt;
&lt;br /&gt;
=== Configure Ansible ===&lt;br /&gt;
To prepare:&lt;br /&gt;
 sudo cp /etc/ansible/hosts /etc/ansible/hosts.original&lt;br /&gt;
 sudo chmod 666 /etc/ansible/hosts&lt;br /&gt;
&lt;br /&gt;
Ansible needs to know the names of your cluster machines. Change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;/etc/ansible/hosts&#039;&#039;:&lt;br /&gt;
 terraform-driver&lt;br /&gt;
 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 terraform-worker-5&lt;br /&gt;
&lt;br /&gt;
Finally, Ansible must be installed on all the hosts too. Add the line&lt;br /&gt;
 - ansible&lt;br /&gt;
to the &#039;&#039;packages:&#039;&#039; section of &#039;&#039;user-data.cfg&#039;&#039;, and re-run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Test run Ansible ===&lt;br /&gt;
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:&lt;br /&gt;
 ansible all -m ping&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;info319-cluster.yaml&#039;&#039; with a simple task that backs up &#039;&#039;~/.bashrc&#039;&#039;:&lt;br /&gt;
 - name: Prepare .bashrc&lt;br /&gt;
   hosts: all&lt;br /&gt;
   tasks:&lt;br /&gt;
 &lt;br /&gt;
     - name: Save original .bashrc&lt;br /&gt;
       ansible.builtin.copy:&lt;br /&gt;
         src: /home/ubuntu/.bashrc&lt;br /&gt;
         dest: /home/ubuntu/.bashrc.original&lt;br /&gt;
         remote_src: yes&lt;br /&gt;
&lt;br /&gt;
Run Ansible:&lt;br /&gt;
 ansible-playbook info319-cluster.yaml&lt;br /&gt;
&lt;br /&gt;
== Ansible playbook ==&lt;br /&gt;
Extend the playbook file &#039;&#039;info319-cluster.yaml&#039;&#039; to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.&lt;br /&gt;
&lt;br /&gt;
=== Preparing .bashrc ===&lt;br /&gt;
In Exercise 4 we made a lot of modifications to &#039;&#039;~/.bashrc&#039;&#039;. In some cases it is more practical to have the cluster configuration in a separate file, for example &#039;&#039;~/.info319&#039;&#039;. &lt;br /&gt;
&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.file&#039;&#039; module to create (&amp;quot;touch&amp;quot;) &#039;&#039;/home/ubuntu/.info319&#039;&#039; on all the hosts.&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.lineinfile&#039;&#039; module to add this line to &#039;&#039;/home/ubuntu/.info319&#039;&#039; on all the hosts:&lt;br /&gt;
 source .info319&lt;br /&gt;
&lt;br /&gt;
See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example &lt;br /&gt;
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
You can use the &#039;&#039;blockinfile&#039;&#039; module to add your local &#039;&#039;ipv4-hosts&#039;&#039; to &#039;&#039;/etc/hosts&#039;&#039; on each node:&lt;br /&gt;
    - name: Copy IPv4 addresses to /etc/hosts&lt;br /&gt;
      ansible.builtin.blockinfile:&lt;br /&gt;
        path: /etc/hosts&lt;br /&gt;
        block: &amp;quot;{{ lookup(&#039;file&#039;, &#039;ipv4-hosts&#039;) } }&amp;quot;&lt;br /&gt;
      become: yes&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There should not be a space between the two curly braces at the end of the &#039;&#039;key:&#039;&#039; line. But without the space, WikiText misinterprets them as a template marker.&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;config&#039;&#039; in your &#039;&#039;exercise-5&#039;&#039; folder:&lt;br /&gt;
 Host terraform-* localhost&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
 &lt;br /&gt;
 Include ~/.ssh/config.terraform-hosts&lt;br /&gt;
(This is the &#039;&#039;config.stub&#039;&#039; file from Exercise 4, with the &#039;&#039;Include&#039;&#039; line added. Also, &#039;&#039;localhost&#039;&#039; has been added to the first line to allow nodes to &#039;&#039;ssh&#039;&#039; themselves...)&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;copy&#039;&#039; module to upload this file, along with &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039; and &#039;&#039;~/.ssh/info319-spark-cluster&#039;&#039; to all hosts.&lt;br /&gt;
&lt;br /&gt;
In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this: &lt;br /&gt;
     - name: Authorise public cluster key&lt;br /&gt;
       ansible.posix.authorized_key:&lt;br /&gt;
         key: &amp;quot;{{ lookup(&#039;file&#039;, &#039;/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub&#039;) } }&amp;quot;&lt;br /&gt;
         user: ubuntu&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Tip:&#039;&#039; &#039;&#039;&#039;ansible-playbook&#039;&#039;&#039; has a &#039;&#039;&#039;--start-at-task &amp;quot;Task name&amp;quot;&#039;&#039;&#039; option to avoid repeating all earlier blocks and stages. You can also use &#039;&#039;&#039;--step&#039;&#039;&#039; to have Ansible ask before each step whether to execute, skip, or finish.&lt;br /&gt;
&lt;br /&gt;
=== Install Java ===&lt;br /&gt;
Use Ansible&#039;s &#039;&#039;ansible.builtin.apt&#039;&#039; module and install an old and stable Java version, for example &#039;&#039;openjdk-8-jdk-headless&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Mount volumes ===&lt;br /&gt;
Using the &#039;&#039;community.general.parted&#039;&#039;, &#039;&#039;community.general.filesystem&#039;&#039; and &#039;&#039;ansible.posix.mount&#039;&#039; modules from your local machine may require installation:&lt;br /&gt;
 ansible-galaxy collection install community.general&lt;br /&gt;
 ansible-galaxy collection install ansible.posix&lt;br /&gt;
&lt;br /&gt;
=== Install HDFS and YARN ===&lt;br /&gt;
To install HDFS and YARN you need the &#039;&#039;master_node&#039;&#039; and &#039;&#039;num_workers&#039;&#039; available as Ansible variables (facts). You can use the &#039;&#039;ansible.builtin.shell&#039;&#039; and &#039;&#039;.set_fact&#039;&#039; modules to do this, for example at the start of a new Ansible play:&lt;br /&gt;
 - name: Install HDFS and YARN&lt;br /&gt;
   hosts: all&lt;br /&gt;
   tasks:&lt;br /&gt;
 &lt;br /&gt;
     - name: Register master_node expression&lt;br /&gt;
       shell: grep tf-driver /etc/hosts | cut -d&#039; &#039; -f1&lt;br /&gt;
       register: master_node_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set master_node fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         master_node: &amp;quot;{{ master_node_expr.stdout } }&amp;quot;&lt;br /&gt;
Write two corresponding tasks for &#039;&#039;num_workers&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.get_url&#039;&#039; module to download the Hadoop and later Spark archives directly to each cluster host. But if you re-run your script many times, it takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.unarchive&#039;&#039; module to unpack the archives. Use the &#039;&#039;apt&#039;&#039; module to install &#039;&#039;gzip&#039;&#039; if you need it. Use &#039;&#039;ansible.builtin.file&#039;&#039; to create symbolic links as in Exercise 4.&lt;br /&gt;
&lt;br /&gt;
Use &#039;&#039;ansible.builtin.lineinfile&#039;&#039; to define environment variables by adding them to &#039;&#039;~/.info319&#039;&#039; (instead of  &#039;&#039;~/.bashrc&#039;&#039;). &lt;br /&gt;
&lt;br /&gt;
Change the variable syntax in the files &#039;&#039;{core,hdfs,mapred,yarn}-site.xml&#039;&#039; from Exercise 4 from Linux to Ansible. For example&lt;br /&gt;
* from &#039;&#039;core-site.xml&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://${HADOOP_NAMENODE}:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
* to &#039;&#039;core-site.xml.j2&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://{{ hadoop_namenode } }:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
You can now use the &#039;&#039;ansible.builtin.template&#039;&#039; module instead of Linux&#039; &#039;&#039;envsubst&#039;&#039; command to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.&lt;br /&gt;
&lt;br /&gt;
Use &#039;&#039;ansible.builtin.shell&#039;&#039; to create Hadoop&#039;s &#039;&#039;worker&#039;&#039; and &#039;&#039;master&#039;&#039; files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the &#039;&#039;terraform-driver&#039;&#039; host.&lt;br /&gt;
&lt;br /&gt;
Note that &#039;&#039;&#039;hdfs namenode -format&#039;&#039;&#039; has got a &#039;&#039;&#039;-nonInteractive&#039;&#039;&#039; option that does not re-format an already formatted namenode.  Use &#039;&#039;failed_when&#039;&#039; to make Ansible ignore &#039;&#039;Exit code 1&#039;&#039; from hdfs in such cases:&lt;br /&gt;
     - name: Format HDFS namenode&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/home/ubuntu/volume/hadoop/bin/hdfs&amp;quot;, &amp;quot;namenode&amp;quot;, &amp;quot;-format&amp;quot;, &amp;quot;-nonInteractive&amp;quot;]&lt;br /&gt;
       register: result&lt;br /&gt;
       failed_when: result.rc not in [0, 1]&lt;br /&gt;
&lt;br /&gt;
Now you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
Finally, you can start HDFS and YARN from &#039;&#039;terraform-driver&#039;&#039;. Note that the &#039;&#039;ansible.builtin.copy&#039;&#039; and &#039;&#039;ansible.builtin.shell&#039;&#039; modules will normally &#039;&#039;not&#039;&#039; run &#039;&#039;~/.bashrc&#039;&#039;. The reason is that &#039;&#039;~/.bashrc&#039;&#039; is intended for interactive shells running, for example, in a terminal window. &#039;&#039;~/.bashrc&#039;&#039; is not needed for many simpler commands, but more complex programs and scripts like Hadoop&#039;s &#039;&#039;start-all.sh&#039;&#039; need many environment variables. Therefore, you must start your own &#039;&#039;/usr/bin/bash&#039;&#039;, initialise it with &#039;&#039;~/.info319&#039;&#039;, and then run &#039;&#039;start-all.sh&#039;&#039; inside it:&lt;br /&gt;
     - name: Start HDFS and YARN&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;start-all.sh&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Spark on the cluster ===&lt;br /&gt;
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
=== Install Zookeeper on the cluster ===&lt;br /&gt;
There are two challenges with Zookeeper:&lt;br /&gt;
# it may not run on all the machines in the cluster (it must be an odd number)&lt;br /&gt;
# each zookeeper needs to know its &#039;&#039;myid&#039;&#039; number&lt;br /&gt;
&lt;br /&gt;
As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.&lt;br /&gt;
&lt;br /&gt;
As for the second point, Exercise 4 suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can use.&lt;br /&gt;
&lt;br /&gt;
This task will start Zookeeper on the selected nodes:&lt;br /&gt;
    - name: Start Zookeper&lt;br /&gt;
      ansible.builtin.shell: &lt;br /&gt;
        argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Kafka on the cluster ===&lt;br /&gt;
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its &#039;&#039;local_ip&#039;&#039;, which you can set like this:&lt;br /&gt;
     - name: Register local_ip expression&lt;br /&gt;
       shell: ip -4 address | grep -o &amp;quot;^ *inet \(.\+\)\/.\+global.*$&amp;quot; | grep -o &amp;quot;[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+&amp;quot; | head -1&lt;br /&gt;
       register: local_ip_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set local_ip fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         local_ip: &amp;quot;{{ local_ip_expr.stdout } }&amp;quot;&lt;br /&gt;
&lt;br /&gt;
It also needs to know its &#039;&#039;id&#039;&#039;, which was written to the file &#039;&#039;/home/ubuntu/volume/zookeeper/data/myid&#039;&#039; before (better than &#039;&#039;/tmp/zookeeper/myid&#039;&#039; which was suggested before).&lt;br /&gt;
&lt;br /&gt;
Finally, you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1214</id>
		<title>Configure Spark cluster using Ansible</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1214"/>
		<updated>2022-10-31T13:43:23Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Test run Ansible */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Ansible ==&lt;br /&gt;
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:&lt;br /&gt;
 sudo apt install ansible&lt;br /&gt;
&lt;br /&gt;
=== Configure Ansible ===&lt;br /&gt;
To prepare:&lt;br /&gt;
 sudo cp /etc/ansible/hosts /etc/ansible/hosts.original&lt;br /&gt;
 sudo chmod 666 /etc/ansible/hosts&lt;br /&gt;
&lt;br /&gt;
Ansible needs to know the names of your cluster machines. Change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;/etc/ansible/hosts&#039;&#039;:&lt;br /&gt;
 terraform-driver&lt;br /&gt;
 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 terraform-worker-5&lt;br /&gt;
&lt;br /&gt;
Finally, Ansible must be installed on all the hosts too. Add the line&lt;br /&gt;
 - ansible&lt;br /&gt;
to the &#039;&#039;packages:&#039;&#039; section of &#039;&#039;user-data.cfg&#039;&#039;, and re-run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Test run Ansible ===&lt;br /&gt;
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:&lt;br /&gt;
 ansible all -m ping&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;info319-cluster.yaml&#039;&#039; with a simple task that backs up &#039;&#039;~/.bashrc&#039;&#039;:&lt;br /&gt;
 - name: Prepare .bashrc&lt;br /&gt;
   hosts: all&lt;br /&gt;
   tasks:&lt;br /&gt;
 &lt;br /&gt;
     - name: Save original .bashrc&lt;br /&gt;
       ansible.builtin.copy:&lt;br /&gt;
         src: /home/ubuntu/.bashrc&lt;br /&gt;
         dest: /home/ubuntu/.bashrc.original&lt;br /&gt;
         remote_src: yes&lt;br /&gt;
&lt;br /&gt;
Run Ansible:&lt;br /&gt;
 ansible-playbook info319-cluster.yaml&lt;br /&gt;
&lt;br /&gt;
== Create Ansible playbook ==&lt;br /&gt;
Extend the playbook file &#039;&#039;info319-cluster.yaml&#039;&#039; to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.&lt;br /&gt;
&lt;br /&gt;
=== Preparing .bashrc ===&lt;br /&gt;
In Exercise 4 we made a lot of modifications to &#039;&#039;~/.bashrc&#039;&#039;. In some cases it is more practical to have the cluster configuration in a separate file, for example &#039;&#039;~/.info319&#039;&#039;. &lt;br /&gt;
&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.file&#039;&#039; module to create (&amp;quot;touch&amp;quot;) &#039;&#039;/home/ubuntu/.info319&#039;&#039; on all the hosts.&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.lineinfile&#039;&#039; module to add this line to &#039;&#039;/home/ubuntu/.info319&#039;&#039; on all the hosts:&lt;br /&gt;
 source .info319&lt;br /&gt;
&lt;br /&gt;
See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example &lt;br /&gt;
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
You can use the &#039;&#039;blockinfile&#039;&#039; module to add your local &#039;&#039;ipv4-hosts&#039;&#039; to &#039;&#039;/etc/hosts&#039;&#039; on each node:&lt;br /&gt;
    - name: Copy IPv4 addresses to /etc/hosts&lt;br /&gt;
      ansible.builtin.blockinfile:&lt;br /&gt;
        path: /etc/hosts&lt;br /&gt;
        block: &amp;quot;{{ lookup(&#039;file&#039;, &#039;ipv4-hosts&#039;) } }&amp;quot;&lt;br /&gt;
      become: yes&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There should not be a space between the two curly braces at the end of the &#039;&#039;key:&#039;&#039; line. But without the space, WikiText misinterprets them as a template marker.&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;config&#039;&#039; in your &#039;&#039;exercise-5&#039;&#039; folder:&lt;br /&gt;
 Host terraform-* localhost&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
 &lt;br /&gt;
 Include ~/.ssh/config.terraform-hosts&lt;br /&gt;
(This is the &#039;&#039;config.stub&#039;&#039; file from Exercise 4, with the &#039;&#039;Include&#039;&#039; line added. Also, &#039;&#039;localhost&#039;&#039; has been added to the first line to allow nodes to &#039;&#039;ssh&#039;&#039; themselves...)&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;copy&#039;&#039; module to upload this file, along with &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039; and &#039;&#039;~/.ssh/info319-spark-cluster&#039;&#039; to all hosts.&lt;br /&gt;
&lt;br /&gt;
In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this: &lt;br /&gt;
     - name: Authorise public cluster key&lt;br /&gt;
       ansible.posix.authorized_key:&lt;br /&gt;
         key: &amp;quot;{{ lookup(&#039;file&#039;, &#039;/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub&#039;) } }&amp;quot;&lt;br /&gt;
         user: ubuntu&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Tip:&#039;&#039; &#039;&#039;&#039;ansible-playbook&#039;&#039;&#039; has a &#039;&#039;&#039;--start-at-task &amp;quot;Task name&amp;quot;&#039;&#039;&#039; option to avoid repeating all earlier blocks and stages. You can also use &#039;&#039;&#039;--step&#039;&#039;&#039; to have Ansible ask before each step whether to execute, skip, or finish.&lt;br /&gt;
&lt;br /&gt;
=== Install Java ===&lt;br /&gt;
Use Ansible&#039;s &#039;&#039;ansible.builtin.apt&#039;&#039; module and install an old and stable Java version, for example &#039;&#039;openjdk-8-jdk-headless&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Mount volumes ===&lt;br /&gt;
Using the &#039;&#039;community.general.parted&#039;&#039;, &#039;&#039;community.general.filesystem&#039;&#039; and &#039;&#039;ansible.posix.mount&#039;&#039; modules from your local machine may require installation:&lt;br /&gt;
 ansible-galaxy collection install community.general&lt;br /&gt;
 ansible-galaxy collection install ansible.posix&lt;br /&gt;
&lt;br /&gt;
=== Install HDFS and YARN ===&lt;br /&gt;
To install HDFS and YARN you need the &#039;&#039;master_node&#039;&#039; and &#039;&#039;num_workers&#039;&#039; available as Ansible variables (facts). You can use the &#039;&#039;ansible.builtin.shell&#039;&#039; and &#039;&#039;.set_fact&#039;&#039; modules to do this, for example at the start of a new Ansible play:&lt;br /&gt;
 - name: Install HDFS and YARN&lt;br /&gt;
   hosts: all&lt;br /&gt;
   tasks:&lt;br /&gt;
 &lt;br /&gt;
     - name: Register master_node expression&lt;br /&gt;
       shell: grep tf-driver /etc/hosts | cut -d&#039; &#039; -f1&lt;br /&gt;
       register: master_node_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set master_node fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         master_node: &amp;quot;{{ master_node_expr.stdout } }&amp;quot;&lt;br /&gt;
Write two corresponding tasks for &#039;&#039;num_workers&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.get_url&#039;&#039; module to download the Hadoop and later Spark archives directly to each cluster host. But if you re-run your script many times, it takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.unarchive&#039;&#039; module to unpack the archives. Use the &#039;&#039;apt&#039;&#039; module to install &#039;&#039;gzip&#039;&#039; if you need it. Use &#039;&#039;ansible.builtin.file&#039;&#039; to create symbolic links as in Exercise 4.&lt;br /&gt;
&lt;br /&gt;
Use &#039;&#039;ansible.builtin.lineinfile&#039;&#039; to define environment variables by adding them to &#039;&#039;~/.info319&#039;&#039; (instead of  &#039;&#039;~/.bashrc&#039;&#039;). &lt;br /&gt;
&lt;br /&gt;
Change the variable syntax in the files &#039;&#039;{core,hdfs,mapred,yarn}-site.xml&#039;&#039; from Exercise 4 from Linux to Ansible. For example&lt;br /&gt;
* from &#039;&#039;core-site.xml&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://${HADOOP_NAMENODE}:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
* to &#039;&#039;core-site.xml.j2&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://{{ hadoop_namenode } }:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
You can now use the &#039;&#039;ansible.builtin.template&#039;&#039; module instead of Linux&#039; &#039;&#039;envsubst&#039;&#039; command to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.&lt;br /&gt;
&lt;br /&gt;
Use &#039;&#039;ansible.builtin.shell&#039;&#039; to create Hadoop&#039;s &#039;&#039;worker&#039;&#039; and &#039;&#039;master&#039;&#039; files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the &#039;&#039;terraform-driver&#039;&#039; host.&lt;br /&gt;
&lt;br /&gt;
Note that &#039;&#039;&#039;hdfs namenode -format&#039;&#039;&#039; has got a &#039;&#039;&#039;-nonInteractive&#039;&#039;&#039; option that does not re-format an already formatted namenode.  Use &#039;&#039;failed_when&#039;&#039; to make Ansible ignore &#039;&#039;Exit code 1&#039;&#039; from hdfs in such cases:&lt;br /&gt;
     - name: Format HDFS namenode&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/home/ubuntu/volume/hadoop/bin/hdfs&amp;quot;, &amp;quot;namenode&amp;quot;, &amp;quot;-format&amp;quot;, &amp;quot;-nonInteractive&amp;quot;]&lt;br /&gt;
       register: result&lt;br /&gt;
       failed_when: result.rc not in [0, 1]&lt;br /&gt;
&lt;br /&gt;
Now you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
Finally, you can start HDFS and YARN from &#039;&#039;terraform-driver&#039;&#039;. Note that the &#039;&#039;ansible.builtin.copy&#039;&#039; and &#039;&#039;ansible.builtin.shell&#039;&#039; modules will normally &#039;&#039;not&#039;&#039; run &#039;&#039;~/.bashrc&#039;&#039;. The reason is that &#039;&#039;~/.bashrc&#039;&#039; is intended for interactive shells running, for example, in a terminal window. &#039;&#039;~/.bashrc&#039;&#039; is not needed for many simpler commands, but more complex programs and scripts like Hadoop&#039;s &#039;&#039;start-all.sh&#039;&#039; need many environment variables. Therefore, you must start your own &#039;&#039;/usr/bin/bash&#039;&#039;, initialise it with &#039;&#039;~/.info319&#039;&#039;, and then run &#039;&#039;start-all.sh&#039;&#039; inside it:&lt;br /&gt;
     - name: Start HDFS and YARN&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;start-all.sh&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Spark on the cluster ===&lt;br /&gt;
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
=== Install Zookeeper on the cluster ===&lt;br /&gt;
There are two challenges with Zookeeper:&lt;br /&gt;
# it may not run on all the machines in the cluster (it must be an odd number)&lt;br /&gt;
# each zookeeper needs to know its &#039;&#039;myid&#039;&#039; number&lt;br /&gt;
&lt;br /&gt;
As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.&lt;br /&gt;
&lt;br /&gt;
As for the second point, Exercise 4 suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can use.&lt;br /&gt;
&lt;br /&gt;
This task will start Zookeeper on the selected nodes:&lt;br /&gt;
    - name: Start Zookeper&lt;br /&gt;
      ansible.builtin.shell: &lt;br /&gt;
        argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Kafka on the cluster ===&lt;br /&gt;
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its &#039;&#039;local_ip&#039;&#039;, which you can set like this:&lt;br /&gt;
     - name: Register local_ip expression&lt;br /&gt;
       shell: ip -4 address | grep -o &amp;quot;^ *inet \(.\+\)\/.\+global.*$&amp;quot; | grep -o &amp;quot;[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+&amp;quot; | head -1&lt;br /&gt;
       register: local_ip_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set local_ip fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         local_ip: &amp;quot;{{ local_ip_expr.stdout } }&amp;quot;&lt;br /&gt;
&lt;br /&gt;
It also needs to know its &#039;&#039;id&#039;&#039;, which was written to the file &#039;&#039;/home/ubuntu/volume/zookeeper/data/myid&#039;&#039; before (better than &#039;&#039;/tmp/zookeeper/myid&#039;&#039; which was suggested before).&lt;br /&gt;
&lt;br /&gt;
Finally, you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1213</id>
		<title>Configure Spark cluster using Ansible</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1213"/>
		<updated>2022-10-31T13:42:46Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Test run Ansible */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Ansible ==&lt;br /&gt;
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:&lt;br /&gt;
 sudo apt install ansible&lt;br /&gt;
&lt;br /&gt;
=== Configure Ansible ===&lt;br /&gt;
To prepare:&lt;br /&gt;
 sudo cp /etc/ansible/hosts /etc/ansible/hosts.original&lt;br /&gt;
 sudo chmod 666 /etc/ansible/hosts&lt;br /&gt;
&lt;br /&gt;
Ansible needs to know the names of your cluster machines. Change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;/etc/ansible/hosts&#039;&#039;:&lt;br /&gt;
 terraform-driver&lt;br /&gt;
 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 terraform-worker-5&lt;br /&gt;
&lt;br /&gt;
Finally, Ansible must be installed on all the hosts too. Add the line&lt;br /&gt;
 - ansible&lt;br /&gt;
to the &#039;&#039;packages:&#039;&#039; section of &#039;&#039;user-data.cfg&#039;&#039;, and re-run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Test run Ansible ===&lt;br /&gt;
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:&lt;br /&gt;
 ansible all -m ping&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;info319-cluster.yaml&#039;&#039; with a simple task:&lt;br /&gt;
 - name: Prepare .bashrc&lt;br /&gt;
   hosts: all&lt;br /&gt;
   tasks:&lt;br /&gt;
 &lt;br /&gt;
     - name: Save original .bashrc&lt;br /&gt;
       ansible.builtin.copy:&lt;br /&gt;
         src: /home/ubuntu/.bashrc&lt;br /&gt;
         dest: /home/ubuntu/.bashrc.original&lt;br /&gt;
         remote_src: yes&lt;br /&gt;
&lt;br /&gt;
Run Ansible:&lt;br /&gt;
 ansible-playbook info319-cluster.yaml&lt;br /&gt;
&lt;br /&gt;
== Create Ansible playbook ==&lt;br /&gt;
Extend the playbook file &#039;&#039;info319-cluster.yaml&#039;&#039; to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.&lt;br /&gt;
&lt;br /&gt;
=== Preparing .bashrc ===&lt;br /&gt;
In Exercise 4 we made a lot of modifications to &#039;&#039;~/.bashrc&#039;&#039;. In some cases it is more practical to have the cluster configuration in a separate file, for example &#039;&#039;~/.info319&#039;&#039;. &lt;br /&gt;
&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.file&#039;&#039; module to create (&amp;quot;touch&amp;quot;) &#039;&#039;/home/ubuntu/.info319&#039;&#039; on all the hosts.&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.lineinfile&#039;&#039; module to add this line to &#039;&#039;/home/ubuntu/.info319&#039;&#039; on all the hosts:&lt;br /&gt;
 source .info319&lt;br /&gt;
&lt;br /&gt;
See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example &lt;br /&gt;
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
You can use the &#039;&#039;blockinfile&#039;&#039; module to add your local &#039;&#039;ipv4-hosts&#039;&#039; to &#039;&#039;/etc/hosts&#039;&#039; on each node:&lt;br /&gt;
    - name: Copy IPv4 addresses to /etc/hosts&lt;br /&gt;
      ansible.builtin.blockinfile:&lt;br /&gt;
        path: /etc/hosts&lt;br /&gt;
        block: &amp;quot;{{ lookup(&#039;file&#039;, &#039;ipv4-hosts&#039;) } }&amp;quot;&lt;br /&gt;
      become: yes&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There should not be a space between the two curly braces at the end of the &#039;&#039;key:&#039;&#039; line. But without the space, WikiText misinterprets them as a template marker.&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;config&#039;&#039; in your &#039;&#039;exercise-5&#039;&#039; folder:&lt;br /&gt;
 Host terraform-* localhost&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
 &lt;br /&gt;
 Include ~/.ssh/config.terraform-hosts&lt;br /&gt;
(This is the &#039;&#039;config.stub&#039;&#039; file from Exercise 4, with the &#039;&#039;Include&#039;&#039; line added. Also, &#039;&#039;localhost&#039;&#039; has been added to the first line to allow nodes to &#039;&#039;ssh&#039;&#039; themselves...)&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;copy&#039;&#039; module to upload this file, along with &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039; and &#039;&#039;~/.ssh/info319-spark-cluster&#039;&#039; to all hosts.&lt;br /&gt;
&lt;br /&gt;
In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this: &lt;br /&gt;
     - name: Authorise public cluster key&lt;br /&gt;
       ansible.posix.authorized_key:&lt;br /&gt;
         key: &amp;quot;{{ lookup(&#039;file&#039;, &#039;/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub&#039;) } }&amp;quot;&lt;br /&gt;
         user: ubuntu&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Tip:&#039;&#039; &#039;&#039;&#039;ansible-playbook&#039;&#039;&#039; has a &#039;&#039;&#039;--start-at-task &amp;quot;Task name&amp;quot;&#039;&#039;&#039; option to avoid repeating all earlier blocks and stages. You can also use &#039;&#039;&#039;--step&#039;&#039;&#039; to have Ansible ask before each step whether to execute, skip, or finish.&lt;br /&gt;
&lt;br /&gt;
=== Install Java ===&lt;br /&gt;
Use Ansible&#039;s &#039;&#039;ansible.builtin.apt&#039;&#039; module and install an old and stable Java version, for example &#039;&#039;openjdk-8-jdk-headless&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Mount volumes ===&lt;br /&gt;
Using the &#039;&#039;community.general.parted&#039;&#039;, &#039;&#039;community.general.filesystem&#039;&#039; and &#039;&#039;ansible.posix.mount&#039;&#039; modules from your local machine may require installation:&lt;br /&gt;
 ansible-galaxy collection install community.general&lt;br /&gt;
 ansible-galaxy collection install ansible.posix&lt;br /&gt;
&lt;br /&gt;
=== Install HDFS and YARN ===&lt;br /&gt;
To install HDFS and YARN you need the &#039;&#039;master_node&#039;&#039; and &#039;&#039;num_workers&#039;&#039; available as Ansible variables (facts). You can use the &#039;&#039;ansible.builtin.shell&#039;&#039; and &#039;&#039;.set_fact&#039;&#039; modules to do this, for example at the start of a new Ansible play:&lt;br /&gt;
 - name: Install HDFS and YARN&lt;br /&gt;
   hosts: all&lt;br /&gt;
   tasks:&lt;br /&gt;
 &lt;br /&gt;
     - name: Register master_node expression&lt;br /&gt;
       shell: grep tf-driver /etc/hosts | cut -d&#039; &#039; -f1&lt;br /&gt;
       register: master_node_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set master_node fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         master_node: &amp;quot;{{ master_node_expr.stdout } }&amp;quot;&lt;br /&gt;
Write two corresponding tasks for &#039;&#039;num_workers&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.get_url&#039;&#039; module to download the Hadoop and later Spark archives directly to each cluster host. But if you re-run your script many times, it takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.unarchive&#039;&#039; module to unpack the archives. Use the &#039;&#039;apt&#039;&#039; module to install &#039;&#039;gzip&#039;&#039; if you need it. Use &#039;&#039;ansible.builtin.file&#039;&#039; to create symbolic links as in Exercise 4.&lt;br /&gt;
&lt;br /&gt;
Use &#039;&#039;ansible.builtin.lineinfile&#039;&#039; to define environment variables by adding them to &#039;&#039;~/.info319&#039;&#039; (instead of  &#039;&#039;~/.bashrc&#039;&#039;). &lt;br /&gt;
&lt;br /&gt;
Change the variable syntax in the files &#039;&#039;{core,hdfs,mapred,yarn}-site.xml&#039;&#039; from Exercise 4 from Linux to Ansible. For example&lt;br /&gt;
* from &#039;&#039;core-site.xml&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://${HADOOP_NAMENODE}:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
* to &#039;&#039;core-site.xml.j2&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://{{ hadoop_namenode } }:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
You can now use the &#039;&#039;ansible.builtin.template&#039;&#039; module instead of Linux&#039; &#039;&#039;envsubst&#039;&#039; command to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.&lt;br /&gt;
&lt;br /&gt;
Use &#039;&#039;ansible.builtin.shell&#039;&#039; to create Hadoop&#039;s &#039;&#039;worker&#039;&#039; and &#039;&#039;master&#039;&#039; files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the &#039;&#039;terraform-driver&#039;&#039; host.&lt;br /&gt;
&lt;br /&gt;
Note that &#039;&#039;&#039;hdfs namenode -format&#039;&#039;&#039; has got a &#039;&#039;&#039;-nonInteractive&#039;&#039;&#039; option that does not re-format an already formatted namenode.  Use &#039;&#039;failed_when&#039;&#039; to make Ansible ignore &#039;&#039;Exit code 1&#039;&#039; from hdfs in such cases:&lt;br /&gt;
     - name: Format HDFS namenode&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/home/ubuntu/volume/hadoop/bin/hdfs&amp;quot;, &amp;quot;namenode&amp;quot;, &amp;quot;-format&amp;quot;, &amp;quot;-nonInteractive&amp;quot;]&lt;br /&gt;
       register: result&lt;br /&gt;
       failed_when: result.rc not in [0, 1]&lt;br /&gt;
&lt;br /&gt;
Now you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
Finally, you can start HDFS and YARN from &#039;&#039;terraform-driver&#039;&#039;. Note that the &#039;&#039;ansible.builtin.copy&#039;&#039; and &#039;&#039;ansible.builtin.shell&#039;&#039; modules will normally &#039;&#039;not&#039;&#039; run &#039;&#039;~/.bashrc&#039;&#039;. The reason is that &#039;&#039;~/.bashrc&#039;&#039; is intended for interactive shells running, for example, in a terminal window. &#039;&#039;~/.bashrc&#039;&#039; is not needed for many simpler commands, but more complex programs and scripts like Hadoop&#039;s &#039;&#039;start-all.sh&#039;&#039; need many environment variables. Therefore, you must start your own &#039;&#039;/usr/bin/bash&#039;&#039;, initialise it with &#039;&#039;~/.info319&#039;&#039;, and then run &#039;&#039;start-all.sh&#039;&#039; inside it:&lt;br /&gt;
     - name: Start HDFS and YARN&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;start-all.sh&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Spark on the cluster ===&lt;br /&gt;
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
=== Install Zookeeper on the cluster ===&lt;br /&gt;
There are two challenges with Zookeeper:&lt;br /&gt;
# it may not run on all the machines in the cluster (it must be an odd number)&lt;br /&gt;
# each zookeeper needs to know its &#039;&#039;myid&#039;&#039; number&lt;br /&gt;
&lt;br /&gt;
As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.&lt;br /&gt;
&lt;br /&gt;
As for the second point, Exercise 4 suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can use.&lt;br /&gt;
&lt;br /&gt;
This task will start Zookeeper on the selected nodes:&lt;br /&gt;
    - name: Start Zookeper&lt;br /&gt;
      ansible.builtin.shell: &lt;br /&gt;
        argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Kafka on the cluster ===&lt;br /&gt;
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its &#039;&#039;local_ip&#039;&#039;, which you can set like this:&lt;br /&gt;
     - name: Register local_ip expression&lt;br /&gt;
       shell: ip -4 address | grep -o &amp;quot;^ *inet \(.\+\)\/.\+global.*$&amp;quot; | grep -o &amp;quot;[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+&amp;quot; | head -1&lt;br /&gt;
       register: local_ip_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set local_ip fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         local_ip: &amp;quot;{{ local_ip_expr.stdout } }&amp;quot;&lt;br /&gt;
&lt;br /&gt;
It also needs to know its &#039;&#039;id&#039;&#039;, which was written to the file &#039;&#039;/home/ubuntu/volume/zookeeper/data/myid&#039;&#039; before (better than &#039;&#039;/tmp/zookeeper/myid&#039;&#039; which was suggested before).&lt;br /&gt;
&lt;br /&gt;
Finally, you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1212</id>
		<title>Configure Spark cluster using Ansible</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1212"/>
		<updated>2022-10-31T13:42:18Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Test run Ansible */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Ansible ==&lt;br /&gt;
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:&lt;br /&gt;
 sudo apt install ansible&lt;br /&gt;
&lt;br /&gt;
=== Configure Ansible ===&lt;br /&gt;
To prepare:&lt;br /&gt;
 sudo cp /etc/ansible/hosts /etc/ansible/hosts.original&lt;br /&gt;
 sudo chmod 666 /etc/ansible/hosts&lt;br /&gt;
&lt;br /&gt;
Ansible needs to know the names of your cluster machines. Change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;/etc/ansible/hosts&#039;&#039;:&lt;br /&gt;
 terraform-driver&lt;br /&gt;
 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 terraform-worker-5&lt;br /&gt;
&lt;br /&gt;
Finally, Ansible must be installed on all the hosts too. Add the line&lt;br /&gt;
 - ansible&lt;br /&gt;
to the &#039;&#039;packages:&#039;&#039; section of &#039;&#039;user-data.cfg&#039;&#039;, and re-run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Test run Ansible ===&lt;br /&gt;
Make sure you can log into your cluster machines without a password. Test your Ansible setup from your local machine:&lt;br /&gt;
 ansible all -m ping&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;info319-cluster.yaml&#039;&#039; with a simple task:&lt;br /&gt;
- name: Prepare .bashrc&lt;br /&gt;
  hosts: all&lt;br /&gt;
  tasks:&lt;br /&gt;
&lt;br /&gt;
    - name: Save original .bashrc&lt;br /&gt;
      ansible.builtin.copy:&lt;br /&gt;
        src: /home/ubuntu/.bashrc&lt;br /&gt;
        dest: /home/ubuntu/.bashrc.original&lt;br /&gt;
        remote_src: yes&lt;br /&gt;
&lt;br /&gt;
Run Ansible:&lt;br /&gt;
 ansible-playbook info319-cluster.yaml&lt;br /&gt;
&lt;br /&gt;
== Create Ansible playbook ==&lt;br /&gt;
Extend the playbook file &#039;&#039;info319-cluster.yaml&#039;&#039; to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.&lt;br /&gt;
&lt;br /&gt;
=== Preparing .bashrc ===&lt;br /&gt;
In Exercise 4 we made a lot of modifications to &#039;&#039;~/.bashrc&#039;&#039;. In some cases it is more practical to have the cluster configuration in a separate file, for example &#039;&#039;~/.info319&#039;&#039;. &lt;br /&gt;
&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.file&#039;&#039; module to create (&amp;quot;touch&amp;quot;) &#039;&#039;/home/ubuntu/.info319&#039;&#039; on all the hosts.&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.lineinfile&#039;&#039; module to add this line to &#039;&#039;/home/ubuntu/.info319&#039;&#039; on all the hosts:&lt;br /&gt;
 source .info319&lt;br /&gt;
&lt;br /&gt;
See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example &lt;br /&gt;
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
You can use the &#039;&#039;blockinfile&#039;&#039; module to add your local &#039;&#039;ipv4-hosts&#039;&#039; to &#039;&#039;/etc/hosts&#039;&#039; on each node:&lt;br /&gt;
    - name: Copy IPv4 addresses to /etc/hosts&lt;br /&gt;
      ansible.builtin.blockinfile:&lt;br /&gt;
        path: /etc/hosts&lt;br /&gt;
        block: &amp;quot;{{ lookup(&#039;file&#039;, &#039;ipv4-hosts&#039;) } }&amp;quot;&lt;br /&gt;
      become: yes&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There should not be a space between the two curly braces at the end of the &#039;&#039;key:&#039;&#039; line. But without the space, WikiText misinterprets them as a template marker.&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;config&#039;&#039; in your &#039;&#039;exercise-5&#039;&#039; folder:&lt;br /&gt;
 Host terraform-* localhost&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
 &lt;br /&gt;
 Include ~/.ssh/config.terraform-hosts&lt;br /&gt;
(This is the &#039;&#039;config.stub&#039;&#039; file from Exercise 4, with the &#039;&#039;Include&#039;&#039; line added. Also, &#039;&#039;localhost&#039;&#039; has been added to the first line to allow nodes to &#039;&#039;ssh&#039;&#039; themselves...)&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;copy&#039;&#039; module to upload this file, along with &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039; and &#039;&#039;~/.ssh/info319-spark-cluster&#039;&#039; to all hosts.&lt;br /&gt;
&lt;br /&gt;
In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this: &lt;br /&gt;
     - name: Authorise public cluster key&lt;br /&gt;
       ansible.posix.authorized_key:&lt;br /&gt;
         key: &amp;quot;{{ lookup(&#039;file&#039;, &#039;/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub&#039;) } }&amp;quot;&lt;br /&gt;
         user: ubuntu&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Tip:&#039;&#039; &#039;&#039;&#039;ansible-playbook&#039;&#039;&#039; has a &#039;&#039;&#039;--start-at-task &amp;quot;Task name&amp;quot;&#039;&#039;&#039; option to avoid repeating all earlier blocks and stages. You can also use &#039;&#039;&#039;--step&#039;&#039;&#039; to have Ansible ask before each step whether to execute, skip, or finish.&lt;br /&gt;
&lt;br /&gt;
=== Install Java ===&lt;br /&gt;
Use Ansible&#039;s &#039;&#039;ansible.builtin.apt&#039;&#039; module and install an old and stable Java version, for example &#039;&#039;openjdk-8-jdk-headless&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Mount volumes ===&lt;br /&gt;
Using the &#039;&#039;community.general.parted&#039;&#039;, &#039;&#039;community.general.filesystem&#039;&#039; and &#039;&#039;ansible.posix.mount&#039;&#039; modules from your local machine may require installation:&lt;br /&gt;
 ansible-galaxy collection install community.general&lt;br /&gt;
 ansible-galaxy collection install ansible.posix&lt;br /&gt;
&lt;br /&gt;
=== Install HDFS and YARN ===&lt;br /&gt;
To install HDFS and YARN you need the &#039;&#039;master_node&#039;&#039; and &#039;&#039;num_workers&#039;&#039; available as Ansible variables (facts). You can use the &#039;&#039;ansible.builtin.shell&#039;&#039; and &#039;&#039;.set_fact&#039;&#039; modules to do this, for example at the start of a new Ansible play:&lt;br /&gt;
 - name: Install HDFS and YARN&lt;br /&gt;
   hosts: all&lt;br /&gt;
   tasks:&lt;br /&gt;
 &lt;br /&gt;
     - name: Register master_node expression&lt;br /&gt;
       shell: grep tf-driver /etc/hosts | cut -d&#039; &#039; -f1&lt;br /&gt;
       register: master_node_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set master_node fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         master_node: &amp;quot;{{ master_node_expr.stdout } }&amp;quot;&lt;br /&gt;
Write two corresponding tasks for &#039;&#039;num_workers&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.get_url&#039;&#039; module to download the Hadoop and later Spark archives directly to each cluster host. But if you re-run your script many times, it takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.unarchive&#039;&#039; module to unpack the archives. Use the &#039;&#039;apt&#039;&#039; module to install &#039;&#039;gzip&#039;&#039; if you need it. Use &#039;&#039;ansible.builtin.file&#039;&#039; to create symbolic links as in Exercise 4.&lt;br /&gt;
&lt;br /&gt;
Use &#039;&#039;ansible.builtin.lineinfile&#039;&#039; to define environment variables by adding them to &#039;&#039;~/.info319&#039;&#039; (instead of  &#039;&#039;~/.bashrc&#039;&#039;). &lt;br /&gt;
&lt;br /&gt;
Change the variable syntax in the files &#039;&#039;{core,hdfs,mapred,yarn}-site.xml&#039;&#039; from Exercise 4 from Linux to Ansible. For example&lt;br /&gt;
* from &#039;&#039;core-site.xml&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://${HADOOP_NAMENODE}:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
* to &#039;&#039;core-site.xml.j2&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://{{ hadoop_namenode } }:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
You can now use the &#039;&#039;ansible.builtin.template&#039;&#039; module instead of Linux&#039; &#039;&#039;envsubst&#039;&#039; command to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.&lt;br /&gt;
&lt;br /&gt;
Use &#039;&#039;ansible.builtin.shell&#039;&#039; to create Hadoop&#039;s &#039;&#039;worker&#039;&#039; and &#039;&#039;master&#039;&#039; files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the &#039;&#039;terraform-driver&#039;&#039; host.&lt;br /&gt;
&lt;br /&gt;
Note that &#039;&#039;&#039;hdfs namenode -format&#039;&#039;&#039; has got a &#039;&#039;&#039;-nonInteractive&#039;&#039;&#039; option that does not re-format an already formatted namenode.  Use &#039;&#039;failed_when&#039;&#039; to make Ansible ignore &#039;&#039;Exit code 1&#039;&#039; from hdfs in such cases:&lt;br /&gt;
     - name: Format HDFS namenode&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/home/ubuntu/volume/hadoop/bin/hdfs&amp;quot;, &amp;quot;namenode&amp;quot;, &amp;quot;-format&amp;quot;, &amp;quot;-nonInteractive&amp;quot;]&lt;br /&gt;
       register: result&lt;br /&gt;
       failed_when: result.rc not in [0, 1]&lt;br /&gt;
&lt;br /&gt;
Now you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
Finally, you can start HDFS and YARN from &#039;&#039;terraform-driver&#039;&#039;. Note that the &#039;&#039;ansible.builtin.copy&#039;&#039; and &#039;&#039;ansible.builtin.shell&#039;&#039; modules will normally &#039;&#039;not&#039;&#039; run &#039;&#039;~/.bashrc&#039;&#039;. The reason is that &#039;&#039;~/.bashrc&#039;&#039; is intended for interactive shells running, for example, in a terminal window. &#039;&#039;~/.bashrc&#039;&#039; is not needed for many simpler commands, but more complex programs and scripts like Hadoop&#039;s &#039;&#039;start-all.sh&#039;&#039; need many environment variables. Therefore, you must start your own &#039;&#039;/usr/bin/bash&#039;&#039;, initialise it with &#039;&#039;~/.info319&#039;&#039;, and then run &#039;&#039;start-all.sh&#039;&#039; inside it:&lt;br /&gt;
     - name: Start HDFS and YARN&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;start-all.sh&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Spark on the cluster ===&lt;br /&gt;
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
=== Install Zookeeper on the cluster ===&lt;br /&gt;
There are two challenges with Zookeeper:&lt;br /&gt;
# it may not run on all the machines in the cluster (it must be an odd number)&lt;br /&gt;
# each zookeeper needs to know its &#039;&#039;myid&#039;&#039; number&lt;br /&gt;
&lt;br /&gt;
As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.&lt;br /&gt;
&lt;br /&gt;
As for the second point, Exercise 4 suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can use.&lt;br /&gt;
&lt;br /&gt;
This task will start Zookeeper on the selected nodes:&lt;br /&gt;
    - name: Start Zookeper&lt;br /&gt;
      ansible.builtin.shell: &lt;br /&gt;
        argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Kafka on the cluster ===&lt;br /&gt;
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its &#039;&#039;local_ip&#039;&#039;, which you can set like this:&lt;br /&gt;
     - name: Register local_ip expression&lt;br /&gt;
       shell: ip -4 address | grep -o &amp;quot;^ *inet \(.\+\)\/.\+global.*$&amp;quot; | grep -o &amp;quot;[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+&amp;quot; | head -1&lt;br /&gt;
       register: local_ip_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set local_ip fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         local_ip: &amp;quot;{{ local_ip_expr.stdout } }&amp;quot;&lt;br /&gt;
&lt;br /&gt;
It also needs to know its &#039;&#039;id&#039;&#039;, which was written to the file &#039;&#039;/home/ubuntu/volume/zookeeper/data/myid&#039;&#039; before (better than &#039;&#039;/tmp/zookeeper/myid&#039;&#039; which was suggested before).&lt;br /&gt;
&lt;br /&gt;
Finally, you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1211</id>
		<title>Configure Spark cluster using Ansible</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Configure_Spark_cluster_using_Ansible&amp;diff=1211"/>
		<updated>2022-10-31T13:41:42Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Install and configure Ansible */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Ansible ==&lt;br /&gt;
On your local host, install [https://docs.ansible.com/ansible/latest/index.html Ansible], for example:&lt;br /&gt;
 sudo apt install ansible&lt;br /&gt;
&lt;br /&gt;
=== Configure Ansible ===&lt;br /&gt;
To prepare:&lt;br /&gt;
 sudo cp /etc/ansible/hosts /etc/ansible/hosts.original&lt;br /&gt;
 sudo chmod 666 /etc/ansible/hosts&lt;br /&gt;
&lt;br /&gt;
Ansible needs to know the names of your cluster machines. Change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;/etc/ansible/hosts&#039;&#039;:&lt;br /&gt;
 terraform-driver&lt;br /&gt;
 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 terraform-worker-5&lt;br /&gt;
&lt;br /&gt;
Finally, Ansible must be installed on all the hosts too. Add the line&lt;br /&gt;
 - ansible&lt;br /&gt;
to the &#039;&#039;packages:&#039;&#039; section of &#039;&#039;user-data.cfg&#039;&#039;, and re-run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Test run Ansible ===&lt;br /&gt;
Make sure you can log into your cluster machines without a password. Test your Ansible set up from your local machine:&lt;br /&gt;
 ansible all -m ping&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;info319-cluster.yaml&#039;&#039; with a simple task:&lt;br /&gt;
- name: Prepare .bashrc&lt;br /&gt;
  hosts: all&lt;br /&gt;
  tasks:&lt;br /&gt;
&lt;br /&gt;
    - name: Save original .bashrc&lt;br /&gt;
      ansible.builtin.copy:&lt;br /&gt;
        src: /home/ubuntu/.bashrc&lt;br /&gt;
        dest: /home/ubuntu/.bashrc.original&lt;br /&gt;
        remote_src: yes&lt;br /&gt;
&lt;br /&gt;
Run Ansible:&lt;br /&gt;
 ansible-playbook info319-cluster.yaml&lt;br /&gt;
&lt;br /&gt;
== Create Ansible playbook ==&lt;br /&gt;
Extend the playbook file &#039;&#039;info319-cluster.yaml&#039;&#039; to re-create the setup from Exercise 4 on the new Terraform cluster. You have some freedom with respect to the order of things.&lt;br /&gt;
&lt;br /&gt;
=== Preparing .bashrc ===&lt;br /&gt;
In Exercise 4 we made a lot of modifications to &#039;&#039;~/.bashrc&#039;&#039;. In some cases it is more practical to have the cluster configuration in a separate file, for example &#039;&#039;~/.info319&#039;&#039;. &lt;br /&gt;
&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.file&#039;&#039; module to create (&amp;quot;touch&amp;quot;) &#039;&#039;/home/ubuntu/.info319&#039;&#039; on all the hosts.&lt;br /&gt;
* Add a task that uses the &#039;&#039;ansible.builtin.lineinfile&#039;&#039; module to add this line to &#039;&#039;/home/ubuntu/.info319&#039;&#039; on all the hosts:&lt;br /&gt;
 source .info319&lt;br /&gt;
&lt;br /&gt;
See the documentation here: https://docs.ansible.com/ansible/latest/collections/index.html, for example &lt;br /&gt;
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html.&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
You can use the &#039;&#039;blockinfile&#039;&#039; module to add your local &#039;&#039;ipv4-hosts&#039;&#039; to &#039;&#039;/etc/hosts&#039;&#039; on each node:&lt;br /&gt;
    - name: Copy IPv4 addresses to /etc/hosts&lt;br /&gt;
      ansible.builtin.blockinfile:&lt;br /&gt;
        path: /etc/hosts&lt;br /&gt;
        block: &amp;quot;{{ lookup(&#039;file&#039;, &#039;ipv4-hosts&#039;) } }&amp;quot;&lt;br /&gt;
      become: yes&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There should not be a space between the two curly braces at the end of the &#039;&#039;key:&#039;&#039; line. But without the space, WikiText misinterprets them as a template marker.&lt;br /&gt;
&lt;br /&gt;
On your local machine, create the file &#039;&#039;config&#039;&#039; in your &#039;&#039;exercise-5&#039;&#039; folder:&lt;br /&gt;
 Host terraform-* localhost&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
 &lt;br /&gt;
 Include ~/.ssh/config.terraform-hosts&lt;br /&gt;
(This is the &#039;&#039;config.stub&#039;&#039; file from Exercise 4, with the &#039;&#039;Include&#039;&#039; line added. Also, &#039;&#039;localhost&#039;&#039; has been added to the first line to allow nodes to &#039;&#039;ssh&#039;&#039; themselves...)&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;copy&#039;&#039; module to upload this file, along with &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039; and &#039;&#039;~/.ssh/info319-spark-cluster&#039;&#039; to all hosts.&lt;br /&gt;
&lt;br /&gt;
In addition to uploading the config and private key files, you must also authorise the public info319-spark-cluster.pub key, like this: &lt;br /&gt;
     - name: Authorise public cluster key&lt;br /&gt;
       ansible.posix.authorized_key:&lt;br /&gt;
         key: &amp;quot;{{ lookup(&#039;file&#039;, &#039;/home/YOUR_USERNAME/.ssh/info319-spark-cluster.pub&#039;) } }&amp;quot;&lt;br /&gt;
         user: ubuntu&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Tip:&#039;&#039; &#039;&#039;&#039;ansible-playbook&#039;&#039;&#039; has a &#039;&#039;&#039;--start-at-task &amp;quot;Task name&amp;quot;&#039;&#039;&#039; option to avoid repeating all earlier blocks and stages. You can also use &#039;&#039;&#039;--step&#039;&#039;&#039; to have Ansible ask before each step whether to execute, skip, or finish.&lt;br /&gt;
&lt;br /&gt;
=== Install Java ===&lt;br /&gt;
Use Ansible&#039;s &#039;&#039;ansible.builtin.apt&#039;&#039; module and install an old and stable Java version, for example &#039;&#039;openjdk-8-jdk-headless&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Mount volumes ===&lt;br /&gt;
Using the &#039;&#039;community.general.parted&#039;&#039;, &#039;&#039;community.general.filesystem&#039;&#039; and &#039;&#039;ansible.posix.mount&#039;&#039; modules from your local machine may require installation:&lt;br /&gt;
 ansible-galaxy collection install community.general&lt;br /&gt;
 ansible-galaxy collection install ansible.posix&lt;br /&gt;
&lt;br /&gt;
=== Install HDFS and YARN ===&lt;br /&gt;
To install HDFS and YARN you need the &#039;&#039;master_node&#039;&#039; and &#039;&#039;num_workers&#039;&#039; available as Ansible variables (facts). You can use the &#039;&#039;ansible.builtin.shell&#039;&#039; and &#039;&#039;.set_fact&#039;&#039; modules to do this, for example at the start of a new Ansible play:&lt;br /&gt;
 - name: Install HDFS and YARN&lt;br /&gt;
   hosts: all&lt;br /&gt;
   tasks:&lt;br /&gt;
 &lt;br /&gt;
     - name: Register master_node expression&lt;br /&gt;
       shell: grep tf-driver /etc/hosts | cut -d&#039; &#039; -f1&lt;br /&gt;
       register: master_node_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set master_node fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         master_node: &amp;quot;{{ master_node_expr.stdout } }&amp;quot;&lt;br /&gt;
Write two corresponding tasks for &#039;&#039;num_workers&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.get_url&#039;&#039; module to download the Hadoop and later Spark archives directly to each cluster host. But if you re-run your script many times, it takes time and can trigger rate limitations in the download hosts. If so, it is better to let Ansible download each archive once to your local machine and then copy it onto the hosts as needed.&lt;br /&gt;
&lt;br /&gt;
Use the &#039;&#039;ansible.builtin.unarchive&#039;&#039; module to unpack the archives. Use the &#039;&#039;apt&#039;&#039; module to install &#039;&#039;gzip&#039;&#039; if you need it. Use &#039;&#039;ansible.builtin.file&#039;&#039; to create symbolic links as in Exercise 4.&lt;br /&gt;
&lt;br /&gt;
Use &#039;&#039;ansible.builtin.lineinfile&#039;&#039; to define environment variables by adding them to &#039;&#039;~/.info319&#039;&#039; (instead of  &#039;&#039;~/.bashrc&#039;&#039;). &lt;br /&gt;
&lt;br /&gt;
Change the variable syntax in the files &#039;&#039;{core,hdfs,mapred,yarn}-site.xml&#039;&#039; from Exercise 4 from Linux to Ansible. For example&lt;br /&gt;
* from &#039;&#039;core-site.xml&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://${HADOOP_NAMENODE}:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
* to &#039;&#039;core-site.xml.j2&#039;&#039;:&lt;br /&gt;
 &amp;lt;configuration&amp;gt;&lt;br /&gt;
     &amp;lt;property&amp;gt;&lt;br /&gt;
         &amp;lt;name&amp;gt;fs.defaultFS&amp;lt;/name&amp;gt;&lt;br /&gt;
         &amp;lt;value&amp;gt;hdfs://{{ hadoop_namenode } }:9000&amp;lt;/value&amp;gt;&lt;br /&gt;
     &amp;lt;/property&amp;gt;&lt;br /&gt;
 &amp;lt;/configuration&amp;gt;&lt;br /&gt;
You can now use the &#039;&#039;ansible.builtin.template&#039;&#039; module instead of Linux&#039; &#039;&#039;envsubst&#039;&#039; command to configure the files. Note that the files use the {{ hadoop_namenode } } variable. It is not defined yet, but should have the same value as {{ master_node } }.&lt;br /&gt;
&lt;br /&gt;
Use &#039;&#039;ansible.builtin.shell&#039;&#039; to create Hadoop&#039;s &#039;&#039;worker&#039;&#039; and &#039;&#039;master&#039;&#039; files as in Execise 4, and to format the HDFS namenode and create the HDFS root folder on the &#039;&#039;terraform-driver&#039;&#039; host.&lt;br /&gt;
&lt;br /&gt;
Note that &#039;&#039;&#039;hdfs namenode -format&#039;&#039;&#039; has got a &#039;&#039;&#039;-nonInteractive&#039;&#039;&#039; option that does not re-format an already formatted namenode.  Use &#039;&#039;failed_when&#039;&#039; to make Ansible ignore &#039;&#039;Exit code 1&#039;&#039; from hdfs in such cases:&lt;br /&gt;
     - name: Format HDFS namenode&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/home/ubuntu/volume/hadoop/bin/hdfs&amp;quot;, &amp;quot;namenode&amp;quot;, &amp;quot;-format&amp;quot;, &amp;quot;-nonInteractive&amp;quot;]&lt;br /&gt;
       register: result&lt;br /&gt;
       failed_when: result.rc not in [0, 1]&lt;br /&gt;
&lt;br /&gt;
Now you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_HDFS_and_YARN_on_the_cluster#Run_HDFS_on_the_cluster | test run HDFS on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
Finally, you can start HDFS and YARN from &#039;&#039;terraform-driver&#039;&#039;. Note that the &#039;&#039;ansible.builtin.copy&#039;&#039; and &#039;&#039;ansible.builtin.shell&#039;&#039; modules will normally &#039;&#039;not&#039;&#039; run &#039;&#039;~/.bashrc&#039;&#039;. The reason is that &#039;&#039;~/.bashrc&#039;&#039; is intended for interactive shells running, for example, in a terminal window. &#039;&#039;~/.bashrc&#039;&#039; is not needed for many simpler commands, but more complex programs and scripts like Hadoop&#039;s &#039;&#039;start-all.sh&#039;&#039; need many environment variables. Therefore, you must start your own &#039;&#039;/usr/bin/bash&#039;&#039;, initialise it with &#039;&#039;~/.info319&#039;&#039;, and then run &#039;&#039;start-all.sh&#039;&#039; inside it:&lt;br /&gt;
     - name: Start HDFS and YARN&lt;br /&gt;
       ansible.builtin.shell:&lt;br /&gt;
         argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;start-all.sh&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Spark on the cluster ===&lt;br /&gt;
Proceed as in Exercise 4. You already know the Ansible modules that are needed. Afterwards you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install_Spark_on_the_cluster#Test_run_Spark | test run Spark on the cluster as in Exercise 4]].&lt;br /&gt;
&lt;br /&gt;
=== Install Zookeeper on the cluster ===&lt;br /&gt;
There are two challenges with Zookeeper:&lt;br /&gt;
# it may not run on all the machines in the cluster (it must be an odd number)&lt;br /&gt;
# each zookeeper needs to know its &#039;&#039;myid&#039;&#039; number&lt;br /&gt;
&lt;br /&gt;
As for the first point, an easy solution is to run Zookeeper on all hosts if the number is odd, but only on the workers if the total number is even.&lt;br /&gt;
&lt;br /&gt;
As for the second point, Exercise 4 suggested shell commands you can use to manage zookeeper ids. But Ansible also has powerful [https://docs.ansible.com/ansible/latest/user_guide/playbooks_loops.html loop constructs] you can use.&lt;br /&gt;
&lt;br /&gt;
This task will start Zookeeper on the selected nodes:&lt;br /&gt;
    - name: Start Zookeper&lt;br /&gt;
      ansible.builtin.shell: &lt;br /&gt;
        argv: [&amp;quot;/usr/bin/bash&amp;quot;, &amp;quot;--rcfile&amp;quot;, &amp;quot;/home/ubuntu/.info319&amp;quot;, &amp;quot;-c&amp;quot;, &amp;quot;zkServer.sh start ${ZOOKEEPER_HOME}/conf/zookeeper.properties&amp;quot;]&lt;br /&gt;
&lt;br /&gt;
=== Install Kafka on the cluster ===&lt;br /&gt;
Again, you already know the Ansible modules that are needed to proceed as in Exercise 4. Each Kafka node needs to know its &#039;&#039;local_ip&#039;&#039;, which you can set like this:&lt;br /&gt;
     - name: Register local_ip expression&lt;br /&gt;
       shell: ip -4 address | grep -o &amp;quot;^ *inet \(.\+\)\/.\+global.*$&amp;quot; | grep -o &amp;quot;[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+&amp;quot; | head -1&lt;br /&gt;
       register: local_ip_expr&lt;br /&gt;
 &lt;br /&gt;
     - name: Set local_ip fact&lt;br /&gt;
       set_fact:&lt;br /&gt;
         local_ip: &amp;quot;{{ local_ip_expr.stdout } }&amp;quot;&lt;br /&gt;
&lt;br /&gt;
It also needs to know its &#039;&#039;id&#039;&#039;, which was written to the file &#039;&#039;/home/ubuntu/volume/zookeeper/data/myid&#039;&#039; before (better than &#039;&#039;/tmp/zookeeper/myid&#039;&#039; which was suggested before).&lt;br /&gt;
&lt;br /&gt;
Finally, you can log into &#039;&#039;terraform-driver&#039;&#039; and [[Install Kafka on the cluster#Test run Kafka | test run Spark on top of Kafka as in Exercise 4]]. Congratulations!&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1210</id>
		<title>Create Spark cluster using Terraform</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1210"/>
		<updated>2022-10-31T13:38:50Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Configure SSH */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== OpenStack ==&lt;br /&gt;
=== Install OpenStack ===&lt;br /&gt;
On your local machine, create an &#039;&#039;exercise-5&#039;&#039; folder, for example a subfolder of &#039;&#039;info319-exercises&#039;&#039;, and &#039;&#039;&#039;cd&#039;&#039;&#039; into it.&lt;br /&gt;
&lt;br /&gt;
Install &#039;&#039;openstackclient&#039;&#039;. There are several ways to do this. On Ubuntu Linux, you can do:&lt;br /&gt;
 sudo apt install python3-openstackclient&lt;br /&gt;
(You may have to reinstall &#039;&#039;python3-six&#039;&#039; and &#039;&#039;python3-urllib3&#039;&#039;. You may also need &#039;&#039;python3-dev&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
Other guides suggest you install it as a Python package (using a virtual environment if you want):&lt;br /&gt;
 pip install python-openstackclient&lt;br /&gt;
&lt;br /&gt;
=== Configure OpenStack for command line ===&lt;br /&gt;
If you want to run OpenStack from the command line, create the file &#039;&#039;keystonerc.sh&#039;&#039; on your local machine:&lt;br /&gt;
 touch keystonerc.sh&lt;br /&gt;
 chmod 0600 keystonerc.sh&lt;br /&gt;
&lt;br /&gt;
Use this as a template:&lt;br /&gt;
 export OS_USERNAME=YOUR_USER_NAME@uib.no&lt;br /&gt;
 export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
 export OS_PASSWORD=g3...YOUR_PASSWORD...Qb&lt;br /&gt;
 export OS_AUTH_URL=https://api.nrec.no:5000/v3&lt;br /&gt;
 export OS_IDENTITY_API_VERSION=3&lt;br /&gt;
 export OS_USER_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_PROJECT_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
 export OS_INTERFACE=public&lt;br /&gt;
 export OS_NO_CACHE=1&lt;br /&gt;
 export OS_TENANT_NAME=$OS_PROJECT_NAME&lt;br /&gt;
&lt;br /&gt;
=== Test run OpenStack from command line ===&lt;br /&gt;
Test with:&lt;br /&gt;
 . keystonerc.sh&lt;br /&gt;
 openstack server list&lt;br /&gt;
(It can be quite slow.)&lt;br /&gt;
&lt;br /&gt;
Other test commands:&lt;br /&gt;
 openstack image list&lt;br /&gt;
 openstack flavor list&lt;br /&gt;
 openstack network list&lt;br /&gt;
 openstack keypair list&lt;br /&gt;
 openstack security group list&lt;br /&gt;
&#039;&#039;&#039;openstack --help&#039;&#039;&#039; lists all the possible commands. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039; Try to create an instance with &#039;&#039;&#039;openstack server create ...&#039;&#039;&#039;, then delete it. You can use the NREC Overview to see the results.&lt;br /&gt;
&lt;br /&gt;
== Terraform ==&lt;br /&gt;
=== Install Terraform ===&lt;br /&gt;
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install  software-properties-common gnupg2 curl&lt;br /&gt;
 curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor &amp;gt; hashicorp.gpg&lt;br /&gt;
 sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/&lt;br /&gt;
 sudo apt-add-repository &amp;quot;deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main&amp;quot;&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install terraform&lt;br /&gt;
&lt;br /&gt;
=== Configure Terraform ===&lt;br /&gt;
Create a configuration file &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # configure the OpenStack provider&lt;br /&gt;
 terraform {&lt;br /&gt;
   required_providers {&lt;br /&gt;
     openstack = {&lt;br /&gt;
       source = &amp;quot;terraform-provider-openstack/openstack&amp;quot;&lt;br /&gt;
     }&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Initialise Terraform with:&lt;br /&gt;
 terraform init&lt;br /&gt;
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:&lt;br /&gt;
 terraform init -upgrade&lt;br /&gt;
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.&lt;br /&gt;
&lt;br /&gt;
=== Test run Terraform ===&lt;br /&gt;
Append something like this to &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # test instance&lt;br /&gt;
 resource &amp;quot;openstack_compute_instance_v2&amp;quot; &amp;quot;terraform-test&amp;quot; {&lt;br /&gt;
   name            = &amp;quot;terraform-test&amp;quot;&lt;br /&gt;
   image_name      = &amp;quot;GOLD Ubuntu 22.04 LTS&amp;quot;&lt;br /&gt;
   flavor_name     = &amp;quot;m1.large&amp;quot;&lt;br /&gt;
   security_groups = [&amp;quot;default&amp;quot;, &amp;quot;info319-spark-cluster&amp;quot;]&lt;br /&gt;
   network {&lt;br /&gt;
     name = &amp;quot;dualStack&amp;quot;&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Try to run Terraform with:&lt;br /&gt;
 terraform plan&lt;br /&gt;
&#039;&#039;&#039;terraform plan&#039;&#039;&#039; lists everything &#039;&#039;terraform&#039;&#039; will do if you &#039;&#039;apply&#039;&#039; it. This list is important to check so you do not permanently delete something critical, like disks/volumes with important data on them...&lt;br /&gt;
&lt;br /&gt;
To create the new test instance:&lt;br /&gt;
 terraform apply  # answer &#039;yes&#039; to approve&lt;br /&gt;
Check &#039;&#039;Compute -&amp;gt; Instances&#039;&#039; in the &#039;&#039;NREC Overview&#039;&#039; to see the new instance appear.&lt;br /&gt;
&lt;br /&gt;
(You can also run &#039;&#039;&#039;terraform plan -out plan.tf&#039;&#039;&#039; to save the plan to a file and then run it faster with &#039;&#039;&#039;terraform apply plan.tf&#039;&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
=== user-data.cfg ===&lt;br /&gt;
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like &#039;&#039;user-data.cfg&#039;&#039;, or example:&lt;br /&gt;
 #cloud-config&lt;br /&gt;
 &lt;br /&gt;
 apt_upgrade: true&lt;br /&gt;
 &lt;br /&gt;
 packages:&lt;br /&gt;
 - emacs&lt;br /&gt;
 &lt;br /&gt;
 power_state:&lt;br /&gt;
   delay: &amp;quot;+3&amp;quot;&lt;br /&gt;
   mode: reboot&lt;br /&gt;
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text &#039;&#039;emacs&#039;&#039; editor.)&lt;br /&gt;
&lt;br /&gt;
To run this script when a new instance is created, add the line&lt;br /&gt;
   user_data       = file(&amp;quot;user-data.cfg&amp;quot;)&lt;br /&gt;
to &#039;&#039;info319-cluster.tf&#039; and run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a &#039;&#039;Connection closed by UNKNOWN port 65535&#039;&#039; message.)&lt;br /&gt;
&lt;br /&gt;
=== ~/.config/openstack/clouds.yaml (optional) ===&lt;br /&gt;
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (&#039;&#039;OS_*&#039;&#039;) defined in keystonerc.sh , you can keep them in a file. &lt;br /&gt;
&lt;br /&gt;
On your local machine, create ~/.config/openstack/clouds.yaml with &#039;&#039;0600&#039;&#039; (or &#039;&#039;-rw-------&#039;&#039;) access, for example:&lt;br /&gt;
 clouds:&lt;br /&gt;
   info319-cluster:&lt;br /&gt;
     auth:&lt;br /&gt;
       auth_url: https://api.nrec.no:5000/v3&lt;br /&gt;
       project_name: uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
       username: YOUR_USER_NAME@uib.no&lt;br /&gt;
       password: g3...YOUR_PASSWORD...Qb&lt;br /&gt;
       user_domain_name: dataporten&lt;br /&gt;
       project_domain_name: dataporten&lt;br /&gt;
     identity_api_version: 3&lt;br /&gt;
     region_name: YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
     interface: public&lt;br /&gt;
     operation_log:&lt;br /&gt;
       logging: TRUE&lt;br /&gt;
       file: openstackclient_admin.log&lt;br /&gt;
       level: info&lt;br /&gt;
&lt;br /&gt;
You need to change your &#039;&#039;info319-cluster.tf&#039;&#039; file from&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
to&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
   cloud  = &amp;quot;info319-cluster&amp;quot;  # defined in a clouds.yaml file&lt;br /&gt;
 }&lt;br /&gt;
The advantage of this setup is that it easier manage multiple clusters from the same local machine. You do not need to run the &#039;&#039;keystonerc.sh&#039;&#039; script, and you avoid problems with changed environment variables.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There are now two ways to run &#039;&#039;&#039;openstack&#039;&#039;&#039;. One assumes that all the &#039;&#039;OS_*&#039;&#039; environment variables defined in &#039;&#039;keystonerc.sh&#039;&#039; are set. The other assumes that there is a &#039;&#039;~/.config/openstack/clouds.yaml&#039;&#039; file and that &#039;&#039;OS_CLOUD&#039;&#039; is set, for example:&lt;br /&gt;
 OS_CLOUD=info319-cluster openstack server list&lt;br /&gt;
&lt;br /&gt;
== Spark cluster ==&lt;br /&gt;
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The &#039;&#039;Resources&#039;&#039; menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] lists all the available resources, but it is easier to list the most relevant ones with &#039;&#039;&#039;openstack &#039;&#039;resource_type&#039;&#039; list&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create or import a key pair ===&lt;br /&gt;
Your &#039;&#039;terraform-test&#039;&#039; instance still has no keypair. To add a keypair to &#039;&#039;info319-cluster.tf&#039;&#039;, use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public &#039;&#039;~/.ssh/info319-spark-cluster.pub&#039;&#039; SSH key from before. &lt;br /&gt;
&lt;br /&gt;
Add the keypair resource to the &#039;&#039;terraform-test&#039;&#039; resource with a line like this:&lt;br /&gt;
   key_pair        = &amp;quot;info319-spark-cluster&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Rerun &#039;&#039;&#039;terraform plan&#039;&#039;&#039;, &#039;&#039;&#039;terraform apply&#039;&#039;&#039;, and sometimes &#039;&#039;&#039;terraform destroy&#039;&#039;&#039; continuously as you build the cluster to check that things work.&lt;br /&gt;
&lt;br /&gt;
=== Test login ===&lt;br /&gt;
Use&lt;br /&gt;
 openstack server list  # or OS_CLOUD=info319-cluster openstack ...&lt;br /&gt;
to find the IPv4 and IPv6 addresses of &#039;&#039;terraform-test&#039;&#039;. Log in with for example:&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.165.222&lt;br /&gt;
and&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::21f4&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark driver and workers ===&lt;br /&gt;
The &#039;&#039;terraform-test&#039;&#039; resource can be renamed to for example &#039;&#039;terraform-driver&#039;&#039; and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].&lt;br /&gt;
&lt;br /&gt;
You can create multiple &#039;&#039;terraform-worker-&#039;&#039; resources in a similar way, but adding a line &lt;br /&gt;
 count            = 6&lt;br /&gt;
and using &#039;&#039;&#039;${count.index}&#039;&#039;&#039; as a variable inside the worker &#039;&#039;name&#039;&#039;. Use &#039;&#039;IPv6&#039;&#039; instead of &#039;&#039;dualStack&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Check that you can login to the new instances using &#039;&#039;&#039;ssh&#039;&#039;&#039; and a IPv6 JumpHost.&lt;br /&gt;
&lt;br /&gt;
=== Local Terraform variables ===&lt;br /&gt;
As &#039;&#039;info319-cluster.tf&#039;&#039; grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:&lt;br /&gt;
 locals {&lt;br /&gt;
     cluster_prefix = &amp;quot;terraform-&amp;quot;&lt;br /&gt;
     num_workers = 6&lt;br /&gt;
     driver_name = &amp;quot;${local.cluster_prefix}driver&amp;quot;&lt;br /&gt;
     worker_prefix = &amp;quot;${local.cluster_prefix}worker-&amp;quot;&lt;br /&gt;
     worker_names = [&lt;br /&gt;
         for idx in range(local.num_workers) : &lt;br /&gt;
             &amp;quot;${local.worker_prefix}${idx}&amp;quot;]&lt;br /&gt;
     all_names = concat([local.driver_name], local.worker_names)&lt;br /&gt;
 }&lt;br /&gt;
Inside strings, you can refer to a variable as &#039;&#039;${local.var_name}&#039;&#039;. Outside of strings, you can write just &#039;&#039;local.var_name&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create and attach volumes ===&lt;br /&gt;
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances]. &lt;br /&gt;
&lt;br /&gt;
You can use &#039;&#039;local.num_workers&#039;&#039; and &#039;&#039;count=&#039;&#039; both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:&lt;br /&gt;
 resource &amp;quot;openstack_compute_volume_attach_v2&amp;quot; &amp;quot;attached-to-workers&amp;quot; {&lt;br /&gt;
   count       = local.num_workers&lt;br /&gt;
   instance_id = &amp;quot;${openstack_compute_instance_v2.terraform-workers[count.index].id}&amp;quot;&lt;br /&gt;
   volume_id   = &amp;quot;${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Log in with SSH and do &#039;&#039;&#039;ls /dev&#039;&#039;&#039; to see that the volumes are attached where they should (but they are not yet partitioned, formatted, and mounted).&lt;br /&gt;
&lt;br /&gt;
=== Create &#039;&#039;hosts&#039;&#039; files ===&lt;br /&gt;
Make OpenStack write out the files &#039;&#039;ipv4-hosts&#039;&#039; looking like this:&lt;br /&gt;
 158.37.65.59 terraform-driver&lt;br /&gt;
 10.1.2.233 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 10.1.2.63 terraform-worker-5&lt;br /&gt;
and &#039;&#039;ipv6-hosts&#039;&#039; looking like this:&lt;br /&gt;
 2001:700:2:8300::27c4 terraform-driver&lt;br /&gt;
 2001:700:2:8301::13a1 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 2001:700:2:8310::110d terraform-worker-5&lt;br /&gt;
 &lt;br /&gt;
Tips:&lt;br /&gt;
* you can define outputs from terraform like this:&lt;br /&gt;
 output &amp;quot;terraform-driver-ipv4&amp;quot; {&lt;br /&gt;
     value = openstack_compute_instance_v2.terraform-driver.access_ip_v4&lt;br /&gt;
 }&lt;br /&gt;
* you can use the &#039;&#039;&#039;[https://www.terraform.io/cli/commands/console terraform console]&#039;&#039;&#039; to explore expressions you can use to define outputs and locals, for example (inside the console):&lt;br /&gt;
 openstack_compute_instance_v2.terraform-workers[5]&lt;br /&gt;
* when an output works as expected, you can redefine it as a local variable and use it to define more complex outputs&lt;br /&gt;
* a few useful functions (you may not need all of them) are:&lt;br /&gt;
     length(string)&lt;br /&gt;
     join(string, list)&lt;br /&gt;
     concat([element], list)&lt;br /&gt;
     substr(string, offset, length)&lt;br /&gt;
     zipmap(list1, list2)&lt;br /&gt;
* you can create lists using this syntax:&lt;br /&gt;
     terraform-workers-ipv6 = [&lt;br /&gt;
     	for idx in range(number):&lt;br /&gt;
 	    list[idx].attribute&lt;br /&gt;
     ]&lt;br /&gt;
* you can output to a file like this:&lt;br /&gt;
 resource &amp;quot;local_file&amp;quot; &amp;quot;ipv4-hosts-file&amp;quot; {&lt;br /&gt;
     content = &amp;quot;${local.ipv4-hosts-string}\n&amp;quot;&lt;br /&gt;
     filename = &amp;quot;ipv4-hosts&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
On your local host, change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039;:&lt;br /&gt;
 Host terraform-driver&lt;br /&gt;
     Hostname 2001:700:2:8300::27c4&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-0&lt;br /&gt;
     Hostname 2001:700:2:8301::13a1&lt;br /&gt;
 &lt;br /&gt;
 ...&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-5&lt;br /&gt;
     Hostname 2001:700:2:8301::110d&lt;br /&gt;
&lt;br /&gt;
On your local host, add these lines to &#039;&#039;~/.ssh/config&#039;&#039;:&lt;br /&gt;
 Host terraform-* localhost&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      ProxyJump sinoa@login.uib.no&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
&lt;br /&gt;
 Include ./config.terraform-hosts&lt;br /&gt;
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).&lt;br /&gt;
&lt;br /&gt;
You can now log into all the cluster machines by name, even after you increase the number of worker nodes:&lt;br /&gt;
 ssh terraform-worker-4&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1209</id>
		<title>Create Spark cluster using Terraform</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1209"/>
		<updated>2022-10-31T13:36:52Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Create hosts files */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== OpenStack ==&lt;br /&gt;
=== Install OpenStack ===&lt;br /&gt;
On your local machine, create an &#039;&#039;exercise-5&#039;&#039; folder, for example a subfolder of &#039;&#039;info319-exercises&#039;&#039;, and &#039;&#039;&#039;cd&#039;&#039;&#039; into it.&lt;br /&gt;
&lt;br /&gt;
Install &#039;&#039;openstackclient&#039;&#039;. There are several ways to do this. On Ubuntu Linux, you can do:&lt;br /&gt;
 sudo apt install python3-openstackclient&lt;br /&gt;
(You may have to reinstall &#039;&#039;python3-six&#039;&#039; and &#039;&#039;python3-urllib3&#039;&#039;. You may also need &#039;&#039;python3-dev&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
Other guides suggest you install it as a Python package (using a virtual environment if you want):&lt;br /&gt;
 pip install python-openstackclient&lt;br /&gt;
&lt;br /&gt;
=== Configure OpenStack for command line ===&lt;br /&gt;
If you want to run OpenStack from the command line, create the file &#039;&#039;keystonerc.sh&#039;&#039; on your local machine:&lt;br /&gt;
 touch keystonerc.sh&lt;br /&gt;
 chmod 0600 keystonerc.sh&lt;br /&gt;
&lt;br /&gt;
Use this as a template:&lt;br /&gt;
 export OS_USERNAME=YOUR_USER_NAME@uib.no&lt;br /&gt;
 export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
 export OS_PASSWORD=g3...YOUR_PASSWORD...Qb&lt;br /&gt;
 export OS_AUTH_URL=https://api.nrec.no:5000/v3&lt;br /&gt;
 export OS_IDENTITY_API_VERSION=3&lt;br /&gt;
 export OS_USER_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_PROJECT_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
 export OS_INTERFACE=public&lt;br /&gt;
 export OS_NO_CACHE=1&lt;br /&gt;
 export OS_TENANT_NAME=$OS_PROJECT_NAME&lt;br /&gt;
&lt;br /&gt;
=== Test run OpenStack from command line ===&lt;br /&gt;
Test with:&lt;br /&gt;
 . keystonerc.sh&lt;br /&gt;
 openstack server list&lt;br /&gt;
(It can be quite slow.)&lt;br /&gt;
&lt;br /&gt;
Other test commands:&lt;br /&gt;
 openstack image list&lt;br /&gt;
 openstack flavor list&lt;br /&gt;
 openstack network list&lt;br /&gt;
 openstack keypair list&lt;br /&gt;
 openstack security group list&lt;br /&gt;
&#039;&#039;&#039;openstack --help&#039;&#039;&#039; lists all the possible commands. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039; Try to create an instance with &#039;&#039;&#039;openstack server create ...&#039;&#039;&#039;, then delete it. You can use the NREC Overview to see the results.&lt;br /&gt;
&lt;br /&gt;
== Terraform ==&lt;br /&gt;
=== Install Terraform ===&lt;br /&gt;
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install  software-properties-common gnupg2 curl&lt;br /&gt;
 curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor &amp;gt; hashicorp.gpg&lt;br /&gt;
 sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/&lt;br /&gt;
 sudo apt-add-repository &amp;quot;deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main&amp;quot;&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install terraform&lt;br /&gt;
&lt;br /&gt;
=== Configure Terraform ===&lt;br /&gt;
Create a configuration file &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # configure the OpenStack provider&lt;br /&gt;
 terraform {&lt;br /&gt;
   required_providers {&lt;br /&gt;
     openstack = {&lt;br /&gt;
       source = &amp;quot;terraform-provider-openstack/openstack&amp;quot;&lt;br /&gt;
     }&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Initialise Terraform with:&lt;br /&gt;
 terraform init&lt;br /&gt;
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:&lt;br /&gt;
 terraform init -upgrade&lt;br /&gt;
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.&lt;br /&gt;
&lt;br /&gt;
=== Test run Terraform ===&lt;br /&gt;
Append something like this to &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # test instance&lt;br /&gt;
 resource &amp;quot;openstack_compute_instance_v2&amp;quot; &amp;quot;terraform-test&amp;quot; {&lt;br /&gt;
   name            = &amp;quot;terraform-test&amp;quot;&lt;br /&gt;
   image_name      = &amp;quot;GOLD Ubuntu 22.04 LTS&amp;quot;&lt;br /&gt;
   flavor_name     = &amp;quot;m1.large&amp;quot;&lt;br /&gt;
   security_groups = [&amp;quot;default&amp;quot;, &amp;quot;info319-spark-cluster&amp;quot;]&lt;br /&gt;
   network {&lt;br /&gt;
     name = &amp;quot;dualStack&amp;quot;&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Try to run Terraform with:&lt;br /&gt;
 terraform plan&lt;br /&gt;
&#039;&#039;&#039;terraform plan&#039;&#039;&#039; lists everything &#039;&#039;terraform&#039;&#039; will do if you &#039;&#039;apply&#039;&#039; it. This list is important to check so you do not permanently delete something critical, like disks/volumes with important data on them...&lt;br /&gt;
&lt;br /&gt;
To create the new test instance:&lt;br /&gt;
 terraform apply  # answer &#039;yes&#039; to approve&lt;br /&gt;
Check &#039;&#039;Compute -&amp;gt; Instances&#039;&#039; in the &#039;&#039;NREC Overview&#039;&#039; to see the new instance appear.&lt;br /&gt;
&lt;br /&gt;
(You can also run &#039;&#039;&#039;terraform plan -out plan.tf&#039;&#039;&#039; to save the plan to a file and then run it faster with &#039;&#039;&#039;terraform apply plan.tf&#039;&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
=== user-data.cfg ===&lt;br /&gt;
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like &#039;&#039;user-data.cfg&#039;&#039;, or example:&lt;br /&gt;
 #cloud-config&lt;br /&gt;
 &lt;br /&gt;
 apt_upgrade: true&lt;br /&gt;
 &lt;br /&gt;
 packages:&lt;br /&gt;
 - emacs&lt;br /&gt;
 &lt;br /&gt;
 power_state:&lt;br /&gt;
   delay: &amp;quot;+3&amp;quot;&lt;br /&gt;
   mode: reboot&lt;br /&gt;
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text &#039;&#039;emacs&#039;&#039; editor.)&lt;br /&gt;
&lt;br /&gt;
To run this script when a new instance is created, add the line&lt;br /&gt;
   user_data       = file(&amp;quot;user-data.cfg&amp;quot;)&lt;br /&gt;
to &#039;&#039;info319-cluster.tf&#039; and run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a &#039;&#039;Connection closed by UNKNOWN port 65535&#039;&#039; message.)&lt;br /&gt;
&lt;br /&gt;
=== ~/.config/openstack/clouds.yaml (optional) ===&lt;br /&gt;
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (&#039;&#039;OS_*&#039;&#039;) defined in keystonerc.sh , you can keep them in a file. &lt;br /&gt;
&lt;br /&gt;
On your local machine, create ~/.config/openstack/clouds.yaml with &#039;&#039;0600&#039;&#039; (or &#039;&#039;-rw-------&#039;&#039;) access, for example:&lt;br /&gt;
 clouds:&lt;br /&gt;
   info319-cluster:&lt;br /&gt;
     auth:&lt;br /&gt;
       auth_url: https://api.nrec.no:5000/v3&lt;br /&gt;
       project_name: uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
       username: YOUR_USER_NAME@uib.no&lt;br /&gt;
       password: g3...YOUR_PASSWORD...Qb&lt;br /&gt;
       user_domain_name: dataporten&lt;br /&gt;
       project_domain_name: dataporten&lt;br /&gt;
     identity_api_version: 3&lt;br /&gt;
     region_name: YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
     interface: public&lt;br /&gt;
     operation_log:&lt;br /&gt;
       logging: TRUE&lt;br /&gt;
       file: openstackclient_admin.log&lt;br /&gt;
       level: info&lt;br /&gt;
&lt;br /&gt;
You need to change your &#039;&#039;info319-cluster.tf&#039;&#039; file from&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
to&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
   cloud  = &amp;quot;info319-cluster&amp;quot;  # defined in a clouds.yaml file&lt;br /&gt;
 }&lt;br /&gt;
The advantage of this setup is that it easier manage multiple clusters from the same local machine. You do not need to run the &#039;&#039;keystonerc.sh&#039;&#039; script, and you avoid problems with changed environment variables.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There are now two ways to run &#039;&#039;&#039;openstack&#039;&#039;&#039;. One assumes that all the &#039;&#039;OS_*&#039;&#039; environment variables defined in &#039;&#039;keystonerc.sh&#039;&#039; are set. The other assumes that there is a &#039;&#039;~/.config/openstack/clouds.yaml&#039;&#039; file and that &#039;&#039;OS_CLOUD&#039;&#039; is set, for example:&lt;br /&gt;
 OS_CLOUD=info319-cluster openstack server list&lt;br /&gt;
&lt;br /&gt;
== Spark cluster ==&lt;br /&gt;
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The &#039;&#039;Resources&#039;&#039; menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] lists all the available resources, but it is easier to list the most relevant ones with &#039;&#039;&#039;openstack &#039;&#039;resource_type&#039;&#039; list&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create or import a key pair ===&lt;br /&gt;
Your &#039;&#039;terraform-test&#039;&#039; instance still has no keypair. To add a keypair to &#039;&#039;info319-cluster.tf&#039;&#039;, use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public &#039;&#039;~/.ssh/info319-spark-cluster.pub&#039;&#039; SSH key from before. &lt;br /&gt;
&lt;br /&gt;
Add the keypair resource to the &#039;&#039;terraform-test&#039;&#039; resource with a line like this:&lt;br /&gt;
   key_pair        = &amp;quot;info319-spark-cluster&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Rerun &#039;&#039;&#039;terraform plan&#039;&#039;&#039;, &#039;&#039;&#039;terraform apply&#039;&#039;&#039;, and sometimes &#039;&#039;&#039;terraform destroy&#039;&#039;&#039; continuously as you build the cluster to check that things work.&lt;br /&gt;
&lt;br /&gt;
=== Test login ===&lt;br /&gt;
Use&lt;br /&gt;
 openstack server list  # or OS_CLOUD=info319-cluster openstack ...&lt;br /&gt;
to find the IPv4 and IPv6 addresses of &#039;&#039;terraform-test&#039;&#039;. Log in with for example:&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.165.222&lt;br /&gt;
and&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::21f4&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark driver and workers ===&lt;br /&gt;
The &#039;&#039;terraform-test&#039;&#039; resource can be renamed to for example &#039;&#039;terraform-driver&#039;&#039; and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].&lt;br /&gt;
&lt;br /&gt;
You can create multiple &#039;&#039;terraform-worker-&#039;&#039; resources in a similar way, but adding a line &lt;br /&gt;
 count            = 6&lt;br /&gt;
and using &#039;&#039;&#039;${count.index}&#039;&#039;&#039; as a variable inside the worker &#039;&#039;name&#039;&#039;. Use &#039;&#039;IPv6&#039;&#039; instead of &#039;&#039;dualStack&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Check that you can login to the new instances using &#039;&#039;&#039;ssh&#039;&#039;&#039; and a IPv6 JumpHost.&lt;br /&gt;
&lt;br /&gt;
=== Local Terraform variables ===&lt;br /&gt;
As &#039;&#039;info319-cluster.tf&#039;&#039; grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:&lt;br /&gt;
 locals {&lt;br /&gt;
     cluster_prefix = &amp;quot;terraform-&amp;quot;&lt;br /&gt;
     num_workers = 6&lt;br /&gt;
     driver_name = &amp;quot;${local.cluster_prefix}driver&amp;quot;&lt;br /&gt;
     worker_prefix = &amp;quot;${local.cluster_prefix}worker-&amp;quot;&lt;br /&gt;
     worker_names = [&lt;br /&gt;
         for idx in range(local.num_workers) : &lt;br /&gt;
             &amp;quot;${local.worker_prefix}${idx}&amp;quot;]&lt;br /&gt;
     all_names = concat([local.driver_name], local.worker_names)&lt;br /&gt;
 }&lt;br /&gt;
Inside strings, you can refer to a variable as &#039;&#039;${local.var_name}&#039;&#039;. Outside of strings, you can write just &#039;&#039;local.var_name&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create and attach volumes ===&lt;br /&gt;
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances]. &lt;br /&gt;
&lt;br /&gt;
You can use &#039;&#039;local.num_workers&#039;&#039; and &#039;&#039;count=&#039;&#039; both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:&lt;br /&gt;
 resource &amp;quot;openstack_compute_volume_attach_v2&amp;quot; &amp;quot;attached-to-workers&amp;quot; {&lt;br /&gt;
   count       = local.num_workers&lt;br /&gt;
   instance_id = &amp;quot;${openstack_compute_instance_v2.terraform-workers[count.index].id}&amp;quot;&lt;br /&gt;
   volume_id   = &amp;quot;${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Log in with SSH and do &#039;&#039;&#039;ls /dev&#039;&#039;&#039; to see that the volumes are attached where they should (but they are not yet partitioned, formatted, and mounted).&lt;br /&gt;
&lt;br /&gt;
=== Create &#039;&#039;hosts&#039;&#039; files ===&lt;br /&gt;
Make OpenStack write out the files &#039;&#039;ipv4-hosts&#039;&#039; looking like this:&lt;br /&gt;
 158.37.65.59 terraform-driver&lt;br /&gt;
 10.1.2.233 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 10.1.2.63 terraform-worker-5&lt;br /&gt;
and &#039;&#039;ipv6-hosts&#039;&#039; looking like this:&lt;br /&gt;
 2001:700:2:8300::27c4 terraform-driver&lt;br /&gt;
 2001:700:2:8301::13a1 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 2001:700:2:8310::110d terraform-worker-5&lt;br /&gt;
 &lt;br /&gt;
Tips:&lt;br /&gt;
* you can define outputs from terraform like this:&lt;br /&gt;
 output &amp;quot;terraform-driver-ipv4&amp;quot; {&lt;br /&gt;
     value = openstack_compute_instance_v2.terraform-driver.access_ip_v4&lt;br /&gt;
 }&lt;br /&gt;
* you can use the &#039;&#039;&#039;[https://www.terraform.io/cli/commands/console terraform console]&#039;&#039;&#039; to explore expressions you can use to define outputs and locals, for example (inside the console):&lt;br /&gt;
 openstack_compute_instance_v2.terraform-workers[5]&lt;br /&gt;
* when an output works as expected, you can redefine it as a local variable and use it to define more complex outputs&lt;br /&gt;
* a few useful functions (you may not need all of them) are:&lt;br /&gt;
     length(string)&lt;br /&gt;
     join(string, list)&lt;br /&gt;
     concat([element], list)&lt;br /&gt;
     substr(string, offset, length)&lt;br /&gt;
     zipmap(list1, list2)&lt;br /&gt;
* you can create lists using this syntax:&lt;br /&gt;
     terraform-workers-ipv6 = [&lt;br /&gt;
     	for idx in range(number):&lt;br /&gt;
 	    list[idx].attribute&lt;br /&gt;
     ]&lt;br /&gt;
* you can output to a file like this:&lt;br /&gt;
 resource &amp;quot;local_file&amp;quot; &amp;quot;ipv4-hosts-file&amp;quot; {&lt;br /&gt;
     content = &amp;quot;${local.ipv4-hosts-string}\n&amp;quot;&lt;br /&gt;
     filename = &amp;quot;ipv4-hosts&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
On your local host, change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039;:&lt;br /&gt;
 Host terraform-driver&lt;br /&gt;
     Hostname 2001:700:2:8300::28c4&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-0&lt;br /&gt;
     Hostname 2001:700:2:8301::14a1&lt;br /&gt;
 &lt;br /&gt;
 ...&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-5&lt;br /&gt;
     Hostname 2001:700:2:8301::120d&lt;br /&gt;
&lt;br /&gt;
On your local host, add these lines to &#039;&#039;~/.ssh/config&#039;&#039;:&lt;br /&gt;
 Host terraform-*&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      ProxyJump sinoa@login.uib.no&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
&lt;br /&gt;
 Include ./config.terraform-hosts&lt;br /&gt;
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).&lt;br /&gt;
&lt;br /&gt;
You can now log into all the cluster machines by name, even after you increase the number of worker nodes:&lt;br /&gt;
 ssh terraform-worker-4&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1208</id>
		<title>Create Spark cluster using Terraform</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1208"/>
		<updated>2022-10-31T13:32:02Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Create and attach volumes */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== OpenStack ==&lt;br /&gt;
=== Install OpenStack ===&lt;br /&gt;
On your local machine, create an &#039;&#039;exercise-5&#039;&#039; folder, for example a subfolder of &#039;&#039;info319-exercises&#039;&#039;, and &#039;&#039;&#039;cd&#039;&#039;&#039; into it.&lt;br /&gt;
&lt;br /&gt;
Install &#039;&#039;openstackclient&#039;&#039;. There are several ways to do this. On Ubuntu Linux, you can do:&lt;br /&gt;
 sudo apt install python3-openstackclient&lt;br /&gt;
(You may have to reinstall &#039;&#039;python3-six&#039;&#039; and &#039;&#039;python3-urllib3&#039;&#039;. You may also need &#039;&#039;python3-dev&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
Other guides suggest you install it as a Python package (using a virtual environment if you want):&lt;br /&gt;
 pip install python-openstackclient&lt;br /&gt;
&lt;br /&gt;
=== Configure OpenStack for command line ===&lt;br /&gt;
If you want to run OpenStack from the command line, create the file &#039;&#039;keystonerc.sh&#039;&#039; on your local machine:&lt;br /&gt;
 touch keystonerc.sh&lt;br /&gt;
 chmod 0600 keystonerc.sh&lt;br /&gt;
&lt;br /&gt;
Use this as a template:&lt;br /&gt;
 export OS_USERNAME=YOUR_USER_NAME@uib.no&lt;br /&gt;
 export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
 export OS_PASSWORD=g3...YOUR_PASSWORD...Qb&lt;br /&gt;
 export OS_AUTH_URL=https://api.nrec.no:5000/v3&lt;br /&gt;
 export OS_IDENTITY_API_VERSION=3&lt;br /&gt;
 export OS_USER_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_PROJECT_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
 export OS_INTERFACE=public&lt;br /&gt;
 export OS_NO_CACHE=1&lt;br /&gt;
 export OS_TENANT_NAME=$OS_PROJECT_NAME&lt;br /&gt;
&lt;br /&gt;
=== Test run OpenStack from command line ===&lt;br /&gt;
Test with:&lt;br /&gt;
 . keystonerc.sh&lt;br /&gt;
 openstack server list&lt;br /&gt;
(It can be quite slow.)&lt;br /&gt;
&lt;br /&gt;
Other test commands:&lt;br /&gt;
 openstack image list&lt;br /&gt;
 openstack flavor list&lt;br /&gt;
 openstack network list&lt;br /&gt;
 openstack keypair list&lt;br /&gt;
 openstack security group list&lt;br /&gt;
&#039;&#039;&#039;openstack --help&#039;&#039;&#039; lists all the possible commands. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039; Try to create an instance with &#039;&#039;&#039;openstack server create ...&#039;&#039;&#039;, then delete it. You can use the NREC Overview to see the results.&lt;br /&gt;
&lt;br /&gt;
== Terraform ==&lt;br /&gt;
=== Install Terraform ===&lt;br /&gt;
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install  software-properties-common gnupg2 curl&lt;br /&gt;
 curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor &amp;gt; hashicorp.gpg&lt;br /&gt;
 sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/&lt;br /&gt;
 sudo apt-add-repository &amp;quot;deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main&amp;quot;&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install terraform&lt;br /&gt;
&lt;br /&gt;
=== Configure Terraform ===&lt;br /&gt;
Create a configuration file &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # configure the OpenStack provider&lt;br /&gt;
 terraform {&lt;br /&gt;
   required_providers {&lt;br /&gt;
     openstack = {&lt;br /&gt;
       source = &amp;quot;terraform-provider-openstack/openstack&amp;quot;&lt;br /&gt;
     }&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Initialise Terraform with:&lt;br /&gt;
 terraform init&lt;br /&gt;
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:&lt;br /&gt;
 terraform init -upgrade&lt;br /&gt;
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.&lt;br /&gt;
&lt;br /&gt;
=== Test run Terraform ===&lt;br /&gt;
Append something like this to &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # test instance&lt;br /&gt;
 resource &amp;quot;openstack_compute_instance_v2&amp;quot; &amp;quot;terraform-test&amp;quot; {&lt;br /&gt;
   name            = &amp;quot;terraform-test&amp;quot;&lt;br /&gt;
   image_name      = &amp;quot;GOLD Ubuntu 22.04 LTS&amp;quot;&lt;br /&gt;
   flavor_name     = &amp;quot;m1.large&amp;quot;&lt;br /&gt;
   security_groups = [&amp;quot;default&amp;quot;, &amp;quot;info319-spark-cluster&amp;quot;]&lt;br /&gt;
   network {&lt;br /&gt;
     name = &amp;quot;dualStack&amp;quot;&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Try to run Terraform with:&lt;br /&gt;
 terraform plan&lt;br /&gt;
&#039;&#039;&#039;terraform plan&#039;&#039;&#039; lists everything &#039;&#039;terraform&#039;&#039; will do if you &#039;&#039;apply&#039;&#039; it. This list is important to check so you do not permanently delete something critical, like disks/volumes with important data on them...&lt;br /&gt;
&lt;br /&gt;
To create the new test instance:&lt;br /&gt;
 terraform apply  # answer &#039;yes&#039; to approve&lt;br /&gt;
Check &#039;&#039;Compute -&amp;gt; Instances&#039;&#039; in the &#039;&#039;NREC Overview&#039;&#039; to see the new instance appear.&lt;br /&gt;
&lt;br /&gt;
(You can also run &#039;&#039;&#039;terraform plan -out plan.tf&#039;&#039;&#039; to save the plan to a file and then run it faster with &#039;&#039;&#039;terraform apply plan.tf&#039;&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
=== user-data.cfg ===&lt;br /&gt;
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like &#039;&#039;user-data.cfg&#039;&#039;, or example:&lt;br /&gt;
 #cloud-config&lt;br /&gt;
 &lt;br /&gt;
 apt_upgrade: true&lt;br /&gt;
 &lt;br /&gt;
 packages:&lt;br /&gt;
 - emacs&lt;br /&gt;
 &lt;br /&gt;
 power_state:&lt;br /&gt;
   delay: &amp;quot;+3&amp;quot;&lt;br /&gt;
   mode: reboot&lt;br /&gt;
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text &#039;&#039;emacs&#039;&#039; editor.)&lt;br /&gt;
&lt;br /&gt;
To run this script when a new instance is created, add the line&lt;br /&gt;
   user_data       = file(&amp;quot;user-data.cfg&amp;quot;)&lt;br /&gt;
to &#039;&#039;info319-cluster.tf&#039; and run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a &#039;&#039;Connection closed by UNKNOWN port 65535&#039;&#039; message.)&lt;br /&gt;
&lt;br /&gt;
=== ~/.config/openstack/clouds.yaml (optional) ===&lt;br /&gt;
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (&#039;&#039;OS_*&#039;&#039;) defined in keystonerc.sh , you can keep them in a file. &lt;br /&gt;
&lt;br /&gt;
On your local machine, create ~/.config/openstack/clouds.yaml with &#039;&#039;0600&#039;&#039; (or &#039;&#039;-rw-------&#039;&#039;) access, for example:&lt;br /&gt;
 clouds:&lt;br /&gt;
   info319-cluster:&lt;br /&gt;
     auth:&lt;br /&gt;
       auth_url: https://api.nrec.no:5000/v3&lt;br /&gt;
       project_name: uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
       username: YOUR_USER_NAME@uib.no&lt;br /&gt;
       password: g3...YOUR_PASSWORD...Qb&lt;br /&gt;
       user_domain_name: dataporten&lt;br /&gt;
       project_domain_name: dataporten&lt;br /&gt;
     identity_api_version: 3&lt;br /&gt;
     region_name: YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
     interface: public&lt;br /&gt;
     operation_log:&lt;br /&gt;
       logging: TRUE&lt;br /&gt;
       file: openstackclient_admin.log&lt;br /&gt;
       level: info&lt;br /&gt;
&lt;br /&gt;
You need to change your &#039;&#039;info319-cluster.tf&#039;&#039; file from&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
to&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
   cloud  = &amp;quot;info319-cluster&amp;quot;  # defined in a clouds.yaml file&lt;br /&gt;
 }&lt;br /&gt;
The advantage of this setup is that it easier manage multiple clusters from the same local machine. You do not need to run the &#039;&#039;keystonerc.sh&#039;&#039; script, and you avoid problems with changed environment variables.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There are now two ways to run &#039;&#039;&#039;openstack&#039;&#039;&#039;. One assumes that all the &#039;&#039;OS_*&#039;&#039; environment variables defined in &#039;&#039;keystonerc.sh&#039;&#039; are set. The other assumes that there is a &#039;&#039;~/.config/openstack/clouds.yaml&#039;&#039; file and that &#039;&#039;OS_CLOUD&#039;&#039; is set, for example:&lt;br /&gt;
 OS_CLOUD=info319-cluster openstack server list&lt;br /&gt;
&lt;br /&gt;
== Spark cluster ==&lt;br /&gt;
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The &#039;&#039;Resources&#039;&#039; menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] lists all the available resources, but it is easier to list the most relevant ones with &#039;&#039;&#039;openstack &#039;&#039;resource_type&#039;&#039; list&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create or import a key pair ===&lt;br /&gt;
Your &#039;&#039;terraform-test&#039;&#039; instance still has no keypair. To add a keypair to &#039;&#039;info319-cluster.tf&#039;&#039;, use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public &#039;&#039;~/.ssh/info319-spark-cluster.pub&#039;&#039; SSH key from before. &lt;br /&gt;
&lt;br /&gt;
Add the keypair resource to the &#039;&#039;terraform-test&#039;&#039; resource with a line like this:&lt;br /&gt;
   key_pair        = &amp;quot;info319-spark-cluster&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Rerun &#039;&#039;&#039;terraform plan&#039;&#039;&#039;, &#039;&#039;&#039;terraform apply&#039;&#039;&#039;, and sometimes &#039;&#039;&#039;terraform destroy&#039;&#039;&#039; continuously as you build the cluster to check that things work.&lt;br /&gt;
&lt;br /&gt;
=== Test login ===&lt;br /&gt;
Use&lt;br /&gt;
 openstack server list  # or OS_CLOUD=info319-cluster openstack ...&lt;br /&gt;
to find the IPv4 and IPv6 addresses of &#039;&#039;terraform-test&#039;&#039;. Log in with for example:&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.165.222&lt;br /&gt;
and&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::21f4&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark driver and workers ===&lt;br /&gt;
The &#039;&#039;terraform-test&#039;&#039; resource can be renamed to for example &#039;&#039;terraform-driver&#039;&#039; and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].&lt;br /&gt;
&lt;br /&gt;
You can create multiple &#039;&#039;terraform-worker-&#039;&#039; resources in a similar way, but adding a line &lt;br /&gt;
 count            = 6&lt;br /&gt;
and using &#039;&#039;&#039;${count.index}&#039;&#039;&#039; as a variable inside the worker &#039;&#039;name&#039;&#039;. Use &#039;&#039;IPv6&#039;&#039; instead of &#039;&#039;dualStack&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Check that you can login to the new instances using &#039;&#039;&#039;ssh&#039;&#039;&#039; and a IPv6 JumpHost.&lt;br /&gt;
&lt;br /&gt;
=== Local Terraform variables ===&lt;br /&gt;
As &#039;&#039;info319-cluster.tf&#039;&#039; grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:&lt;br /&gt;
 locals {&lt;br /&gt;
     cluster_prefix = &amp;quot;terraform-&amp;quot;&lt;br /&gt;
     num_workers = 6&lt;br /&gt;
     driver_name = &amp;quot;${local.cluster_prefix}driver&amp;quot;&lt;br /&gt;
     worker_prefix = &amp;quot;${local.cluster_prefix}worker-&amp;quot;&lt;br /&gt;
     worker_names = [&lt;br /&gt;
         for idx in range(local.num_workers) : &lt;br /&gt;
             &amp;quot;${local.worker_prefix}${idx}&amp;quot;]&lt;br /&gt;
     all_names = concat([local.driver_name], local.worker_names)&lt;br /&gt;
 }&lt;br /&gt;
Inside strings, you can refer to a variable as &#039;&#039;${local.var_name}&#039;&#039;. Outside of strings, you can write just &#039;&#039;local.var_name&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create and attach volumes ===&lt;br /&gt;
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances]. &lt;br /&gt;
&lt;br /&gt;
You can use &#039;&#039;local.num_workers&#039;&#039; and &#039;&#039;count=&#039;&#039; both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:&lt;br /&gt;
 resource &amp;quot;openstack_compute_volume_attach_v2&amp;quot; &amp;quot;attached-to-workers&amp;quot; {&lt;br /&gt;
   count       = local.num_workers&lt;br /&gt;
   instance_id = &amp;quot;${openstack_compute_instance_v2.terraform-workers[count.index].id}&amp;quot;&lt;br /&gt;
   volume_id   = &amp;quot;${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Log in with SSH and do &#039;&#039;&#039;ls /dev&#039;&#039;&#039; to see that the volumes are attached where they should (but they are not yet partitioned, formatted, and mounted).&lt;br /&gt;
&lt;br /&gt;
=== Create &#039;&#039;hosts&#039;&#039; files ===&lt;br /&gt;
Make OpenStack write out the file &#039;&#039;ipv4-hosts&#039;&#039; looking like this:&lt;br /&gt;
 158.37.65.58 terraform-driver&lt;br /&gt;
 10.1.2.234 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 10.1.2.63 terraform-worker-5&lt;br /&gt;
and &#039;&#039;ipv6-hosts&#039;&#039; looking like this:&lt;br /&gt;
 2001:700:2:8300::28c4 terraform-driver&lt;br /&gt;
 2001:700:2:8301::14a1 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 2001:700:2:8310::120d terraform-worker-5&lt;br /&gt;
 &lt;br /&gt;
Tips:&lt;br /&gt;
* you can define outputs from terraform like this:&lt;br /&gt;
 output &amp;quot;terraform-driver-ipv4&amp;quot; {&lt;br /&gt;
     value = openstack_compute_instance_v2.terraform-driver.access_ip_v4&lt;br /&gt;
 }&lt;br /&gt;
* you can use the &#039;&#039;&#039;[https://www.terraform.io/cli/commands/console terraform console]&#039;&#039;&#039; to explore expressions you can use to define outputs and locals, for example:&lt;br /&gt;
 openstack_compute_instance_v2.terraform-workers[5]&lt;br /&gt;
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs&lt;br /&gt;
* a few useful functions are:&lt;br /&gt;
     concat([local.element], local.list)&lt;br /&gt;
     length(string)&lt;br /&gt;
     substr(string, offset, length)&lt;br /&gt;
     zipmap(list1, list2)&lt;br /&gt;
* you can create lists using this syntax:&lt;br /&gt;
     terraform-workers-ipv6 = [&lt;br /&gt;
     	for idx in range(number):&lt;br /&gt;
 	    list[idx].attribute&lt;br /&gt;
     ]&lt;br /&gt;
* you can output to a file like this:&lt;br /&gt;
 resource &amp;quot;local_file&amp;quot; &amp;quot;ipv4-hosts-file&amp;quot; {&lt;br /&gt;
     content = &amp;quot;${local.ipv4-hosts-string}\n&amp;quot;&lt;br /&gt;
     filename = &amp;quot;ipv4-hosts&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
On your local host, change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039;:&lt;br /&gt;
 Host terraform-driver&lt;br /&gt;
     Hostname 2001:700:2:8300::28c4&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-0&lt;br /&gt;
     Hostname 2001:700:2:8301::14a1&lt;br /&gt;
 &lt;br /&gt;
 ...&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-5&lt;br /&gt;
     Hostname 2001:700:2:8301::120d&lt;br /&gt;
&lt;br /&gt;
On your local host, add these lines to &#039;&#039;~/.ssh/config&#039;&#039;:&lt;br /&gt;
 Host terraform-*&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      ProxyJump sinoa@login.uib.no&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
&lt;br /&gt;
 Include ./config.terraform-hosts&lt;br /&gt;
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).&lt;br /&gt;
&lt;br /&gt;
You can now log into all the cluster machines by name, even after you increase the number of worker nodes:&lt;br /&gt;
 ssh terraform-worker-4&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1207</id>
		<title>Create Spark cluster using Terraform</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1207"/>
		<updated>2022-10-31T13:31:15Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Spark cluster */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== OpenStack ==&lt;br /&gt;
=== Install OpenStack ===&lt;br /&gt;
On your local machine, create an &#039;&#039;exercise-5&#039;&#039; folder, for example a subfolder of &#039;&#039;info319-exercises&#039;&#039;, and &#039;&#039;&#039;cd&#039;&#039;&#039; into it.&lt;br /&gt;
&lt;br /&gt;
Install &#039;&#039;openstackclient&#039;&#039;. There are several ways to do this. On Ubuntu Linux, you can do:&lt;br /&gt;
 sudo apt install python3-openstackclient&lt;br /&gt;
(You may have to reinstall &#039;&#039;python3-six&#039;&#039; and &#039;&#039;python3-urllib3&#039;&#039;. You may also need &#039;&#039;python3-dev&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
Other guides suggest you install it as a Python package (using a virtual environment if you want):&lt;br /&gt;
 pip install python-openstackclient&lt;br /&gt;
&lt;br /&gt;
=== Configure OpenStack for command line ===&lt;br /&gt;
If you want to run OpenStack from the command line, create the file &#039;&#039;keystonerc.sh&#039;&#039; on your local machine:&lt;br /&gt;
 touch keystonerc.sh&lt;br /&gt;
 chmod 0600 keystonerc.sh&lt;br /&gt;
&lt;br /&gt;
Use this as a template:&lt;br /&gt;
 export OS_USERNAME=YOUR_USER_NAME@uib.no&lt;br /&gt;
 export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
 export OS_PASSWORD=g3...YOUR_PASSWORD...Qb&lt;br /&gt;
 export OS_AUTH_URL=https://api.nrec.no:5000/v3&lt;br /&gt;
 export OS_IDENTITY_API_VERSION=3&lt;br /&gt;
 export OS_USER_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_PROJECT_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
 export OS_INTERFACE=public&lt;br /&gt;
 export OS_NO_CACHE=1&lt;br /&gt;
 export OS_TENANT_NAME=$OS_PROJECT_NAME&lt;br /&gt;
&lt;br /&gt;
=== Test run OpenStack from command line ===&lt;br /&gt;
Test with:&lt;br /&gt;
 . keystonerc.sh&lt;br /&gt;
 openstack server list&lt;br /&gt;
(It can be quite slow.)&lt;br /&gt;
&lt;br /&gt;
Other test commands:&lt;br /&gt;
 openstack image list&lt;br /&gt;
 openstack flavor list&lt;br /&gt;
 openstack network list&lt;br /&gt;
 openstack keypair list&lt;br /&gt;
 openstack security group list&lt;br /&gt;
&#039;&#039;&#039;openstack --help&#039;&#039;&#039; lists all the possible commands. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039; Try to create an instance with &#039;&#039;&#039;openstack server create ...&#039;&#039;&#039;, then delete it. You can use the NREC Overview to see the results.&lt;br /&gt;
&lt;br /&gt;
== Terraform ==&lt;br /&gt;
=== Install Terraform ===&lt;br /&gt;
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install  software-properties-common gnupg2 curl&lt;br /&gt;
 curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor &amp;gt; hashicorp.gpg&lt;br /&gt;
 sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/&lt;br /&gt;
 sudo apt-add-repository &amp;quot;deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main&amp;quot;&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install terraform&lt;br /&gt;
&lt;br /&gt;
=== Configure Terraform ===&lt;br /&gt;
Create a configuration file &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # configure the OpenStack provider&lt;br /&gt;
 terraform {&lt;br /&gt;
   required_providers {&lt;br /&gt;
     openstack = {&lt;br /&gt;
       source = &amp;quot;terraform-provider-openstack/openstack&amp;quot;&lt;br /&gt;
     }&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Initialise Terraform with:&lt;br /&gt;
 terraform init&lt;br /&gt;
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:&lt;br /&gt;
 terraform init -upgrade&lt;br /&gt;
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.&lt;br /&gt;
&lt;br /&gt;
=== Test run Terraform ===&lt;br /&gt;
Append something like this to &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # test instance&lt;br /&gt;
 resource &amp;quot;openstack_compute_instance_v2&amp;quot; &amp;quot;terraform-test&amp;quot; {&lt;br /&gt;
   name            = &amp;quot;terraform-test&amp;quot;&lt;br /&gt;
   image_name      = &amp;quot;GOLD Ubuntu 22.04 LTS&amp;quot;&lt;br /&gt;
   flavor_name     = &amp;quot;m1.large&amp;quot;&lt;br /&gt;
   security_groups = [&amp;quot;default&amp;quot;, &amp;quot;info319-spark-cluster&amp;quot;]&lt;br /&gt;
   network {&lt;br /&gt;
     name = &amp;quot;dualStack&amp;quot;&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Try to run Terraform with:&lt;br /&gt;
 terraform plan&lt;br /&gt;
&#039;&#039;&#039;terraform plan&#039;&#039;&#039; lists everything &#039;&#039;terraform&#039;&#039; will do if you &#039;&#039;apply&#039;&#039; it. This list is important to check so you do not permanently delete something critical, like disks/volumes with important data on them...&lt;br /&gt;
&lt;br /&gt;
To create the new test instance:&lt;br /&gt;
 terraform apply  # answer &#039;yes&#039; to approve&lt;br /&gt;
Check &#039;&#039;Compute -&amp;gt; Instances&#039;&#039; in the &#039;&#039;NREC Overview&#039;&#039; to see the new instance appear.&lt;br /&gt;
&lt;br /&gt;
(You can also run &#039;&#039;&#039;terraform plan -out plan.tf&#039;&#039;&#039; to save the plan to a file and then run it faster with &#039;&#039;&#039;terraform apply plan.tf&#039;&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
=== user-data.cfg ===&lt;br /&gt;
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like &#039;&#039;user-data.cfg&#039;&#039;, or example:&lt;br /&gt;
 #cloud-config&lt;br /&gt;
 &lt;br /&gt;
 apt_upgrade: true&lt;br /&gt;
 &lt;br /&gt;
 packages:&lt;br /&gt;
 - emacs&lt;br /&gt;
 &lt;br /&gt;
 power_state:&lt;br /&gt;
   delay: &amp;quot;+3&amp;quot;&lt;br /&gt;
   mode: reboot&lt;br /&gt;
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text &#039;&#039;emacs&#039;&#039; editor.)&lt;br /&gt;
&lt;br /&gt;
To run this script when a new instance is created, add the line&lt;br /&gt;
   user_data       = file(&amp;quot;user-data.cfg&amp;quot;)&lt;br /&gt;
to &#039;&#039;info319-cluster.tf&#039; and run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a &#039;&#039;Connection closed by UNKNOWN port 65535&#039;&#039; message.)&lt;br /&gt;
&lt;br /&gt;
=== ~/.config/openstack/clouds.yaml (optional) ===&lt;br /&gt;
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (&#039;&#039;OS_*&#039;&#039;) defined in keystonerc.sh , you can keep them in a file. &lt;br /&gt;
&lt;br /&gt;
On your local machine, create ~/.config/openstack/clouds.yaml with &#039;&#039;0600&#039;&#039; (or &#039;&#039;-rw-------&#039;&#039;) access, for example:&lt;br /&gt;
 clouds:&lt;br /&gt;
   info319-cluster:&lt;br /&gt;
     auth:&lt;br /&gt;
       auth_url: https://api.nrec.no:5000/v3&lt;br /&gt;
       project_name: uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
       username: YOUR_USER_NAME@uib.no&lt;br /&gt;
       password: g3...YOUR_PASSWORD...Qb&lt;br /&gt;
       user_domain_name: dataporten&lt;br /&gt;
       project_domain_name: dataporten&lt;br /&gt;
     identity_api_version: 3&lt;br /&gt;
     region_name: YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
     interface: public&lt;br /&gt;
     operation_log:&lt;br /&gt;
       logging: TRUE&lt;br /&gt;
       file: openstackclient_admin.log&lt;br /&gt;
       level: info&lt;br /&gt;
&lt;br /&gt;
You need to change your &#039;&#039;info319-cluster.tf&#039;&#039; file from&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
to&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
   cloud  = &amp;quot;info319-cluster&amp;quot;  # defined in a clouds.yaml file&lt;br /&gt;
 }&lt;br /&gt;
The advantage of this setup is that it easier manage multiple clusters from the same local machine. You do not need to run the &#039;&#039;keystonerc.sh&#039;&#039; script, and you avoid problems with changed environment variables.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There are now two ways to run &#039;&#039;&#039;openstack&#039;&#039;&#039;. One assumes that all the &#039;&#039;OS_*&#039;&#039; environment variables defined in &#039;&#039;keystonerc.sh&#039;&#039; are set. The other assumes that there is a &#039;&#039;~/.config/openstack/clouds.yaml&#039;&#039; file and that &#039;&#039;OS_CLOUD&#039;&#039; is set, for example:&lt;br /&gt;
 OS_CLOUD=info319-cluster openstack server list&lt;br /&gt;
&lt;br /&gt;
== Spark cluster ==&lt;br /&gt;
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The &#039;&#039;Resources&#039;&#039; menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] lists all the available resources, but it is easier to list the most relevant ones with &#039;&#039;&#039;openstack &#039;&#039;resource_type&#039;&#039; list&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create or import a key pair ===&lt;br /&gt;
Your &#039;&#039;terraform-test&#039;&#039; instance still has no keypair. To add a keypair to &#039;&#039;info319-cluster.tf&#039;&#039;, use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public &#039;&#039;~/.ssh/info319-spark-cluster.pub&#039;&#039; SSH key from before. &lt;br /&gt;
&lt;br /&gt;
Add the keypair resource to the &#039;&#039;terraform-test&#039;&#039; resource with a line like this:&lt;br /&gt;
   key_pair        = &amp;quot;info319-spark-cluster&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Rerun &#039;&#039;&#039;terraform plan&#039;&#039;&#039;, &#039;&#039;&#039;terraform apply&#039;&#039;&#039;, and sometimes &#039;&#039;&#039;terraform destroy&#039;&#039;&#039; continuously as you build the cluster to check that things work.&lt;br /&gt;
&lt;br /&gt;
=== Test login ===&lt;br /&gt;
Use&lt;br /&gt;
 openstack server list  # or OS_CLOUD=info319-cluster openstack ...&lt;br /&gt;
to find the IPv4 and IPv6 addresses of &#039;&#039;terraform-test&#039;&#039;. Log in with for example:&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.165.222&lt;br /&gt;
and&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::21f4&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark driver and workers ===&lt;br /&gt;
The &#039;&#039;terraform-test&#039;&#039; resource can be renamed to for example &#039;&#039;terraform-driver&#039;&#039; and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].&lt;br /&gt;
&lt;br /&gt;
You can create multiple &#039;&#039;terraform-worker-&#039;&#039; resources in a similar way, but adding a line &lt;br /&gt;
 count            = 6&lt;br /&gt;
and using &#039;&#039;&#039;${count.index}&#039;&#039;&#039; as a variable inside the worker &#039;&#039;name&#039;&#039;. Use &#039;&#039;IPv6&#039;&#039; instead of &#039;&#039;dualStack&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Check that you can login to the new instances using &#039;&#039;&#039;ssh&#039;&#039;&#039; and a IPv6 JumpHost.&lt;br /&gt;
&lt;br /&gt;
=== Local Terraform variables ===&lt;br /&gt;
As &#039;&#039;info319-cluster.tf&#039;&#039; grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:&lt;br /&gt;
 locals {&lt;br /&gt;
     cluster_prefix = &amp;quot;terraform-&amp;quot;&lt;br /&gt;
     num_workers = 6&lt;br /&gt;
     driver_name = &amp;quot;${local.cluster_prefix}driver&amp;quot;&lt;br /&gt;
     worker_prefix = &amp;quot;${local.cluster_prefix}worker-&amp;quot;&lt;br /&gt;
     worker_names = [&lt;br /&gt;
         for idx in range(local.num_workers) : &lt;br /&gt;
             &amp;quot;${local.worker_prefix}${idx}&amp;quot;]&lt;br /&gt;
     all_names = concat([local.driver_name], local.worker_names)&lt;br /&gt;
 }&lt;br /&gt;
Inside strings, you can refer to a variable as &#039;&#039;${local.var_name}&#039;&#039;. Outside of strings, you can write just &#039;&#039;local.var_name&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create and attach volumes ===&lt;br /&gt;
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances]. &lt;br /&gt;
&lt;br /&gt;
You can use &#039;&#039;local.num_workers&#039;&#039; and &#039;&#039;count=&#039;&#039; both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:&lt;br /&gt;
 resource &amp;quot;openstack_compute_volume_attach_v2&amp;quot; &amp;quot;attached-to-workers&amp;quot; {&lt;br /&gt;
   count       = local.num_workers&lt;br /&gt;
   instance_id = &amp;quot;${openstack_compute_instance_v2.terraform-workers[count.index].id}&amp;quot;&lt;br /&gt;
   volume_id   = &amp;quot;${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Log in with SSH and to &#039;&#039;&#039;ls /dev&#039;&#039;&#039; to see that the volumes are attached where they should (but they are not yet partitioned, formatted, and mounted).&lt;br /&gt;
&lt;br /&gt;
=== Create &#039;&#039;hosts&#039;&#039; files ===&lt;br /&gt;
Make OpenStack write out the file &#039;&#039;ipv4-hosts&#039;&#039; looking like this:&lt;br /&gt;
 158.37.65.58 terraform-driver&lt;br /&gt;
 10.1.2.234 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 10.1.2.63 terraform-worker-5&lt;br /&gt;
and &#039;&#039;ipv6-hosts&#039;&#039; looking like this:&lt;br /&gt;
 2001:700:2:8300::28c4 terraform-driver&lt;br /&gt;
 2001:700:2:8301::14a1 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 2001:700:2:8310::120d terraform-worker-5&lt;br /&gt;
 &lt;br /&gt;
Tips:&lt;br /&gt;
* you can define outputs from terraform like this:&lt;br /&gt;
 output &amp;quot;terraform-driver-ipv4&amp;quot; {&lt;br /&gt;
     value = openstack_compute_instance_v2.terraform-driver.access_ip_v4&lt;br /&gt;
 }&lt;br /&gt;
* you can use the &#039;&#039;&#039;[https://www.terraform.io/cli/commands/console terraform console]&#039;&#039;&#039; to explore expressions you can use to define outputs and locals, for example:&lt;br /&gt;
 openstack_compute_instance_v2.terraform-workers[5]&lt;br /&gt;
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs&lt;br /&gt;
* a few useful functions are:&lt;br /&gt;
     concat([local.element], local.list)&lt;br /&gt;
     length(string)&lt;br /&gt;
     substr(string, offset, length)&lt;br /&gt;
     zipmap(list1, list2)&lt;br /&gt;
* you can create lists using this syntax:&lt;br /&gt;
     terraform-workers-ipv6 = [&lt;br /&gt;
     	for idx in range(number):&lt;br /&gt;
 	    list[idx].attribute&lt;br /&gt;
     ]&lt;br /&gt;
* you can output to a file like this:&lt;br /&gt;
 resource &amp;quot;local_file&amp;quot; &amp;quot;ipv4-hosts-file&amp;quot; {&lt;br /&gt;
     content = &amp;quot;${local.ipv4-hosts-string}\n&amp;quot;&lt;br /&gt;
     filename = &amp;quot;ipv4-hosts&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
On your local host, change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039;:&lt;br /&gt;
 Host terraform-driver&lt;br /&gt;
     Hostname 2001:700:2:8300::28c4&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-0&lt;br /&gt;
     Hostname 2001:700:2:8301::14a1&lt;br /&gt;
 &lt;br /&gt;
 ...&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-5&lt;br /&gt;
     Hostname 2001:700:2:8301::120d&lt;br /&gt;
&lt;br /&gt;
On your local host, add these lines to &#039;&#039;~/.ssh/config&#039;&#039;:&lt;br /&gt;
 Host terraform-*&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      ProxyJump sinoa@login.uib.no&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
&lt;br /&gt;
 Include ./config.terraform-hosts&lt;br /&gt;
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).&lt;br /&gt;
&lt;br /&gt;
You can now log into all the cluster machines by name, even after you increase the number of worker nodes:&lt;br /&gt;
 ssh terraform-worker-4&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1206</id>
		<title>Create Spark cluster using Terraform</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1206"/>
		<updated>2022-10-31T13:23:00Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Test login */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== OpenStack ==&lt;br /&gt;
=== Install OpenStack ===&lt;br /&gt;
On your local machine, create an &#039;&#039;exercise-5&#039;&#039; folder, for example a subfolder of &#039;&#039;info319-exercises&#039;&#039;, and &#039;&#039;&#039;cd&#039;&#039;&#039; into it.&lt;br /&gt;
&lt;br /&gt;
Install &#039;&#039;openstackclient&#039;&#039;. There are several ways to do this. On Ubuntu Linux, you can do:&lt;br /&gt;
 sudo apt install python3-openstackclient&lt;br /&gt;
(You may have to reinstall &#039;&#039;python3-six&#039;&#039; and &#039;&#039;python3-urllib3&#039;&#039;. You may also need &#039;&#039;python3-dev&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
Other guides suggest you install it as a Python package (using a virtual environment if you want):&lt;br /&gt;
 pip install python-openstackclient&lt;br /&gt;
&lt;br /&gt;
=== Configure OpenStack for command line ===&lt;br /&gt;
If you want to run OpenStack from the command line, create the file &#039;&#039;keystonerc.sh&#039;&#039; on your local machine:&lt;br /&gt;
 touch keystonerc.sh&lt;br /&gt;
 chmod 0600 keystonerc.sh&lt;br /&gt;
&lt;br /&gt;
Use this as a template:&lt;br /&gt;
 export OS_USERNAME=YOUR_USER_NAME@uib.no&lt;br /&gt;
 export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
 export OS_PASSWORD=g3...YOUR_PASSWORD...Qb&lt;br /&gt;
 export OS_AUTH_URL=https://api.nrec.no:5000/v3&lt;br /&gt;
 export OS_IDENTITY_API_VERSION=3&lt;br /&gt;
 export OS_USER_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_PROJECT_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
 export OS_INTERFACE=public&lt;br /&gt;
 export OS_NO_CACHE=1&lt;br /&gt;
 export OS_TENANT_NAME=$OS_PROJECT_NAME&lt;br /&gt;
&lt;br /&gt;
=== Test run OpenStack from command line ===&lt;br /&gt;
Test with:&lt;br /&gt;
 . keystonerc.sh&lt;br /&gt;
 openstack server list&lt;br /&gt;
(It can be quite slow.)&lt;br /&gt;
&lt;br /&gt;
Other test commands:&lt;br /&gt;
 openstack image list&lt;br /&gt;
 openstack flavor list&lt;br /&gt;
 openstack network list&lt;br /&gt;
 openstack keypair list&lt;br /&gt;
 openstack security group list&lt;br /&gt;
&#039;&#039;&#039;openstack --help&#039;&#039;&#039; lists all the possible commands. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039; Try to create an instance with &#039;&#039;&#039;openstack server create ...&#039;&#039;&#039;, then delete it. You can use the NREC Overview to see the results.&lt;br /&gt;
&lt;br /&gt;
== Terraform ==&lt;br /&gt;
=== Install Terraform ===&lt;br /&gt;
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install  software-properties-common gnupg2 curl&lt;br /&gt;
 curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor &amp;gt; hashicorp.gpg&lt;br /&gt;
 sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/&lt;br /&gt;
 sudo apt-add-repository &amp;quot;deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main&amp;quot;&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install terraform&lt;br /&gt;
&lt;br /&gt;
=== Configure Terraform ===&lt;br /&gt;
Create a configuration file &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # configure the OpenStack provider&lt;br /&gt;
 terraform {&lt;br /&gt;
   required_providers {&lt;br /&gt;
     openstack = {&lt;br /&gt;
       source = &amp;quot;terraform-provider-openstack/openstack&amp;quot;&lt;br /&gt;
     }&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Initialise Terraform with:&lt;br /&gt;
 terraform init&lt;br /&gt;
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:&lt;br /&gt;
 terraform init -upgrade&lt;br /&gt;
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.&lt;br /&gt;
&lt;br /&gt;
=== Test run Terraform ===&lt;br /&gt;
Append something like this to &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # test instance&lt;br /&gt;
 resource &amp;quot;openstack_compute_instance_v2&amp;quot; &amp;quot;terraform-test&amp;quot; {&lt;br /&gt;
   name            = &amp;quot;terraform-test&amp;quot;&lt;br /&gt;
   image_name      = &amp;quot;GOLD Ubuntu 22.04 LTS&amp;quot;&lt;br /&gt;
   flavor_name     = &amp;quot;m1.large&amp;quot;&lt;br /&gt;
   security_groups = [&amp;quot;default&amp;quot;, &amp;quot;info319-spark-cluster&amp;quot;]&lt;br /&gt;
   network {&lt;br /&gt;
     name = &amp;quot;dualStack&amp;quot;&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Try to run Terraform with:&lt;br /&gt;
 terraform plan&lt;br /&gt;
&#039;&#039;&#039;terraform plan&#039;&#039;&#039; lists everything &#039;&#039;terraform&#039;&#039; will do if you &#039;&#039;apply&#039;&#039; it. This list is important to check so you do not permanently delete something critical, like disks/volumes with important data on them...&lt;br /&gt;
&lt;br /&gt;
To create the new test instance:&lt;br /&gt;
 terraform apply  # answer &#039;yes&#039; to approve&lt;br /&gt;
Check &#039;&#039;Compute -&amp;gt; Instances&#039;&#039; in the &#039;&#039;NREC Overview&#039;&#039; to see the new instance appear.&lt;br /&gt;
&lt;br /&gt;
(You can also run &#039;&#039;&#039;terraform plan -out plan.tf&#039;&#039;&#039; to save the plan to a file and then run it faster with &#039;&#039;&#039;terraform apply plan.tf&#039;&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
=== user-data.cfg ===&lt;br /&gt;
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like &#039;&#039;user-data.cfg&#039;&#039;, or example:&lt;br /&gt;
 #cloud-config&lt;br /&gt;
 &lt;br /&gt;
 apt_upgrade: true&lt;br /&gt;
 &lt;br /&gt;
 packages:&lt;br /&gt;
 - emacs&lt;br /&gt;
 &lt;br /&gt;
 power_state:&lt;br /&gt;
   delay: &amp;quot;+3&amp;quot;&lt;br /&gt;
   mode: reboot&lt;br /&gt;
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text &#039;&#039;emacs&#039;&#039; editor.)&lt;br /&gt;
&lt;br /&gt;
To run this script when a new instance is created, add the line&lt;br /&gt;
   user_data       = file(&amp;quot;user-data.cfg&amp;quot;)&lt;br /&gt;
to &#039;&#039;info319-cluster.tf&#039; and run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a &#039;&#039;Connection closed by UNKNOWN port 65535&#039;&#039; message.)&lt;br /&gt;
&lt;br /&gt;
=== ~/.config/openstack/clouds.yaml (optional) ===&lt;br /&gt;
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (&#039;&#039;OS_*&#039;&#039;) defined in keystonerc.sh , you can keep them in a file. &lt;br /&gt;
&lt;br /&gt;
On your local machine, create ~/.config/openstack/clouds.yaml with &#039;&#039;0600&#039;&#039; (or &#039;&#039;-rw-------&#039;&#039;) access, for example:&lt;br /&gt;
 clouds:&lt;br /&gt;
   info319-cluster:&lt;br /&gt;
     auth:&lt;br /&gt;
       auth_url: https://api.nrec.no:5000/v3&lt;br /&gt;
       project_name: uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
       username: YOUR_USER_NAME@uib.no&lt;br /&gt;
       password: g3...YOUR_PASSWORD...Qb&lt;br /&gt;
       user_domain_name: dataporten&lt;br /&gt;
       project_domain_name: dataporten&lt;br /&gt;
     identity_api_version: 3&lt;br /&gt;
     region_name: YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
     interface: public&lt;br /&gt;
     operation_log:&lt;br /&gt;
       logging: TRUE&lt;br /&gt;
       file: openstackclient_admin.log&lt;br /&gt;
       level: info&lt;br /&gt;
&lt;br /&gt;
You need to change your &#039;&#039;info319-cluster.tf&#039;&#039; file from&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
to&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
   cloud  = &amp;quot;info319-cluster&amp;quot;  # defined in a clouds.yaml file&lt;br /&gt;
 }&lt;br /&gt;
The advantage of this setup is that it easier manage multiple clusters from the same local machine. You do not need to run the &#039;&#039;keystonerc.sh&#039;&#039; script, and you avoid problems with changed environment variables.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There are now two ways to run &#039;&#039;&#039;openstack&#039;&#039;&#039;. One assumes that all the &#039;&#039;OS_*&#039;&#039; environment variables defined in &#039;&#039;keystonerc.sh&#039;&#039; are set. The other assumes that there is a &#039;&#039;~/.config/openstack/clouds.yaml&#039;&#039; file and that &#039;&#039;OS_CLOUD&#039;&#039; is set, for example:&lt;br /&gt;
 OS_CLOUD=info319-cluster openstack server list&lt;br /&gt;
&lt;br /&gt;
== Spark cluster ==&lt;br /&gt;
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The &#039;&#039;Resources&#039;&#039; menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] lists all the available resources, but it is easier to list the most relevant ones with &#039;&#039;&#039;openstack &#039;&#039;resource_type&#039;&#039; list&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create or import a key pair ===&lt;br /&gt;
Your &#039;&#039;terraform-test&#039;&#039; instance still has no keypair. To add a keypair to &#039;&#039;info319-cluster.tf&#039;&#039;, use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public &#039;&#039;~/.ssh/info319-spark-cluster.pub&#039;&#039; SSH key from before. &lt;br /&gt;
&lt;br /&gt;
Add the keypair resource to the &#039;&#039;terraform-test&#039;&#039; resource with a line like this:&lt;br /&gt;
   key_pair        = &amp;quot;info319-spark-cluster&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Rerun &#039;&#039;&#039;terraform plan&#039;&#039;&#039;, &#039;&#039;&#039;terraform apply&#039;&#039;&#039;, and sometimes &#039;&#039;&#039;terraform destroy&#039;&#039;&#039; continuously as you build the cluster to check that things work.&lt;br /&gt;
&lt;br /&gt;
=== Test login ===&lt;br /&gt;
Use&lt;br /&gt;
 openstack server list  # or OS_CLOUD=info319-cluster openstack ...&lt;br /&gt;
to find the IPv4 and IPv6 addresses of &#039;&#039;terraform-test&#039;&#039;. Log in with for example:&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.165.222&lt;br /&gt;
and&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::21f4&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark driver ===&lt;br /&gt;
The &#039;&#039;terraform-test&#039;&#039; resource can be renamed to for example &#039;&#039;terraform-driver&#039;&#039; and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark workers ===&lt;br /&gt;
You can create multiple &#039;&#039;terraform-worker-&#039;&#039; resources in a similar way, but adding a line &lt;br /&gt;
 count            = 6&lt;br /&gt;
and using &#039;&#039;&#039;${count.index}&#039;&#039;&#039; as a variable inside the worker &#039;&#039;name&#039;&#039;. Use &#039;&#039;IPv6&#039;&#039; instead of &#039;&#039;dualStack&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Check that you can login to the new instances using &#039;&#039;&#039;ssh&#039;&#039;&#039; and a IPv6 JumpHost.&lt;br /&gt;
&lt;br /&gt;
As &#039;&#039;info319-cluster.tf&#039;&#039; grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:&lt;br /&gt;
 locals {&lt;br /&gt;
     cluster_prefix = &amp;quot;terraform-&amp;quot;&lt;br /&gt;
     num_workers = 6&lt;br /&gt;
     driver_name = &amp;quot;${local.cluster_prefix}driver&amp;quot;&lt;br /&gt;
     worker_prefix = &amp;quot;${local.cluster_prefix}worker-&amp;quot;&lt;br /&gt;
     worker_names = [&lt;br /&gt;
         for idx in range(local.num_workers) : &lt;br /&gt;
             &amp;quot;${local.worker_prefix}${idx}&amp;quot;]&lt;br /&gt;
     all_names = concat([local.driver_name], local.worker_names)&lt;br /&gt;
 }&lt;br /&gt;
Inside strings, you can refer to the variables using &#039;&#039;${local.var_name}&#039;&#039;. Outside of string, you can write just &#039;&#039;local.var_name&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create and attach volumes ===&lt;br /&gt;
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances]. &lt;br /&gt;
&lt;br /&gt;
You can use &#039;&#039;local.num_workers&#039;&#039; and &#039;&#039;count=&#039;&#039; both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:&lt;br /&gt;
 resource &amp;quot;openstack_compute_volume_attach_v2&amp;quot; &amp;quot;attached-to-workers&amp;quot; {&lt;br /&gt;
   count       = local.num_workers&lt;br /&gt;
   instance_id = &amp;quot;${openstack_compute_instance_v2.terraform-workers[count.index].id}&amp;quot;&lt;br /&gt;
   volume_id   = &amp;quot;${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Log in with SSH and to &#039;&#039;&#039;ls /dev&#039;&#039;&#039; to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).&lt;br /&gt;
&lt;br /&gt;
=== Create &#039;&#039;hosts&#039;&#039; files ===&lt;br /&gt;
Make OpenStack write out the file &#039;&#039;ipv4-hosts&#039;&#039; looking like this:&lt;br /&gt;
 158.37.65.58 terraform-driver&lt;br /&gt;
 10.1.2.234 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 10.1.2.63 terraform-worker-5&lt;br /&gt;
and &#039;&#039;ipv6-hosts&#039;&#039; looking like this:&lt;br /&gt;
 2001:700:2:8300::28c4 terraform-driver&lt;br /&gt;
 2001:700:2:8301::14a1 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 2001:700:2:8310::120d terraform-worker-5&lt;br /&gt;
 &lt;br /&gt;
Tips:&lt;br /&gt;
* you can define outputs from terraform like this:&lt;br /&gt;
 output &amp;quot;terraform-driver-ipv4&amp;quot; {&lt;br /&gt;
     value = openstack_compute_instance_v2.terraform-driver.access_ip_v4&lt;br /&gt;
 }&lt;br /&gt;
* you can use the &#039;&#039;&#039;[https://www.terraform.io/cli/commands/console terraform console]&#039;&#039;&#039; to explore expressions you can use to define outputs and locals, for example:&lt;br /&gt;
 openstack_compute_instance_v2.terraform-workers[5]&lt;br /&gt;
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs&lt;br /&gt;
* a few useful functions are:&lt;br /&gt;
     concat([local.element], local.list)&lt;br /&gt;
     length(string)&lt;br /&gt;
     substr(string, offset, length)&lt;br /&gt;
     zipmap(list1, list2)&lt;br /&gt;
* you can create lists using this syntax:&lt;br /&gt;
     terraform-workers-ipv6 = [&lt;br /&gt;
     	for idx in range(number):&lt;br /&gt;
 	    list[idx].attribute&lt;br /&gt;
     ]&lt;br /&gt;
* you can output to a file like this:&lt;br /&gt;
 resource &amp;quot;local_file&amp;quot; &amp;quot;ipv4-hosts-file&amp;quot; {&lt;br /&gt;
     content = &amp;quot;${local.ipv4-hosts-string}\n&amp;quot;&lt;br /&gt;
     filename = &amp;quot;ipv4-hosts&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
On your local host, change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039;:&lt;br /&gt;
 Host terraform-driver&lt;br /&gt;
     Hostname 2001:700:2:8300::28c4&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-0&lt;br /&gt;
     Hostname 2001:700:2:8301::14a1&lt;br /&gt;
 &lt;br /&gt;
 ...&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-5&lt;br /&gt;
     Hostname 2001:700:2:8301::120d&lt;br /&gt;
&lt;br /&gt;
On your local host, add these lines to &#039;&#039;~/.ssh/config&#039;&#039;:&lt;br /&gt;
 Host terraform-*&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      ProxyJump sinoa@login.uib.no&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
&lt;br /&gt;
 Include ./config.terraform-hosts&lt;br /&gt;
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).&lt;br /&gt;
&lt;br /&gt;
You can now log into all the cluster machines by name, even after you increase the number of worker nodes:&lt;br /&gt;
 ssh terraform-worker-4&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1205</id>
		<title>Create Spark cluster using Terraform</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1205"/>
		<updated>2022-10-31T13:22:26Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Test login */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== OpenStack ==&lt;br /&gt;
=== Install OpenStack ===&lt;br /&gt;
On your local machine, create an &#039;&#039;exercise-5&#039;&#039; folder, for example a subfolder of &#039;&#039;info319-exercises&#039;&#039;, and &#039;&#039;&#039;cd&#039;&#039;&#039; into it.&lt;br /&gt;
&lt;br /&gt;
Install &#039;&#039;openstackclient&#039;&#039;. There are several ways to do this. On Ubuntu Linux, you can do:&lt;br /&gt;
 sudo apt install python3-openstackclient&lt;br /&gt;
(You may have to reinstall &#039;&#039;python3-six&#039;&#039; and &#039;&#039;python3-urllib3&#039;&#039;. You may also need &#039;&#039;python3-dev&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
Other guides suggest you install it as a Python package (using a virtual environment if you want):&lt;br /&gt;
 pip install python-openstackclient&lt;br /&gt;
&lt;br /&gt;
=== Configure OpenStack for command line ===&lt;br /&gt;
If you want to run OpenStack from the command line, create the file &#039;&#039;keystonerc.sh&#039;&#039; on your local machine:&lt;br /&gt;
 touch keystonerc.sh&lt;br /&gt;
 chmod 0600 keystonerc.sh&lt;br /&gt;
&lt;br /&gt;
Use this as a template:&lt;br /&gt;
 export OS_USERNAME=YOUR_USER_NAME@uib.no&lt;br /&gt;
 export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
 export OS_PASSWORD=g3...YOUR_PASSWORD...Qb&lt;br /&gt;
 export OS_AUTH_URL=https://api.nrec.no:5000/v3&lt;br /&gt;
 export OS_IDENTITY_API_VERSION=3&lt;br /&gt;
 export OS_USER_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_PROJECT_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
 export OS_INTERFACE=public&lt;br /&gt;
 export OS_NO_CACHE=1&lt;br /&gt;
 export OS_TENANT_NAME=$OS_PROJECT_NAME&lt;br /&gt;
&lt;br /&gt;
=== Test run OpenStack from command line ===&lt;br /&gt;
Test with:&lt;br /&gt;
 . keystonerc.sh&lt;br /&gt;
 openstack server list&lt;br /&gt;
(It can be quite slow.)&lt;br /&gt;
&lt;br /&gt;
Other test commands:&lt;br /&gt;
 openstack image list&lt;br /&gt;
 openstack flavor list&lt;br /&gt;
 openstack network list&lt;br /&gt;
 openstack keypair list&lt;br /&gt;
 openstack security group list&lt;br /&gt;
&#039;&#039;&#039;openstack --help&#039;&#039;&#039; lists all the possible commands. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039; Try to create an instance with &#039;&#039;&#039;openstack server create ...&#039;&#039;&#039;, then delete it. You can use the NREC Overview to see the results.&lt;br /&gt;
&lt;br /&gt;
== Terraform ==&lt;br /&gt;
=== Install Terraform ===&lt;br /&gt;
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install  software-properties-common gnupg2 curl&lt;br /&gt;
 curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor &amp;gt; hashicorp.gpg&lt;br /&gt;
 sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/&lt;br /&gt;
 sudo apt-add-repository &amp;quot;deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main&amp;quot;&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install terraform&lt;br /&gt;
&lt;br /&gt;
=== Configure Terraform ===&lt;br /&gt;
Create a configuration file &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # configure the OpenStack provider&lt;br /&gt;
 terraform {&lt;br /&gt;
   required_providers {&lt;br /&gt;
     openstack = {&lt;br /&gt;
       source = &amp;quot;terraform-provider-openstack/openstack&amp;quot;&lt;br /&gt;
     }&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Initialise Terraform with:&lt;br /&gt;
 terraform init&lt;br /&gt;
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:&lt;br /&gt;
 terraform init -upgrade&lt;br /&gt;
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.&lt;br /&gt;
&lt;br /&gt;
=== Test run Terraform ===&lt;br /&gt;
Append something like this to &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # test instance&lt;br /&gt;
 resource &amp;quot;openstack_compute_instance_v2&amp;quot; &amp;quot;terraform-test&amp;quot; {&lt;br /&gt;
   name            = &amp;quot;terraform-test&amp;quot;&lt;br /&gt;
   image_name      = &amp;quot;GOLD Ubuntu 22.04 LTS&amp;quot;&lt;br /&gt;
   flavor_name     = &amp;quot;m1.large&amp;quot;&lt;br /&gt;
   security_groups = [&amp;quot;default&amp;quot;, &amp;quot;info319-spark-cluster&amp;quot;]&lt;br /&gt;
   network {&lt;br /&gt;
     name = &amp;quot;dualStack&amp;quot;&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Try to run Terraform with:&lt;br /&gt;
 terraform plan&lt;br /&gt;
&#039;&#039;&#039;terraform plan&#039;&#039;&#039; lists everything &#039;&#039;terraform&#039;&#039; will do if you &#039;&#039;apply&#039;&#039; it. This list is important to check so you do not permanently delete something critical, like disks/volumes with important data on them...&lt;br /&gt;
&lt;br /&gt;
To create the new test instance:&lt;br /&gt;
 terraform apply  # answer &#039;yes&#039; to approve&lt;br /&gt;
Check &#039;&#039;Compute -&amp;gt; Instances&#039;&#039; in the &#039;&#039;NREC Overview&#039;&#039; to see the new instance appear.&lt;br /&gt;
&lt;br /&gt;
(You can also run &#039;&#039;&#039;terraform plan -out plan.tf&#039;&#039;&#039; to save the plan to a file and then run it faster with &#039;&#039;&#039;terraform apply plan.tf&#039;&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
=== user-data.cfg ===&lt;br /&gt;
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like &#039;&#039;user-data.cfg&#039;&#039;, or example:&lt;br /&gt;
 #cloud-config&lt;br /&gt;
 &lt;br /&gt;
 apt_upgrade: true&lt;br /&gt;
 &lt;br /&gt;
 packages:&lt;br /&gt;
 - emacs&lt;br /&gt;
 &lt;br /&gt;
 power_state:&lt;br /&gt;
   delay: &amp;quot;+3&amp;quot;&lt;br /&gt;
   mode: reboot&lt;br /&gt;
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text &#039;&#039;emacs&#039;&#039; editor.)&lt;br /&gt;
&lt;br /&gt;
To run this script when a new instance is created, add the line&lt;br /&gt;
   user_data       = file(&amp;quot;user-data.cfg&amp;quot;)&lt;br /&gt;
to &#039;&#039;info319-cluster.tf&#039; and run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a &#039;&#039;Connection closed by UNKNOWN port 65535&#039;&#039; message.)&lt;br /&gt;
&lt;br /&gt;
=== ~/.config/openstack/clouds.yaml (optional) ===&lt;br /&gt;
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (&#039;&#039;OS_*&#039;&#039;) defined in keystonerc.sh , you can keep them in a file. &lt;br /&gt;
&lt;br /&gt;
On your local machine, create ~/.config/openstack/clouds.yaml with &#039;&#039;0600&#039;&#039; (or &#039;&#039;-rw-------&#039;&#039;) access, for example:&lt;br /&gt;
 clouds:&lt;br /&gt;
   info319-cluster:&lt;br /&gt;
     auth:&lt;br /&gt;
       auth_url: https://api.nrec.no:5000/v3&lt;br /&gt;
       project_name: uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
       username: YOUR_USER_NAME@uib.no&lt;br /&gt;
       password: g3...YOUR_PASSWORD...Qb&lt;br /&gt;
       user_domain_name: dataporten&lt;br /&gt;
       project_domain_name: dataporten&lt;br /&gt;
     identity_api_version: 3&lt;br /&gt;
     region_name: YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
     interface: public&lt;br /&gt;
     operation_log:&lt;br /&gt;
       logging: TRUE&lt;br /&gt;
       file: openstackclient_admin.log&lt;br /&gt;
       level: info&lt;br /&gt;
&lt;br /&gt;
You need to change your &#039;&#039;info319-cluster.tf&#039;&#039; file from&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
to&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
   cloud  = &amp;quot;info319-cluster&amp;quot;  # defined in a clouds.yaml file&lt;br /&gt;
 }&lt;br /&gt;
The advantage of this setup is that it easier manage multiple clusters from the same local machine. You do not need to run the &#039;&#039;keystonerc.sh&#039;&#039; script, and you avoid problems with changed environment variables.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There are now two ways to run &#039;&#039;&#039;openstack&#039;&#039;&#039;. One assumes that all the &#039;&#039;OS_*&#039;&#039; environment variables defined in &#039;&#039;keystonerc.sh&#039;&#039; are set. The other assumes that there is a &#039;&#039;~/.config/openstack/clouds.yaml&#039;&#039; file and that &#039;&#039;OS_CLOUD&#039;&#039; is set, for example:&lt;br /&gt;
 OS_CLOUD=info319-cluster openstack server list&lt;br /&gt;
&lt;br /&gt;
== Spark cluster ==&lt;br /&gt;
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The &#039;&#039;Resources&#039;&#039; menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] lists all the available resources, but it is easier to list the most relevant ones with &#039;&#039;&#039;openstack &#039;&#039;resource_type&#039;&#039; list&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create or import a key pair ===&lt;br /&gt;
Your &#039;&#039;terraform-test&#039;&#039; instance still has no keypair. To add a keypair to &#039;&#039;info319-cluster.tf&#039;&#039;, use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public &#039;&#039;~/.ssh/info319-spark-cluster.pub&#039;&#039; SSH key from before. &lt;br /&gt;
&lt;br /&gt;
Add the keypair resource to the &#039;&#039;terraform-test&#039;&#039; resource with a line like this:&lt;br /&gt;
   key_pair        = &amp;quot;info319-spark-cluster&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Rerun &#039;&#039;&#039;terraform plan&#039;&#039;&#039;, &#039;&#039;&#039;terraform apply&#039;&#039;&#039;, and sometimes &#039;&#039;&#039;terraform destroy&#039;&#039;&#039; continuously as you build the cluster to check that things work.&lt;br /&gt;
&lt;br /&gt;
=== Test login ===&lt;br /&gt;
Use&lt;br /&gt;
 openstack server list  # or OS_CLOUD=info319-cluster openstack ...&lt;br /&gt;
to find the IPv4 and IPv6 addresses of &#039;&#039;terraform-test&#039;&#039;. Log in with for example:&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.65.222&lt;br /&gt;
and&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::22f4&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark driver ===&lt;br /&gt;
The &#039;&#039;terraform-test&#039;&#039; resource can be renamed to for example &#039;&#039;terraform-driver&#039;&#039; and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark workers ===&lt;br /&gt;
You can create multiple &#039;&#039;terraform-worker-&#039;&#039; resources in a similar way, but adding a line &lt;br /&gt;
 count            = 6&lt;br /&gt;
and using &#039;&#039;&#039;${count.index}&#039;&#039;&#039; as a variable inside the worker &#039;&#039;name&#039;&#039;. Use &#039;&#039;IPv6&#039;&#039; instead of &#039;&#039;dualStack&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Check that you can login to the new instances using &#039;&#039;&#039;ssh&#039;&#039;&#039; and a IPv6 JumpHost.&lt;br /&gt;
&lt;br /&gt;
As &#039;&#039;info319-cluster.tf&#039;&#039; grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:&lt;br /&gt;
 locals {&lt;br /&gt;
     cluster_prefix = &amp;quot;terraform-&amp;quot;&lt;br /&gt;
     num_workers = 6&lt;br /&gt;
     driver_name = &amp;quot;${local.cluster_prefix}driver&amp;quot;&lt;br /&gt;
     worker_prefix = &amp;quot;${local.cluster_prefix}worker-&amp;quot;&lt;br /&gt;
     worker_names = [&lt;br /&gt;
         for idx in range(local.num_workers) : &lt;br /&gt;
             &amp;quot;${local.worker_prefix}${idx}&amp;quot;]&lt;br /&gt;
     all_names = concat([local.driver_name], local.worker_names)&lt;br /&gt;
 }&lt;br /&gt;
Inside strings, you can refer to the variables using &#039;&#039;${local.var_name}&#039;&#039;. Outside of string, you can write just &#039;&#039;local.var_name&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create and attach volumes ===&lt;br /&gt;
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances]. &lt;br /&gt;
&lt;br /&gt;
You can use &#039;&#039;local.num_workers&#039;&#039; and &#039;&#039;count=&#039;&#039; both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:&lt;br /&gt;
 resource &amp;quot;openstack_compute_volume_attach_v2&amp;quot; &amp;quot;attached-to-workers&amp;quot; {&lt;br /&gt;
   count       = local.num_workers&lt;br /&gt;
   instance_id = &amp;quot;${openstack_compute_instance_v2.terraform-workers[count.index].id}&amp;quot;&lt;br /&gt;
   volume_id   = &amp;quot;${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Log in with SSH and to &#039;&#039;&#039;ls /dev&#039;&#039;&#039; to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).&lt;br /&gt;
&lt;br /&gt;
=== Create &#039;&#039;hosts&#039;&#039; files ===&lt;br /&gt;
Make OpenStack write out the file &#039;&#039;ipv4-hosts&#039;&#039; looking like this:&lt;br /&gt;
 158.37.65.58 terraform-driver&lt;br /&gt;
 10.1.2.234 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 10.1.2.63 terraform-worker-5&lt;br /&gt;
and &#039;&#039;ipv6-hosts&#039;&#039; looking like this:&lt;br /&gt;
 2001:700:2:8300::28c4 terraform-driver&lt;br /&gt;
 2001:700:2:8301::14a1 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 2001:700:2:8310::120d terraform-worker-5&lt;br /&gt;
 &lt;br /&gt;
Tips:&lt;br /&gt;
* you can define outputs from terraform like this:&lt;br /&gt;
 output &amp;quot;terraform-driver-ipv4&amp;quot; {&lt;br /&gt;
     value = openstack_compute_instance_v2.terraform-driver.access_ip_v4&lt;br /&gt;
 }&lt;br /&gt;
* you can use the &#039;&#039;&#039;[https://www.terraform.io/cli/commands/console terraform console]&#039;&#039;&#039; to explore expressions you can use to define outputs and locals, for example:&lt;br /&gt;
 openstack_compute_instance_v2.terraform-workers[5]&lt;br /&gt;
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs&lt;br /&gt;
* a few useful functions are:&lt;br /&gt;
     concat([local.element], local.list)&lt;br /&gt;
     length(string)&lt;br /&gt;
     substr(string, offset, length)&lt;br /&gt;
     zipmap(list1, list2)&lt;br /&gt;
* you can create lists using this syntax:&lt;br /&gt;
     terraform-workers-ipv6 = [&lt;br /&gt;
     	for idx in range(number):&lt;br /&gt;
 	    list[idx].attribute&lt;br /&gt;
     ]&lt;br /&gt;
* you can output to a file like this:&lt;br /&gt;
 resource &amp;quot;local_file&amp;quot; &amp;quot;ipv4-hosts-file&amp;quot; {&lt;br /&gt;
     content = &amp;quot;${local.ipv4-hosts-string}\n&amp;quot;&lt;br /&gt;
     filename = &amp;quot;ipv4-hosts&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
On your local host, change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039;:&lt;br /&gt;
 Host terraform-driver&lt;br /&gt;
     Hostname 2001:700:2:8300::28c4&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-0&lt;br /&gt;
     Hostname 2001:700:2:8301::14a1&lt;br /&gt;
 &lt;br /&gt;
 ...&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-5&lt;br /&gt;
     Hostname 2001:700:2:8301::120d&lt;br /&gt;
&lt;br /&gt;
On your local host, add these lines to &#039;&#039;~/.ssh/config&#039;&#039;:&lt;br /&gt;
 Host terraform-*&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      ProxyJump sinoa@login.uib.no&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
&lt;br /&gt;
 Include ./config.terraform-hosts&lt;br /&gt;
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).&lt;br /&gt;
&lt;br /&gt;
You can now log into all the cluster machines by name, even after you increase the number of worker nodes:&lt;br /&gt;
 ssh terraform-worker-4&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1204</id>
		<title>Create Spark cluster using Terraform</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1204"/>
		<updated>2022-10-31T13:21:15Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* ~/.config/openstack/clouds.yaml (optional) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== OpenStack ==&lt;br /&gt;
=== Install OpenStack ===&lt;br /&gt;
On your local machine, create an &#039;&#039;exercise-5&#039;&#039; folder, for example a subfolder of &#039;&#039;info319-exercises&#039;&#039;, and &#039;&#039;&#039;cd&#039;&#039;&#039; into it.&lt;br /&gt;
&lt;br /&gt;
Install &#039;&#039;openstackclient&#039;&#039;. There are several ways to do this. On Ubuntu Linux, you can do:&lt;br /&gt;
 sudo apt install python3-openstackclient&lt;br /&gt;
(You may have to reinstall &#039;&#039;python3-six&#039;&#039; and &#039;&#039;python3-urllib3&#039;&#039;. You may also need &#039;&#039;python3-dev&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
Other guides suggest you install it as a Python package (using a virtual environment if you want):&lt;br /&gt;
 pip install python-openstackclient&lt;br /&gt;
&lt;br /&gt;
=== Configure OpenStack for command line ===&lt;br /&gt;
If you want to run OpenStack from the command line, create the file &#039;&#039;keystonerc.sh&#039;&#039; on your local machine:&lt;br /&gt;
 touch keystonerc.sh&lt;br /&gt;
 chmod 0600 keystonerc.sh&lt;br /&gt;
&lt;br /&gt;
Use this as a template:&lt;br /&gt;
 export OS_USERNAME=YOUR_USER_NAME@uib.no&lt;br /&gt;
 export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
 export OS_PASSWORD=g3...YOUR_PASSWORD...Qb&lt;br /&gt;
 export OS_AUTH_URL=https://api.nrec.no:5000/v3&lt;br /&gt;
 export OS_IDENTITY_API_VERSION=3&lt;br /&gt;
 export OS_USER_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_PROJECT_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
 export OS_INTERFACE=public&lt;br /&gt;
 export OS_NO_CACHE=1&lt;br /&gt;
 export OS_TENANT_NAME=$OS_PROJECT_NAME&lt;br /&gt;
&lt;br /&gt;
=== Test run OpenStack from command line ===&lt;br /&gt;
Test with:&lt;br /&gt;
 . keystonerc.sh&lt;br /&gt;
 openstack server list&lt;br /&gt;
(It can be quite slow.)&lt;br /&gt;
&lt;br /&gt;
Other test commands:&lt;br /&gt;
 openstack image list&lt;br /&gt;
 openstack flavor list&lt;br /&gt;
 openstack network list&lt;br /&gt;
 openstack keypair list&lt;br /&gt;
 openstack security group list&lt;br /&gt;
&#039;&#039;&#039;openstack --help&#039;&#039;&#039; lists all the possible commands. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039; Try to create an instance with &#039;&#039;&#039;openstack server create ...&#039;&#039;&#039;, then delete it. You can use the NREC Overview to see the results.&lt;br /&gt;
&lt;br /&gt;
== Terraform ==&lt;br /&gt;
=== Install Terraform ===&lt;br /&gt;
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install  software-properties-common gnupg2 curl&lt;br /&gt;
 curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor &amp;gt; hashicorp.gpg&lt;br /&gt;
 sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/&lt;br /&gt;
 sudo apt-add-repository &amp;quot;deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main&amp;quot;&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install terraform&lt;br /&gt;
&lt;br /&gt;
=== Configure Terraform ===&lt;br /&gt;
Create a configuration file &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # configure the OpenStack provider&lt;br /&gt;
 terraform {&lt;br /&gt;
   required_providers {&lt;br /&gt;
     openstack = {&lt;br /&gt;
       source = &amp;quot;terraform-provider-openstack/openstack&amp;quot;&lt;br /&gt;
     }&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Initialise Terraform with:&lt;br /&gt;
 terraform init&lt;br /&gt;
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:&lt;br /&gt;
 terraform init -upgrade&lt;br /&gt;
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.&lt;br /&gt;
&lt;br /&gt;
=== Test run Terraform ===&lt;br /&gt;
Append something like this to &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # test instance&lt;br /&gt;
 resource &amp;quot;openstack_compute_instance_v2&amp;quot; &amp;quot;terraform-test&amp;quot; {&lt;br /&gt;
   name            = &amp;quot;terraform-test&amp;quot;&lt;br /&gt;
   image_name      = &amp;quot;GOLD Ubuntu 22.04 LTS&amp;quot;&lt;br /&gt;
   flavor_name     = &amp;quot;m1.large&amp;quot;&lt;br /&gt;
   security_groups = [&amp;quot;default&amp;quot;, &amp;quot;info319-spark-cluster&amp;quot;]&lt;br /&gt;
   network {&lt;br /&gt;
     name = &amp;quot;dualStack&amp;quot;&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Try to run Terraform with:&lt;br /&gt;
 terraform plan&lt;br /&gt;
&#039;&#039;&#039;terraform plan&#039;&#039;&#039; lists everything &#039;&#039;terraform&#039;&#039; will do if you &#039;&#039;apply&#039;&#039; it. This list is important to check so you do not permanently delete something critical, like disks/volumes with important data on them...&lt;br /&gt;
&lt;br /&gt;
To create the new test instance:&lt;br /&gt;
 terraform apply  # answer &#039;yes&#039; to approve&lt;br /&gt;
Check &#039;&#039;Compute -&amp;gt; Instances&#039;&#039; in the &#039;&#039;NREC Overview&#039;&#039; to see the new instance appear.&lt;br /&gt;
&lt;br /&gt;
(You can also run &#039;&#039;&#039;terraform plan -out plan.tf&#039;&#039;&#039; to save the plan to a file and then run it faster with &#039;&#039;&#039;terraform apply plan.tf&#039;&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
=== user-data.cfg ===&lt;br /&gt;
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like &#039;&#039;user-data.cfg&#039;&#039;, or example:&lt;br /&gt;
 #cloud-config&lt;br /&gt;
 &lt;br /&gt;
 apt_upgrade: true&lt;br /&gt;
 &lt;br /&gt;
 packages:&lt;br /&gt;
 - emacs&lt;br /&gt;
 &lt;br /&gt;
 power_state:&lt;br /&gt;
   delay: &amp;quot;+3&amp;quot;&lt;br /&gt;
   mode: reboot&lt;br /&gt;
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text &#039;&#039;emacs&#039;&#039; editor.)&lt;br /&gt;
&lt;br /&gt;
To run this script when a new instance is created, add the line&lt;br /&gt;
   user_data       = file(&amp;quot;user-data.cfg&amp;quot;)&lt;br /&gt;
to &#039;&#039;info319-cluster.tf&#039; and run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a &#039;&#039;Connection closed by UNKNOWN port 65535&#039;&#039; message.)&lt;br /&gt;
&lt;br /&gt;
=== ~/.config/openstack/clouds.yaml (optional) ===&lt;br /&gt;
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (&#039;&#039;OS_*&#039;&#039;) defined in keystonerc.sh , you can keep them in a file. &lt;br /&gt;
&lt;br /&gt;
On your local machine, create ~/.config/openstack/clouds.yaml with &#039;&#039;0600&#039;&#039; (or &#039;&#039;-rw-------&#039;&#039;) access, for example:&lt;br /&gt;
 clouds:&lt;br /&gt;
   info319-cluster:&lt;br /&gt;
     auth:&lt;br /&gt;
       auth_url: https://api.nrec.no:5000/v3&lt;br /&gt;
       project_name: uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
       username: YOUR_USER_NAME@uib.no&lt;br /&gt;
       password: g3...YOUR_PASSWORD...Qb&lt;br /&gt;
       user_domain_name: dataporten&lt;br /&gt;
       project_domain_name: dataporten&lt;br /&gt;
     identity_api_version: 3&lt;br /&gt;
     region_name: YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
     interface: public&lt;br /&gt;
     operation_log:&lt;br /&gt;
       logging: TRUE&lt;br /&gt;
       file: openstackclient_admin.log&lt;br /&gt;
       level: info&lt;br /&gt;
&lt;br /&gt;
You need to change your &#039;&#039;info319-cluster.tf&#039;&#039; file from&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
to&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
   cloud  = &amp;quot;info319-cluster&amp;quot;  # defined in a clouds.yaml file&lt;br /&gt;
 }&lt;br /&gt;
The advantage of this setup is that it easier manage multiple clusters from the same local machine. You do not need to run the &#039;&#039;keystonerc.sh&#039;&#039; script, and you avoid problems with changed environment variables.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There are now two ways to run &#039;&#039;&#039;openstack&#039;&#039;&#039;. One assumes that all the &#039;&#039;OS_*&#039;&#039; environment variables defined in &#039;&#039;keystonerc.sh&#039;&#039; are set. The other assumes that there is a &#039;&#039;~/.config/openstack/clouds.yaml&#039;&#039; file and that &#039;&#039;OS_CLOUD&#039;&#039; is set, for example:&lt;br /&gt;
 OS_CLOUD=info319-cluster openstack server list&lt;br /&gt;
&lt;br /&gt;
== Spark cluster ==&lt;br /&gt;
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The &#039;&#039;Resources&#039;&#039; menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] lists all the available resources, but it is easier to list the most relevant ones with &#039;&#039;&#039;openstack &#039;&#039;resource_type&#039;&#039; list&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create or import a key pair ===&lt;br /&gt;
Your &#039;&#039;terraform-test&#039;&#039; instance still has no keypair. To add a keypair to &#039;&#039;info319-cluster.tf&#039;&#039;, use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public &#039;&#039;~/.ssh/info319-spark-cluster.pub&#039;&#039; SSH key from before. &lt;br /&gt;
&lt;br /&gt;
Add the keypair resource to the &#039;&#039;terraform-test&#039;&#039; resource with a line like this:&lt;br /&gt;
   key_pair        = &amp;quot;info319-spark-cluster&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Rerun &#039;&#039;&#039;terraform plan&#039;&#039;&#039;, &#039;&#039;&#039;terraform apply&#039;&#039;&#039;, and sometimes &#039;&#039;&#039;terraform destroy&#039;&#039;&#039; continuously as you build the cluster to check that things work.&lt;br /&gt;
&lt;br /&gt;
=== Test login ===&lt;br /&gt;
Use&lt;br /&gt;
 openstack server list&lt;br /&gt;
to find the IPv4 and IPv6 addresses of &#039;&#039;terraform-test&#039;&#039;. Log in with for example:&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.65.222&lt;br /&gt;
and&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::22f4&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark driver ===&lt;br /&gt;
The &#039;&#039;terraform-test&#039;&#039; resource can be renamed to for example &#039;&#039;terraform-driver&#039;&#039; and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark workers ===&lt;br /&gt;
You can create multiple &#039;&#039;terraform-worker-&#039;&#039; resources in a similar way, but adding a line &lt;br /&gt;
 count            = 6&lt;br /&gt;
and using &#039;&#039;&#039;${count.index}&#039;&#039;&#039; as a variable inside the worker &#039;&#039;name&#039;&#039;. Use &#039;&#039;IPv6&#039;&#039; instead of &#039;&#039;dualStack&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Check that you can login to the new instances using &#039;&#039;&#039;ssh&#039;&#039;&#039; and a IPv6 JumpHost.&lt;br /&gt;
&lt;br /&gt;
As &#039;&#039;info319-cluster.tf&#039;&#039; grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:&lt;br /&gt;
 locals {&lt;br /&gt;
     cluster_prefix = &amp;quot;terraform-&amp;quot;&lt;br /&gt;
     num_workers = 6&lt;br /&gt;
     driver_name = &amp;quot;${local.cluster_prefix}driver&amp;quot;&lt;br /&gt;
     worker_prefix = &amp;quot;${local.cluster_prefix}worker-&amp;quot;&lt;br /&gt;
     worker_names = [&lt;br /&gt;
         for idx in range(local.num_workers) : &lt;br /&gt;
             &amp;quot;${local.worker_prefix}${idx}&amp;quot;]&lt;br /&gt;
     all_names = concat([local.driver_name], local.worker_names)&lt;br /&gt;
 }&lt;br /&gt;
Inside strings, you can refer to the variables using &#039;&#039;${local.var_name}&#039;&#039;. Outside of string, you can write just &#039;&#039;local.var_name&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create and attach volumes ===&lt;br /&gt;
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances]. &lt;br /&gt;
&lt;br /&gt;
You can use &#039;&#039;local.num_workers&#039;&#039; and &#039;&#039;count=&#039;&#039; both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:&lt;br /&gt;
 resource &amp;quot;openstack_compute_volume_attach_v2&amp;quot; &amp;quot;attached-to-workers&amp;quot; {&lt;br /&gt;
   count       = local.num_workers&lt;br /&gt;
   instance_id = &amp;quot;${openstack_compute_instance_v2.terraform-workers[count.index].id}&amp;quot;&lt;br /&gt;
   volume_id   = &amp;quot;${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Log in with SSH and to &#039;&#039;&#039;ls /dev&#039;&#039;&#039; to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).&lt;br /&gt;
&lt;br /&gt;
=== Create &#039;&#039;hosts&#039;&#039; files ===&lt;br /&gt;
Make OpenStack write out the file &#039;&#039;ipv4-hosts&#039;&#039; looking like this:&lt;br /&gt;
 158.37.65.58 terraform-driver&lt;br /&gt;
 10.1.2.234 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 10.1.2.63 terraform-worker-5&lt;br /&gt;
and &#039;&#039;ipv6-hosts&#039;&#039; looking like this:&lt;br /&gt;
 2001:700:2:8300::28c4 terraform-driver&lt;br /&gt;
 2001:700:2:8301::14a1 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 2001:700:2:8310::120d terraform-worker-5&lt;br /&gt;
 &lt;br /&gt;
Tips:&lt;br /&gt;
* you can define outputs from terraform like this:&lt;br /&gt;
 output &amp;quot;terraform-driver-ipv4&amp;quot; {&lt;br /&gt;
     value = openstack_compute_instance_v2.terraform-driver.access_ip_v4&lt;br /&gt;
 }&lt;br /&gt;
* you can use the &#039;&#039;&#039;[https://www.terraform.io/cli/commands/console terraform console]&#039;&#039;&#039; to explore expressions you can use to define outputs and locals, for example:&lt;br /&gt;
 openstack_compute_instance_v2.terraform-workers[5]&lt;br /&gt;
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs&lt;br /&gt;
* a few useful functions are:&lt;br /&gt;
     concat([local.element], local.list)&lt;br /&gt;
     length(string)&lt;br /&gt;
     substr(string, offset, length)&lt;br /&gt;
     zipmap(list1, list2)&lt;br /&gt;
* you can create lists using this syntax:&lt;br /&gt;
     terraform-workers-ipv6 = [&lt;br /&gt;
     	for idx in range(number):&lt;br /&gt;
 	    list[idx].attribute&lt;br /&gt;
     ]&lt;br /&gt;
* you can output to a file like this:&lt;br /&gt;
 resource &amp;quot;local_file&amp;quot; &amp;quot;ipv4-hosts-file&amp;quot; {&lt;br /&gt;
     content = &amp;quot;${local.ipv4-hosts-string}\n&amp;quot;&lt;br /&gt;
     filename = &amp;quot;ipv4-hosts&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
On your local host, change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039;:&lt;br /&gt;
 Host terraform-driver&lt;br /&gt;
     Hostname 2001:700:2:8300::28c4&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-0&lt;br /&gt;
     Hostname 2001:700:2:8301::14a1&lt;br /&gt;
 &lt;br /&gt;
 ...&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-5&lt;br /&gt;
     Hostname 2001:700:2:8301::120d&lt;br /&gt;
&lt;br /&gt;
On your local host, add these lines to &#039;&#039;~/.ssh/config&#039;&#039;:&lt;br /&gt;
 Host terraform-*&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      ProxyJump sinoa@login.uib.no&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
&lt;br /&gt;
 Include ./config.terraform-hosts&lt;br /&gt;
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).&lt;br /&gt;
&lt;br /&gt;
You can now log into all the cluster machines by name, even after you increase the number of worker nodes:&lt;br /&gt;
 ssh terraform-worker-4&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1203</id>
		<title>Create Spark cluster using Terraform</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1203"/>
		<updated>2022-10-31T13:20:49Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Spark cluster */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== OpenStack ==&lt;br /&gt;
=== Install OpenStack ===&lt;br /&gt;
On your local machine, create an &#039;&#039;exercise-5&#039;&#039; folder, for example a subfolder of &#039;&#039;info319-exercises&#039;&#039;, and &#039;&#039;&#039;cd&#039;&#039;&#039; into it.&lt;br /&gt;
&lt;br /&gt;
Install &#039;&#039;openstackclient&#039;&#039;. There are several ways to do this. On Ubuntu Linux, you can do:&lt;br /&gt;
 sudo apt install python3-openstackclient&lt;br /&gt;
(You may have to reinstall &#039;&#039;python3-six&#039;&#039; and &#039;&#039;python3-urllib3&#039;&#039;. You may also need &#039;&#039;python3-dev&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
Other guides suggest you install it as a Python package (using a virtual environment if you want):&lt;br /&gt;
 pip install python-openstackclient&lt;br /&gt;
&lt;br /&gt;
=== Configure OpenStack for command line ===&lt;br /&gt;
If you want to run OpenStack from the command line, create the file &#039;&#039;keystonerc.sh&#039;&#039; on your local machine:&lt;br /&gt;
 touch keystonerc.sh&lt;br /&gt;
 chmod 0600 keystonerc.sh&lt;br /&gt;
&lt;br /&gt;
Use this as a template:&lt;br /&gt;
 export OS_USERNAME=YOUR_USER_NAME@uib.no&lt;br /&gt;
 export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
 export OS_PASSWORD=g3...YOUR_PASSWORD...Qb&lt;br /&gt;
 export OS_AUTH_URL=https://api.nrec.no:5000/v3&lt;br /&gt;
 export OS_IDENTITY_API_VERSION=3&lt;br /&gt;
 export OS_USER_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_PROJECT_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
 export OS_INTERFACE=public&lt;br /&gt;
 export OS_NO_CACHE=1&lt;br /&gt;
 export OS_TENANT_NAME=$OS_PROJECT_NAME&lt;br /&gt;
&lt;br /&gt;
=== Test run OpenStack from command line ===&lt;br /&gt;
Test with:&lt;br /&gt;
 . keystonerc.sh&lt;br /&gt;
 openstack server list&lt;br /&gt;
(It can be quite slow.)&lt;br /&gt;
&lt;br /&gt;
Other test commands:&lt;br /&gt;
 openstack image list&lt;br /&gt;
 openstack flavor list&lt;br /&gt;
 openstack network list&lt;br /&gt;
 openstack keypair list&lt;br /&gt;
 openstack security group list&lt;br /&gt;
&#039;&#039;&#039;openstack --help&#039;&#039;&#039; lists all the possible commands. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039; Try to create an instance with &#039;&#039;&#039;openstack server create ...&#039;&#039;&#039;, then delete it. You can use the NREC Overview to see the results.&lt;br /&gt;
&lt;br /&gt;
== Terraform ==&lt;br /&gt;
=== Install Terraform ===&lt;br /&gt;
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install  software-properties-common gnupg2 curl&lt;br /&gt;
 curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor &amp;gt; hashicorp.gpg&lt;br /&gt;
 sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/&lt;br /&gt;
 sudo apt-add-repository &amp;quot;deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main&amp;quot;&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install terraform&lt;br /&gt;
&lt;br /&gt;
=== Configure Terraform ===&lt;br /&gt;
Create a configuration file &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # configure the OpenStack provider&lt;br /&gt;
 terraform {&lt;br /&gt;
   required_providers {&lt;br /&gt;
     openstack = {&lt;br /&gt;
       source = &amp;quot;terraform-provider-openstack/openstack&amp;quot;&lt;br /&gt;
     }&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Initialise Terraform with:&lt;br /&gt;
 terraform init&lt;br /&gt;
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:&lt;br /&gt;
 terraform init -upgrade&lt;br /&gt;
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.&lt;br /&gt;
&lt;br /&gt;
=== Test run Terraform ===&lt;br /&gt;
Append something like this to &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # test instance&lt;br /&gt;
 resource &amp;quot;openstack_compute_instance_v2&amp;quot; &amp;quot;terraform-test&amp;quot; {&lt;br /&gt;
   name            = &amp;quot;terraform-test&amp;quot;&lt;br /&gt;
   image_name      = &amp;quot;GOLD Ubuntu 22.04 LTS&amp;quot;&lt;br /&gt;
   flavor_name     = &amp;quot;m1.large&amp;quot;&lt;br /&gt;
   security_groups = [&amp;quot;default&amp;quot;, &amp;quot;info319-spark-cluster&amp;quot;]&lt;br /&gt;
   network {&lt;br /&gt;
     name = &amp;quot;dualStack&amp;quot;&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Try to run Terraform with:&lt;br /&gt;
 terraform plan&lt;br /&gt;
&#039;&#039;&#039;terraform plan&#039;&#039;&#039; lists everything &#039;&#039;terraform&#039;&#039; will do if you &#039;&#039;apply&#039;&#039; it. This list is important to check so you do not permanently delete something critical, like disks/volumes with important data on them...&lt;br /&gt;
&lt;br /&gt;
To create the new test instance:&lt;br /&gt;
 terraform apply  # answer &#039;yes&#039; to approve&lt;br /&gt;
Check &#039;&#039;Compute -&amp;gt; Instances&#039;&#039; in the &#039;&#039;NREC Overview&#039;&#039; to see the new instance appear.&lt;br /&gt;
&lt;br /&gt;
(You can also run &#039;&#039;&#039;terraform plan -out plan.tf&#039;&#039;&#039; to save the plan to a file and then run it faster with &#039;&#039;&#039;terraform apply plan.tf&#039;&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
=== user-data.cfg ===&lt;br /&gt;
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like &#039;&#039;user-data.cfg&#039;&#039;, or example:&lt;br /&gt;
 #cloud-config&lt;br /&gt;
 &lt;br /&gt;
 apt_upgrade: true&lt;br /&gt;
 &lt;br /&gt;
 packages:&lt;br /&gt;
 - emacs&lt;br /&gt;
 &lt;br /&gt;
 power_state:&lt;br /&gt;
   delay: &amp;quot;+3&amp;quot;&lt;br /&gt;
   mode: reboot&lt;br /&gt;
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text &#039;&#039;emacs&#039;&#039; editor.)&lt;br /&gt;
&lt;br /&gt;
To run this script when a new instance is created, add the line&lt;br /&gt;
   user_data       = file(&amp;quot;user-data.cfg&amp;quot;)&lt;br /&gt;
to &#039;&#039;info319-cluster.tf&#039; and run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a &#039;&#039;Connection closed by UNKNOWN port 65535&#039;&#039; message.)&lt;br /&gt;
&lt;br /&gt;
=== ~/.config/openstack/clouds.yaml (optional) ===&lt;br /&gt;
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (&#039;&#039;OS_*&#039;&#039;) defined in keystonerc.sh , you can keep them in a file. &lt;br /&gt;
&lt;br /&gt;
On your local machine, create ~/.config/openstack/clouds.yaml with &#039;&#039;0600&#039;&#039; (or &#039;&#039;-rw-------&#039;&#039;) access, for example:&lt;br /&gt;
 clouds:&lt;br /&gt;
   info319-cluster:&lt;br /&gt;
     auth:&lt;br /&gt;
       auth_url: https://api.nrec.no:5000/v3&lt;br /&gt;
       project_name: uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
       username: YOUR_USER_NAME@uib.no&lt;br /&gt;
       password: g3...YOUR_PASSWORD...Qb&lt;br /&gt;
       user_domain_name: dataporten&lt;br /&gt;
       project_domain_name: dataporten&lt;br /&gt;
     identity_api_version: 3&lt;br /&gt;
     region_name: YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
     interface: public&lt;br /&gt;
     operation_log:&lt;br /&gt;
       logging: TRUE&lt;br /&gt;
       file: openstackclient_admin.log&lt;br /&gt;
       level: info&lt;br /&gt;
&lt;br /&gt;
You need to change your &#039;&#039;info319-cluster.tf&#039;&#039; file from&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
to&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
   cloud  = &amp;quot;info319-cluster&amp;quot;  # defined in a clouds.yaml file&lt;br /&gt;
 }&lt;br /&gt;
The advantage of this setup is that it easier manage multiple clusters from the same local machine. You do not need to run the &#039;&#039;keystonerc.sh&#039;&#039; script, and you avoid problems with changed environment variables.&lt;br /&gt;
&lt;br /&gt;
== Spark cluster ==&lt;br /&gt;
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The &#039;&#039;Resources&#039;&#039; menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] lists all the available resources, but it is easier to list the most relevant ones with &#039;&#039;&#039;openstack &#039;&#039;resource_type&#039;&#039; list&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create or import a key pair ===&lt;br /&gt;
Your &#039;&#039;terraform-test&#039;&#039; instance still has no keypair. To add a keypair to &#039;&#039;info319-cluster.tf&#039;&#039;, use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public &#039;&#039;~/.ssh/info319-spark-cluster.pub&#039;&#039; SSH key from before. &lt;br /&gt;
&lt;br /&gt;
Add the keypair resource to the &#039;&#039;terraform-test&#039;&#039; resource with a line like this:&lt;br /&gt;
   key_pair        = &amp;quot;info319-spark-cluster&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Rerun &#039;&#039;&#039;terraform plan&#039;&#039;&#039;, &#039;&#039;&#039;terraform apply&#039;&#039;&#039;, and sometimes &#039;&#039;&#039;terraform destroy&#039;&#039;&#039; continuously as you build the cluster to check that things work.&lt;br /&gt;
&lt;br /&gt;
=== Test login ===&lt;br /&gt;
Use&lt;br /&gt;
 openstack server list&lt;br /&gt;
to find the IPv4 and IPv6 addresses of &#039;&#039;terraform-test&#039;&#039;. Log in with for example:&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.65.222&lt;br /&gt;
and&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::22f4&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark driver ===&lt;br /&gt;
The &#039;&#039;terraform-test&#039;&#039; resource can be renamed to for example &#039;&#039;terraform-driver&#039;&#039; and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark workers ===&lt;br /&gt;
You can create multiple &#039;&#039;terraform-worker-&#039;&#039; resources in a similar way, but adding a line &lt;br /&gt;
 count            = 6&lt;br /&gt;
and using &#039;&#039;&#039;${count.index}&#039;&#039;&#039; as a variable inside the worker &#039;&#039;name&#039;&#039;. Use &#039;&#039;IPv6&#039;&#039; instead of &#039;&#039;dualStack&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Check that you can login to the new instances using &#039;&#039;&#039;ssh&#039;&#039;&#039; and a IPv6 JumpHost.&lt;br /&gt;
&lt;br /&gt;
As &#039;&#039;info319-cluster.tf&#039;&#039; grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:&lt;br /&gt;
 locals {&lt;br /&gt;
     cluster_prefix = &amp;quot;terraform-&amp;quot;&lt;br /&gt;
     num_workers = 6&lt;br /&gt;
     driver_name = &amp;quot;${local.cluster_prefix}driver&amp;quot;&lt;br /&gt;
     worker_prefix = &amp;quot;${local.cluster_prefix}worker-&amp;quot;&lt;br /&gt;
     worker_names = [&lt;br /&gt;
         for idx in range(local.num_workers) : &lt;br /&gt;
             &amp;quot;${local.worker_prefix}${idx}&amp;quot;]&lt;br /&gt;
     all_names = concat([local.driver_name], local.worker_names)&lt;br /&gt;
 }&lt;br /&gt;
Inside strings, you can refer to the variables using &#039;&#039;${local.var_name}&#039;&#039;. Outside of string, you can write just &#039;&#039;local.var_name&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create and attach volumes ===&lt;br /&gt;
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances]. &lt;br /&gt;
&lt;br /&gt;
You can use &#039;&#039;local.num_workers&#039;&#039; and &#039;&#039;count=&#039;&#039; both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:&lt;br /&gt;
 resource &amp;quot;openstack_compute_volume_attach_v2&amp;quot; &amp;quot;attached-to-workers&amp;quot; {&lt;br /&gt;
   count       = local.num_workers&lt;br /&gt;
   instance_id = &amp;quot;${openstack_compute_instance_v2.terraform-workers[count.index].id}&amp;quot;&lt;br /&gt;
   volume_id   = &amp;quot;${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Log in with SSH and to &#039;&#039;&#039;ls /dev&#039;&#039;&#039; to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).&lt;br /&gt;
&lt;br /&gt;
=== Create &#039;&#039;hosts&#039;&#039; files ===&lt;br /&gt;
Make OpenStack write out the file &#039;&#039;ipv4-hosts&#039;&#039; looking like this:&lt;br /&gt;
 158.37.65.58 terraform-driver&lt;br /&gt;
 10.1.2.234 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 10.1.2.63 terraform-worker-5&lt;br /&gt;
and &#039;&#039;ipv6-hosts&#039;&#039; looking like this:&lt;br /&gt;
 2001:700:2:8300::28c4 terraform-driver&lt;br /&gt;
 2001:700:2:8301::14a1 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 2001:700:2:8310::120d terraform-worker-5&lt;br /&gt;
 &lt;br /&gt;
Tips:&lt;br /&gt;
* you can define outputs from terraform like this:&lt;br /&gt;
 output &amp;quot;terraform-driver-ipv4&amp;quot; {&lt;br /&gt;
     value = openstack_compute_instance_v2.terraform-driver.access_ip_v4&lt;br /&gt;
 }&lt;br /&gt;
* you can use the &#039;&#039;&#039;[https://www.terraform.io/cli/commands/console terraform console]&#039;&#039;&#039; to explore expressions you can use to define outputs and locals, for example:&lt;br /&gt;
 openstack_compute_instance_v2.terraform-workers[5]&lt;br /&gt;
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs&lt;br /&gt;
* a few useful functions are:&lt;br /&gt;
     concat([local.element], local.list)&lt;br /&gt;
     length(string)&lt;br /&gt;
     substr(string, offset, length)&lt;br /&gt;
     zipmap(list1, list2)&lt;br /&gt;
* you can create lists using this syntax:&lt;br /&gt;
     terraform-workers-ipv6 = [&lt;br /&gt;
     	for idx in range(number):&lt;br /&gt;
 	    list[idx].attribute&lt;br /&gt;
     ]&lt;br /&gt;
* you can output to a file like this:&lt;br /&gt;
 resource &amp;quot;local_file&amp;quot; &amp;quot;ipv4-hosts-file&amp;quot; {&lt;br /&gt;
     content = &amp;quot;${local.ipv4-hosts-string}\n&amp;quot;&lt;br /&gt;
     filename = &amp;quot;ipv4-hosts&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
On your local host, change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039;:&lt;br /&gt;
 Host terraform-driver&lt;br /&gt;
     Hostname 2001:700:2:8300::28c4&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-0&lt;br /&gt;
     Hostname 2001:700:2:8301::14a1&lt;br /&gt;
 &lt;br /&gt;
 ...&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-5&lt;br /&gt;
     Hostname 2001:700:2:8301::120d&lt;br /&gt;
&lt;br /&gt;
On your local host, add these lines to &#039;&#039;~/.ssh/config&#039;&#039;:&lt;br /&gt;
 Host terraform-*&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      ProxyJump sinoa@login.uib.no&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
&lt;br /&gt;
 Include ./config.terraform-hosts&lt;br /&gt;
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).&lt;br /&gt;
&lt;br /&gt;
You can now log into all the cluster machines by name, even after you increase the number of worker nodes:&lt;br /&gt;
 ssh terraform-worker-4&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1202</id>
		<title>Create Spark cluster using Terraform</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1202"/>
		<updated>2022-10-31T13:10:05Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* ~/.config/openstack/clouds.yaml (optional) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== OpenStack ==&lt;br /&gt;
=== Install OpenStack ===&lt;br /&gt;
On your local machine, create an &#039;&#039;exercise-5&#039;&#039; folder, for example a subfolder of &#039;&#039;info319-exercises&#039;&#039;, and &#039;&#039;&#039;cd&#039;&#039;&#039; into it.&lt;br /&gt;
&lt;br /&gt;
Install &#039;&#039;openstackclient&#039;&#039;. There are several ways to do this. On Ubuntu Linux, you can do:&lt;br /&gt;
 sudo apt install python3-openstackclient&lt;br /&gt;
(You may have to reinstall &#039;&#039;python3-six&#039;&#039; and &#039;&#039;python3-urllib3&#039;&#039;. You may also need &#039;&#039;python3-dev&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
Other guides suggest you install it as a Python package (using a virtual environment if you want):&lt;br /&gt;
 pip install python-openstackclient&lt;br /&gt;
&lt;br /&gt;
=== Configure OpenStack for command line ===&lt;br /&gt;
If you want to run OpenStack from the command line, create the file &#039;&#039;keystonerc.sh&#039;&#039; on your local machine:&lt;br /&gt;
 touch keystonerc.sh&lt;br /&gt;
 chmod 0600 keystonerc.sh&lt;br /&gt;
&lt;br /&gt;
Use this as a template:&lt;br /&gt;
 export OS_USERNAME=YOUR_USER_NAME@uib.no&lt;br /&gt;
 export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
 export OS_PASSWORD=g3...YOUR_PASSWORD...Qb&lt;br /&gt;
 export OS_AUTH_URL=https://api.nrec.no:5000/v3&lt;br /&gt;
 export OS_IDENTITY_API_VERSION=3&lt;br /&gt;
 export OS_USER_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_PROJECT_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
 export OS_INTERFACE=public&lt;br /&gt;
 export OS_NO_CACHE=1&lt;br /&gt;
 export OS_TENANT_NAME=$OS_PROJECT_NAME&lt;br /&gt;
&lt;br /&gt;
=== Test run OpenStack from command line ===&lt;br /&gt;
Test with:&lt;br /&gt;
 . keystonerc.sh&lt;br /&gt;
 openstack server list&lt;br /&gt;
(It can be quite slow.)&lt;br /&gt;
&lt;br /&gt;
Other test commands:&lt;br /&gt;
 openstack image list&lt;br /&gt;
 openstack flavor list&lt;br /&gt;
 openstack network list&lt;br /&gt;
 openstack keypair list&lt;br /&gt;
 openstack security group list&lt;br /&gt;
&#039;&#039;&#039;openstack --help&#039;&#039;&#039; lists all the possible commands. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039; Try to create an instance with &#039;&#039;&#039;openstack server create ...&#039;&#039;&#039;, then delete it. You can use the NREC Overview to see the results.&lt;br /&gt;
&lt;br /&gt;
== Terraform ==&lt;br /&gt;
=== Install Terraform ===&lt;br /&gt;
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install  software-properties-common gnupg2 curl&lt;br /&gt;
 curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor &amp;gt; hashicorp.gpg&lt;br /&gt;
 sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/&lt;br /&gt;
 sudo apt-add-repository &amp;quot;deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main&amp;quot;&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install terraform&lt;br /&gt;
&lt;br /&gt;
=== Configure Terraform ===&lt;br /&gt;
Create a configuration file &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # configure the OpenStack provider&lt;br /&gt;
 terraform {&lt;br /&gt;
   required_providers {&lt;br /&gt;
     openstack = {&lt;br /&gt;
       source = &amp;quot;terraform-provider-openstack/openstack&amp;quot;&lt;br /&gt;
     }&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Initialise Terraform with:&lt;br /&gt;
 terraform init&lt;br /&gt;
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:&lt;br /&gt;
 terraform init -upgrade&lt;br /&gt;
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.&lt;br /&gt;
&lt;br /&gt;
=== Test run Terraform ===&lt;br /&gt;
Append something like this to &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # test instance&lt;br /&gt;
 resource &amp;quot;openstack_compute_instance_v2&amp;quot; &amp;quot;terraform-test&amp;quot; {&lt;br /&gt;
   name            = &amp;quot;terraform-test&amp;quot;&lt;br /&gt;
   image_name      = &amp;quot;GOLD Ubuntu 22.04 LTS&amp;quot;&lt;br /&gt;
   flavor_name     = &amp;quot;m1.large&amp;quot;&lt;br /&gt;
   security_groups = [&amp;quot;default&amp;quot;, &amp;quot;info319-spark-cluster&amp;quot;]&lt;br /&gt;
   network {&lt;br /&gt;
     name = &amp;quot;dualStack&amp;quot;&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Try to run Terraform with:&lt;br /&gt;
 terraform plan&lt;br /&gt;
&#039;&#039;&#039;terraform plan&#039;&#039;&#039; lists everything &#039;&#039;terraform&#039;&#039; will do if you &#039;&#039;apply&#039;&#039; it. This list is important to check so you do not permanently delete something critical, like disks/volumes with important data on them...&lt;br /&gt;
&lt;br /&gt;
To create the new test instance:&lt;br /&gt;
 terraform apply  # answer &#039;yes&#039; to approve&lt;br /&gt;
Check &#039;&#039;Compute -&amp;gt; Instances&#039;&#039; in the &#039;&#039;NREC Overview&#039;&#039; to see the new instance appear.&lt;br /&gt;
&lt;br /&gt;
(You can also run &#039;&#039;&#039;terraform plan -out plan.tf&#039;&#039;&#039; to save the plan to a file and then run it faster with &#039;&#039;&#039;terraform apply plan.tf&#039;&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
=== user-data.cfg ===&lt;br /&gt;
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like &#039;&#039;user-data.cfg&#039;&#039;, or example:&lt;br /&gt;
 #cloud-config&lt;br /&gt;
 &lt;br /&gt;
 apt_upgrade: true&lt;br /&gt;
 &lt;br /&gt;
 packages:&lt;br /&gt;
 - emacs&lt;br /&gt;
 &lt;br /&gt;
 power_state:&lt;br /&gt;
   delay: &amp;quot;+3&amp;quot;&lt;br /&gt;
   mode: reboot&lt;br /&gt;
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text &#039;&#039;emacs&#039;&#039; editor.)&lt;br /&gt;
&lt;br /&gt;
To run this script when a new instance is created, add the line&lt;br /&gt;
   user_data       = file(&amp;quot;user-data.cfg&amp;quot;)&lt;br /&gt;
to &#039;&#039;info319-cluster.tf&#039; and run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a &#039;&#039;Connection closed by UNKNOWN port 65535&#039;&#039; message.)&lt;br /&gt;
&lt;br /&gt;
=== ~/.config/openstack/clouds.yaml (optional) ===&lt;br /&gt;
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (&#039;&#039;OS_*&#039;&#039;) defined in keystonerc.sh , you can keep them in a file. &lt;br /&gt;
&lt;br /&gt;
On your local machine, create ~/.config/openstack/clouds.yaml with &#039;&#039;0600&#039;&#039; (or &#039;&#039;-rw-------&#039;&#039;) access, for example:&lt;br /&gt;
 clouds:&lt;br /&gt;
   info319-cluster:&lt;br /&gt;
     auth:&lt;br /&gt;
       auth_url: https://api.nrec.no:5000/v3&lt;br /&gt;
       project_name: uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
       username: YOUR_USER_NAME@uib.no&lt;br /&gt;
       password: g3...YOUR_PASSWORD...Qb&lt;br /&gt;
       user_domain_name: dataporten&lt;br /&gt;
       project_domain_name: dataporten&lt;br /&gt;
     identity_api_version: 3&lt;br /&gt;
     region_name: YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
     interface: public&lt;br /&gt;
     operation_log:&lt;br /&gt;
       logging: TRUE&lt;br /&gt;
       file: openstackclient_admin.log&lt;br /&gt;
       level: info&lt;br /&gt;
&lt;br /&gt;
You need to change your &#039;&#039;info319-cluster.tf&#039;&#039; file from&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
to&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
   cloud  = &amp;quot;info319-cluster&amp;quot;  # defined in a clouds.yaml file&lt;br /&gt;
 }&lt;br /&gt;
The advantage of this setup is that it easier manage multiple clusters from the same local machine. You do not need to run the &#039;&#039;keystonerc.sh&#039;&#039; script, and you avoid problems with changed environment variables.&lt;br /&gt;
&lt;br /&gt;
== Spark cluster ==&lt;br /&gt;
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The &#039;&#039;Resources&#039;&#039; menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] gives you details.&lt;br /&gt;
&lt;br /&gt;
=== Create or import a key pair ===&lt;br /&gt;
Your &#039;&#039;terraform-test&#039;&#039; instance still has no keypair. To add a keypair to &#039;&#039;info319-cluster.tf&#039;&#039;, use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public &#039;&#039;~/.ssh/info319-spark-cluster.pub&#039;&#039; SSH key from before. &lt;br /&gt;
&lt;br /&gt;
Add the keypair resource to the &#039;&#039;terraform-test&#039;&#039; resource with a line like this:&lt;br /&gt;
   key_pair        = &amp;quot;info319-spark-cluster&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Rerun &#039;&#039;&#039;terraform plan&#039;&#039;&#039; and &#039;&#039;&#039;terraform apply&#039;&#039;&#039; continuously to check that things work.&lt;br /&gt;
&lt;br /&gt;
=== Test login ===&lt;br /&gt;
Use&lt;br /&gt;
 openstack server list&lt;br /&gt;
to find the IPv4 and IPv6 addresses of &#039;&#039;terraform-test&#039;&#039;. Log in with for example:&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.65.222&lt;br /&gt;
and&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::22f4&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There are two ways to run &#039;&#039;&#039;openstack&#039;&#039;&#039;. One assumes that all the &#039;&#039;OS_*&#039;&#039; environment variables defined in &#039;&#039;keystonerc.sh&#039;&#039; are set. The other assumes that there is a &#039;&#039;~/.config/openstack/clouds.yaml&#039;&#039; file and that &#039;&#039;OS_CLOUD&#039;&#039; is set, for example:&lt;br /&gt;
 OS_CLOUD=info319-cluster openstack server list&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark driver ===&lt;br /&gt;
The &#039;&#039;terraform-test&#039;&#039; resource can be renamed to for example &#039;&#039;terraform-driver&#039;&#039; and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark workers ===&lt;br /&gt;
You can create multiple &#039;&#039;terraform-worker-&#039;&#039; resources in a similar way, but adding a line &lt;br /&gt;
 count            = 6&lt;br /&gt;
and using &#039;&#039;&#039;${count.index}&#039;&#039;&#039; as a variable inside the worker &#039;&#039;name&#039;&#039;. Use &#039;&#039;IPv6&#039;&#039; instead of &#039;&#039;dualStack&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Check that you can login to the new instances using &#039;&#039;&#039;ssh&#039;&#039;&#039; and a IPv6 JumpHost.&lt;br /&gt;
&lt;br /&gt;
As &#039;&#039;info319-cluster.tf&#039;&#039; grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:&lt;br /&gt;
 locals {&lt;br /&gt;
     cluster_prefix = &amp;quot;terraform-&amp;quot;&lt;br /&gt;
     num_workers = 6&lt;br /&gt;
     driver_name = &amp;quot;${local.cluster_prefix}driver&amp;quot;&lt;br /&gt;
     worker_prefix = &amp;quot;${local.cluster_prefix}worker-&amp;quot;&lt;br /&gt;
     worker_names = [&lt;br /&gt;
         for idx in range(local.num_workers) : &lt;br /&gt;
             &amp;quot;${local.worker_prefix}${idx}&amp;quot;]&lt;br /&gt;
     all_names = concat([local.driver_name], local.worker_names)&lt;br /&gt;
 }&lt;br /&gt;
Inside strings, you can refer to the variables using &#039;&#039;${local.var_name}&#039;&#039;. Outside of string, you can write just &#039;&#039;local.var_name&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create and attach volumes ===&lt;br /&gt;
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances]. &lt;br /&gt;
&lt;br /&gt;
You can use &#039;&#039;local.num_workers&#039;&#039; and &#039;&#039;count=&#039;&#039; both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:&lt;br /&gt;
 resource &amp;quot;openstack_compute_volume_attach_v2&amp;quot; &amp;quot;attached-to-workers&amp;quot; {&lt;br /&gt;
   count       = local.num_workers&lt;br /&gt;
   instance_id = &amp;quot;${openstack_compute_instance_v2.terraform-workers[count.index].id}&amp;quot;&lt;br /&gt;
   volume_id   = &amp;quot;${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Log in with SSH and to &#039;&#039;&#039;ls /dev&#039;&#039;&#039; to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).&lt;br /&gt;
&lt;br /&gt;
=== Create &#039;&#039;hosts&#039;&#039; files ===&lt;br /&gt;
Make OpenStack write out the file &#039;&#039;ipv4-hosts&#039;&#039; looking like this:&lt;br /&gt;
 158.37.65.58 terraform-driver&lt;br /&gt;
 10.1.2.234 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 10.1.2.63 terraform-worker-5&lt;br /&gt;
and &#039;&#039;ipv6-hosts&#039;&#039; looking like this:&lt;br /&gt;
 2001:700:2:8300::28c4 terraform-driver&lt;br /&gt;
 2001:700:2:8301::14a1 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 2001:700:2:8310::120d terraform-worker-5&lt;br /&gt;
 &lt;br /&gt;
Tips:&lt;br /&gt;
* you can define outputs from terraform like this:&lt;br /&gt;
 output &amp;quot;terraform-driver-ipv4&amp;quot; {&lt;br /&gt;
     value = openstack_compute_instance_v2.terraform-driver.access_ip_v4&lt;br /&gt;
 }&lt;br /&gt;
* you can use the &#039;&#039;&#039;[https://www.terraform.io/cli/commands/console terraform console]&#039;&#039;&#039; to explore expressions you can use to define outputs and locals, for example:&lt;br /&gt;
 openstack_compute_instance_v2.terraform-workers[5]&lt;br /&gt;
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs&lt;br /&gt;
* a few useful functions are:&lt;br /&gt;
     concat([local.element], local.list)&lt;br /&gt;
     length(string)&lt;br /&gt;
     substr(string, offset, length)&lt;br /&gt;
     zipmap(list1, list2)&lt;br /&gt;
* you can create lists using this syntax:&lt;br /&gt;
     terraform-workers-ipv6 = [&lt;br /&gt;
     	for idx in range(number):&lt;br /&gt;
 	    list[idx].attribute&lt;br /&gt;
     ]&lt;br /&gt;
* you can output to a file like this:&lt;br /&gt;
 resource &amp;quot;local_file&amp;quot; &amp;quot;ipv4-hosts-file&amp;quot; {&lt;br /&gt;
     content = &amp;quot;${local.ipv4-hosts-string}\n&amp;quot;&lt;br /&gt;
     filename = &amp;quot;ipv4-hosts&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
On your local host, change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039;:&lt;br /&gt;
 Host terraform-driver&lt;br /&gt;
     Hostname 2001:700:2:8300::28c4&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-0&lt;br /&gt;
     Hostname 2001:700:2:8301::14a1&lt;br /&gt;
 &lt;br /&gt;
 ...&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-5&lt;br /&gt;
     Hostname 2001:700:2:8301::120d&lt;br /&gt;
&lt;br /&gt;
On your local host, add these lines to &#039;&#039;~/.ssh/config&#039;&#039;:&lt;br /&gt;
 Host terraform-*&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      ProxyJump sinoa@login.uib.no&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
&lt;br /&gt;
 Include ./config.terraform-hosts&lt;br /&gt;
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).&lt;br /&gt;
&lt;br /&gt;
You can now log into all the cluster machines by name, even after you increase the number of worker nodes:&lt;br /&gt;
 ssh terraform-worker-4&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1201</id>
		<title>Create Spark cluster using Terraform</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1201"/>
		<updated>2022-10-31T13:02:19Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Test run Terraform */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== OpenStack ==&lt;br /&gt;
=== Install OpenStack ===&lt;br /&gt;
On your local machine, create an &#039;&#039;exercise-5&#039;&#039; folder, for example a subfolder of &#039;&#039;info319-exercises&#039;&#039;, and &#039;&#039;&#039;cd&#039;&#039;&#039; into it.&lt;br /&gt;
&lt;br /&gt;
Install &#039;&#039;openstackclient&#039;&#039;. There are several ways to do this. On Ubuntu Linux, you can do:&lt;br /&gt;
 sudo apt install python3-openstackclient&lt;br /&gt;
(You may have to reinstall &#039;&#039;python3-six&#039;&#039; and &#039;&#039;python3-urllib3&#039;&#039;. You may also need &#039;&#039;python3-dev&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
Other guides suggest you install it as a Python package (using a virtual environment if you want):&lt;br /&gt;
 pip install python-openstackclient&lt;br /&gt;
&lt;br /&gt;
=== Configure OpenStack for command line ===&lt;br /&gt;
If you want to run OpenStack from the command line, create the file &#039;&#039;keystonerc.sh&#039;&#039; on your local machine:&lt;br /&gt;
 touch keystonerc.sh&lt;br /&gt;
 chmod 0600 keystonerc.sh&lt;br /&gt;
&lt;br /&gt;
Use this as a template:&lt;br /&gt;
 export OS_USERNAME=YOUR_USER_NAME@uib.no&lt;br /&gt;
 export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
 export OS_PASSWORD=g3...YOUR_PASSWORD...Qb&lt;br /&gt;
 export OS_AUTH_URL=https://api.nrec.no:5000/v3&lt;br /&gt;
 export OS_IDENTITY_API_VERSION=3&lt;br /&gt;
 export OS_USER_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_PROJECT_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
 export OS_INTERFACE=public&lt;br /&gt;
 export OS_NO_CACHE=1&lt;br /&gt;
 export OS_TENANT_NAME=$OS_PROJECT_NAME&lt;br /&gt;
&lt;br /&gt;
=== Test run OpenStack from command line ===&lt;br /&gt;
Test with:&lt;br /&gt;
 . keystonerc.sh&lt;br /&gt;
 openstack server list&lt;br /&gt;
(It can be quite slow.)&lt;br /&gt;
&lt;br /&gt;
Other test commands:&lt;br /&gt;
 openstack image list&lt;br /&gt;
 openstack flavor list&lt;br /&gt;
 openstack network list&lt;br /&gt;
 openstack keypair list&lt;br /&gt;
 openstack security group list&lt;br /&gt;
&#039;&#039;&#039;openstack --help&#039;&#039;&#039; lists all the possible commands. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039; Try to create an instance with &#039;&#039;&#039;openstack server create ...&#039;&#039;&#039;, then delete it. You can use the NREC Overview to see the results.&lt;br /&gt;
&lt;br /&gt;
== Terraform ==&lt;br /&gt;
=== Install Terraform ===&lt;br /&gt;
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install  software-properties-common gnupg2 curl&lt;br /&gt;
 curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor &amp;gt; hashicorp.gpg&lt;br /&gt;
 sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/&lt;br /&gt;
 sudo apt-add-repository &amp;quot;deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main&amp;quot;&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install terraform&lt;br /&gt;
&lt;br /&gt;
=== Configure Terraform ===&lt;br /&gt;
Create a configuration file &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # configure the OpenStack provider&lt;br /&gt;
 terraform {&lt;br /&gt;
   required_providers {&lt;br /&gt;
     openstack = {&lt;br /&gt;
       source = &amp;quot;terraform-provider-openstack/openstack&amp;quot;&lt;br /&gt;
     }&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Initialise Terraform with:&lt;br /&gt;
 terraform init&lt;br /&gt;
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:&lt;br /&gt;
 terraform init -upgrade&lt;br /&gt;
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.&lt;br /&gt;
&lt;br /&gt;
=== Test run Terraform ===&lt;br /&gt;
Append something like this to &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # test instance&lt;br /&gt;
 resource &amp;quot;openstack_compute_instance_v2&amp;quot; &amp;quot;terraform-test&amp;quot; {&lt;br /&gt;
   name            = &amp;quot;terraform-test&amp;quot;&lt;br /&gt;
   image_name      = &amp;quot;GOLD Ubuntu 22.04 LTS&amp;quot;&lt;br /&gt;
   flavor_name     = &amp;quot;m1.large&amp;quot;&lt;br /&gt;
   security_groups = [&amp;quot;default&amp;quot;, &amp;quot;info319-spark-cluster&amp;quot;]&lt;br /&gt;
   network {&lt;br /&gt;
     name = &amp;quot;dualStack&amp;quot;&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Try to run Terraform with:&lt;br /&gt;
 terraform plan&lt;br /&gt;
&#039;&#039;&#039;terraform plan&#039;&#039;&#039; lists everything &#039;&#039;terraform&#039;&#039; will do if you &#039;&#039;apply&#039;&#039; it. This list is important to check so you do not permanently delete something critical, like disks/volumes with important data on them...&lt;br /&gt;
&lt;br /&gt;
To create the new test instance:&lt;br /&gt;
 terraform apply  # answer &#039;yes&#039; to approve&lt;br /&gt;
Check &#039;&#039;Compute -&amp;gt; Instances&#039;&#039; in the &#039;&#039;NREC Overview&#039;&#039; to see the new instance appear.&lt;br /&gt;
&lt;br /&gt;
(You can also run &#039;&#039;&#039;terraform plan -out plan.tf&#039;&#039;&#039; to save the plan to a file and then run it faster with &#039;&#039;&#039;terraform apply plan.tf&#039;&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
=== user-data.cfg ===&lt;br /&gt;
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like &#039;&#039;user-data.cfg&#039;&#039;, or example:&lt;br /&gt;
 #cloud-config&lt;br /&gt;
 &lt;br /&gt;
 apt_upgrade: true&lt;br /&gt;
 &lt;br /&gt;
 packages:&lt;br /&gt;
 - emacs&lt;br /&gt;
 &lt;br /&gt;
 power_state:&lt;br /&gt;
   delay: &amp;quot;+3&amp;quot;&lt;br /&gt;
   mode: reboot&lt;br /&gt;
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text &#039;&#039;emacs&#039;&#039; editor.)&lt;br /&gt;
&lt;br /&gt;
To run this script when a new instance is created, add the line&lt;br /&gt;
   user_data       = file(&amp;quot;user-data.cfg&amp;quot;)&lt;br /&gt;
to &#039;&#039;info319-cluster.tf&#039; and run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a &#039;&#039;Connection closed by UNKNOWN port 65535&#039;&#039; message.)&lt;br /&gt;
&lt;br /&gt;
=== ~/.config/openstack/clouds.yaml (optional) ===&lt;br /&gt;
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (&#039;&#039;OS_*&#039;&#039;) defined in keystonerc.sh , you can keep them in a file. &lt;br /&gt;
&lt;br /&gt;
On your local machine, create ~/.config/openstack/clouds.yaml with &#039;&#039;0600&#039;&#039; (or &#039;&#039;-rw-------&#039;&#039;) access, for example:&lt;br /&gt;
 clouds:&lt;br /&gt;
   info319-cluster:&lt;br /&gt;
     auth:&lt;br /&gt;
       auth_url: https://api.nrec.no:5000/v3&lt;br /&gt;
       project_name: uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
       username: YOUR_USER_NAME@uib.no&lt;br /&gt;
       password: g3...YOUR_PASSWORD...Qb&lt;br /&gt;
       user_domain_name: dataporten&lt;br /&gt;
       project_domain_name: dataporten&lt;br /&gt;
     identity_api_version: 3&lt;br /&gt;
     region_name: YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
     interface: public&lt;br /&gt;
     operation_log:&lt;br /&gt;
       logging: TRUE&lt;br /&gt;
       file: openstackclient_admin.log&lt;br /&gt;
       level: info&lt;br /&gt;
&lt;br /&gt;
You need to change your &#039;&#039;info319-cluster.tf&#039;&#039; file from&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
to&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
   cloud  = &amp;quot;info319-cluster&amp;quot;  # defined in a clouds.yaml file&lt;br /&gt;
 }&lt;br /&gt;
The advantage of this setup is that it easier manage multiple clusters from the same local machine, and you avoid problems with changed enrivonment variables.&lt;br /&gt;
&lt;br /&gt;
== Spark cluster ==&lt;br /&gt;
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The &#039;&#039;Resources&#039;&#039; menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] gives you details.&lt;br /&gt;
&lt;br /&gt;
=== Create or import a key pair ===&lt;br /&gt;
Your &#039;&#039;terraform-test&#039;&#039; instance still has no keypair. To add a keypair to &#039;&#039;info319-cluster.tf&#039;&#039;, use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public &#039;&#039;~/.ssh/info319-spark-cluster.pub&#039;&#039; SSH key from before. &lt;br /&gt;
&lt;br /&gt;
Add the keypair resource to the &#039;&#039;terraform-test&#039;&#039; resource with a line like this:&lt;br /&gt;
   key_pair        = &amp;quot;info319-spark-cluster&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Rerun &#039;&#039;&#039;terraform plan&#039;&#039;&#039; and &#039;&#039;&#039;terraform apply&#039;&#039;&#039; continuously to check that things work.&lt;br /&gt;
&lt;br /&gt;
=== Test login ===&lt;br /&gt;
Use&lt;br /&gt;
 openstack server list&lt;br /&gt;
to find the IPv4 and IPv6 addresses of &#039;&#039;terraform-test&#039;&#039;. Log in with for example:&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.65.222&lt;br /&gt;
and&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::22f4&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There are two ways to run &#039;&#039;&#039;openstack&#039;&#039;&#039;. One assumes that all the &#039;&#039;OS_*&#039;&#039; environment variables defined in &#039;&#039;keystonerc.sh&#039;&#039; are set. The other assumes that there is a &#039;&#039;~/.config/openstack/clouds.yaml&#039;&#039; file and that &#039;&#039;OS_CLOUD&#039;&#039; is set, for example:&lt;br /&gt;
 OS_CLOUD=info319-cluster openstack server list&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark driver ===&lt;br /&gt;
The &#039;&#039;terraform-test&#039;&#039; resource can be renamed to for example &#039;&#039;terraform-driver&#039;&#039; and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark workers ===&lt;br /&gt;
You can create multiple &#039;&#039;terraform-worker-&#039;&#039; resources in a similar way, but adding a line &lt;br /&gt;
 count            = 6&lt;br /&gt;
and using &#039;&#039;&#039;${count.index}&#039;&#039;&#039; as a variable inside the worker &#039;&#039;name&#039;&#039;. Use &#039;&#039;IPv6&#039;&#039; instead of &#039;&#039;dualStack&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Check that you can login to the new instances using &#039;&#039;&#039;ssh&#039;&#039;&#039; and a IPv6 JumpHost.&lt;br /&gt;
&lt;br /&gt;
As &#039;&#039;info319-cluster.tf&#039;&#039; grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:&lt;br /&gt;
 locals {&lt;br /&gt;
     cluster_prefix = &amp;quot;terraform-&amp;quot;&lt;br /&gt;
     num_workers = 6&lt;br /&gt;
     driver_name = &amp;quot;${local.cluster_prefix}driver&amp;quot;&lt;br /&gt;
     worker_prefix = &amp;quot;${local.cluster_prefix}worker-&amp;quot;&lt;br /&gt;
     worker_names = [&lt;br /&gt;
         for idx in range(local.num_workers) : &lt;br /&gt;
             &amp;quot;${local.worker_prefix}${idx}&amp;quot;]&lt;br /&gt;
     all_names = concat([local.driver_name], local.worker_names)&lt;br /&gt;
 }&lt;br /&gt;
Inside strings, you can refer to the variables using &#039;&#039;${local.var_name}&#039;&#039;. Outside of string, you can write just &#039;&#039;local.var_name&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create and attach volumes ===&lt;br /&gt;
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances]. &lt;br /&gt;
&lt;br /&gt;
You can use &#039;&#039;local.num_workers&#039;&#039; and &#039;&#039;count=&#039;&#039; both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:&lt;br /&gt;
 resource &amp;quot;openstack_compute_volume_attach_v2&amp;quot; &amp;quot;attached-to-workers&amp;quot; {&lt;br /&gt;
   count       = local.num_workers&lt;br /&gt;
   instance_id = &amp;quot;${openstack_compute_instance_v2.terraform-workers[count.index].id}&amp;quot;&lt;br /&gt;
   volume_id   = &amp;quot;${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Log in with SSH and to &#039;&#039;&#039;ls /dev&#039;&#039;&#039; to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).&lt;br /&gt;
&lt;br /&gt;
=== Create &#039;&#039;hosts&#039;&#039; files ===&lt;br /&gt;
Make OpenStack write out the file &#039;&#039;ipv4-hosts&#039;&#039; looking like this:&lt;br /&gt;
 158.37.65.58 terraform-driver&lt;br /&gt;
 10.1.2.234 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 10.1.2.63 terraform-worker-5&lt;br /&gt;
and &#039;&#039;ipv6-hosts&#039;&#039; looking like this:&lt;br /&gt;
 2001:700:2:8300::28c4 terraform-driver&lt;br /&gt;
 2001:700:2:8301::14a1 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 2001:700:2:8310::120d terraform-worker-5&lt;br /&gt;
 &lt;br /&gt;
Tips:&lt;br /&gt;
* you can define outputs from terraform like this:&lt;br /&gt;
 output &amp;quot;terraform-driver-ipv4&amp;quot; {&lt;br /&gt;
     value = openstack_compute_instance_v2.terraform-driver.access_ip_v4&lt;br /&gt;
 }&lt;br /&gt;
* you can use the &#039;&#039;&#039;[https://www.terraform.io/cli/commands/console terraform console]&#039;&#039;&#039; to explore expressions you can use to define outputs and locals, for example:&lt;br /&gt;
 openstack_compute_instance_v2.terraform-workers[5]&lt;br /&gt;
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs&lt;br /&gt;
* a few useful functions are:&lt;br /&gt;
     concat([local.element], local.list)&lt;br /&gt;
     length(string)&lt;br /&gt;
     substr(string, offset, length)&lt;br /&gt;
     zipmap(list1, list2)&lt;br /&gt;
* you can create lists using this syntax:&lt;br /&gt;
     terraform-workers-ipv6 = [&lt;br /&gt;
     	for idx in range(number):&lt;br /&gt;
 	    list[idx].attribute&lt;br /&gt;
     ]&lt;br /&gt;
* you can output to a file like this:&lt;br /&gt;
 resource &amp;quot;local_file&amp;quot; &amp;quot;ipv4-hosts-file&amp;quot; {&lt;br /&gt;
     content = &amp;quot;${local.ipv4-hosts-string}\n&amp;quot;&lt;br /&gt;
     filename = &amp;quot;ipv4-hosts&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
On your local host, change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039;:&lt;br /&gt;
 Host terraform-driver&lt;br /&gt;
     Hostname 2001:700:2:8300::28c4&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-0&lt;br /&gt;
     Hostname 2001:700:2:8301::14a1&lt;br /&gt;
 &lt;br /&gt;
 ...&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-5&lt;br /&gt;
     Hostname 2001:700:2:8301::120d&lt;br /&gt;
&lt;br /&gt;
On your local host, add these lines to &#039;&#039;~/.ssh/config&#039;&#039;:&lt;br /&gt;
 Host terraform-*&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      ProxyJump sinoa@login.uib.no&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
&lt;br /&gt;
 Include ./config.terraform-hosts&lt;br /&gt;
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).&lt;br /&gt;
&lt;br /&gt;
You can now log into all the cluster machines by name, even after you increase the number of worker nodes:&lt;br /&gt;
 ssh terraform-worker-4&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1200</id>
		<title>Create Spark cluster using Terraform</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1200"/>
		<updated>2022-10-31T12:58:05Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* user-data.cfg */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== OpenStack ==&lt;br /&gt;
=== Install OpenStack ===&lt;br /&gt;
On your local machine, create an &#039;&#039;exercise-5&#039;&#039; folder, for example a subfolder of &#039;&#039;info319-exercises&#039;&#039;, and &#039;&#039;&#039;cd&#039;&#039;&#039; into it.&lt;br /&gt;
&lt;br /&gt;
Install &#039;&#039;openstackclient&#039;&#039;. There are several ways to do this. On Ubuntu Linux, you can do:&lt;br /&gt;
 sudo apt install python3-openstackclient&lt;br /&gt;
(You may have to reinstall &#039;&#039;python3-six&#039;&#039; and &#039;&#039;python3-urllib3&#039;&#039;. You may also need &#039;&#039;python3-dev&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
Other guides suggest you install it as a Python package (using a virtual environment if you want):&lt;br /&gt;
 pip install python-openstackclient&lt;br /&gt;
&lt;br /&gt;
=== Configure OpenStack for command line ===&lt;br /&gt;
If you want to run OpenStack from the command line, create the file &#039;&#039;keystonerc.sh&#039;&#039; on your local machine:&lt;br /&gt;
 touch keystonerc.sh&lt;br /&gt;
 chmod 0600 keystonerc.sh&lt;br /&gt;
&lt;br /&gt;
Use this as a template:&lt;br /&gt;
 export OS_USERNAME=YOUR_USER_NAME@uib.no&lt;br /&gt;
 export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
 export OS_PASSWORD=g3...YOUR_PASSWORD...Qb&lt;br /&gt;
 export OS_AUTH_URL=https://api.nrec.no:5000/v3&lt;br /&gt;
 export OS_IDENTITY_API_VERSION=3&lt;br /&gt;
 export OS_USER_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_PROJECT_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
 export OS_INTERFACE=public&lt;br /&gt;
 export OS_NO_CACHE=1&lt;br /&gt;
 export OS_TENANT_NAME=$OS_PROJECT_NAME&lt;br /&gt;
&lt;br /&gt;
=== Test run OpenStack from command line ===&lt;br /&gt;
Test with:&lt;br /&gt;
 . keystonerc.sh&lt;br /&gt;
 openstack server list&lt;br /&gt;
(It can be quite slow.)&lt;br /&gt;
&lt;br /&gt;
Other test commands:&lt;br /&gt;
 openstack image list&lt;br /&gt;
 openstack flavor list&lt;br /&gt;
 openstack network list&lt;br /&gt;
 openstack keypair list&lt;br /&gt;
 openstack security group list&lt;br /&gt;
&#039;&#039;&#039;openstack --help&#039;&#039;&#039; lists all the possible commands. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039; Try to create an instance with &#039;&#039;&#039;openstack server create ...&#039;&#039;&#039;, then delete it. You can use the NREC Overview to see the results.&lt;br /&gt;
&lt;br /&gt;
== Terraform ==&lt;br /&gt;
=== Install Terraform ===&lt;br /&gt;
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install  software-properties-common gnupg2 curl&lt;br /&gt;
 curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor &amp;gt; hashicorp.gpg&lt;br /&gt;
 sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/&lt;br /&gt;
 sudo apt-add-repository &amp;quot;deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main&amp;quot;&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install terraform&lt;br /&gt;
&lt;br /&gt;
=== Configure Terraform ===&lt;br /&gt;
Create a configuration file &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # configure the OpenStack provider&lt;br /&gt;
 terraform {&lt;br /&gt;
   required_providers {&lt;br /&gt;
     openstack = {&lt;br /&gt;
       source = &amp;quot;terraform-provider-openstack/openstack&amp;quot;&lt;br /&gt;
     }&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Initialise Terraform with:&lt;br /&gt;
 terraform init&lt;br /&gt;
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:&lt;br /&gt;
 terraform init -upgrade&lt;br /&gt;
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.&lt;br /&gt;
&lt;br /&gt;
=== Test run Terraform ===&lt;br /&gt;
Append something like this to &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # test instance&lt;br /&gt;
 resource &amp;quot;openstack_compute_instance_v2&amp;quot; &amp;quot;terraform-test&amp;quot; {&lt;br /&gt;
   name            = &amp;quot;terraform-test&amp;quot;&lt;br /&gt;
   image_name      = &amp;quot;GOLD Ubuntu 22.04 LTS&amp;quot;&lt;br /&gt;
   flavor_name     = &amp;quot;m1.large&amp;quot;&lt;br /&gt;
   security_groups = [&amp;quot;default&amp;quot;, &amp;quot;info319-spark-cluster&amp;quot;]&lt;br /&gt;
   network {&lt;br /&gt;
     name = &amp;quot;dualStack&amp;quot;&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Try to run Terraform with:&lt;br /&gt;
 terraform plan&lt;br /&gt;
&#039;&#039;&#039;terraform plan&#039;&#039;&#039; lists everything &#039;&#039;terraform&#039;&#039; will do if you &#039;&#039;apply&#039;&#039; it. This list is important to check so you do not permanently delete something critical, like disks/volumes.&lt;br /&gt;
&lt;br /&gt;
To create the new test instance:&lt;br /&gt;
 terraform apply  # answer &#039;yes&#039; to approve&lt;br /&gt;
Check &#039;&#039;Compute -&amp;gt; Instances&#039;&#039; in the &#039;&#039;NREC Overview&#039;&#039; to see the new instance appear.&lt;br /&gt;
&lt;br /&gt;
=== user-data.cfg ===&lt;br /&gt;
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is sometimes useful to do a few basic initialisation steps already when the instance is created. To do this, you can create in initialisation script called something like &#039;&#039;user-data.cfg&#039;&#039;, or example:&lt;br /&gt;
 #cloud-config&lt;br /&gt;
 &lt;br /&gt;
 apt_upgrade: true&lt;br /&gt;
 &lt;br /&gt;
 packages:&lt;br /&gt;
 - emacs&lt;br /&gt;
 &lt;br /&gt;
 power_state:&lt;br /&gt;
   delay: &amp;quot;+3&amp;quot;&lt;br /&gt;
   mode: reboot&lt;br /&gt;
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text &#039;&#039;emacs&#039;&#039; editor.)&lt;br /&gt;
&lt;br /&gt;
To run this script when a new instance is created, add the line&lt;br /&gt;
   user_data       = file(&amp;quot;user-data.cfg&amp;quot;)&lt;br /&gt;
to &#039;&#039;info319-cluster.tf&#039; and run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a &#039;&#039;Connection closed by UNKNOWN port 65535&#039;&#039; message.)&lt;br /&gt;
&lt;br /&gt;
=== ~/.config/openstack/clouds.yaml (optional) ===&lt;br /&gt;
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (&#039;&#039;OS_*&#039;&#039;) defined in keystonerc.sh , you can keep them in a file. &lt;br /&gt;
&lt;br /&gt;
On your local machine, create ~/.config/openstack/clouds.yaml with &#039;&#039;0600&#039;&#039; (or &#039;&#039;-rw-------&#039;&#039;) access, for example:&lt;br /&gt;
 clouds:&lt;br /&gt;
   info319-cluster:&lt;br /&gt;
     auth:&lt;br /&gt;
       auth_url: https://api.nrec.no:5000/v3&lt;br /&gt;
       project_name: uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
       username: YOUR_USER_NAME@uib.no&lt;br /&gt;
       password: g3...YOUR_PASSWORD...Qb&lt;br /&gt;
       user_domain_name: dataporten&lt;br /&gt;
       project_domain_name: dataporten&lt;br /&gt;
     identity_api_version: 3&lt;br /&gt;
     region_name: YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
     interface: public&lt;br /&gt;
     operation_log:&lt;br /&gt;
       logging: TRUE&lt;br /&gt;
       file: openstackclient_admin.log&lt;br /&gt;
       level: info&lt;br /&gt;
&lt;br /&gt;
You need to change your &#039;&#039;info319-cluster.tf&#039;&#039; file from&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
to&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
   cloud  = &amp;quot;info319-cluster&amp;quot;  # defined in a clouds.yaml file&lt;br /&gt;
 }&lt;br /&gt;
The advantage of this setup is that it easier manage multiple clusters from the same local machine, and you avoid problems with changed enrivonment variables.&lt;br /&gt;
&lt;br /&gt;
== Spark cluster ==&lt;br /&gt;
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The &#039;&#039;Resources&#039;&#039; menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] gives you details.&lt;br /&gt;
&lt;br /&gt;
=== Create or import a key pair ===&lt;br /&gt;
Your &#039;&#039;terraform-test&#039;&#039; instance still has no keypair. To add a keypair to &#039;&#039;info319-cluster.tf&#039;&#039;, use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public &#039;&#039;~/.ssh/info319-spark-cluster.pub&#039;&#039; SSH key from before. &lt;br /&gt;
&lt;br /&gt;
Add the keypair resource to the &#039;&#039;terraform-test&#039;&#039; resource with a line like this:&lt;br /&gt;
   key_pair        = &amp;quot;info319-spark-cluster&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Rerun &#039;&#039;&#039;terraform plan&#039;&#039;&#039; and &#039;&#039;&#039;terraform apply&#039;&#039;&#039; continuously to check that things work.&lt;br /&gt;
&lt;br /&gt;
=== Test login ===&lt;br /&gt;
Use&lt;br /&gt;
 openstack server list&lt;br /&gt;
to find the IPv4 and IPv6 addresses of &#039;&#039;terraform-test&#039;&#039;. Log in with for example:&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.65.222&lt;br /&gt;
and&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::22f4&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There are two ways to run &#039;&#039;&#039;openstack&#039;&#039;&#039;. One assumes that all the &#039;&#039;OS_*&#039;&#039; environment variables defined in &#039;&#039;keystonerc.sh&#039;&#039; are set. The other assumes that there is a &#039;&#039;~/.config/openstack/clouds.yaml&#039;&#039; file and that &#039;&#039;OS_CLOUD&#039;&#039; is set, for example:&lt;br /&gt;
 OS_CLOUD=info319-cluster openstack server list&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark driver ===&lt;br /&gt;
The &#039;&#039;terraform-test&#039;&#039; resource can be renamed to for example &#039;&#039;terraform-driver&#039;&#039; and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark workers ===&lt;br /&gt;
You can create multiple &#039;&#039;terraform-worker-&#039;&#039; resources in a similar way, but adding a line &lt;br /&gt;
 count            = 6&lt;br /&gt;
and using &#039;&#039;&#039;${count.index}&#039;&#039;&#039; as a variable inside the worker &#039;&#039;name&#039;&#039;. Use &#039;&#039;IPv6&#039;&#039; instead of &#039;&#039;dualStack&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Check that you can login to the new instances using &#039;&#039;&#039;ssh&#039;&#039;&#039; and a IPv6 JumpHost.&lt;br /&gt;
&lt;br /&gt;
As &#039;&#039;info319-cluster.tf&#039;&#039; grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:&lt;br /&gt;
 locals {&lt;br /&gt;
     cluster_prefix = &amp;quot;terraform-&amp;quot;&lt;br /&gt;
     num_workers = 6&lt;br /&gt;
     driver_name = &amp;quot;${local.cluster_prefix}driver&amp;quot;&lt;br /&gt;
     worker_prefix = &amp;quot;${local.cluster_prefix}worker-&amp;quot;&lt;br /&gt;
     worker_names = [&lt;br /&gt;
         for idx in range(local.num_workers) : &lt;br /&gt;
             &amp;quot;${local.worker_prefix}${idx}&amp;quot;]&lt;br /&gt;
     all_names = concat([local.driver_name], local.worker_names)&lt;br /&gt;
 }&lt;br /&gt;
Inside strings, you can refer to the variables using &#039;&#039;${local.var_name}&#039;&#039;. Outside of string, you can write just &#039;&#039;local.var_name&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create and attach volumes ===&lt;br /&gt;
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances]. &lt;br /&gt;
&lt;br /&gt;
You can use &#039;&#039;local.num_workers&#039;&#039; and &#039;&#039;count=&#039;&#039; both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:&lt;br /&gt;
 resource &amp;quot;openstack_compute_volume_attach_v2&amp;quot; &amp;quot;attached-to-workers&amp;quot; {&lt;br /&gt;
   count       = local.num_workers&lt;br /&gt;
   instance_id = &amp;quot;${openstack_compute_instance_v2.terraform-workers[count.index].id}&amp;quot;&lt;br /&gt;
   volume_id   = &amp;quot;${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Log in with SSH and to &#039;&#039;&#039;ls /dev&#039;&#039;&#039; to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).&lt;br /&gt;
&lt;br /&gt;
=== Create &#039;&#039;hosts&#039;&#039; files ===&lt;br /&gt;
Make OpenStack write out the file &#039;&#039;ipv4-hosts&#039;&#039; looking like this:&lt;br /&gt;
 158.37.65.58 terraform-driver&lt;br /&gt;
 10.1.2.234 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 10.1.2.63 terraform-worker-5&lt;br /&gt;
and &#039;&#039;ipv6-hosts&#039;&#039; looking like this:&lt;br /&gt;
 2001:700:2:8300::28c4 terraform-driver&lt;br /&gt;
 2001:700:2:8301::14a1 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 2001:700:2:8310::120d terraform-worker-5&lt;br /&gt;
 &lt;br /&gt;
Tips:&lt;br /&gt;
* you can define outputs from terraform like this:&lt;br /&gt;
 output &amp;quot;terraform-driver-ipv4&amp;quot; {&lt;br /&gt;
     value = openstack_compute_instance_v2.terraform-driver.access_ip_v4&lt;br /&gt;
 }&lt;br /&gt;
* you can use the &#039;&#039;&#039;[https://www.terraform.io/cli/commands/console terraform console]&#039;&#039;&#039; to explore expressions you can use to define outputs and locals, for example:&lt;br /&gt;
 openstack_compute_instance_v2.terraform-workers[5]&lt;br /&gt;
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs&lt;br /&gt;
* a few useful functions are:&lt;br /&gt;
     concat([local.element], local.list)&lt;br /&gt;
     length(string)&lt;br /&gt;
     substr(string, offset, length)&lt;br /&gt;
     zipmap(list1, list2)&lt;br /&gt;
* you can create lists using this syntax:&lt;br /&gt;
     terraform-workers-ipv6 = [&lt;br /&gt;
     	for idx in range(number):&lt;br /&gt;
 	    list[idx].attribute&lt;br /&gt;
     ]&lt;br /&gt;
* you can output to a file like this:&lt;br /&gt;
 resource &amp;quot;local_file&amp;quot; &amp;quot;ipv4-hosts-file&amp;quot; {&lt;br /&gt;
     content = &amp;quot;${local.ipv4-hosts-string}\n&amp;quot;&lt;br /&gt;
     filename = &amp;quot;ipv4-hosts&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
On your local host, change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039;:&lt;br /&gt;
 Host terraform-driver&lt;br /&gt;
     Hostname 2001:700:2:8300::28c4&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-0&lt;br /&gt;
     Hostname 2001:700:2:8301::14a1&lt;br /&gt;
 &lt;br /&gt;
 ...&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-5&lt;br /&gt;
     Hostname 2001:700:2:8301::120d&lt;br /&gt;
&lt;br /&gt;
On your local host, add these lines to &#039;&#039;~/.ssh/config&#039;&#039;:&lt;br /&gt;
 Host terraform-*&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      ProxyJump sinoa@login.uib.no&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
&lt;br /&gt;
 Include ./config.terraform-hosts&lt;br /&gt;
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).&lt;br /&gt;
&lt;br /&gt;
You can now log into all the cluster machines by name, even after you increase the number of worker nodes:&lt;br /&gt;
 ssh terraform-worker-4&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1199</id>
		<title>Create Spark cluster using Terraform</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1199"/>
		<updated>2022-10-31T12:54:38Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Configure Terraform */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== OpenStack ==&lt;br /&gt;
=== Install OpenStack ===&lt;br /&gt;
On your local machine, create an &#039;&#039;exercise-5&#039;&#039; folder, for example a subfolder of &#039;&#039;info319-exercises&#039;&#039;, and &#039;&#039;&#039;cd&#039;&#039;&#039; into it.&lt;br /&gt;
&lt;br /&gt;
Install &#039;&#039;openstackclient&#039;&#039;. There are several ways to do this. On Ubuntu Linux, you can do:&lt;br /&gt;
 sudo apt install python3-openstackclient&lt;br /&gt;
(You may have to reinstall &#039;&#039;python3-six&#039;&#039; and &#039;&#039;python3-urllib3&#039;&#039;. You may also need &#039;&#039;python3-dev&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
Other guides suggest you install it as a Python package (using a virtual environment if you want):&lt;br /&gt;
 pip install python-openstackclient&lt;br /&gt;
&lt;br /&gt;
=== Configure OpenStack for command line ===&lt;br /&gt;
If you want to run OpenStack from the command line, create the file &#039;&#039;keystonerc.sh&#039;&#039; on your local machine:&lt;br /&gt;
 touch keystonerc.sh&lt;br /&gt;
 chmod 0600 keystonerc.sh&lt;br /&gt;
&lt;br /&gt;
Use this as a template:&lt;br /&gt;
 export OS_USERNAME=YOUR_USER_NAME@uib.no&lt;br /&gt;
 export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
 export OS_PASSWORD=g3...YOUR_PASSWORD...Qb&lt;br /&gt;
 export OS_AUTH_URL=https://api.nrec.no:5000/v3&lt;br /&gt;
 export OS_IDENTITY_API_VERSION=3&lt;br /&gt;
 export OS_USER_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_PROJECT_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
 export OS_INTERFACE=public&lt;br /&gt;
 export OS_NO_CACHE=1&lt;br /&gt;
 export OS_TENANT_NAME=$OS_PROJECT_NAME&lt;br /&gt;
&lt;br /&gt;
=== Test run OpenStack from command line ===&lt;br /&gt;
Test with:&lt;br /&gt;
 . keystonerc.sh&lt;br /&gt;
 openstack server list&lt;br /&gt;
(It can be quite slow.)&lt;br /&gt;
&lt;br /&gt;
Other test commands:&lt;br /&gt;
 openstack image list&lt;br /&gt;
 openstack flavor list&lt;br /&gt;
 openstack network list&lt;br /&gt;
 openstack keypair list&lt;br /&gt;
 openstack security group list&lt;br /&gt;
&#039;&#039;&#039;openstack --help&#039;&#039;&#039; lists all the possible commands. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039; Try to create an instance with &#039;&#039;&#039;openstack server create ...&#039;&#039;&#039;, then delete it. You can use the NREC Overview to see the results.&lt;br /&gt;
&lt;br /&gt;
== Terraform ==&lt;br /&gt;
=== Install Terraform ===&lt;br /&gt;
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install  software-properties-common gnupg2 curl&lt;br /&gt;
 curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor &amp;gt; hashicorp.gpg&lt;br /&gt;
 sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/&lt;br /&gt;
 sudo apt-add-repository &amp;quot;deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main&amp;quot;&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install terraform&lt;br /&gt;
&lt;br /&gt;
=== Configure Terraform ===&lt;br /&gt;
Create a configuration file &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # configure the OpenStack provider&lt;br /&gt;
 terraform {&lt;br /&gt;
   required_providers {&lt;br /&gt;
     openstack = {&lt;br /&gt;
       source = &amp;quot;terraform-provider-openstack/openstack&amp;quot;&lt;br /&gt;
     }&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Initialise Terraform with:&lt;br /&gt;
 terraform init&lt;br /&gt;
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise Terraform, for example with:&lt;br /&gt;
 terraform init -upgrade&lt;br /&gt;
This can happen if you run Terraform in the same shared folder from different local machines and is usually unproblematic.&lt;br /&gt;
&lt;br /&gt;
=== Test run Terraform ===&lt;br /&gt;
Append something like this to &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # test instance&lt;br /&gt;
 resource &amp;quot;openstack_compute_instance_v2&amp;quot; &amp;quot;terraform-test&amp;quot; {&lt;br /&gt;
   name            = &amp;quot;terraform-test&amp;quot;&lt;br /&gt;
   image_name      = &amp;quot;GOLD Ubuntu 22.04 LTS&amp;quot;&lt;br /&gt;
   flavor_name     = &amp;quot;m1.large&amp;quot;&lt;br /&gt;
   security_groups = [&amp;quot;default&amp;quot;, &amp;quot;info319-spark-cluster&amp;quot;]&lt;br /&gt;
   network {&lt;br /&gt;
     name = &amp;quot;dualStack&amp;quot;&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Try to run Terraform with:&lt;br /&gt;
 terraform plan&lt;br /&gt;
&#039;&#039;&#039;terraform plan&#039;&#039;&#039; lists everything &#039;&#039;terraform&#039;&#039; will do if you &#039;&#039;apply&#039;&#039; it. This list is important to check so you do not permanently delete something critical, like disks/volumes.&lt;br /&gt;
&lt;br /&gt;
To create the new test instance:&lt;br /&gt;
 terraform apply  # answer &#039;yes&#039; to approve&lt;br /&gt;
Check &#039;&#039;Compute -&amp;gt; Instances&#039;&#039; in the &#039;&#039;NREC Overview&#039;&#039; to see the new instance appear.&lt;br /&gt;
&lt;br /&gt;
=== user-data.cfg ===&lt;br /&gt;
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is useful to do a few initialisation steps already when an instance is created. To do this, you can create in initialisation script called something like &#039;&#039;user-data.cfg&#039;&#039;:&lt;br /&gt;
 #cloud-config&lt;br /&gt;
 &lt;br /&gt;
 apt_upgrade: true&lt;br /&gt;
 &lt;br /&gt;
 packages:&lt;br /&gt;
 - emacs&lt;br /&gt;
 &lt;br /&gt;
 power_state:&lt;br /&gt;
   delay: &amp;quot;+3&amp;quot;&lt;br /&gt;
   mode: reboot&lt;br /&gt;
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text &#039;&#039;emacs&#039;&#039; editor.)&lt;br /&gt;
&lt;br /&gt;
Add the line&lt;br /&gt;
   user_data       = file(&amp;quot;user-data.cfg&amp;quot;)&lt;br /&gt;
to &#039;&#039;info319-cluster.tf&#039; and run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a &#039;&#039;Connection closed by UNKNOWN port 65535&#039;&#039; message.)&lt;br /&gt;
&lt;br /&gt;
=== ~/.config/openstack/clouds.yaml (optional) ===&lt;br /&gt;
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (&#039;&#039;OS_*&#039;&#039;) defined in keystonerc.sh , you can keep them in a file. &lt;br /&gt;
&lt;br /&gt;
On your local machine, create ~/.config/openstack/clouds.yaml with &#039;&#039;0600&#039;&#039; (or &#039;&#039;-rw-------&#039;&#039;) access, for example:&lt;br /&gt;
 clouds:&lt;br /&gt;
   info319-cluster:&lt;br /&gt;
     auth:&lt;br /&gt;
       auth_url: https://api.nrec.no:5000/v3&lt;br /&gt;
       project_name: uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
       username: YOUR_USER_NAME@uib.no&lt;br /&gt;
       password: g3...YOUR_PASSWORD...Qb&lt;br /&gt;
       user_domain_name: dataporten&lt;br /&gt;
       project_domain_name: dataporten&lt;br /&gt;
     identity_api_version: 3&lt;br /&gt;
     region_name: YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
     interface: public&lt;br /&gt;
     operation_log:&lt;br /&gt;
       logging: TRUE&lt;br /&gt;
       file: openstackclient_admin.log&lt;br /&gt;
       level: info&lt;br /&gt;
&lt;br /&gt;
You need to change your &#039;&#039;info319-cluster.tf&#039;&#039; file from&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
to&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
   cloud  = &amp;quot;info319-cluster&amp;quot;  # defined in a clouds.yaml file&lt;br /&gt;
 }&lt;br /&gt;
The advantage of this setup is that it easier manage multiple clusters from the same local machine, and you avoid problems with changed enrivonment variables.&lt;br /&gt;
&lt;br /&gt;
== Spark cluster ==&lt;br /&gt;
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The &#039;&#039;Resources&#039;&#039; menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] gives you details.&lt;br /&gt;
&lt;br /&gt;
=== Create or import a key pair ===&lt;br /&gt;
Your &#039;&#039;terraform-test&#039;&#039; instance still has no keypair. To add a keypair to &#039;&#039;info319-cluster.tf&#039;&#039;, use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public &#039;&#039;~/.ssh/info319-spark-cluster.pub&#039;&#039; SSH key from before. &lt;br /&gt;
&lt;br /&gt;
Add the keypair resource to the &#039;&#039;terraform-test&#039;&#039; resource with a line like this:&lt;br /&gt;
   key_pair        = &amp;quot;info319-spark-cluster&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Rerun &#039;&#039;&#039;terraform plan&#039;&#039;&#039; and &#039;&#039;&#039;terraform apply&#039;&#039;&#039; continuously to check that things work.&lt;br /&gt;
&lt;br /&gt;
=== Test login ===&lt;br /&gt;
Use&lt;br /&gt;
 openstack server list&lt;br /&gt;
to find the IPv4 and IPv6 addresses of &#039;&#039;terraform-test&#039;&#039;. Log in with for example:&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.65.222&lt;br /&gt;
and&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::22f4&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There are two ways to run &#039;&#039;&#039;openstack&#039;&#039;&#039;. One assumes that all the &#039;&#039;OS_*&#039;&#039; environment variables defined in &#039;&#039;keystonerc.sh&#039;&#039; are set. The other assumes that there is a &#039;&#039;~/.config/openstack/clouds.yaml&#039;&#039; file and that &#039;&#039;OS_CLOUD&#039;&#039; is set, for example:&lt;br /&gt;
 OS_CLOUD=info319-cluster openstack server list&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark driver ===&lt;br /&gt;
The &#039;&#039;terraform-test&#039;&#039; resource can be renamed to for example &#039;&#039;terraform-driver&#039;&#039; and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark workers ===&lt;br /&gt;
You can create multiple &#039;&#039;terraform-worker-&#039;&#039; resources in a similar way, but adding a line &lt;br /&gt;
 count            = 6&lt;br /&gt;
and using &#039;&#039;&#039;${count.index}&#039;&#039;&#039; as a variable inside the worker &#039;&#039;name&#039;&#039;. Use &#039;&#039;IPv6&#039;&#039; instead of &#039;&#039;dualStack&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Check that you can login to the new instances using &#039;&#039;&#039;ssh&#039;&#039;&#039; and a IPv6 JumpHost.&lt;br /&gt;
&lt;br /&gt;
As &#039;&#039;info319-cluster.tf&#039;&#039; grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:&lt;br /&gt;
 locals {&lt;br /&gt;
     cluster_prefix = &amp;quot;terraform-&amp;quot;&lt;br /&gt;
     num_workers = 6&lt;br /&gt;
     driver_name = &amp;quot;${local.cluster_prefix}driver&amp;quot;&lt;br /&gt;
     worker_prefix = &amp;quot;${local.cluster_prefix}worker-&amp;quot;&lt;br /&gt;
     worker_names = [&lt;br /&gt;
         for idx in range(local.num_workers) : &lt;br /&gt;
             &amp;quot;${local.worker_prefix}${idx}&amp;quot;]&lt;br /&gt;
     all_names = concat([local.driver_name], local.worker_names)&lt;br /&gt;
 }&lt;br /&gt;
Inside strings, you can refer to the variables using &#039;&#039;${local.var_name}&#039;&#039;. Outside of string, you can write just &#039;&#039;local.var_name&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create and attach volumes ===&lt;br /&gt;
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances]. &lt;br /&gt;
&lt;br /&gt;
You can use &#039;&#039;local.num_workers&#039;&#039; and &#039;&#039;count=&#039;&#039; both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:&lt;br /&gt;
 resource &amp;quot;openstack_compute_volume_attach_v2&amp;quot; &amp;quot;attached-to-workers&amp;quot; {&lt;br /&gt;
   count       = local.num_workers&lt;br /&gt;
   instance_id = &amp;quot;${openstack_compute_instance_v2.terraform-workers[count.index].id}&amp;quot;&lt;br /&gt;
   volume_id   = &amp;quot;${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Log in with SSH and to &#039;&#039;&#039;ls /dev&#039;&#039;&#039; to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).&lt;br /&gt;
&lt;br /&gt;
=== Create &#039;&#039;hosts&#039;&#039; files ===&lt;br /&gt;
Make OpenStack write out the file &#039;&#039;ipv4-hosts&#039;&#039; looking like this:&lt;br /&gt;
 158.37.65.58 terraform-driver&lt;br /&gt;
 10.1.2.234 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 10.1.2.63 terraform-worker-5&lt;br /&gt;
and &#039;&#039;ipv6-hosts&#039;&#039; looking like this:&lt;br /&gt;
 2001:700:2:8300::28c4 terraform-driver&lt;br /&gt;
 2001:700:2:8301::14a1 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 2001:700:2:8310::120d terraform-worker-5&lt;br /&gt;
 &lt;br /&gt;
Tips:&lt;br /&gt;
* you can define outputs from terraform like this:&lt;br /&gt;
 output &amp;quot;terraform-driver-ipv4&amp;quot; {&lt;br /&gt;
     value = openstack_compute_instance_v2.terraform-driver.access_ip_v4&lt;br /&gt;
 }&lt;br /&gt;
* you can use the &#039;&#039;&#039;[https://www.terraform.io/cli/commands/console terraform console]&#039;&#039;&#039; to explore expressions you can use to define outputs and locals, for example:&lt;br /&gt;
 openstack_compute_instance_v2.terraform-workers[5]&lt;br /&gt;
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs&lt;br /&gt;
* a few useful functions are:&lt;br /&gt;
     concat([local.element], local.list)&lt;br /&gt;
     length(string)&lt;br /&gt;
     substr(string, offset, length)&lt;br /&gt;
     zipmap(list1, list2)&lt;br /&gt;
* you can create lists using this syntax:&lt;br /&gt;
     terraform-workers-ipv6 = [&lt;br /&gt;
     	for idx in range(number):&lt;br /&gt;
 	    list[idx].attribute&lt;br /&gt;
     ]&lt;br /&gt;
* you can output to a file like this:&lt;br /&gt;
 resource &amp;quot;local_file&amp;quot; &amp;quot;ipv4-hosts-file&amp;quot; {&lt;br /&gt;
     content = &amp;quot;${local.ipv4-hosts-string}\n&amp;quot;&lt;br /&gt;
     filename = &amp;quot;ipv4-hosts&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
On your local host, change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039;:&lt;br /&gt;
 Host terraform-driver&lt;br /&gt;
     Hostname 2001:700:2:8300::28c4&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-0&lt;br /&gt;
     Hostname 2001:700:2:8301::14a1&lt;br /&gt;
 &lt;br /&gt;
 ...&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-5&lt;br /&gt;
     Hostname 2001:700:2:8301::120d&lt;br /&gt;
&lt;br /&gt;
On your local host, add these lines to &#039;&#039;~/.ssh/config&#039;&#039;:&lt;br /&gt;
 Host terraform-*&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      ProxyJump sinoa@login.uib.no&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
&lt;br /&gt;
 Include ./config.terraform-hosts&lt;br /&gt;
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).&lt;br /&gt;
&lt;br /&gt;
You can now log into all the cluster machines by name, even after you increase the number of worker nodes:&lt;br /&gt;
 ssh terraform-worker-4&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1198</id>
		<title>Create Spark cluster using Terraform</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1198"/>
		<updated>2022-10-31T12:54:12Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Configure Terraform */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== OpenStack ==&lt;br /&gt;
=== Install OpenStack ===&lt;br /&gt;
On your local machine, create an &#039;&#039;exercise-5&#039;&#039; folder, for example a subfolder of &#039;&#039;info319-exercises&#039;&#039;, and &#039;&#039;&#039;cd&#039;&#039;&#039; into it.&lt;br /&gt;
&lt;br /&gt;
Install &#039;&#039;openstackclient&#039;&#039;. There are several ways to do this. On Ubuntu Linux, you can do:&lt;br /&gt;
 sudo apt install python3-openstackclient&lt;br /&gt;
(You may have to reinstall &#039;&#039;python3-six&#039;&#039; and &#039;&#039;python3-urllib3&#039;&#039;. You may also need &#039;&#039;python3-dev&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
Other guides suggest you install it as a Python package (using a virtual environment if you want):&lt;br /&gt;
 pip install python-openstackclient&lt;br /&gt;
&lt;br /&gt;
=== Configure OpenStack for command line ===&lt;br /&gt;
If you want to run OpenStack from the command line, create the file &#039;&#039;keystonerc.sh&#039;&#039; on your local machine:&lt;br /&gt;
 touch keystonerc.sh&lt;br /&gt;
 chmod 0600 keystonerc.sh&lt;br /&gt;
&lt;br /&gt;
Use this as a template:&lt;br /&gt;
 export OS_USERNAME=YOUR_USER_NAME@uib.no&lt;br /&gt;
 export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
 export OS_PASSWORD=g3...YOUR_PASSWORD...Qb&lt;br /&gt;
 export OS_AUTH_URL=https://api.nrec.no:5000/v3&lt;br /&gt;
 export OS_IDENTITY_API_VERSION=3&lt;br /&gt;
 export OS_USER_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_PROJECT_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
 export OS_INTERFACE=public&lt;br /&gt;
 export OS_NO_CACHE=1&lt;br /&gt;
 export OS_TENANT_NAME=$OS_PROJECT_NAME&lt;br /&gt;
&lt;br /&gt;
=== Test run OpenStack from command line ===&lt;br /&gt;
Test with:&lt;br /&gt;
 . keystonerc.sh&lt;br /&gt;
 openstack server list&lt;br /&gt;
(It can be quite slow.)&lt;br /&gt;
&lt;br /&gt;
Other test commands:&lt;br /&gt;
 openstack image list&lt;br /&gt;
 openstack flavor list&lt;br /&gt;
 openstack network list&lt;br /&gt;
 openstack keypair list&lt;br /&gt;
 openstack security group list&lt;br /&gt;
&#039;&#039;&#039;openstack --help&#039;&#039;&#039; lists all the possible commands. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039; Try to create an instance with &#039;&#039;&#039;openstack server create ...&#039;&#039;&#039;, then delete it. You can use the NREC Overview to see the results.&lt;br /&gt;
&lt;br /&gt;
== Terraform ==&lt;br /&gt;
=== Install Terraform ===&lt;br /&gt;
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install  software-properties-common gnupg2 curl&lt;br /&gt;
 curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor &amp;gt; hashicorp.gpg&lt;br /&gt;
 sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/&lt;br /&gt;
 sudo apt-add-repository &amp;quot;deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main&amp;quot;&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install terraform&lt;br /&gt;
&lt;br /&gt;
=== Configure Terraform ===&lt;br /&gt;
Create a configuration file &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # configure the OpenStack provider&lt;br /&gt;
 terraform {&lt;br /&gt;
   required_providers {&lt;br /&gt;
     openstack = {&lt;br /&gt;
       source = &amp;quot;terraform-provider-openstack/openstack&amp;quot;&lt;br /&gt;
     }&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Initialise Terraform with:&lt;br /&gt;
 terraform init&lt;br /&gt;
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise terraform, for example with:&lt;br /&gt;
 terraform init -upgrade&lt;br /&gt;
This can happen if you run terraform in the same shared folder from different local machines and is usually unproblematic.&lt;br /&gt;
&lt;br /&gt;
=== Test run Terraform ===&lt;br /&gt;
Append something like this to &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # test instance&lt;br /&gt;
 resource &amp;quot;openstack_compute_instance_v2&amp;quot; &amp;quot;terraform-test&amp;quot; {&lt;br /&gt;
   name            = &amp;quot;terraform-test&amp;quot;&lt;br /&gt;
   image_name      = &amp;quot;GOLD Ubuntu 22.04 LTS&amp;quot;&lt;br /&gt;
   flavor_name     = &amp;quot;m1.large&amp;quot;&lt;br /&gt;
   security_groups = [&amp;quot;default&amp;quot;, &amp;quot;info319-spark-cluster&amp;quot;]&lt;br /&gt;
   network {&lt;br /&gt;
     name = &amp;quot;dualStack&amp;quot;&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Try to run Terraform with:&lt;br /&gt;
 terraform plan&lt;br /&gt;
&#039;&#039;&#039;terraform plan&#039;&#039;&#039; lists everything &#039;&#039;terraform&#039;&#039; will do if you &#039;&#039;apply&#039;&#039; it. This list is important to check so you do not permanently delete something critical, like disks/volumes.&lt;br /&gt;
&lt;br /&gt;
To create the new test instance:&lt;br /&gt;
 terraform apply  # answer &#039;yes&#039; to approve&lt;br /&gt;
Check &#039;&#039;Compute -&amp;gt; Instances&#039;&#039; in the &#039;&#039;NREC Overview&#039;&#039; to see the new instance appear.&lt;br /&gt;
&lt;br /&gt;
=== user-data.cfg ===&lt;br /&gt;
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is useful to do a few initialisation steps already when an instance is created. To do this, you can create in initialisation script called something like &#039;&#039;user-data.cfg&#039;&#039;:&lt;br /&gt;
 #cloud-config&lt;br /&gt;
 &lt;br /&gt;
 apt_upgrade: true&lt;br /&gt;
 &lt;br /&gt;
 packages:&lt;br /&gt;
 - emacs&lt;br /&gt;
 &lt;br /&gt;
 power_state:&lt;br /&gt;
   delay: &amp;quot;+3&amp;quot;&lt;br /&gt;
   mode: reboot&lt;br /&gt;
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text &#039;&#039;emacs&#039;&#039; editor.)&lt;br /&gt;
&lt;br /&gt;
Add the line&lt;br /&gt;
   user_data       = file(&amp;quot;user-data.cfg&amp;quot;)&lt;br /&gt;
to &#039;&#039;info319-cluster.tf&#039; and run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a &#039;&#039;Connection closed by UNKNOWN port 65535&#039;&#039; message.)&lt;br /&gt;
&lt;br /&gt;
=== ~/.config/openstack/clouds.yaml (optional) ===&lt;br /&gt;
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (&#039;&#039;OS_*&#039;&#039;) defined in keystonerc.sh , you can keep them in a file. &lt;br /&gt;
&lt;br /&gt;
On your local machine, create ~/.config/openstack/clouds.yaml with &#039;&#039;0600&#039;&#039; (or &#039;&#039;-rw-------&#039;&#039;) access, for example:&lt;br /&gt;
 clouds:&lt;br /&gt;
   info319-cluster:&lt;br /&gt;
     auth:&lt;br /&gt;
       auth_url: https://api.nrec.no:5000/v3&lt;br /&gt;
       project_name: uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
       username: YOUR_USER_NAME@uib.no&lt;br /&gt;
       password: g3...YOUR_PASSWORD...Qb&lt;br /&gt;
       user_domain_name: dataporten&lt;br /&gt;
       project_domain_name: dataporten&lt;br /&gt;
     identity_api_version: 3&lt;br /&gt;
     region_name: YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
     interface: public&lt;br /&gt;
     operation_log:&lt;br /&gt;
       logging: TRUE&lt;br /&gt;
       file: openstackclient_admin.log&lt;br /&gt;
       level: info&lt;br /&gt;
&lt;br /&gt;
You need to change your &#039;&#039;info319-cluster.tf&#039;&#039; file from&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
to&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
   cloud  = &amp;quot;info319-cluster&amp;quot;  # defined in a clouds.yaml file&lt;br /&gt;
 }&lt;br /&gt;
The advantage of this setup is that it easier manage multiple clusters from the same local machine, and you avoid problems with changed enrivonment variables.&lt;br /&gt;
&lt;br /&gt;
== Spark cluster ==&lt;br /&gt;
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The &#039;&#039;Resources&#039;&#039; menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] gives you details.&lt;br /&gt;
&lt;br /&gt;
=== Create or import a key pair ===&lt;br /&gt;
Your &#039;&#039;terraform-test&#039;&#039; instance still has no keypair. To add a keypair to &#039;&#039;info319-cluster.tf&#039;&#039;, use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public &#039;&#039;~/.ssh/info319-spark-cluster.pub&#039;&#039; SSH key from before. &lt;br /&gt;
&lt;br /&gt;
Add the keypair resource to the &#039;&#039;terraform-test&#039;&#039; resource with a line like this:&lt;br /&gt;
   key_pair        = &amp;quot;info319-spark-cluster&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Rerun &#039;&#039;&#039;terraform plan&#039;&#039;&#039; and &#039;&#039;&#039;terraform apply&#039;&#039;&#039; continuously to check that things work.&lt;br /&gt;
&lt;br /&gt;
=== Test login ===&lt;br /&gt;
Use&lt;br /&gt;
 openstack server list&lt;br /&gt;
to find the IPv4 and IPv6 addresses of &#039;&#039;terraform-test&#039;&#039;. Log in with for example:&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.65.222&lt;br /&gt;
and&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::22f4&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There are two ways to run &#039;&#039;&#039;openstack&#039;&#039;&#039;. One assumes that all the &#039;&#039;OS_*&#039;&#039; environment variables defined in &#039;&#039;keystonerc.sh&#039;&#039; are set. The other assumes that there is a &#039;&#039;~/.config/openstack/clouds.yaml&#039;&#039; file and that &#039;&#039;OS_CLOUD&#039;&#039; is set, for example:&lt;br /&gt;
 OS_CLOUD=info319-cluster openstack server list&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark driver ===&lt;br /&gt;
The &#039;&#039;terraform-test&#039;&#039; resource can be renamed to for example &#039;&#039;terraform-driver&#039;&#039; and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark workers ===&lt;br /&gt;
You can create multiple &#039;&#039;terraform-worker-&#039;&#039; resources in a similar way, but adding a line &lt;br /&gt;
 count            = 6&lt;br /&gt;
and using &#039;&#039;&#039;${count.index}&#039;&#039;&#039; as a variable inside the worker &#039;&#039;name&#039;&#039;. Use &#039;&#039;IPv6&#039;&#039; instead of &#039;&#039;dualStack&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Check that you can login to the new instances using &#039;&#039;&#039;ssh&#039;&#039;&#039; and a IPv6 JumpHost.&lt;br /&gt;
&lt;br /&gt;
As &#039;&#039;info319-cluster.tf&#039;&#039; grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:&lt;br /&gt;
 locals {&lt;br /&gt;
     cluster_prefix = &amp;quot;terraform-&amp;quot;&lt;br /&gt;
     num_workers = 6&lt;br /&gt;
     driver_name = &amp;quot;${local.cluster_prefix}driver&amp;quot;&lt;br /&gt;
     worker_prefix = &amp;quot;${local.cluster_prefix}worker-&amp;quot;&lt;br /&gt;
     worker_names = [&lt;br /&gt;
         for idx in range(local.num_workers) : &lt;br /&gt;
             &amp;quot;${local.worker_prefix}${idx}&amp;quot;]&lt;br /&gt;
     all_names = concat([local.driver_name], local.worker_names)&lt;br /&gt;
 }&lt;br /&gt;
Inside strings, you can refer to the variables using &#039;&#039;${local.var_name}&#039;&#039;. Outside of string, you can write just &#039;&#039;local.var_name&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create and attach volumes ===&lt;br /&gt;
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances]. &lt;br /&gt;
&lt;br /&gt;
You can use &#039;&#039;local.num_workers&#039;&#039; and &#039;&#039;count=&#039;&#039; both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:&lt;br /&gt;
 resource &amp;quot;openstack_compute_volume_attach_v2&amp;quot; &amp;quot;attached-to-workers&amp;quot; {&lt;br /&gt;
   count       = local.num_workers&lt;br /&gt;
   instance_id = &amp;quot;${openstack_compute_instance_v2.terraform-workers[count.index].id}&amp;quot;&lt;br /&gt;
   volume_id   = &amp;quot;${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Log in with SSH and to &#039;&#039;&#039;ls /dev&#039;&#039;&#039; to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).&lt;br /&gt;
&lt;br /&gt;
=== Create &#039;&#039;hosts&#039;&#039; files ===&lt;br /&gt;
Make OpenStack write out the file &#039;&#039;ipv4-hosts&#039;&#039; looking like this:&lt;br /&gt;
 158.37.65.58 terraform-driver&lt;br /&gt;
 10.1.2.234 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 10.1.2.63 terraform-worker-5&lt;br /&gt;
and &#039;&#039;ipv6-hosts&#039;&#039; looking like this:&lt;br /&gt;
 2001:700:2:8300::28c4 terraform-driver&lt;br /&gt;
 2001:700:2:8301::14a1 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 2001:700:2:8310::120d terraform-worker-5&lt;br /&gt;
 &lt;br /&gt;
Tips:&lt;br /&gt;
* you can define outputs from terraform like this:&lt;br /&gt;
 output &amp;quot;terraform-driver-ipv4&amp;quot; {&lt;br /&gt;
     value = openstack_compute_instance_v2.terraform-driver.access_ip_v4&lt;br /&gt;
 }&lt;br /&gt;
* you can use the &#039;&#039;&#039;[https://www.terraform.io/cli/commands/console terraform console]&#039;&#039;&#039; to explore expressions you can use to define outputs and locals, for example:&lt;br /&gt;
 openstack_compute_instance_v2.terraform-workers[5]&lt;br /&gt;
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs&lt;br /&gt;
* a few useful functions are:&lt;br /&gt;
     concat([local.element], local.list)&lt;br /&gt;
     length(string)&lt;br /&gt;
     substr(string, offset, length)&lt;br /&gt;
     zipmap(list1, list2)&lt;br /&gt;
* you can create lists using this syntax:&lt;br /&gt;
     terraform-workers-ipv6 = [&lt;br /&gt;
     	for idx in range(number):&lt;br /&gt;
 	    list[idx].attribute&lt;br /&gt;
     ]&lt;br /&gt;
* you can output to a file like this:&lt;br /&gt;
 resource &amp;quot;local_file&amp;quot; &amp;quot;ipv4-hosts-file&amp;quot; {&lt;br /&gt;
     content = &amp;quot;${local.ipv4-hosts-string}\n&amp;quot;&lt;br /&gt;
     filename = &amp;quot;ipv4-hosts&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
On your local host, change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039;:&lt;br /&gt;
 Host terraform-driver&lt;br /&gt;
     Hostname 2001:700:2:8300::28c4&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-0&lt;br /&gt;
     Hostname 2001:700:2:8301::14a1&lt;br /&gt;
 &lt;br /&gt;
 ...&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-5&lt;br /&gt;
     Hostname 2001:700:2:8301::120d&lt;br /&gt;
&lt;br /&gt;
On your local host, add these lines to &#039;&#039;~/.ssh/config&#039;&#039;:&lt;br /&gt;
 Host terraform-*&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      ProxyJump sinoa@login.uib.no&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
&lt;br /&gt;
 Include ./config.terraform-hosts&lt;br /&gt;
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).&lt;br /&gt;
&lt;br /&gt;
You can now log into all the cluster machines by name, even after you increase the number of worker nodes:&lt;br /&gt;
 ssh terraform-worker-4&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1197</id>
		<title>Create Spark cluster using Terraform</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1197"/>
		<updated>2022-10-31T12:52:47Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Install Terraform */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== OpenStack ==&lt;br /&gt;
=== Install OpenStack ===&lt;br /&gt;
On your local machine, create an &#039;&#039;exercise-5&#039;&#039; folder, for example a subfolder of &#039;&#039;info319-exercises&#039;&#039;, and &#039;&#039;&#039;cd&#039;&#039;&#039; into it.&lt;br /&gt;
&lt;br /&gt;
Install &#039;&#039;openstackclient&#039;&#039;. There are several ways to do this. On Ubuntu Linux, you can do:&lt;br /&gt;
 sudo apt install python3-openstackclient&lt;br /&gt;
(You may have to reinstall &#039;&#039;python3-six&#039;&#039; and &#039;&#039;python3-urllib3&#039;&#039;. You may also need &#039;&#039;python3-dev&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
Other guides suggest you install it as a Python package (using a virtual environment if you want):&lt;br /&gt;
 pip install python-openstackclient&lt;br /&gt;
&lt;br /&gt;
=== Configure OpenStack for command line ===&lt;br /&gt;
If you want to run OpenStack from the command line, create the file &#039;&#039;keystonerc.sh&#039;&#039; on your local machine:&lt;br /&gt;
 touch keystonerc.sh&lt;br /&gt;
 chmod 0600 keystonerc.sh&lt;br /&gt;
&lt;br /&gt;
Use this as a template:&lt;br /&gt;
 export OS_USERNAME=YOUR_USER_NAME@uib.no&lt;br /&gt;
 export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
 export OS_PASSWORD=g3...YOUR_PASSWORD...Qb&lt;br /&gt;
 export OS_AUTH_URL=https://api.nrec.no:5000/v3&lt;br /&gt;
 export OS_IDENTITY_API_VERSION=3&lt;br /&gt;
 export OS_USER_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_PROJECT_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
 export OS_INTERFACE=public&lt;br /&gt;
 export OS_NO_CACHE=1&lt;br /&gt;
 export OS_TENANT_NAME=$OS_PROJECT_NAME&lt;br /&gt;
&lt;br /&gt;
=== Test run OpenStack from command line ===&lt;br /&gt;
Test with:&lt;br /&gt;
 . keystonerc.sh&lt;br /&gt;
 openstack server list&lt;br /&gt;
(It can be quite slow.)&lt;br /&gt;
&lt;br /&gt;
Other test commands:&lt;br /&gt;
 openstack image list&lt;br /&gt;
 openstack flavor list&lt;br /&gt;
 openstack network list&lt;br /&gt;
 openstack keypair list&lt;br /&gt;
 openstack security group list&lt;br /&gt;
&#039;&#039;&#039;openstack --help&#039;&#039;&#039; lists all the possible commands. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039; Try to create an instance with &#039;&#039;&#039;openstack server create ...&#039;&#039;&#039;, then delete it. You can use the NREC Overview to see the results.&lt;br /&gt;
&lt;br /&gt;
== Terraform ==&lt;br /&gt;
=== Install Terraform ===&lt;br /&gt;
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide]:&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install  software-properties-common gnupg2 curl&lt;br /&gt;
 curl https://apt.releases.hashicorp.com/gpg | gpg --dearmor &amp;gt; hashicorp.gpg&lt;br /&gt;
 sudo install -o root -g root -m 644 hashicorp.gpg /etc/apt/trusted.gpg.d/&lt;br /&gt;
 sudo apt-add-repository &amp;quot;deb [arch=$(dpkg --print-architecture)] https://apt.releases.hashicorp.com focal main&amp;quot;&lt;br /&gt;
 sudo apt update&lt;br /&gt;
 sudo apt install terraform&lt;br /&gt;
&lt;br /&gt;
=== Configure Terraform ===&lt;br /&gt;
Create a configuration file &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # configure the OpenStack provider&lt;br /&gt;
 terraform {&lt;br /&gt;
   required_providers {&lt;br /&gt;
     openstack = {&lt;br /&gt;
       source = &amp;quot;terraform-provider-openstack/openstack&amp;quot;&lt;br /&gt;
     }&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Initialise Terraform with:&lt;br /&gt;
 terraform init&lt;br /&gt;
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise terraform, for example with:&lt;br /&gt;
 terraform init -upgrade&lt;br /&gt;
This is usually unproblematic.&lt;br /&gt;
&lt;br /&gt;
=== Test run Terraform ===&lt;br /&gt;
Append something like this to &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # test instance&lt;br /&gt;
 resource &amp;quot;openstack_compute_instance_v2&amp;quot; &amp;quot;terraform-test&amp;quot; {&lt;br /&gt;
   name            = &amp;quot;terraform-test&amp;quot;&lt;br /&gt;
   image_name      = &amp;quot;GOLD Ubuntu 22.04 LTS&amp;quot;&lt;br /&gt;
   flavor_name     = &amp;quot;m1.large&amp;quot;&lt;br /&gt;
   security_groups = [&amp;quot;default&amp;quot;, &amp;quot;info319-spark-cluster&amp;quot;]&lt;br /&gt;
   network {&lt;br /&gt;
     name = &amp;quot;dualStack&amp;quot;&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Try to run Terraform with:&lt;br /&gt;
 terraform plan&lt;br /&gt;
&#039;&#039;&#039;terraform plan&#039;&#039;&#039; lists everything &#039;&#039;terraform&#039;&#039; will do if you &#039;&#039;apply&#039;&#039; it. This list is important to check so you do not permanently delete something critical, like disks/volumes.&lt;br /&gt;
&lt;br /&gt;
To create the new test instance:&lt;br /&gt;
 terraform apply  # answer &#039;yes&#039; to approve&lt;br /&gt;
Check &#039;&#039;Compute -&amp;gt; Instances&#039;&#039; in the &#039;&#039;NREC Overview&#039;&#039; to see the new instance appear.&lt;br /&gt;
&lt;br /&gt;
=== user-data.cfg ===&lt;br /&gt;
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is useful to do a few initialisation steps already when an instance is created. To do this, you can create in initialisation script called something like &#039;&#039;user-data.cfg&#039;&#039;:&lt;br /&gt;
 #cloud-config&lt;br /&gt;
 &lt;br /&gt;
 apt_upgrade: true&lt;br /&gt;
 &lt;br /&gt;
 packages:&lt;br /&gt;
 - emacs&lt;br /&gt;
 &lt;br /&gt;
 power_state:&lt;br /&gt;
   delay: &amp;quot;+3&amp;quot;&lt;br /&gt;
   mode: reboot&lt;br /&gt;
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text &#039;&#039;emacs&#039;&#039; editor.)&lt;br /&gt;
&lt;br /&gt;
Add the line&lt;br /&gt;
   user_data       = file(&amp;quot;user-data.cfg&amp;quot;)&lt;br /&gt;
to &#039;&#039;info319-cluster.tf&#039; and run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a &#039;&#039;Connection closed by UNKNOWN port 65535&#039;&#039; message.)&lt;br /&gt;
&lt;br /&gt;
=== ~/.config/openstack/clouds.yaml (optional) ===&lt;br /&gt;
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (&#039;&#039;OS_*&#039;&#039;) defined in keystonerc.sh , you can keep them in a file. &lt;br /&gt;
&lt;br /&gt;
On your local machine, create ~/.config/openstack/clouds.yaml with &#039;&#039;0600&#039;&#039; (or &#039;&#039;-rw-------&#039;&#039;) access, for example:&lt;br /&gt;
 clouds:&lt;br /&gt;
   info319-cluster:&lt;br /&gt;
     auth:&lt;br /&gt;
       auth_url: https://api.nrec.no:5000/v3&lt;br /&gt;
       project_name: uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
       username: YOUR_USER_NAME@uib.no&lt;br /&gt;
       password: g3...YOUR_PASSWORD...Qb&lt;br /&gt;
       user_domain_name: dataporten&lt;br /&gt;
       project_domain_name: dataporten&lt;br /&gt;
     identity_api_version: 3&lt;br /&gt;
     region_name: YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
     interface: public&lt;br /&gt;
     operation_log:&lt;br /&gt;
       logging: TRUE&lt;br /&gt;
       file: openstackclient_admin.log&lt;br /&gt;
       level: info&lt;br /&gt;
&lt;br /&gt;
You need to change your &#039;&#039;info319-cluster.tf&#039;&#039; file from&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
to&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
   cloud  = &amp;quot;info319-cluster&amp;quot;  # defined in a clouds.yaml file&lt;br /&gt;
 }&lt;br /&gt;
The advantage of this setup is that it easier manage multiple clusters from the same local machine, and you avoid problems with changed enrivonment variables.&lt;br /&gt;
&lt;br /&gt;
== Spark cluster ==&lt;br /&gt;
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The &#039;&#039;Resources&#039;&#039; menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] gives you details.&lt;br /&gt;
&lt;br /&gt;
=== Create or import a key pair ===&lt;br /&gt;
Your &#039;&#039;terraform-test&#039;&#039; instance still has no keypair. To add a keypair to &#039;&#039;info319-cluster.tf&#039;&#039;, use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public &#039;&#039;~/.ssh/info319-spark-cluster.pub&#039;&#039; SSH key from before. &lt;br /&gt;
&lt;br /&gt;
Add the keypair resource to the &#039;&#039;terraform-test&#039;&#039; resource with a line like this:&lt;br /&gt;
   key_pair        = &amp;quot;info319-spark-cluster&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Rerun &#039;&#039;&#039;terraform plan&#039;&#039;&#039; and &#039;&#039;&#039;terraform apply&#039;&#039;&#039; continuously to check that things work.&lt;br /&gt;
&lt;br /&gt;
=== Test login ===&lt;br /&gt;
Use&lt;br /&gt;
 openstack server list&lt;br /&gt;
to find the IPv4 and IPv6 addresses of &#039;&#039;terraform-test&#039;&#039;. Log in with for example:&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.65.222&lt;br /&gt;
and&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::22f4&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There are two ways to run &#039;&#039;&#039;openstack&#039;&#039;&#039;. One assumes that all the &#039;&#039;OS_*&#039;&#039; environment variables defined in &#039;&#039;keystonerc.sh&#039;&#039; are set. The other assumes that there is a &#039;&#039;~/.config/openstack/clouds.yaml&#039;&#039; file and that &#039;&#039;OS_CLOUD&#039;&#039; is set, for example:&lt;br /&gt;
 OS_CLOUD=info319-cluster openstack server list&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark driver ===&lt;br /&gt;
The &#039;&#039;terraform-test&#039;&#039; resource can be renamed to for example &#039;&#039;terraform-driver&#039;&#039; and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark workers ===&lt;br /&gt;
You can create multiple &#039;&#039;terraform-worker-&#039;&#039; resources in a similar way, but adding a line &lt;br /&gt;
 count            = 6&lt;br /&gt;
and using &#039;&#039;&#039;${count.index}&#039;&#039;&#039; as a variable inside the worker &#039;&#039;name&#039;&#039;. Use &#039;&#039;IPv6&#039;&#039; instead of &#039;&#039;dualStack&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Check that you can login to the new instances using &#039;&#039;&#039;ssh&#039;&#039;&#039; and a IPv6 JumpHost.&lt;br /&gt;
&lt;br /&gt;
As &#039;&#039;info319-cluster.tf&#039;&#039; grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:&lt;br /&gt;
 locals {&lt;br /&gt;
     cluster_prefix = &amp;quot;terraform-&amp;quot;&lt;br /&gt;
     num_workers = 6&lt;br /&gt;
     driver_name = &amp;quot;${local.cluster_prefix}driver&amp;quot;&lt;br /&gt;
     worker_prefix = &amp;quot;${local.cluster_prefix}worker-&amp;quot;&lt;br /&gt;
     worker_names = [&lt;br /&gt;
         for idx in range(local.num_workers) : &lt;br /&gt;
             &amp;quot;${local.worker_prefix}${idx}&amp;quot;]&lt;br /&gt;
     all_names = concat([local.driver_name], local.worker_names)&lt;br /&gt;
 }&lt;br /&gt;
Inside strings, you can refer to the variables using &#039;&#039;${local.var_name}&#039;&#039;. Outside of string, you can write just &#039;&#039;local.var_name&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create and attach volumes ===&lt;br /&gt;
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances]. &lt;br /&gt;
&lt;br /&gt;
You can use &#039;&#039;local.num_workers&#039;&#039; and &#039;&#039;count=&#039;&#039; both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:&lt;br /&gt;
 resource &amp;quot;openstack_compute_volume_attach_v2&amp;quot; &amp;quot;attached-to-workers&amp;quot; {&lt;br /&gt;
   count       = local.num_workers&lt;br /&gt;
   instance_id = &amp;quot;${openstack_compute_instance_v2.terraform-workers[count.index].id}&amp;quot;&lt;br /&gt;
   volume_id   = &amp;quot;${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Log in with SSH and to &#039;&#039;&#039;ls /dev&#039;&#039;&#039; to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).&lt;br /&gt;
&lt;br /&gt;
=== Create &#039;&#039;hosts&#039;&#039; files ===&lt;br /&gt;
Make OpenStack write out the file &#039;&#039;ipv4-hosts&#039;&#039; looking like this:&lt;br /&gt;
 158.37.65.58 terraform-driver&lt;br /&gt;
 10.1.2.234 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 10.1.2.63 terraform-worker-5&lt;br /&gt;
and &#039;&#039;ipv6-hosts&#039;&#039; looking like this:&lt;br /&gt;
 2001:700:2:8300::28c4 terraform-driver&lt;br /&gt;
 2001:700:2:8301::14a1 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 2001:700:2:8310::120d terraform-worker-5&lt;br /&gt;
 &lt;br /&gt;
Tips:&lt;br /&gt;
* you can define outputs from terraform like this:&lt;br /&gt;
 output &amp;quot;terraform-driver-ipv4&amp;quot; {&lt;br /&gt;
     value = openstack_compute_instance_v2.terraform-driver.access_ip_v4&lt;br /&gt;
 }&lt;br /&gt;
* you can use the &#039;&#039;&#039;[https://www.terraform.io/cli/commands/console terraform console]&#039;&#039;&#039; to explore expressions you can use to define outputs and locals, for example:&lt;br /&gt;
 openstack_compute_instance_v2.terraform-workers[5]&lt;br /&gt;
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs&lt;br /&gt;
* a few useful functions are:&lt;br /&gt;
     concat([local.element], local.list)&lt;br /&gt;
     length(string)&lt;br /&gt;
     substr(string, offset, length)&lt;br /&gt;
     zipmap(list1, list2)&lt;br /&gt;
* you can create lists using this syntax:&lt;br /&gt;
     terraform-workers-ipv6 = [&lt;br /&gt;
     	for idx in range(number):&lt;br /&gt;
 	    list[idx].attribute&lt;br /&gt;
     ]&lt;br /&gt;
* you can output to a file like this:&lt;br /&gt;
 resource &amp;quot;local_file&amp;quot; &amp;quot;ipv4-hosts-file&amp;quot; {&lt;br /&gt;
     content = &amp;quot;${local.ipv4-hosts-string}\n&amp;quot;&lt;br /&gt;
     filename = &amp;quot;ipv4-hosts&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
On your local host, change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039;:&lt;br /&gt;
 Host terraform-driver&lt;br /&gt;
     Hostname 2001:700:2:8300::28c4&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-0&lt;br /&gt;
     Hostname 2001:700:2:8301::14a1&lt;br /&gt;
 &lt;br /&gt;
 ...&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-5&lt;br /&gt;
     Hostname 2001:700:2:8301::120d&lt;br /&gt;
&lt;br /&gt;
On your local host, add these lines to &#039;&#039;~/.ssh/config&#039;&#039;:&lt;br /&gt;
 Host terraform-*&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      ProxyJump sinoa@login.uib.no&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
&lt;br /&gt;
 Include ./config.terraform-hosts&lt;br /&gt;
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).&lt;br /&gt;
&lt;br /&gt;
You can now log into all the cluster machines by name, even after you increase the number of worker nodes:&lt;br /&gt;
 ssh terraform-worker-4&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
	<entry>
		<id>http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1196</id>
		<title>Create Spark cluster using Terraform</title>
		<link rel="alternate" type="text/html" href="http://info319.wiki.uib.no/index.php?title=Create_Spark_cluster_using_Terraform&amp;diff=1196"/>
		<updated>2022-10-31T12:47:38Z</updated>

		<summary type="html">&lt;p&gt;Sinoa: /* Configure OpenStack for command line */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== OpenStack ==&lt;br /&gt;
=== Install OpenStack ===&lt;br /&gt;
On your local machine, create an &#039;&#039;exercise-5&#039;&#039; folder, for example a subfolder of &#039;&#039;info319-exercises&#039;&#039;, and &#039;&#039;&#039;cd&#039;&#039;&#039; into it.&lt;br /&gt;
&lt;br /&gt;
Install &#039;&#039;openstackclient&#039;&#039;. There are several ways to do this. On Ubuntu Linux, you can do:&lt;br /&gt;
 sudo apt install python3-openstackclient&lt;br /&gt;
(You may have to reinstall &#039;&#039;python3-six&#039;&#039; and &#039;&#039;python3-urllib3&#039;&#039;. You may also need &#039;&#039;python3-dev&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
Other guides suggest you install it as a Python package (using a virtual environment if you want):&lt;br /&gt;
 pip install python-openstackclient&lt;br /&gt;
&lt;br /&gt;
=== Configure OpenStack for command line ===&lt;br /&gt;
If you want to run OpenStack from the command line, create the file &#039;&#039;keystonerc.sh&#039;&#039; on your local machine:&lt;br /&gt;
 touch keystonerc.sh&lt;br /&gt;
 chmod 0600 keystonerc.sh&lt;br /&gt;
&lt;br /&gt;
Use this as a template:&lt;br /&gt;
 export OS_USERNAME=YOUR_USER_NAME@uib.no&lt;br /&gt;
 export OS_PROJECT_NAME=uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
 export OS_PASSWORD=g3...YOUR_PASSWORD...Qb&lt;br /&gt;
 export OS_AUTH_URL=https://api.nrec.no:5000/v3&lt;br /&gt;
 export OS_IDENTITY_API_VERSION=3&lt;br /&gt;
 export OS_USER_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_PROJECT_DOMAIN_NAME=dataporten&lt;br /&gt;
 export OS_REGION_NAME=YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
 export OS_INTERFACE=public&lt;br /&gt;
 export OS_NO_CACHE=1&lt;br /&gt;
 export OS_TENANT_NAME=$OS_PROJECT_NAME&lt;br /&gt;
&lt;br /&gt;
=== Test run OpenStack from command line ===&lt;br /&gt;
Test with:&lt;br /&gt;
 . keystonerc.sh&lt;br /&gt;
 openstack server list&lt;br /&gt;
(It can be quite slow.)&lt;br /&gt;
&lt;br /&gt;
Other test commands:&lt;br /&gt;
 openstack image list&lt;br /&gt;
 openstack flavor list&lt;br /&gt;
 openstack network list&lt;br /&gt;
 openstack keypair list&lt;br /&gt;
 openstack security group list&lt;br /&gt;
&#039;&#039;&#039;openstack --help&#039;&#039;&#039; lists all the possible commands. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039; Try to create an instance with &#039;&#039;&#039;openstack server create ...&#039;&#039;&#039;, then delete it. You can use the NREC Overview to see the results.&lt;br /&gt;
&lt;br /&gt;
== Terraform ==&lt;br /&gt;
=== Install Terraform ===&lt;br /&gt;
Install Terraform along the lines of [https://computingforgeeks.com/how-to-install-terraform-on-ubuntu/ this guide].&lt;br /&gt;
&lt;br /&gt;
=== Configure Terraform ===&lt;br /&gt;
Create a configuration file &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # configure the OpenStack provider&lt;br /&gt;
 terraform {&lt;br /&gt;
   required_providers {&lt;br /&gt;
     openstack = {&lt;br /&gt;
       source = &amp;quot;terraform-provider-openstack/openstack&amp;quot;&lt;br /&gt;
     }&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Initialise Terraform with:&lt;br /&gt;
 terraform init&lt;br /&gt;
This should install the provider that connects Terraform with OpenStack. You may sometimes need to re-initialise terraform, for example with:&lt;br /&gt;
 terraform init -upgrade&lt;br /&gt;
This is usually unproblematic.&lt;br /&gt;
&lt;br /&gt;
=== Test run Terraform ===&lt;br /&gt;
Append something like this to &#039;&#039;info319-cluster.tf&#039;&#039;:&lt;br /&gt;
 # test instance&lt;br /&gt;
 resource &amp;quot;openstack_compute_instance_v2&amp;quot; &amp;quot;terraform-test&amp;quot; {&lt;br /&gt;
   name            = &amp;quot;terraform-test&amp;quot;&lt;br /&gt;
   image_name      = &amp;quot;GOLD Ubuntu 22.04 LTS&amp;quot;&lt;br /&gt;
   flavor_name     = &amp;quot;m1.large&amp;quot;&lt;br /&gt;
   security_groups = [&amp;quot;default&amp;quot;, &amp;quot;info319-spark-cluster&amp;quot;]&lt;br /&gt;
   network {&lt;br /&gt;
     name = &amp;quot;dualStack&amp;quot;&lt;br /&gt;
   }&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Try to run Terraform with:&lt;br /&gt;
 terraform plan&lt;br /&gt;
&#039;&#039;&#039;terraform plan&#039;&#039;&#039; lists everything &#039;&#039;terraform&#039;&#039; will do if you &#039;&#039;apply&#039;&#039; it. This list is important to check so you do not permanently delete something critical, like disks/volumes.&lt;br /&gt;
&lt;br /&gt;
To create the new test instance:&lt;br /&gt;
 terraform apply  # answer &#039;yes&#039; to approve&lt;br /&gt;
Check &#039;&#039;Compute -&amp;gt; Instances&#039;&#039; in the &#039;&#039;NREC Overview&#039;&#039; to see the new instance appear.&lt;br /&gt;
&lt;br /&gt;
=== user-data.cfg ===&lt;br /&gt;
Later, we will use a tool called Ansible to install software and otherwise configure the new instances. But it is useful to do a few initialisation steps already when an instance is created. To do this, you can create in initialisation script called something like &#039;&#039;user-data.cfg&#039;&#039;:&lt;br /&gt;
 #cloud-config&lt;br /&gt;
 &lt;br /&gt;
 apt_upgrade: true&lt;br /&gt;
 &lt;br /&gt;
 packages:&lt;br /&gt;
 - emacs&lt;br /&gt;
 &lt;br /&gt;
 power_state:&lt;br /&gt;
   delay: &amp;quot;+3&amp;quot;&lt;br /&gt;
   mode: reboot&lt;br /&gt;
This script upgrades Ubuntu, waits 3 minutes to give the upgrade time, and restarts the new instance. (It also installs the text &#039;&#039;emacs&#039;&#039; editor.)&lt;br /&gt;
&lt;br /&gt;
Add the line&lt;br /&gt;
   user_data       = file(&amp;quot;user-data.cfg&amp;quot;)&lt;br /&gt;
to &#039;&#039;info319-cluster.tf&#039; and run &#039;&#039;&#039;terraform apply&#039;&#039;&#039;. After a few minues, log in to check that new instances now has emacs installed. (If you try to connect too early you will receive a &#039;&#039;Connection closed by UNKNOWN port 65535&#039;&#039; message.)&lt;br /&gt;
&lt;br /&gt;
=== ~/.config/openstack/clouds.yaml (optional) ===&lt;br /&gt;
Instead of keeping the Openstack configuration (which Terraform also needs) as environment variables (&#039;&#039;OS_*&#039;&#039;) defined in keystonerc.sh , you can keep them in a file. &lt;br /&gt;
&lt;br /&gt;
On your local machine, create ~/.config/openstack/clouds.yaml with &#039;&#039;0600&#039;&#039; (or &#039;&#039;-rw-------&#039;&#039;) access, for example:&lt;br /&gt;
 clouds:&lt;br /&gt;
   info319-cluster:&lt;br /&gt;
     auth:&lt;br /&gt;
       auth_url: https://api.nrec.no:5000/v3&lt;br /&gt;
       project_name: uib-info-YOUR_NREC_PROJECT&lt;br /&gt;
       username: YOUR_USER_NAME@uib.no&lt;br /&gt;
       password: g3...YOUR_PASSWORD...Qb&lt;br /&gt;
       user_domain_name: dataporten&lt;br /&gt;
       project_domain_name: dataporten&lt;br /&gt;
     identity_api_version: 3&lt;br /&gt;
     region_name: YOUR_REGION_EITHER_bgo_OR_osl&lt;br /&gt;
     interface: public&lt;br /&gt;
     operation_log:&lt;br /&gt;
       logging: TRUE&lt;br /&gt;
       file: openstackclient_admin.log&lt;br /&gt;
       level: info&lt;br /&gt;
&lt;br /&gt;
You need to change your &#039;&#039;info319-cluster.tf&#039;&#039; file from&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
 }&lt;br /&gt;
to&lt;br /&gt;
 provider &amp;quot;openstack&amp;quot; {&lt;br /&gt;
   cloud  = &amp;quot;info319-cluster&amp;quot;  # defined in a clouds.yaml file&lt;br /&gt;
 }&lt;br /&gt;
The advantage of this setup is that it easier manage multiple clusters from the same local machine, and you avoid problems with changed enrivonment variables.&lt;br /&gt;
&lt;br /&gt;
== Spark cluster ==&lt;br /&gt;
Use Terraform to create a Spark cluster, similar to the one in Exercise 4. The &#039;&#039;Resources&#039;&#039; menu in the left pane of [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs this page] gives you details.&lt;br /&gt;
&lt;br /&gt;
=== Create or import a key pair ===&lt;br /&gt;
Your &#039;&#039;terraform-test&#039;&#039; instance still has no keypair. To add a keypair to &#039;&#039;info319-cluster.tf&#039;&#039;, use the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_keypair_v2 documentation here] to import the public &#039;&#039;~/.ssh/info319-spark-cluster.pub&#039;&#039; SSH key from before. &lt;br /&gt;
&lt;br /&gt;
Add the keypair resource to the &#039;&#039;terraform-test&#039;&#039; resource with a line like this:&lt;br /&gt;
   key_pair        = &amp;quot;info319-spark-cluster&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Rerun &#039;&#039;&#039;terraform plan&#039;&#039;&#039; and &#039;&#039;&#039;terraform apply&#039;&#039;&#039; continuously to check that things work.&lt;br /&gt;
&lt;br /&gt;
=== Test login ===&lt;br /&gt;
Use&lt;br /&gt;
 openstack server list&lt;br /&gt;
to find the IPv4 and IPv6 addresses of &#039;&#039;terraform-test&#039;&#039;. Log in with for example:&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster ubuntu@158.37.65.222&lt;br /&gt;
and&lt;br /&gt;
 ssh -i ~/.ssh/info319-spark-cluster -J YOUR_USERNAME@login.uib.no ubuntu@2001:700:2:8300::22f4&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Note:&#039;&#039; There are two ways to run &#039;&#039;&#039;openstack&#039;&#039;&#039;. One assumes that all the &#039;&#039;OS_*&#039;&#039; environment variables defined in &#039;&#039;keystonerc.sh&#039;&#039; are set. The other assumes that there is a &#039;&#039;~/.config/openstack/clouds.yaml&#039;&#039; file and that &#039;&#039;OS_CLOUD&#039;&#039; is set, for example:&lt;br /&gt;
 OS_CLOUD=info319-cluster openstack server list&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark driver ===&lt;br /&gt;
The &#039;&#039;terraform-test&#039;&#039; resource can be renamed to for example &#039;&#039;terraform-driver&#039;&#039; and modified as you like according to the [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2 documentation here].&lt;br /&gt;
&lt;br /&gt;
=== Create the Spark workers ===&lt;br /&gt;
You can create multiple &#039;&#039;terraform-worker-&#039;&#039; resources in a similar way, but adding a line &lt;br /&gt;
 count            = 6&lt;br /&gt;
and using &#039;&#039;&#039;${count.index}&#039;&#039;&#039; as a variable inside the worker &#039;&#039;name&#039;&#039;. Use &#039;&#039;IPv6&#039;&#039; instead of &#039;&#039;dualStack&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Check that you can login to the new instances using &#039;&#039;&#039;ssh&#039;&#039;&#039; and a IPv6 JumpHost.&lt;br /&gt;
&lt;br /&gt;
As &#039;&#039;info319-cluster.tf&#039;&#039; grows larger, you can ensure consistency by defining local variables in the beginning of the file, for example:&lt;br /&gt;
 locals {&lt;br /&gt;
     cluster_prefix = &amp;quot;terraform-&amp;quot;&lt;br /&gt;
     num_workers = 6&lt;br /&gt;
     driver_name = &amp;quot;${local.cluster_prefix}driver&amp;quot;&lt;br /&gt;
     worker_prefix = &amp;quot;${local.cluster_prefix}worker-&amp;quot;&lt;br /&gt;
     worker_names = [&lt;br /&gt;
         for idx in range(local.num_workers) : &lt;br /&gt;
             &amp;quot;${local.worker_prefix}${idx}&amp;quot;]&lt;br /&gt;
     all_names = concat([local.driver_name], local.worker_names)&lt;br /&gt;
 }&lt;br /&gt;
Inside strings, you can refer to the variables using &#039;&#039;${local.var_name}&#039;&#039;. Outside of string, you can write just &#039;&#039;local.var_name&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Create and attach volumes ===&lt;br /&gt;
Follow the guide here to [https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/compute_instance_v2#instance-with-attached-volume attach volumes to the instances]. &lt;br /&gt;
&lt;br /&gt;
You can use &#039;&#039;local.num_workers&#039;&#039; and &#039;&#039;count=&#039;&#039; both to create multiple volumes and to attach them. For example, an expression like this can be used to connect all the worker volumes (after you have created them) to their respective workers:&lt;br /&gt;
 resource &amp;quot;openstack_compute_volume_attach_v2&amp;quot; &amp;quot;attached-to-workers&amp;quot; {&lt;br /&gt;
   count       = local.num_workers&lt;br /&gt;
   instance_id = &amp;quot;${openstack_compute_instance_v2.terraform-workers[count.index].id}&amp;quot;&lt;br /&gt;
   volume_id   = &amp;quot;${openstack_blockstorage_volume_v2.terraform-worker-volumes[count.index].id}&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
Log in with SSH and to &#039;&#039;&#039;ls /dev&#039;&#039;&#039; to see that the volumes are attached where they should (but not yet partitioned, formatted, and mounted).&lt;br /&gt;
&lt;br /&gt;
=== Create &#039;&#039;hosts&#039;&#039; files ===&lt;br /&gt;
Make OpenStack write out the file &#039;&#039;ipv4-hosts&#039;&#039; looking like this:&lt;br /&gt;
 158.37.65.58 terraform-driver&lt;br /&gt;
 10.1.2.234 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 10.1.2.63 terraform-worker-5&lt;br /&gt;
and &#039;&#039;ipv6-hosts&#039;&#039; looking like this:&lt;br /&gt;
 2001:700:2:8300::28c4 terraform-driver&lt;br /&gt;
 2001:700:2:8301::14a1 terraform-worker-0&lt;br /&gt;
 ...&lt;br /&gt;
 2001:700:2:8310::120d terraform-worker-5&lt;br /&gt;
 &lt;br /&gt;
Tips:&lt;br /&gt;
* you can define outputs from terraform like this:&lt;br /&gt;
 output &amp;quot;terraform-driver-ipv4&amp;quot; {&lt;br /&gt;
     value = openstack_compute_instance_v2.terraform-driver.access_ip_v4&lt;br /&gt;
 }&lt;br /&gt;
* you can use the &#039;&#039;&#039;[https://www.terraform.io/cli/commands/console terraform console]&#039;&#039;&#039; to explore expressions you can use to define outputs and locals, for example:&lt;br /&gt;
 openstack_compute_instance_v2.terraform-workers[5]&lt;br /&gt;
* when an output works as expected, you can redefine it as a local variable and use it to defined more complex outputs&lt;br /&gt;
* a few useful functions are:&lt;br /&gt;
     concat([local.element], local.list)&lt;br /&gt;
     length(string)&lt;br /&gt;
     substr(string, offset, length)&lt;br /&gt;
     zipmap(list1, list2)&lt;br /&gt;
* you can create lists using this syntax:&lt;br /&gt;
     terraform-workers-ipv6 = [&lt;br /&gt;
     	for idx in range(number):&lt;br /&gt;
 	    list[idx].attribute&lt;br /&gt;
     ]&lt;br /&gt;
* you can output to a file like this:&lt;br /&gt;
 resource &amp;quot;local_file&amp;quot; &amp;quot;ipv4-hosts-file&amp;quot; {&lt;br /&gt;
     content = &amp;quot;${local.ipv4-hosts-string}\n&amp;quot;&lt;br /&gt;
     filename = &amp;quot;ipv4-hosts&amp;quot;&lt;br /&gt;
 }&lt;br /&gt;
&lt;br /&gt;
=== Configure SSH ===&lt;br /&gt;
On your local host, change &#039;&#039;info319-cluster.tf&#039;&#039; so it also writes a file like this to &#039;&#039;~/.ssh/config.terraform-hosts&#039;&#039;:&lt;br /&gt;
 Host terraform-driver&lt;br /&gt;
     Hostname 2001:700:2:8300::28c4&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-0&lt;br /&gt;
     Hostname 2001:700:2:8301::14a1&lt;br /&gt;
 &lt;br /&gt;
 ...&lt;br /&gt;
 &lt;br /&gt;
 Host terraform-worker-5&lt;br /&gt;
     Hostname 2001:700:2:8301::120d&lt;br /&gt;
&lt;br /&gt;
On your local host, add these lines to &#039;&#039;~/.ssh/config&#039;&#039;:&lt;br /&gt;
 Host terraform-*&lt;br /&gt;
      User ubuntu&lt;br /&gt;
      IdentityFile ~/.ssh/info319-spark-cluster&lt;br /&gt;
      ProxyJump sinoa@login.uib.no&lt;br /&gt;
      StrictHostKeyChecking no&lt;br /&gt;
      UserKnownHostsFile /dev/null&lt;br /&gt;
&lt;br /&gt;
 Include ./config.terraform-hosts&lt;br /&gt;
This is a similar wildcard entry to the one in Exercise 4, but this time the host names and IPv6 addresses are included from an external file (you need a recent SSH version to do this).&lt;br /&gt;
&lt;br /&gt;
You can now log into all the cluster machines by name, even after you increase the number of worker nodes:&lt;br /&gt;
 ssh terraform-worker-4&lt;/div&gt;</summary>
		<author><name>Sinoa</name></author>
	</entry>
</feed>