Sessions: Difference between revisions

← Older edit

Latest revision as of 21:16, 10 November 2022

Tentative themes for each session

Thursday August 18th: Introduction meeting File:IntroductionMeeting.pdf
Thursday September 1st: Session 1 - Introduction to big data. Big-data processing. Spark
Thursday September 15th: Session 2 - More about Spark. Data sources. Twitter
Thursday September 29th: Session 3 - Streaming Spark. Big-data architectures. Kafka
Thursday October 13th: Session 4 - Cloud computing. NREC an Openstack
Thursday October 27th: Session 5 - Cloud management. Terraform and Ansible. Docker and Kubernetes
Thursday November 10th: Session 6 - Societal issues. Privacy. GDPR
Thursday November 24th: Session 7 - Essay presentations
Thursday December 8th: Session 8 - Project demonstrations

Session 1 - Introduction to big data. Big-data processing. Spark

Kitchin, chapters 1, 4-5
Chambers & Zaharia, chapters 1-3, 12, 15
Slides: File:S01-BigData-published.pdf File:S01-Spark-published.pdf

Supplementary:

Section 1 in Opdahl, A. L., & Nunavath, V. (2020). Big Data. Big Data in Emergency Management: Exploitation Techniques for Social and Mobile Data, 15-29. Paper
Spark 3.3.0 Overview and Quick Start (with Python examples)

Session 2 - More about Spark. Data sources. Twitter

Chambers & Zaharia, chapters 4-9 (chapter 10 on SQL is also very relevant)
Kitchin, chapter 3
Slides: File:S02-OrganisationINFO319-published.pdf File:S02-DataSources-published.pdf File:DanielRosnes-Introduction-to-Tweepy-and-Twitter-API-2.0.pdf File:S02-MoreSpark-published.pdf

Guest presentation: Daniel Rosnes on using Twitter data for the news: Introduction to Twitter API v2 and Tweepy

Supplementary:

Chambers & Zaharia, chapter 10 (perhaps mandatory too)
Twitter API v2
Tweepy: Twitter for Python
Tweepy Documentation

Session 3 - Streaming Spark. Big-data architectures. Kafka

Chambers & Zaharia, chapters 20-21
Gallofré, M., Opdahl, A. L., Stoppel, S., Tessem, B., & Veres, C. (2021). The News Angler Project: Exploring the Next Generation of Journalistic Knowledge Platforms. In Proceedings of Norsk IKT-konferanse for forskning og utdanning. Short Paper and poster: File:A1-Poster-NIKT2021.pdf
Kafka Introduction
Slides: File:S03-StreamingSpark-published.pdf File:S03-MoreSpark-published.pdf File:S03-Kafka-published.pdf File:S03-ResearchMethod-published.pdf

The Guest Talk on architectures and the News Hunter platform is postponed to a later session.

Supplementary:

kafka-python API
Structured Streaming Spark Programming Guide
News Hunter:
- Berven, A., Christensen, O. A., Moldeklev, S., Opdahl, A. L., & Villanger, K. J. (2020). A knowledge-graph platform for newsrooms. Computers in Industry, 123, 103321. Paper
- Opdahl, A. L., & Tessem, B. (2021). Ontologies for finding journalistic angles. Software and Systems Modeling, 20(1), 71-87. Paper
Design science research method:
- Design Science in Information Systems Research by Alan R. Hevner, Salvatore T. March, Jinsoo Park and Sudha Ram. MIS Quarterly 28(1):75-105, March 2004. (You need to be on UiB's network to access the link - I have uploaded it under Files in mitt.uib.no, but it may soon be deleted from there...)
- Hevner, A. R. (2007). A three cycle view of design science research. Scandinavian journal of information systems, 19(2), 4. File:Hevner2007-ThreeCycleView-SJIS.pdf

Session 4 - Cloud computing. NREC and Openstack

NREC and OpenStack, the following sections/pages: Introduction, Project application, Logging in, The dashboard, Create a Linux virtual machine (skip: Windows), Using SSH, Working with Security Groups, Create and manage volumes, Create and manage snapshots (skip: images), Instance console
Slides: File:S04-OpenStack-published.pdf File:S04-UbuntuLinux-published.pdf

Guest presentation: Sohail Khan on computer vision and deep networks for image analysis. His slides and demo code are uploaded to mitt.uib.no under Files (size and file-type limitations).

There are not so many readings for this session, because it is where we will start running Spark in a cluster, so there will be practical work that takes some time. Computer networks and image analysis is not a mandatory part of the course, but something you may want to use in your projects. Sohail's presentation will include suggestions for further reading.

Session 5 - Cloud management. Terraform and Ansible.

Guest presentation: Marc Gallofré Ocaña on the News Hunter platform and its big-data ready architecture. Slides: File:MarcGallofre-BigDataArchitecture.pdf

Comment: Hopefully, we can introduce Docker and Kubernetes in later sesssions.

Session 6 - Societal issues. Privacy. GDPR

Kitchin, chapters 13-14 and 17-19
What is GDPR, the EU’s new data protection law?
Slides: File:S06-Privacy.pdf

Guest presentation: Ghazaal Sheiki on fact checking. Slides: File:GhazaalSheiki-AutomatedFactChecking.pdf

Supplementary:

Kitchin, chapters 12 and 15-16 are also recommended reading
EU's General Data Protection Regulation (GDPR) - the official legal text

@@ Line 1: / Line 1: @@
-This is the schedule for INFO319 in the autumn of 2018. The dates should be fixed at this stage, but the themes may change a little.
+== Tentative themes for each session ==
+* Thursday August 18th: Introduction meeting [[File:IntroductionMeeting.pdf]]
+* Thursday September 1st: Session 1 - Introduction to big data. Big-data processing. Spark
+* Thursday September 15th: Session 2 - More about Spark. Data sources. Twitter
+* Thursday September 29th: Session 3 - Streaming Spark. Big-data architectures. Kafka
+* Thursday October 13th: Session 4 - Cloud computing. NREC an Openstack
+* Thursday October 27th: Session 5 - Cloud management. Terraform and Ansible. Docker and Kubernetes
+* Thursday November 10th: Session 6 - Societal issues. Privacy. GDPR
+* Thursday November 24th: Session 7 - Essay presentations
+* Thursday December 8th: Session 8 - Project demonstrations
-== Dates and Tentative Themes ==
+== Session 1 - Introduction to big data. Big-data processing. Spark ==
-* Monday 2019-08-19 1015: [[Information meeting]]
+* Kitchin, chapters 1, 4-5
-* Tuesday 2019-08-20 1015: [[Session 1 - Introduction to INFO319 and Emergency Management]]
+* Chambers & Zaharia, chapters 1-3, 12, 15
-* Tuesday 2019-08-20 1400: [[Practical session, trying emergency response tool (Ushahidi)]]
+* Slides: [[File:S01-BigData-published.pdf]] [[File:S01-Spark-published.pdf]]
-* Wednesday 2019-08-21 1015: [[Session 2 - Big Data for emergency management]]
-* Wednesday 2019-08-21 1400: [[Practical session, introduction to Spark]],[[Essay | Essay optional proposal]] (email to [mailto:vimala.nunavath@uia.no vimala.nunavath@uia.no])
-* Monday 2019-09-09 1015: [[Essay | essay topic selection deadline and topic presentation]] (email to [mailto:vimala.nunavath@uia.no vimala.nunavath@uia.no]) and [[Student group programming project | Optional project proposal]] (email to [mailto:vimala.nunavath@uia.no vimala.nunavath@uia.no])
-* Thursday 2019-09-19 1015: [[Session 3 - Emergency data sources]]
-* Thursday 2019-09-19 1400: [[Practical session, using Spark for emergency datasources]]
-* Friday 2019-09-20 1015: [[Session 4 - Sensors/IoT for Emergency Management]]
-* Thursday 2019-10-03 1400:[[Programming project | Project proposal deadline]] (email to [mailto:vimala.nunavath@uia.no vimala.nunavath@uia.no])
-* Thursday 2019-10-03 1015: [[Session 5 - Social Media for Emergency Management]]
-* Thursday 2019-10-03 1400: [[Practical session, Spark streaming for Twitter data analysis]]
-* Friday 2019-10-04 1015: [[Session 6 - Machine learning and NLP for EM]]
-* Friday 2019-10-04 1400: [[Practical session, Spark streaming for sentiment analysis]]
-* Thursday 2019-10-24 1015: [[Session 7 - Visualization/Dashboards for Emergency Management (Guest lecture)]]
-* Friday 2019-10-25 1015: [[Session 8 - Visualization/Dashboards for Emergency Management (Guest lecture)]]
-* Thursday 2019-12-05 1015: [[Session 9 - Essay presentations (mandatory)]]
-* Friday 2019-12-06 1015: [[Session 10 - Student programming project presentations (mandatory)]]
-* Wednesday 2019-12-04 1400: [[Essay | Essay deadline]] (submit PDF-file through Inspera)
-* Tuesday 2019-12-13 1400: [[Programming project | Project deadline]] (submit ZIP-file through Inspera)
-[https://tp.uio.no/uib/timeplan/timeplan.php?id=INFO319&type=course&sort=week&sem=18h&lang=en.]
+Supplementary:
+* Section 1 in Opdahl, A. L., & Nunavath, V. (2020). Big Data. Big Data in Emergency Management: Exploitation Techniques for Social and Mobile Data, 15-29. [https://link.springer.com/chapter/10.1007/978-3-030-48099-8_2 Paper]
+* Spark 3.3.0  [https://spark.apache.org/docs/latest/overview.html Overview] and [https://spark.apache.org/docs/latest/quick-start.html Quick Start (with Python examples)]
-== Locations ==
+== Session 2 - More about Spark. Data sources. Twitter ==
-* '''Information meeting:''' Lauritz Meltzers hus (the Social Science / SV building), seminar room 548 (6th floor)
+* Chambers & Zaharia, chapters 4-9 (chapter 10 on SQL is also very relevant)
-* '''Regular sessions:''' Lauritz Meltzers hus (the Social Science / SV building), seminar room 548 (6th floor)
+* Kitchin, chapter 3
+* Slides: [[File:S02-OrganisationINFO319-published.pdf]] [[File:S02-DataSources-published.pdf]] [[File:DanielRosnes-Introduction-to-Tweepy-and-Twitter-API-2.0.pdf]] [[File:S02-MoreSpark-published.pdf]]
-== Format ==
+Guest presentation: Daniel Rosnes on using Twitter data for the news: Introduction to Twitter API v2 and Tweepy
-The regular sessions will be a combination of lectures, student presentations and discussions. You will all be expected to present 2-3 papers/chapters as part of the course. I will try to balance workload evenly (so if some people get two papers and other three, the two papers will be longer and the three papers shorter).
-We will try to finish each session by 1600.
+Supplementary:
+* Chambers & Zaharia, chapter 10 ''(perhaps mandatory too)''
+* [https://developer.twitter.com/en/docs/twitter-api Twitter API v2]
+* [https://github.com/tweepy/tweepy Tweepy: Twitter for Python]
+* [https://docs.tweepy.org/en/latest/ Tweepy Documentation]
-== Readings ==
+== Session 3 - Streaming Spark. Big-data architectures. Kafka ==
-The detailed readings for each session will be made available on this page in due time.
+* Chambers & Zaharia, chapters 20-21
+* Gallofré, M., Opdahl, A. L., Stoppel, S., Tessem, B., & Veres, C. (2021). The News Angler Project: Exploring the Next Generation of Journalistic Knowledge Platforms. In Proceedings of Norsk IKT-konferanse for forskning og utdanning. [https://ojs.bibsys.no/index.php/NIK/article/view/939/792 Short Paper] and poster: [[file:A1-Poster-NIKT2021.pdf]]
+* [https://kafka.apache.org/intro Kafka Introduction]
+* Slides: [[file:S03-StreamingSpark-published.pdf]] [[file:S03-MoreSpark-published.pdf]] [[file:S03-Kafka-published.pdf]] [[file:S03-ResearchMethod-published.pdf]]
-== Presenting a paper ==
+''The Guest Talk on architectures and the News Hunter platform is postponed to a later session.''
-Here are a few points about the paper presentations and preparations:
-*    Make sure you start reading at least a week before, not in the last 2-3 days. You need time to let the paper sink in a bit before you start preparing the presentation. That way it is easier to see and present the big picture.
-*    This may be the first time you read a research paper. I have tried to choose papers that are rather short and simple but, nevertheless, some parts of almost every paper will be hard for you to understand. If you come across difficult details, try to focus on the purpose of what they are doing. When they mention, for example, statistical techniques, I do not expect that you read up on statistics. But explain why they need statistics and tell us the names of the techniques they use and on what data.
-*    Plan each presentation for about 20 minutes. We will set off 5-10 more minutes for discussion and comments.
-*    Prepare slides. For a 20 minute presentation, 10 slides is the maximum.
-*    Your presentation should try to answer at least the following: What is the problem the paper addresses? Why is this an important problem? Are the authors targetting a particular usage domain? What solutions do they propose? How does the solution work? Have they evaluated the solution? If so how? If not yet, how are they planning to evaluate it - or how do you think they should evaluate it? What are the limitations of the proposal? Do you see problems with what they are doing?
-*    These questions are not all suitable for all papers, so you must make a pick! Maybe there are other things you should say about the paper too. Some of the papers mostly describe a problem or a case study, for example, so the presentations will be quite different.
-*    Rehearse a few times beforehand. Talk through the presentation out loud for yourself (not just "inside your head").
-*    Share your slides by uploading them to the file section here in the portal.
-*    Some papers are longer and some shorter, some easier and some harder. This is how it has to be, but I will try to balance it out so that the workload on each of you is as equal as possible.
-== Uploading your presentation ==
+Supplementary:
-Sharing your presentation slides is a mandatory part of the presentation. You can upload your slides through Inspera in this group
+* [https://kafka-python.readthedocs.io/en/master/ kafka-python API]
-[https://mitt.uib.no/groups/? mitt.uib.no/groups/?]. If you are not already a member, you can register yourself (the group is open).
+* [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Structured Streaming Spark Programming Guide]
+* News Hunter:
+** Berven, A., Christensen, O. A., Moldeklev, S., Opdahl, A. L., & Villanger, K. J. (2020). A knowledge-graph platform for newsrooms. Computers in Industry, 123, 103321. [https://scholar.google.com/scholar?output=instlink&q=info:0K5dB1_9nusJ:scholar.google.com/&hl=en&as_sdt=0,5&as_ylo=2018&scillfp=11776208952974186557&oi=lle Paper]
+** Opdahl, A. L., & Tessem, B. (2021). Ontologies for finding journalistic angles. Software and Systems Modeling, 20(1), 71-87. [https://link.springer.com/article/10.1007/s10270-020-00801-w Paper]
+* Design science research method:
+** [https://www.jstor.org/stable/25148625#metadata_info_tab_contents Design Science in Information Systems Research] by Alan R. Hevner, Salvatore T. March, Jinsoo Park and Sudha Ram. MIS Quarterly 28(1):75-105, March 2004. ''(You need to be on UiB's network to access the link - I have uploaded it under Files in mitt.uib.no, but it may soon be deleted from there...)''
+** Hevner, A. R. (2007). A three cycle view of design science research. Scandinavian journal of information systems, 19(2), 4. [[File:Hevner2007-ThreeCycleView-SJIS.pdf]]
-Please use file names like this: "Session2-Pathologies-ALO.pdf", so that "Session2" is the session, "Pathologies" is a central term in the paper title and "ALO" are your initials.
+== Session 4 - Cloud computing. NREC and Openstack ==
+* [https://docs.nrec.no/index.html NREC and OpenStack], the following sections/pages: Introduction, Project application, Logging in, The dashboard, Create a Linux virtual machine (skip: Windows), Using SSH, Working with Security Groups, Create and manage volumes, Create and manage snapshots (skip: images), Instance console
+* Slides: [[file:S04-OpenStack-published.pdf]] [[file:S04-UbuntuLinux-published.pdf]]
+Guest presentation: Sohail Khan on computer vision and deep networks for image analysis. His slides and demo code are uploaded to mitt.uib.no under Files (size and file-type limitations).
+There are not so many readings for this session, because it is where we will start running Spark in a cluster, so there will be practical work that takes some time. Computer networks and image analysis is not a mandatory part of the course, but something you may want to use in your projects. Sohail's presentation will include suggestions for further reading.
+== Session 5 - Cloud management. Terraform and Ansible. <!-- Docker and Kubernetes --> ==
+* [https://docs.nrec.no/terraform-part1.html TerraForm and NREC part I], [https://docs.nrec.no/terraform-part2.html part II], and [https://docs.nrec.no/terraform-part3.html part III]
+* [https://www.ansible.com/overview/how-ansible-works How Ansible Works] and [https://docs.ansible.com/ansible_community.html the Ansible Community portal]
+<!--
+* Docker Docs: [https://docs.docker.com/get-started/overview/ Docker overview] and [https://docs.docker.com/get-started/overview/ Get started]
+* [https://kubernetes.io/docs/tutorials/kubernetes-basics/ Learn Kubernetes basics], modules 1-6
+-->
+* Slides: [[File:S05-Terraform-Ansible-published.pdf]] [[File:S05-NewsAngler-published.pdf]]
+Guest presentation: Marc Gallofré Ocaña on the News Hunter platform and its big-data ready architecture. Slides: [[File:MarcGallofre-BigDataArchitecture.pdf]]
+Comment: Hopefully, we can introduce Docker and Kubernetes in later sesssions.
+== Session 6 - Societal issues. Privacy. GDPR ==
+* Kitchin, chapters 13-14 and 17-19
+* [https://gdpr.eu/what-is-gdpr/ What is GDPR, the EU’s new data protection law?]
+* Slides: [[File:S06-Privacy.pdf]]
+Guest presentation: Ghazaal Sheiki on fact checking. Slides:  [[File:GhazaalSheiki-AutomatedFactChecking.pdf]]
+Supplementary:
+* Kitchin, chapters 12 and 15-16 are also recommended reading
+* EU's [https://gdpr-info.eu/ General Data Protection Regulation (GDPR)] - the official legal text
+== Session 7 - Essay presentations ==
+== Session 8 - Project demonstrations ==

Anonymous

Search

Sessions: Difference between revisions

Namespaces

More

Page actions

Sinoa (talk | contribs)

Latest revision as of 21:16, 10 November 2022

Contents

Tentative themes for each session

Session 1 - Introduction to big data. Big-data processing. Spark

Session 2 - More about Spark. Data sources. Twitter

Session 3 - Streaming Spark. Big-data architectures. Kafka

Session 4 - Cloud computing. NREC and Openstack

Session 5 - Cloud management. Terraform and Ansible.

Session 6 - Societal issues. Privacy. GDPR

Session 7 - Essay presentations

Session 8 - Project demonstrations

Navigation

Pages

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Sessions: Difference between revisions

Sinoa (talk | contribs)

Latest revision as of 21:16, 10 November 2022

Tentative themes for each session

Session 1 - Introduction to big data. Big-data processing. Spark

Session 2 - More about Spark. Data sources. Twitter

Session 3 - Streaming Spark. Big-data architectures. Kafka

Session 4 - Cloud computing. NREC and Openstack

Session 5 - Cloud management. Terraform and Ansible.

Session 6 - Societal issues. Privacy. GDPR

Session 7 - Essay presentations

Session 8 - Project demonstrations

Navigation

Wiki tools

Page tools