Streaming tweets with Twitter API: Difference between revisions

From info319
No edit summary
No edit summary
Line 12: Line 12:


  (venv) $ mkdir keys
  (venv) $ mkdir keys
  (venv) $ chmod 700 keys
  (venv) $ chmod 700 keys
  (venv) $ touch keys/bearer_token
  (venv) $ touch keys/bearer_token
  (venv) $ chmod 600 keys/bearer_token
  (venv) $ chmod 600 keys/bearer_token
  (venv) $ cat > keys/bearer_token
  (venv) $ cat > keys/bearer_token
  XXX
  XXX
  XXX
  XXX
 
  <Ctrl-D>
  Ctrl-D


The '''cat > credentials''' command lets you type content directly into the file, overwriting existing content ('''cat >> credentials''' appends). Type '''Control-D''' to close the file.
The '''cat > credentials''' command lets you type content directly into the file, overwriting existing content ('''cat >> credentials''' appends). Type '''Control-D''' to close the file.
Line 33: Line 27:


Here is the minimal boilerplate code to connect to Twitter's API and start downloading a stream of tweets (the code takes a few seconds to connect):
Here is the minimal boilerplate code to connect to Twitter's API and start downloading a stream of tweets (the code takes a few seconds to connect):
  import json
  import json
  from time import sleep
  from time import sleep
  import threading
  import threading
  import tweepy
  import tweepy
 
  class TweetHarvester:  
  class TweetHarvester:
     def __init__(self, bearer_token, duration):
     def __init__(self, bearer_token, duration):
         """Creates the tweepy client and starts the streaming thread."""
         """Creates the tweepy client and starts the streaming thread."""
Line 48: Line 41:
             args=(duration,))         
             args=(duration,))         
         self.thread.start()
         self.thread.start()
     def run(self, duration):
     def run(self, duration):
         """The streaming tweepy client runs this function."""
         """The streaming tweepy client runs this function."""
Line 55: Line 47:
         sleep(duration)
         sleep(duration)
         self.streaming_client.disconnect()
         self.streaming_client.disconnect()
       
     def on_data(self, json_data):
     def on_data(self, json_data):
         """Tweepy calls this when it receives data from Twitter"""
         """Tweepy calls this when it receives data from Twitter"""
         id_text = json.loads(json_data.decode())
         id_text = json.loads(json_data.decode())
         print('Received tweet:', id_text)
         print('Received tweet:', id_text)
   
   
  bearer_token = open('./keys/bearer_token').read()
  bearer_token = open('./keys/bearer_token').read()
Line 72: Line 62:


Remove the these lines from the '''run()'''-function (along with '''threaded=True''' in the call to '''sample()''' and along with everything about the '''duration''' variable):
Remove the these lines from the '''run()'''-function (along with '''threaded=True''' in the call to '''sample()''' and along with everything about the '''duration''' variable):
         sleep(duration)
         sleep(duration)
         self.streaming_client.disconnect()
         self.streaming_client.disconnect()
Line 90: Line 79:


  $ apt install nc
  $ apt install nc
  $ nc localhost 65000  # you must run this after you start StreamingClient elsewhere
  $ nc localhost 65000  # you must run this after you start StreamingClient elsewhere


Line 96: Line 84:


== Stream to Spark ==
== Stream to Spark ==
Streaming Spark is the topic of Session3, but in this exercise we will write a simple Spark stream that receives and prints the Stream of tweets. It is based in [
Streaming Spark is the topic of Session3, but in this exercise we will write a simple Spark stream that receives and prints the Stream of tweets. It is based on [
https://sparkbyexamples.com/spark/spark-streaming-from-tcp-socket/
https://sparkbyexamples.com/spark/spark-streaming-from-tcp-socket/
  this example].
  this example].

Revision as of 06:54, 7 September 2022

Streaming tweets with Twitter API

Get Twitter API account

This was suggested preparations before the course started, but to remind you:

...

The simple XXX account is a fine start and offers a lot of opportunities. When you have defined a more concrete project, you may apply for a XXX account. Then it is up to Twitter whether they will grant you access.

Usually, your account will be created in much less than a day. Twitter offers several ways to authenticate yourself, but the easiest was is to use the XXX and XXX. In your info319-exercises//2 folder, create a keys/bearer_token file that only you can read:

(venv) $ mkdir keys
(venv) $ chmod 700 keys
(venv) $ touch keys/bearer_token
(venv) $ chmod 600 keys/bearer_token
(venv) $ cat > keys/bearer_token
XXX
XXX
<Ctrl-D>

The cat > credentials command lets you type content directly into the file, overwriting existing content (cat >> credentials appends). Type Control-D to close the file.

Saving tweets to file

(venv) $ pip install tweepy

Here is the minimal boilerplate code to connect to Twitter's API and start downloading a stream of tweets (the code takes a few seconds to connect):

import json
from time import sleep
import threading
import tweepy
 
class TweetHarvester: 
    def __init__(self, bearer_token, duration):
        """Creates the tweepy client and starts the streaming thread."""
        self.streaming_client = tweepy.StreamingClient(bearer_token)
        self.thread = threading.Thread(
            target=self.run,
            args=(duration,))        
        self.thread.start()
    def run(self, duration):
        """The streaming tweepy client runs this function."""
        self.streaming_client.on_data = self.on_data
        self.streaming_client.sample(threaded=True)
        sleep(duration)
        self.streaming_client.disconnect()
    def on_data(self, json_data):
        """Tweepy calls this when it receives data from Twitter"""
        id_text = json.loads(json_data.decode())
        print('Received tweet:', id_text)

bearer_token = open('./keys/bearer_token').read()
harvester = TweetHarvester(bearer_token, 8)

Note how, every time a new ((bunch of)) tweet((s)) is received, tweepy calls the on_data()-function, which prints the tweeted text to the screen.

Create code that saves the tweets to a file (for example as JSON or CSV). Create a subfolder for tweet files to keep your info319-exercise2 folder tidy.

Change the code so you download and save additional information about each tweet, for example the handle (user name) of the tweeter, the tweet id, whether it is a retweet of another id, and perhaps the data and time.

Remove the these lines from the run()-function (along with threaded=True in the call to sample() and along with everything about the duration variable):

        sleep(duration)
        self.streaming_client.disconnect()

Instead, call disconnect() from on_data() when a given number of tweets (say 100 to start) have been received.

Change the code so it instead writes the tweets to a file (which it closes properly on disconnect).

Streaming tweets to a socket

The following changes to the boilerplate code instead writes the stream of tweets to a socket. Think of a socket as an internal internet connection from your machine back to itself. Different programs on your computer can use this socket to communicate using the regular internet APIs.

XXX

Change the code so it sends the stream of tweets to a socket (for example PORT 65000) instead of to a file. On termination, the program must close the socket properly using XXX.

You can use the nc utility to receive data from the socket. From another console window (inside or outside VS Code, a virtual environment is not needed):

$ apt install nc
$ nc localhost 65000  # you must run this after you start StreamingClient elsewhere

IMPORTANT: When you debug, the socket will often remain open after your program has crashed (there is a timeout). So you may have to change PORT number often in the StreamingClient and the test code line (65001, 65002, ...).

Stream to Spark

Streaming Spark is the topic of Session3, but in this exercise we will write a simple Spark stream that receives and prints the Stream of tweets. It is based on [ https://sparkbyexamples.com/spark/spark-streaming-from-tcp-socket/

this example].