Processing tweets with Spark
From info319
Processing tweets with Spark
Continuing the examples from Getting started with Apache Spark:
- load the tweets in ‘tweet-id-text-345/’ as JSON objects
- collect only the texts from the tweets
- split the texts into words and select all the hashtags
- the step where you go from a column of lists-of-words to a columns of words is a little harder
- import pyspark.sql.functions import export is the simplest way to do this
- split the tweets into two sets of 80% and 20% size
- find URLs in the texts and download a few image files
