Processing tweets with Spark

From info319

Processing tweets with Spark

Continuing the examples from Getting started with Apache Spark:

load the tweets in ‘tweet-id-text-345/’ as JSON objects
collect only the texts from the tweets
split the texts into words and select all the hashtags
the step where you go from a column of lists-of-words to a columns of words is harded in plain Spark
- using the explode()-function (import pyspark.sql.functions import explode) seems to be the easiest way
split the tweets into two sets of 80% and 20% size
find URLs in the texts and download a few image files

Retrieved from "http://info319.wiki.uib.no/index.php?title=Processing_tweets_with_Spark&oldid=917"