Processing tweets with Spark

Continuing the examples from Getting started with Apache Spark:

load the tweets in ‘tweet-id-text-345/’ as JSON objects
collect only the texts from the tweets
split the texts into words and select all the hashtags
the step where you go from a column of lists-of-words to a columns of words is a little harder

 * you can write this step in Python (using collect() and then creating a new DataFrame)
 * you can also go via a Pandas frame, but like the Python solution, this breaks the parallel processing
 * you can also use the file all-tweet-words.txt in mitt.uib.no to skip this step
 * two other solutions, which we will present later, are: 
   * going via an RDD and using a flatMap()
   * writing a user-defined Spark function (UDF)

split the tweets into two sets of 80% and 20% size
find URLs in the texts and download a few image files

Anonymous

Search

Processing tweets with Spark

Namespaces

More

Page actions

Processing tweets with Spark

Navigation

Pages

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Processing tweets with Spark

Processing tweets with Spark

Navigation

Wiki tools

Page tools