Processing tweets with Spark

From info319

Revision as of 14:14, 1 September 2022 by Sinoa (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Processing tweets with Spark

Continuing the examples from Getting started with Apache Spark:

load the tweets in ‘tweet-id-text-345/’ as JSON objects
collect only the texts from the tweets
split the texts into words and select all the hashtags
the step where you go from a column of lists-of-words to a columns of words is a little harder
- you can write this step in Python (using collect() and then creating a new DataFrame)
- you can also go via a Pandas frame, but like the Python solution, this breaks the parallel processing
- you can also use the file all-tweet-words.txt in mitt.uib.no to skip this step
- two other solutions, which we will present later, are:
- going via an RDD and using a flatMap()
- writing a user-defined Spark function (UDF)
split the tweets into two sets of 80% and 20% size
find URLs in the texts and download a few image files

Retrieved from "http://info319.wiki.uib.no/index.php?title=Processing_tweets_with_Spark&oldid=911"