Processing tweets with Spark: Difference between revisions

Latest revision as of 09:04, 3 September 2022

load the tweets in ‘tweet-id-text-345/’ as JSON objects
collect only the texts from the tweets
split the texts into words and select all the hashtags
the step where you go from a column of lists-of-words to a columns of words is harded in plain Spark
- using the explode()-function (import pyspark.sql.functions import explode) seems to be the easiest way
split the tweets into two sets of 80% and 20% size
find URLs in the texts and download a few image files

@@ Line 5: / Line 5: @@
 * collect only the texts from the tweets
 * split the texts into words and select all the hashtags
-* the step where you go from a column of lists-of-words to a columns of words is a little harder
+* the step where you go from a column of lists-of-words to a columns of words is harded in plain Spark
-  * you can write this step in Python (using collect() and then creating a new DataFrame)
+** using the ''explode()''-function (''import pyspark.sql.functions import explode'') seems to be the easiest way
-  * you can also go via a Pandas frame, but like the Python solution, this breaks the parallel processing
-  * you can also use the file all-tweet-words.txt in mitt.uib.no to skip this step
-  * two other solutions, which we will present later, are:
-    * going via an RDD and using a flatMap()
-    * writing a user-defined Spark function (UDF)
 * split the tweets into two sets of 80% and 20% size
 * find URLs in the texts and download a few image files