Processing tweets with Spark: Difference between revisions

Revision as of 14:14, 1 September 2022

@@ Line 6: / Line 6: @@
 * split the texts into words and select all the hashtags
 * the step where you go from a column of lists-of-words to a columns of words is a little harder
-  * you can write this step in Python (using collect() and then creating a new DataFrame)
+** you can write this step in Python (using collect() and then creating a new DataFrame)
-  * you can also go via a Pandas frame, but like the Python solution, this breaks the parallel processing
+** you can also go via a Pandas frame, but like the Python solution, this breaks the parallel processing
-  * you can also use the file all-tweet-words.txt in mitt.uib.no to skip this step
+** you can also use the file all-tweet-words.txt in mitt.uib.no to skip this step
-  * two other solutions, which we will present later, are:
+** two other solutions, which we will present later, are:
-    * going via an RDD and using a flatMap()
+** going via an RDD and using a flatMap()
-    * writing a user-defined Spark function (UDF)
+** writing a user-defined Spark function (UDF)
 * split the tweets into two sets of 80% and 20% size
 * find URLs in the texts and download a few image files