Processing tweets with Spark: Difference between revisions

From info319
No edit summary
Line 6: Line 6:
* split the texts into words and select all the hashtags
* split the texts into words and select all the hashtags
* the step where you go from a column of lists-of-words to a columns of words is a little harder
* the step where you go from a column of lists-of-words to a columns of words is a little harder
  * you can write this step in Python (using collect() and then creating a new DataFrame)
** you can write this step in Python (using collect() and then creating a new DataFrame)
  * you can also go via a Pandas frame, but like the Python solution, this breaks the parallel processing
** you can also go via a Pandas frame, but like the Python solution, this breaks the parallel processing
  * you can also use the file all-tweet-words.txt in mitt.uib.no to skip this step
** you can also use the file all-tweet-words.txt in mitt.uib.no to skip this step
  * two other solutions, which we will present later, are:  
** two other solutions, which we will present later, are:  
    * going via an RDD and using a flatMap()
** going via an RDD and using a flatMap()
    * writing a user-defined Spark function (UDF)
** writing a user-defined Spark function (UDF)
* split the tweets into two sets of 80% and 20% size
* split the tweets into two sets of 80% and 20% size
* find URLs in the texts and download a few image files
* find URLs in the texts and download a few image files

Revision as of 14:14, 1 September 2022

Processing tweets with Spark

Continuing the examples from Getting started with Apache Spark:

  • load the tweets in ‘tweet-id-text-345/’ as JSON objects
  • collect only the texts from the tweets
  • split the texts into words and select all the hashtags
  • the step where you go from a column of lists-of-words to a columns of words is a little harder
    • you can write this step in Python (using collect() and then creating a new DataFrame)
    • you can also go via a Pandas frame, but like the Python solution, this breaks the parallel processing
    • you can also use the file all-tweet-words.txt in mitt.uib.no to skip this step
    • two other solutions, which we will present later, are:
    • going via an RDD and using a flatMap()
    • writing a user-defined Spark function (UDF)
  • split the tweets into two sets of 80% and 20% size
  • find URLs in the texts and download a few image files