Processing tweets with Spark: Difference between revisions
From info319
No edit summary |
|||
| Line 6: | Line 6: | ||
* split the texts into words and select all the hashtags | * split the texts into words and select all the hashtags | ||
* the step where you go from a column of lists-of-words to a columns of words is a little harder | * the step where you go from a column of lists-of-words to a columns of words is a little harder | ||
** you can write this step in Python (using collect() and then creating a new DataFrame) | |||
** you can also go via a Pandas frame, but like the Python solution, this breaks the parallel processing | |||
** you can also use the file all-tweet-words.txt in mitt.uib.no to skip this step | |||
** two other solutions, which we will present later, are: | |||
** going via an RDD and using a flatMap() | |||
** writing a user-defined Spark function (UDF) | |||
* split the tweets into two sets of 80% and 20% size | * split the tweets into two sets of 80% and 20% size | ||
* find URLs in the texts and download a few image files | * find URLs in the texts and download a few image files | ||
Revision as of 14:14, 1 September 2022
Processing tweets with Spark
Continuing the examples from Getting started with Apache Spark:
- load the tweets in ‘tweet-id-text-345/’ as JSON objects
- collect only the texts from the tweets
- split the texts into words and select all the hashtags
- the step where you go from a column of lists-of-words to a columns of words is a little harder
- you can write this step in Python (using collect() and then creating a new DataFrame)
- you can also go via a Pandas frame, but like the Python solution, this breaks the parallel processing
- you can also use the file all-tweet-words.txt in mitt.uib.no to skip this step
- two other solutions, which we will present later, are:
- going via an RDD and using a flatMap()
- writing a user-defined Spark function (UDF)
- split the tweets into two sets of 80% and 20% size
- find URLs in the texts and download a few image files
