Processing tweets with Spark: Difference between revisions
From info319
| (2 intermediate revisions by the same user not shown) | |||
| Line 5: | Line 5: | ||
* collect only the texts from the tweets | * collect only the texts from the tweets | ||
* split the texts into words and select all the hashtags | * split the texts into words and select all the hashtags | ||
* the step where you go from a column of lists-of-words to a columns of words is | * the step where you go from a column of lists-of-words to a columns of words is harded in plain Spark | ||
** using the ''explode()''-function (''import pyspark.sql.functions import explode'') seems to be the easiest way | |||
* split the tweets into two sets of 80% and 20% size | * split the tweets into two sets of 80% and 20% size | ||
* find URLs in the texts and download a few image files | * find URLs in the texts and download a few image files | ||
Latest revision as of 09:04, 3 September 2022
Processing tweets with Spark
Continuing the examples from Getting started with Apache Spark:
- load the tweets in ‘tweet-id-text-345/’ as JSON objects
- collect only the texts from the tweets
- split the texts into words and select all the hashtags
- the step where you go from a column of lists-of-words to a columns of words is harded in plain Spark
- using the explode()-function (import pyspark.sql.functions import explode) seems to be the easiest way
- split the tweets into two sets of 80% and 20% size
- find URLs in the texts and download a few image files
