Processing tweets with Spark: Difference between revisions

Revision as of 14:13, 1 September 2022

Processing tweets with Spark

Continuing the examples from Getting started with Apache Spark:

load the tweets in ‘tweet-id-text-345/’ as JSON objects
collect only the texts from the tweets
split the texts into words and select all the hashtags
the step where you go from a column of lists-of-words to a columns of words is a little harder

 * you can write this step in Python (using collect() and then creating a new DataFrame)
 * you can also go via a Pandas frame, but like the Python solution, this breaks the parallel processing
 * you can also use the file all-tweet-words.txt in mitt.uib.no to skip this step
 * two other solutions, which we will present later, are: 
   * going via an RDD and using a flatMap()
   * writing a user-defined Spark function (UDF)

split the tweets into two sets of 80% and 20% size
find URLs in the texts and download a few image files

@@ Line 1: / Line 1: @@
-== Processing tweets with Spark ===
+== Processing tweets with Spark ==
-Continuing the examples from [Getting started with Apache Spark]:
+Continuing the examples from [[Getting started with Apache Spark]]:
 * load the tweets in ‘tweet-id-text-345/’ as JSON objects
 * collect only the texts from the tweets

Anonymous

Search

Processing tweets with Spark: Difference between revisions

Namespaces

More

Page actions

Revision as of 14:13, 1 September 2022

Processing tweets with Spark

Navigation

Pages

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Processing tweets with Spark: Difference between revisions

Revision as of 14:13, 1 September 2022

Processing tweets with Spark

Navigation

Wiki tools

Page tools