tweets

HelloWorld Spark? Smart (selective) wordcount Scala example!

In the previous post I showed how to build a Spark Scala jar and submit a job using spark-submit, now let’s customize a little bit our main Scala Spark object. You can find the project of the following example here on github.

Let’s imagine we’ve collected a series of messages about football (tweets or whatever) and we want to count all words, but not simply every word, all those are of interest. Say we have a “dictionary” of football players’ names, and we want to see which of them appears the most in those messages.

Example time!

Imagine we have a file (called names) with a list of names (one per line):

Kane
Sirigu
Neymar

Totti

And in another file (called messages) we have a list of messages

I’m obviously with Harry Kane (Hurricane) today… Let’s go Tottenham!!!
Kid Kane with Arsenal jersey, lol
Another top save from Sirigu to keep Toulouse out. PSG seconds away from top spot.
Why Sirigu and Verratti are laughing so much?
Francesco Totti Scores With Flying Kung Fu Kick, Celebrates With Selfie
What would Roma do without Totti? See his great goal & celebration HERE
! #Totti #Goals
Wenger wasn’t the Arsenal coach when Totti started to play in Serie A

The result analyzing these lines has to be (Kane, 2), (Sirigu, 2), (Totti, 3). To achieve this, …continue reading →

Advertisements