In this article we are going to review the classic Hadoop word count example, customizing it a little bit. As usual I suggest to use Eclipse with Maven in order to create a project that can be modified, compiled and easily executed on the cluster. First of all, download the maven boilerplate project from here: https://github.com/H4ml3t/maven-hadoop-java-wordcount-template
Let’s imagine we’ve collected a series of messages about football (tweets or whatever) and we want to count all words, but not simply every word, all those are of interest. Say we have a “dictionary” of football players’ names, and we want to see which of them appears the most in those messages.
Imagine we have a file (called names) with a list of names (one per line):
And in another file (called messages) we have a list of messages
I’m obviously with Harry Kane (Hurricane) today… Let’s go Tottenham!!!
Kid Kane with Arsenal jersey, lol
Another top save from Sirigu to keep Toulouse out. PSG seconds away from top spot.
Why Sirigu and Verratti are laughing so much?
Francesco Totti Scores With Flying Kung Fu Kick, Celebrates With Selfie
What would Roma do without Totti? See his great goal & celebration HERE! #Totti #Goals Wenger wasn’t the Arsenal coach when Totti started to play in Serie A
The result analyzing these lines has to be (Kane, 2), (Sirigu, 2), (Totti, 3). To achieve this, …continue reading →