Working with JSON in Scala using the Json4s library (part two)

In this second part (this is the link to the first part) we are going to delve deeper in the function offered by the library Json4s. Let me start by saying that I’m not a developer or a contributor of Json4s, I’m just an user that wanted to report this library on an article, in order to share my experience with it. As a consequence I may use the library not in the right way. If so, I apologize and I would kindly ask you to comment under this post, in the specific area.

In the previous article we saw simple operations, in this one we will see just few of the more advanced features available:

  1. how the selection works with nested objects
  2. how to properly retrieve and print a value (and a string in particular String)
  3. how to filter fields and return a “pruned” JSON
  4. merging two JSONs
  5. the diff function

Like in the previous example, let’s create a Json4sTest.scala class with a JSON variable:

package com.nosqlnocry.test
import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._
object Json4sTest {  
  def main(args: Array[String]) {
    val JSONString = """
         "id": "1q2w3e4r5t",
         "age": 26,
         "loginTimeStamps": [1434904257,1400689856,1396629056],
         "messages": [
          {"id":1,"content":"Please like this post!"},
          {"id":2,"content":"Forza Roma!"}
         "profile": { "id":"my-nickname", "score":123, "avatar":"path.jpg" }
      val JSON = parse(JSONString)
      // ...

The JSON contains few elements: simple key-values, an array with numeric values (time-stamps in long), an array of object and another object.

(1) Now let’s investigate how the commands …Continue reading →


Working with JSON in Scala using the json4s library (Part one).

In this very brilliant article, you can find a comparison between Scala libraries in terms of parsing speed. One of the best result was given by the json4s library. In the first part I will describe the library and it’s main functions, while in the second part I’ll go in deep showing some more detailed examples. As usual let’s create a Maven Scala project with Eclipse, adding the following dependency to the Maven pom.xml file:


Substitute ${$scala.version} with your version of Scala (2.10 for example). If you don’t know how to create a Maven project with Scala in Eclipse follow this article (just the first part in which it is showed how to setup/install Eclipse with the Scala plugin). At the time of writing, I’ve found some problem with the 3.2.11 version (which is the last one), but the previous one was working smoothly. Now let’s create a Scala object with a main function to run:

package com.nosqlnocry.test
import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._
object Json4sTest {  
  def main(arg: Array[String]) {

Before starting, we have to take a look at how the library json4s is modelling JSONs. Looking in the box below, we can see that it is using a syntax tree AST (Abstract Syntax Tree). …Continue reading →

Spark examples: how to work with CSV / TSV files (performing selection and projection operation)

One of the most simple format your files may have in order to start playing with Spark, is CSV (comma separated value or TSV tab…). Let’s see how to perform, over a set of this files, some operation. As usual, I suggest you to create a Scala Maven project on Eclipse, compile a jar and execute it on the cluster with the spark-submit command.

See this previous article for detailed instructions about how to setup Eclipse for developing in Spark Scala and this other article to see how to build a Spark jat jar and submit a job.

Say we have a folder in HDFS that contains the partial result of a previous computation.

[meniluca@hadoopnode: ~ ] $ hadoop fs -ls /user/meniluca/wireshark-csv-files/
-rw-r--r--   3 lmeniche supergroup          0 2015-03-13 23:00 _SUCCESS
-rw-r--r--   3 lmeniche supergroup          0 2015-03-13 22:57 part-00000
-rw-r--r--   3 lmeniche supergroup          0 2015-03-13 22:58 part-00001
-rw-r--r--   3 lmeniche supergroup          0 2015-03-13 22:57 part-00002
-rw-r--r--   3 lmeniche supergroup          0 2015-03-13 22:58 part-00003

[meniluca@hadoopnode: ~ ] $ hadoop fs -text /user/meniluca/wireshark-csv-files/part-00000 | head

Let’s imagine our directory “wireshark-csv-files” contains csv files coming from some sort of elaboration of wireshark data with …continue reading →

Hadoop MapReduce wordcount example in Java. Introduction to Hadoop job.

In this article we are going to review the classic Hadoop word count example, customizing it a little bit. As usual I suggest to use Eclipse with Maven in order to create a project that can be modified, compiled and easily executed on the cluster. First of all, download the maven boilerplate project from here:

$ git clone

If you want to compile it directly than you can

$ cd maven-hadoop-java-wordcount-template
$ mvn package

the result fat jar will be found in the target folder with name “maven-hadoop-java-wordcount-template-0.0.1-SNAPSHOT-jar-with-dependencies.jar“.

Alternatively, if you want to modify the code (like we are about to do now) open Eclipse and go for [File] -> [Import] -> [Existing maven project] -> Browse for the directory …Continue reading →

Setup Eclipse to start developing in Spark Scala and build a fat jar

I suggest two ways to get started to develop Spark in Scala, both with Eclipse: one is to download (from the site the full pre-configured Eclipse which already includes the Scala IDE; another one consists in updating your existing Eclipse adding the Scala plugin (detailed instructions below). This basically will allow you to start Scala projects and run them locally. In each case, at the end of the procedure, in order to start developing in Spark, you have to import inside Eclipse as “existing maven project” a project template (that you can find linked at the bottom of this article).

Now I’ll illustrate how to integrate the Scala plugin in you existing maven installation. In this example I used an Eclipse Kepler EE. From the site copy the latest link version for Kepler, or if not present, follow the link “Older versions” in the page, and choose the right Scala version for you. I copied the link from an older stable version for Scala 2.10.4 (which is the version available in the cluster I’m using at the moment), precisely this:


Make sure you have Java JDK 1.7 installed and that Eclipse is pointing at it. Click on [Windows] -> [Preferences] -> (in the left menu) [Java] -> (click on) [Installed JRE] and check if a JDK 1.7 installation is selected. In case, use …Continue reading →

HelloWorld Spark? Smart (selective) wordcount Scala example!

In the previous post I showed how to build a Spark Scala jar and submit a job using spark-submit, now let’s customize a little bit our main Scala Spark object. You can find the project of the following example here on github.

Let’s imagine we’ve collected a series of messages about football (tweets or whatever) and we want to count all words, but not simply every word, all those are of interest. Say we have a “dictionary” of football players’ names, and we want to see which of them appears the most in those messages.

Example time!

Imagine we have a file (called names) with a list of names (one per line):



And in another file (called messages) we have a list of messages

I’m obviously with Harry Kane (Hurricane) today… Let’s go Tottenham!!!
Kid Kane with Arsenal jersey, lol
Another top save from Sirigu to keep Toulouse out. PSG seconds away from top spot.
Why Sirigu and Verratti are laughing so much?
Francesco Totti Scores With Flying Kung Fu Kick, Celebrates With Selfie
What would Roma do without Totti? See his great goal & celebration HERE
! #Totti #Goals
Wenger wasn’t the Arsenal coach when Totti started to play in Serie A

The result analyzing these lines has to be (Kane, 2), (Sirigu, 2), (Totti, 3). To achieve this, …continue reading →

How to build a Spark fat jar in Scala and submit a job

Are you looking for a ready-to-use solution to submit a job in Spark? These are short instructions about how to start creating a Spark Scala project, in order to build a fat jar that can be executed in a Spark environment. I assume you already have installed Maven (and Java JDK) and Spark (locally or in a real cluster); you can either compile the project from your shell (like I’ll show here) or “import an existing Maven project” with Eclipse and build it from there (read this other article to see how).

Requirements: Maven installation, Spark installation.

Simply download the following Maven project from github:

If you have git installed, you can clone the repository:

git clone
cd spark-scala-maven-boilerplate-project

or without git you have to download the zip from here: (to open use: unzip

Here it is the pom.xml maven file: …continue reading →

o/ …ehi!

Titling the first post “Hello World” is too predictable and boring, I suppose. Well, maybe I should have named it so, since all the blog will be predictable and boring (I’ll try to put some effort in being original :P). Then, let me first apologize to you, both for the predicable-boring issue and for my not-mathertongue english writing skill. What is this blog all about? Yet another programming blog in which I will drop some snippets of code, projects, test, experiences, ideas… together with everything else you don’t care, like stuff concerning my professional life and thoughts (I’ll introduce myself in the “About” page).

Leaving the latter apart, the first part that you do care (or at least surely much more than the second), will involve big-data topics. Now, I hate two things in the world:

  1. anchovies on the pizza
  2. the word big-data

The number 1 because (OK, if not for the horrible English, now you’ll understand I’m from Italy for this) when you eat pizza with anchovies on it, you really can’t understand what is that flavor you have in your mouth. It is so strong and misleading that you could be chewing a sock with an anchovy on top and you wouldn’t feel the difference. “Big-data” for computer science is …continue reading →