selection

Spark examples: how to work with CSV / TSV files (performing selection and projection operation)

One of the most simple format your files may have in order to start playing with Spark, is CSV (comma separated value or TSV tab…). Let’s see how to perform, over a set of this files, some operation. As usual, I suggest you to create a Scala Maven project on Eclipse, compile a jar and execute it on the cluster with the spark-submit command.

See this previous article for detailed instructions about how to setup Eclipse for developing in Spark Scala and this other article to see how to build a Spark jat jar and submit a job.

Say we have a folder in HDFS that contains the partial result of a previous computation.

[meniluca@hadoopnode: ~ ] $ hadoop fs -ls /user/meniluca/wireshark-csv-files/
-rw-r--r--   3 lmeniche supergroup          0 2015-03-13 23:00 _SUCCESS
-rw-r--r--   3 lmeniche supergroup          0 2015-03-13 22:57 part-00000
-rw-r--r--   3 lmeniche supergroup          0 2015-03-13 22:58 part-00001
-rw-r--r--   3 lmeniche supergroup          0 2015-03-13 22:57 part-00002
-rw-r--r--   3 lmeniche supergroup          0 2015-03-13 22:58 part-00003
...

[meniluca@hadoopnode: ~ ] $ hadoop fs -text /user/meniluca/wireshark-csv-files/part-00000 | head
132,0,localhost,192.168.0.100,80,8234
0,3420,192.168.1.24,192.168.0.100,8888,9000
8538,0,192.168.0.100,nosqlnocry.com,80,45600
...

Let’s imagine our directory “wireshark-csv-files” contains csv files coming from some sort of elaboration of wireshark data with …continue reading →

Advertisements