How to build a Spark fat jar in Scala and submit a job

Are you looking for a ready-to-use solution to submit a job in Spark? These are short instructions about how to start creating a Spark Scala project, in order to build a fat jar that can be executed in a Spark environment. I assume you already have installed Maven (and Java JDK) and Spark (locally or in a real cluster); you can either compile the project from your shell (like I’ll show here) or “import an existing Maven project” with Eclipse and build it from there (read this other article to see how).

Requirements: Maven installation, Spark installation.

Simply download the following Maven project from github: https://github.com/H4ml3t/spark-scala-maven-boilerplate-project

If you have git installed, you can clone the repository:

git clone git@github.com:H4ml3t/spark-scala-maven-boilerplate-project.git
cd spark-scala-maven-boilerplate-project

or without git you have to download the zip from here: https://github.com/H4ml3t/spark-scala-maven-boilerplate-project/archive/master.zip (to open use: unzip master.zip)

Here it is the pom.xml maven file:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>spark-scala-maven-project</groupId>
  <artifactId>spark-scala-maven-project</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <name>${project.artifactId}</name>
  <description>This is a boilerplate maven project to start using Spark in Scala</description>
  <inceptionYear>2010</inceptionYear>

  <properties>
    <maven.compiler.source>1.6</maven.compiler.source>
    <maven.compiler.target>1.6</maven.compiler.target>
    <encoding>UTF-8</encoding>
    <scala.tools.version>2.10</scala.tools.version>
    <!-- Put the Scala version of the cluster -->
    <scala.version>2.10.4</scala.version>
  </properties>
  
  <!-- repository to add org.apache.spark -->
  <repositories>
    <repository>
      <id>cloudera-repo-releases</id>
      <url>https://repository.cloudera.com/artifactory/repo/</url>
    </repository>
  </repositories>

  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <testSourceDirectory>src/test/scala</testSourceDirectory>
    <plugins>
      <plugin>
        <!-- see http://davidb.github.com/scala-maven-plugin -->
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
        <version>3.1.3</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
            <configuration>
              <args>
                <arg>-make:transitive</arg>
                <arg>-dependencyfile</arg>
                <arg>${project.build.directory}/.scala_dependencies</arg>
              </args>
            </configuration>
          </execution>
        </executions>
      </plugin>

      <!-- "package" command plugin -->
      <plugin>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>2.4.1</version>
        <configuration>
          <descriptorRefs>
            <descriptorRef>jar-with-dependencies</descriptorRef>
          </descriptorRefs>
        </configuration>
        <executions>
          <execution>
            <id>make-assembly</id>
            <phase>package</phase>
            <goals>
              <goal>single</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>

  <dependencies>
    <!-- Scala and Spark dependencies -->
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.10</artifactId>
      <version>1.2.0-cdh5.3.1</version>
    </dependency>
  </dependencies>
</project>

Modify fields like “spark-core_2.10” with the version of your Spark installation, together with the “scala.version” property.

Then, in the folder src/main/scala/com/examples you can find a Scala object to modify with your Spark Scala code. The current object, in particular, takes all lines from the input path and replaces spaces with commas.

package com.examples
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.log4j.Logger

object MainExample {
 def main(arg: Array[String]) {
  var logger = Logger.getLogger(this.getClass())
  if (arg.length < 2) {
   logger.error("=> wrong parameters number")
   System.err.println("Usage: MainExample  ")
   System.exit(1)
  }
  val jobName = "MainExample"
  val conf = new SparkConf().setAppName(jobName)
  val sc = new SparkContext(conf)
  val pathToFiles = arg(0)
  val outputPath = arg(1)
  logger.info("=> jobName \"" + jobName + "\"")
  logger.info("=> pathToFiles \"" + pathToFiles + "\"")
  val files = sc.textFile(pathToFiles)
  // do your work here
  val rowsWithoutSpaces = files.map(_.replaceAll(" ", ","))
  // and save the result
  rowsWithoutSpaces.saveAsTextFile(outputPath)
 }
}

When you’ve finished, compile the project with the Maven command:

mvn clean package

If the output is like this …

[...]
[INFO]
[INFO] --- maven-assembly-plugin:2.4.1:single (make-assembly) @ spark-scala-maven-project ---
[INFO] Building jar: .../spark-scala-maven-boilerplate-project/target/spark-scala-maven-project-0.0.1-SNAPSHOT-jar-with-dependencies.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 30.241 s
[INFO] Finished at: 2015-03-01T11:29:19+01:00
[INFO] Final Memory: 44M/766M
[INFO] ------------------------------------------------------------------------

… you will find inside the /target folder the fat jar called spark-scala-maven-project-0.0.1-SNAPSHOT-jar-with-dependencies.jar.

Read this article to learn how to setup Eclipse to start developing in Spark with Scala projects and build jar within Eclipse with Maven.

Bring this jar in a shell with Spark (supposing you are developing outside a cluster you may use “scp” or “rsync” command to move the jar inside it) and submit your job with the provided Spark command:

spark-submit --class com.examples.MainExample \
    --master yarn-cluster \
    spark-scala-maven-project-0.0.1-SNAPSHOT-with-depencencies.jar \
    inputhdfspath \
    outputhdfspath

The Main example Scala object take two arguments the input path and the output path. 90% of the times you don’t have to specify any “hdfs://” prefix before the path, since the job is running deployed in the Hadoop environment which file system is automatically HDFS. You can use stars to select as input all the content in a folder, for example “/user/me/input-folder/*“, and as output has to be a not existing folder.

After the execution of your Job, if everything went fine, you can download the result from HDFS to your file system, using the “-getmerge” operation:

hadoop fs -getmerge outputhdfspath resultRetrievedLocally
cat resultRetrievedLocally

I hope this was helpful, please write a comment for any problem, error or suggestion. Please share/like the article. Cheers,

meniluca 😀

Advertisements

6 comments

  1. suppose we are running a cluster outside of eclipse you mentioned that we need to scp the jar file and submit the job with spark command,
    in that case what about the MainExample file which has the code ? do that also need to be transferred to the cluster where we are submitting the job

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s