Running Apache Spark on HashiCorp Nomad
Apache Spark is a popular data processing engine/framework that has been architected to use third-party schedulers. The schedulers that are available, however, involve a level of complexity that can be undesirable for many potential Spark users. To help fill this gap, we are pleased to announce that the HashiCorp Nomad ecosystem now includes a version of Apache Spark that natively integrates Nomad as a Spark cluster manager and scheduler.
» Why Spark on Nomad?
Nomad's design (inspired by Google's Borg and Omega) has enabled a set of features that make it well-suited to run analytical applications. Particularly relevant is its native support for batch workloads and parallelized, high throughput scheduling (more on Nomad’s scheduler internals here). Nomad is also easy to set up and use, which has the potential to ease the learning curve and operational burden for Spark users. Key ease-of-use related features include:
- Single binary deployment and no external dependencies
- A simple and intuitive data model
- A declarative job specification
- Support for high availability and multi-datacenter federation out-of-the-box
Nomad also integrates seamlessly with HashiCorp Consul and HashiCorp Vault for service discovery, runtime configuration, and secrets management.
» How it Works
When running on Nomad, the Spark executors that run tasks for your application, and (optionally) the application driver itself, run as Nomad tasks in a Nomad job.
A user can submit a Spark application in the usual way. In this example, the spark-submit
command is used to run the SparkPi sample application against Nomad in cluster mode:
$ spark-submit --class org.apache.spark.examples.SparkPi \
--master nomad \
--deploy-mode cluster \
--conf spark.nomad.sparkDistribution=http://example.com/spark.tgz \
http://example.com/spark-examples.jar 100
A user can customize the Nomad job that Spark creates by explicitly setting configuration properties (see above) or by using a custom template as a starting point:
job "template" {
meta {
"foo" = "bar"
}
group "executor-group-name" {
task "executor-task-name" {
meta {
"spark.nomad.role" = "executor"
}
env {
"BAZ" = "something"
}
}
}
}
Job templates can be used to add metadata or constraints, set environment variables, add sidecar tasks and utilize the Consul and Vault integration.
The Nomad/Spark integration also supports fine-grained resource allocation, HDFS, and continuous monitoring of application output.
» Getting Started
Our official Apache Spark Integration Guide is the best way to get started. You can also use Nomad's example Terraform configuration and embedded Spark quickstart to give the integration a test drive on AWS. Nomad-enabled builds are currently available for Spark 2.1.0 and 2.1.1.
Sign up for the latest HashiCorp news
More blog posts like this one
Nomad 1.9 adds NVIDIA MIG support, golden job versions, and more
HashiCorp Nomad 1.9 introduces NVIDIA multi-instance GPU support, NUMA and quotas for devices, exec2 GA, and golden job versions.
Terraform, Packer, Nomad, and Waypoint updates help scale ILM at HashiConf 2024
New Infrastructure Lifecycle Management (ILM) offerings from HashiCorp Terraform, Packer, Nomad, and Waypoint help organizations manage their infrastructure at scale with reduced complexity.
Terraform Enterprise improves deployment flexibility with Nomad and OpenShift
Customers can now deploy Terraform Enterprise using Red Hat OpenShift or HashiCorp Nomad runtime platforms.