nomad

Singularity and HashiCorp Nomad: A Perfect Fit for Enterprise High Performance Computing

Eduardo Arango is a Software engineer at Sylabs Inc. Currently a PhD student on Computer science, on Cloud computing architecture. His areas of research are High Performance Computing, Linux containers, Distributed Systems, and cloud computing. Eduardo is a software engineer at Sylabs inc, the company behind the Singularity OSS project, working on quality assurance and the test infrastructure for the singularity project, Nomad integration with Singularity runtime, and also a Singularity OSS maintainer. LinkedIn: https://www.linkedin.com/in/eduardo-arango

Containers are changing the software packaging and distribution paradigm. Singularity takes this to the next level by offering a simple platform designed around container mobility, reproducibility, security, and performance. Singularity was designed to solve problems associated with root-owned daemons and root-level privileges within containers on multi-tenant environments. Before Singularity, these concerns prevented the system administrators and architects charged with building trusted HPC environments for scientists from installing container platforms.

Singularity blocks privilege escalation within the container; if a user wants to be root inside the container, it must be root outside the container. This usage paradigm mitigates many of the security concerns that exist with containers on multi-tenant shared resources. You can directly call programs inside the container from outside the container, fully incorporating pipes, standard IO, filesystem access, X11, and MPI. The Singularity runtime facilitates cohesion between applications that require direct integration with the host operating system (e.g., GPUs, infiniband, other specialized interconnects), and the services that require isolation when making use of host resources (e.g., network or CPU utilization).

Our goal at Sylabs is to extend the reach of Singularity by providing access to services that can handle more demanding artificial intelligence, machine learning, and other advanced analytic workloads. With Singularity, enterprise users have direct access to an entire solutions ecosystem that simplifies the process of moving applications, workloads, and computing environments across a single infrastructure or across hybrid environments. Singularity enables the power of AI, machine, and deep learning to deliver on our goal and provide unique enterprise-grade services.

Nomad’s performance characteristics and scalability make it well-suited to orchestrate high performance analytical workloads. Nomad’s task driver subsystem allows users to leverage these characteristics for both Docker-based and legacy/non-containerized workloads. The task driver subsystem was refactored in the 0.9 release to enable users to contribute new task drivers as external plugins. The Singularity task driver plugin for Nomad is the first such community contribution! This integration enables data scientists and other users to run analytical workloads that combine the benefits of the two systems. Released under the MPL-2 open source license, the v1.0-alpha release of the Singularity plugin for Nomad is now available. We look forward to community feedback.

»Getting Started

To compile the task driver, run make build after cloning the repo. This will build the binary for the Nomad task driver plugin. After the build step, copy the task driver binary to the Nomad plugin dir, which by default is located under /plugins/. See the Nomad -data-dir and -plugin-dir flags for more information.

After starting the Nomad agent, we can check that the singularity task driver status:

$ nomad-driver-singularity> nomad node status -self
ID = 27dc426e
Name = linux-345w
Class = <none>
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
Uptime = 17h8m0s
Driver Status = exec,java,mock_driver,qemu,raw_exec,singularity

Node Events
Time                        Subsystem     Message
2019-04-05T10:17:48-05:00   Cluster       Node registered

Allocated Resources
CPU              Memory       Disk
0/33600 MHz      0 B/16 GiB   0 B/369 GiB

Allocation Resource Utilization
CPU              Memory
0/33600 MHz      0 B/16 GiB

Host Resource Utilization
CPU               Memory            Disk
1539/33600 MHz    7.2 GiB/16 GiB    10 GiB/380 GiB

Device Resource Utilization
nvidia/gpu/Quadro M620[GPU-0173e955-9436-1b06-0e11-4b0134af1e92] 574 / 1999 MiB

Allocations

No allocations placed

We will notice that the “singularity” driver is in the Driver Status field in the output above.

As the task driver is healthy, we can start planning our first job:

$ nomad-driver-singularity> nomad plan examples/example.hcl
+ Job: "example1"
+ Task Group: "wild-cow" (1 create)
+ Task: "mooo" (forces create)

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 0

To submit the job with version verification run:

nomad job run -check-index 0 examples/example.hcl

Running the job with the check-index flag ensures that it will only be executed if the server-side version matches the job modify index returned. If the index has changed, another user has modified the job and the plan's results are potentially invalid.

In the project repo, we can find a working example under examples/example.hcl:

job "example1" {
  
	datacenters = ["dc1"]
	type = "batch"
	group "wild-cow" {
	
		count = 1
	
		task "mooo" {
	
			driver = "singularity"
	
			// You can pass env vars to the runtime
			env {
				SINGULARITYENV_FOO = "var"  
			}
	
			config {

				// For this example we are enabling debug and verbose
				// options to retrieve logs via alloc logs
				debug = true
				verbose = true

				// This example runs an image from the sylabs container
				// library with the canonical example of lolcow
				image = "library://sylabsed/examples/lolcow:latest"

				// command can be run, exec or test
				command = "run"  
			}
		}
	}
}

Lets run the example job to make that cow mooooo:

$ nomad-driver-singularity> nomad run examples/example.hcl

==> Monitoring evaluation "f237d718"
		Evaluation triggered by job "example1"
		Allocation "4dca7d9d" created: node "27dc426e", group "wild-cow"
		Evaluation status changed: "pending" -> "complete"
==> Evaluation "f237d718" finished with status "complete”

We can check how everything is going with the nomad job status command:

nomad-driver-singularity> nomad job status
ID        Type   Priority Status   Submit Date
example1  batch  50       running  2019-04-05T10:24:09-05:00
$ nomad-driver-singularity> nomad job status example1
ID            = example1
Name          = example1
Submit Date   = 2019-04-05T10:24:09-05:00
Type          = batch
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary

Task Group  Queued Starting Running Failed Complete Lost

wild-cow   0      1        0       0       0       0

 

Allocations

ID       Node ID    Task Group   Version   Desired   Status   Created    Modified

4dca7d9d 27dc426e   wild-cow    0         run       pending  1m13s ago   1m13s ago

And the end result - a happy and opinionated cow!

$ nomad-driver-singularity> nomad logs 4dca7d9d mooo

 ^__^
 (oo)\_______
  (__)\       )\/\
     ||----w  |
     ||      ||

The example above demonstrates the Singularity task driver plugin for HashiCorp Nomad with the canonical lolcow example. See the official documentation on the Nomad website for additional details.

Singularity continues to experience widespread support from a growing community of users. We are proud to say that the Singularity container runtime and image format is trusted to run over 1 million jobs each day by users in academia, government, and a rapidly growing enterprise segment.

»Appendix:


Sign up for the latest HashiCorp news

By submitting this form, you acknowledge and agree that HashiCorp will process your personal information in accordance with the Privacy Policy.