Singularity and HashiCorp Nomad: A Perfect Fit for Enterprise High Performance Computing
Eduardo Arango is a Software engineer at Sylabs Inc. Currently a PhD student on Computer science, on Cloud computing architecture. His areas of research are High Performance Computing, Linux containers, Distributed Systems, and cloud computing. Eduardo is a software engineer at Sylabs inc, the company behind the Singularity OSS project, working on quality assurance and the test infrastructure for the singularity project, Nomad integration with Singularity runtime, and also a Singularity OSS maintainer. LinkedIn: https://www.linkedin.com/in/eduardo-arango
Containers are changing the software packaging and distribution paradigm. Singularity takes this to the next level by offering a simple platform designed around container mobility, reproducibility, security, and performance. Singularity was designed to solve problems associated with root-owned daemons and root-level privileges within containers on multi-tenant environments. Before Singularity, these concerns prevented the system administrators and architects charged with building trusted HPC environments for scientists from installing container platforms.
Singularity blocks privilege escalation within the container; if a user wants to be root inside the container, it must be root outside the container. This usage paradigm mitigates many of the security concerns that exist with containers on multi-tenant shared resources. You can directly call programs inside the container from outside the container, fully incorporating pipes, standard IO, filesystem access, X11, and MPI. The Singularity runtime facilitates cohesion between applications that require direct integration with the host operating system (e.g., GPUs, infiniband, other specialized interconnects), and the services that require isolation when making use of host resources (e.g., network or CPU utilization).
Our goal at Sylabs is to extend the reach of Singularity by providing access to services that can handle more demanding artificial intelligence, machine learning, and other advanced analytic workloads. With Singularity, enterprise users have direct access to an entire solutions ecosystem that simplifies the process of moving applications, workloads, and computing environments across a single infrastructure or across hybrid environments. Singularity enables the power of AI, machine, and deep learning to deliver on our goal and provide unique enterprise-grade services.
Nomad’s performance characteristics and scalability make it well-suited to orchestrate high performance analytical workloads. Nomad’s task driver subsystem allows users to leverage these characteristics for both Docker-based and legacy/non-containerized workloads. The task driver subsystem was refactored in the 0.9 release to enable users to contribute new task drivers as external plugins. The Singularity task driver plugin for Nomad is the first such community contribution! This integration enables data scientists and other users to run analytical workloads that combine the benefits of the two systems. Released under the MPL-2 open source license, the v1.0-alpha release of the Singularity plugin for Nomad is now available. We look forward to community feedback.
» Getting Started
To compile the task driver, run make build
after cloning the repo. This will build the binary for the Nomad task driver plugin. After the build step, copy the task driver binary to the Nomad plugin dir, which by default is located under -data-dir
and -plugin-dir
flags for more information.
After starting the Nomad agent, we can check that the singularity task driver status:
$ nomad-driver-singularity> nomad node status -self
ID = 27dc426e
Name = linux-345w
Class = <none>
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
Uptime = 17h8m0s
Driver Status = exec,java,mock_driver,qemu,raw_exec,singularity
Node Events
Time Subsystem Message
2019-04-05T10:17:48-05:00 Cluster Node registered
Allocated Resources
CPU Memory Disk
0/33600 MHz 0 B/16 GiB 0 B/369 GiB
Allocation Resource Utilization
CPU Memory
0/33600 MHz 0 B/16 GiB
Host Resource Utilization
CPU Memory Disk
1539/33600 MHz 7.2 GiB/16 GiB 10 GiB/380 GiB
Device Resource Utilization
nvidia/gpu/Quadro M620[GPU-0173e955-9436-1b06-0e11-4b0134af1e92] 574 / 1999 MiB
Allocations
No allocations placed
We will notice that the “singularity” driver is in the Driver Status
field in the output above.
As the task driver is healthy, we can start planning our first job:
$ nomad-driver-singularity> nomad plan examples/example.hcl
+ Job: "example1"
+ Task Group: "wild-cow" (1 create)
+ Task: "mooo" (forces create)
Scheduler dry-run:
- All tasks successfully allocated.
Job Modify Index: 0
To submit the job with version verification run:
nomad job run -check-index 0 examples/example.hcl
Running the job with the check-index flag ensures that it will only be executed if the server-side version matches the job modify index returned. If the index has changed, another user has modified the job and the plan's results are potentially invalid.
In the project repo, we can find a working example under examples/example.hcl:
job "example1" {
datacenters = ["dc1"]
type = "batch"
group "wild-cow" {
count = 1
task "mooo" {
driver = "singularity"
// You can pass env vars to the runtime
env {
SINGULARITYENV_FOO = "var"
}
config {
// For this example we are enabling debug and verbose
// options to retrieve logs via alloc logs
debug = true
verbose = true
// This example runs an image from the sylabs container
// library with the canonical example of lolcow
image = "library://sylabsed/examples/lolcow:latest"
// command can be run, exec or test
command = "run"
}
}
}
}
Lets run the example job to make that cow mooooo:
$ nomad-driver-singularity> nomad run examples/example.hcl
==> Monitoring evaluation "f237d718"
Evaluation triggered by job "example1"
Allocation "4dca7d9d" created: node "27dc426e", group "wild-cow"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "f237d718" finished with status "complete”
We can check how everything is going with the nomad job status
command:
nomad-driver-singularity> nomad job status
ID Type Priority Status Submit Date
example1 batch 50 running 2019-04-05T10:24:09-05:00
$ nomad-driver-singularity> nomad job status example1
ID = example1
Name = example1
Submit Date = 2019-04-05T10:24:09-05:00
Type = batch
Priority = 50
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
wild-cow 0 1 0 0 0 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
4dca7d9d 27dc426e wild-cow 0 run pending 1m13s ago 1m13s ago
And the end result - a happy and opinionated cow!
$ nomad-driver-singularity> nomad logs 4dca7d9d mooo
^__^
(oo)\_______
(__)\ )\/\
||----w |
|| ||
The example above demonstrates the Singularity task driver plugin for HashiCorp Nomad with the canonical lolcow example. See the official documentation on the Nomad website for additional details.
Singularity continues to experience widespread support from a growing community of users. We are proud to say that the Singularity container runtime and image format is trusted to run over 1 million jobs each day by users in academia, government, and a rapidly growing enterprise segment.
» Appendix:
Sign up for the latest HashiCorp news
More blog posts like this one
Nomad 1.9 adds NVIDIA MIG support, golden job versions, and more
HashiCorp Nomad 1.9 introduces NVIDIA multi-instance GPU support, NUMA and quotas for devices, exec2 GA, and golden job versions.
Terraform, Packer, Nomad, and Waypoint updates help scale ILM at HashiConf 2024
New Infrastructure Lifecycle Management (ILM) offerings from HashiCorp Terraform, Packer, Nomad, and Waypoint help organizations manage their infrastructure at scale with reduced complexity.
Terraform Enterprise improves deployment flexibility with Nomad and OpenShift
Customers can now deploy Terraform Enterprise using Red Hat OpenShift or HashiCorp Nomad runtime platforms.