Skip to main content

Nomad Bench: Load testing and benchmarking for Nomad

Nomad Bench provides reusable infrastructure tooling, so you can load test and experiment with Nomad clusters running at scale.

HashiCorp Nomad is simple to deploy and highly scalable, but exactly how scalable? Production clusters reach 10,000 clients and beyond, but reproducing bugs or testing load characteristics at this scale is challenging and costly. The one million and two million container challenges provide an impressive baseline for scheduling performance, but they were not intended to create a realistic scenario for constant experimentation.

The nomad-bench project set out to create reusable infrastructure automation to run test scenarios to collect metrics and data from Nomad clusters running at scale. The core goal of this effort is to create reproducible, large-scale test scenarios so users can better understand how Nomad scales and uncover problems detected only in large cluster deployments.

»nomad-bench components

The nomad-bench infrastructure consists of two main components. The core cluster is a long-lived, production-ready core Nomad cluster used to run base services and drive test cases. One important service running in the core cluster is an InfluxDB instance that collects real-time data from test runs.

The test cluster is a short-lived ephemeral cluster running Nomad servers on Amazon EC2 instances and Nomad clients using nomad-nodesim, allowing clusters to scale to tens of thousands of nodes. Each test cluster can have a different number of servers, EC2 instance type, disk performance, and operating system. Test clusters may also be configured with a custom binary to easily test and compare code changes.

test clusters

»Data collection

To collect and analyze data from tests, each cluster has an associated InfluxDB bucket to isolate its data. InfluxDB also allows for real-time data analysis to monitor test progress via dashboards. Data is collected using Telegraf daemons deployed on all test cluster servers. In addition to Nomad metrics and logs, these daemons collect system metrics, such as CPU, memory, and disk IO.

We chose InfluxDB over Prometheus, Grafana, and other tools due to its ability to easily load existing data via the plain-text line protocol format, allow data isolation in buckets, and deploy as a single binary.

»nomad-nodesim

nomad-nodesim is a lightweight, virtualized Nomad client wrapper that can simulate and run hundreds of processes per application instance. This helps simulate and register tens of thousands of Nomad clients in a single test cluster, without having to stand up tens of thousands of real hosts. The clients can have different configurations, such as being partitioned in different datacenters or node pools, or holding different metadata values.

These nomad-nodesim processes are deployed to the core Nomad cluster so they run within the same private network. Each test cluster has its own nomad-nodesim job that can be customized for the scenario being tested.

The application simplifies deployment and eases financial headaches when attempting to run large scales of Nomad clients. The calculations below roughly estimate the cost of running 3 Nomad servers and 10,000 Nomad clients for a month.

  • Without nomad-nodesim: 3 x t3.medium and 10,000 x t3.nano = $25,617 (EC2 only)
  • With nomad-nodesim: 13 x t3.medium = $268 (EC2 only)

»Validation

To have confidence in running Nomad server load and stress tests using nomad-nodesim, the team needed to validate that running nomad-nodesim closely mimics “real” clients. To do this, we ran two experiment variations, one with real Nomad clients running on dedicated hosts, the other with nodesim clients. Both ran three dedicated hosts for the Nomad server process and five clients.

Each experiment ran through a simple set of steps:

  • Register a job using the mock driver with a single task group whose count is 100
  • Update the task group resources to force a destructive update
  • Deregister the job

The job specification that was initially registered is detailed below. It used an HCL variable to control the task memory resource assignment, which madescripting of updates easier as no manipulation of the specification is required.

variable "resource_memory" { default = 10 }
 
job "mock" {
  update {
    max_parallel = 25
  }
  group "mock" {
    count = 100
    task "mock" {
      driver = "mock_driver"
      config {
        run_for = "24h"
      }
      resources {
        cpu    = 1
        memory = var.resource_memory
      }
    }
  }

The script below is used to run the experiment in a controlled and repeatable manner, pausing after each step to allow system stabilization.

#!/usr/bin/env bash
 
set -e -x
 
sleep 90
nomad run mock.nomad.hcl
sleep 90
nomad run -var='resource_memory=11' mock.nomad.hcl
sleep 90
nomad stop mock
sleep 90
nomad system gc

Here’s the resulting count(nomad.client.update_status) charts:

count(nomad.client.update_status)

And the resulting count(nomad.client.update_alloc) charts:

count(nomad.client.update_alloc)

And the count(nomad.client.get_client_allocs) chart:

count(nomad.client.get_client_allocs)

»nomad-bench results and next steps

With this validation experiment, we confirmed that the nomad-nodesim application performs similarly to real Nomad clients. It updates Nomad servers of allocation updates slightly faster because it does not have a task-runner or driver implementation. In some cases, this may also cause updates to be batched slightly more efficiently than real clients.

In order to account for this minor difference and to allow for more flexible testing, we added configuration functionality within PR #24 to allow running nomad-nodesim with either a simulated or real allocation runner implementation.

»Try it yourself

The Nomad benchmarking infrastructure and nomad-nodesim application provide an excellent base for running repeatable and large scale tests. They allows engineers and users to test Nomad at scale and iterate on changes to identify throughput improvements. The Nomad engineering team uses this to run a persistent cluster for soak testing and short-lived clusters to test code changes.

If you want to check out and run the nomad-bench infrastructure suite, you can do this using the publicly available repository. Instructions on how to get started are included and all feedback is welcome.

Sign up for the latest HashiCorp news

By submitting this form, you acknowledge and agree that HashiCorp will process your personal information in accordance with the Privacy Policy.

HashiCorp uses data collected by cookies and JavaScript libraries to improve your browsing experience, analyze site traffic, and increase the overall performance of our site. By using our website, you’re agreeing to our Privacy Policy and Cookie Policy.

The categories below outline which companies and tools we use for collecting data. To opt out of a category of data collection, set the toggle to “Off” and save your preferences.