New approaches to measuring Nomad performance
See how the HashiCorp Nomad team re-examined how to capture performance for a workload orchestrator, resulting in new metrics to better capture Nomad’s performance.
During the work on Nomad’s 1.8 release, the engineering team spent some time analyzing HashiCorp Nomad’s performance, trying to identify bottlenecks and come up with guidelines for our users. In order to do that, we first asked ourselves: In what ways can we measure the performance of a workload orchestrator?
Traditionally in engineering, we consider three elements of systems performance:
- Throughput: The number of operations performed in a given time frame, commonly used in telecommunications when measuring data rate (bytes per second)
- Response time: The time it takes for a service to respond to a user's input
- Latency: The time a unit of work needs to wait before it gets processed by a service
Even though Nomad comes with a comprehensive set of metrics, most of them are low-level and using them to understand server performance can be non-trivial for cluster operators.
At a high level, Nomad servers operate like a queue. Every time a user submits a job, Nomad creates an evaluation — a unit of work — and this evaluation ends up in a queue for processing. This queue is called an evaluation broker. All evaluations initially end up there, and that’s where scheduler workers pick them up in order to create deployment plans and put them in another queue — the plan queue — and create allocations. Allocations declare which tasks in a given job should be run on a given node. The diagram below illustrates this process at a high level:
Even though the diagram above illustrates two queues — the eval broker and the plan applier — the eval broker is the queue that captures throughput, response time, and latency, as of Nomad 1.8. The time spent in the plan applier queue also has to be taken into account because scheduler workers return the status to the eval broker, so in order to capture throughput, response time and latency, Nomad 1.8 added three new metrics to the evaluation broker: wait time, process time, and response time:
» Wait time
The wait_time
metric measures the time an evaluation spent in the evaluation broker queue before it got picked up by the scheduler worker.
Metric | Description | Unit | Type | Labels |
nomad.nomad.broker.wait_time
|
Time elapsed while the evaluation was ready to be processed and waiting to be dequeued. | ms | Timer | job, namespace, type, triggered_by, host |
» Process time
The process_time
metric measures how long it took the scheduler worker to process the evaluation, i.e. from the time it left the evaluation broker queue until you received a response.
Metric | Description | Unit | Type | Labels |
nomad.nomad.broker.process_time
|
Time elapsed while the evaluation was dequeued and finished processing. | ms | Timer | job, namespace, type, triggered_by, host |
» Response time
The response_time
metric measures the period of time from when the evaluation first enters the broker queue, until you hear back from the scheduler worker. The result of the evaluation is irrelevant here; only the time spent processing it matters.
Metric | Description | Unit | Type | Labels |
nomad.nomad.broker.response_time
|
Time elapsed from when the evaluation was last enqueued to when it finished processing. | ms | Timer | job, namespace, type, triggered_by, host |
» What about throughput?
A Nomad server’s throughput can be easily captured with existing metrics. If you define throughput as “the number of successful work units completed in a given time,” then you can use the nomad.nomad.eval.ack
metric. It measures how long it takes for the Eval.Ack
RPC to complete, and if you’re interested only in how many times the RPC endpoint was called, you can just count the number of calls and divide by the interval time you’re interested in.
Throughput = count(EvalAck)/IntervalTime
The new Nomad metrics also have the ability to capture the mean utilization of scheduler workers via the Utilization Law.
MeanUtilization = ProcessTime * MeanThroughtput
» Final notes and further research
These new metrics introduced in Nomad 1.8 are designed to make it easier for operators to understand their Nomad servers’ performance. Note that these metrics only capture the performance of servers because the team’s primary objective was to create testing scenarios that are as reproducible as possible. There are also many hardware and networking factors that determine overall client performance, but we wanted to separate those factors from server-focused analysis. In the future we hope to revisit Nomad client performance, and of course we’re eager to hear any suggestions on how to improve Nomad metrics via the project's issues page.
If you’re interested in trying Nomad, you can download it from the project’s website. It’s easy to install and we offer many tutorials. To learn more about observability in Nomad, read Monitoring Nomad.
Sign up for the latest HashiCorp news
More blog posts like this one
Nomad 1.9 adds NVIDIA MIG support, golden job versions, and more
HashiCorp Nomad 1.9 introduces NVIDIA multi-instance GPU support, NUMA and quotas for devices, exec2 GA, and golden job versions.
Terraform, Packer, Nomad, and Waypoint updates help scale ILM at HashiConf 2024
New Infrastructure Lifecycle Management (ILM) offerings from HashiCorp Terraform, Packer, Nomad, and Waypoint help organizations manage their infrastructure at scale with reduced complexity.
Terraform Enterprise improves deployment flexibility with Nomad and OpenShift
Customers can now deploy Terraform Enterprise using Red Hat OpenShift or HashiCorp Nomad runtime platforms.