Skip to main content

New approaches to measuring Nomad performance

See how the HashiCorp Nomad team re-examined how to capture performance for a workload orchestrator, resulting in new metrics to better capture Nomad’s performance.

During the work on Nomad’s 1.8 release, the engineering team spent some time analyzing HashiCorp Nomad’s performance, trying to identify bottlenecks and come up with guidelines for our users. In order to do that, we first asked ourselves: In what ways can we measure the performance of a workload orchestrator?

Traditionally in engineering, we consider three elements of systems performance:

  1. Throughput: The number of operations performed in a given time frame, commonly used in telecommunications when measuring data rate (bytes per second)
  2. Response time: The time it takes for a service to respond to a user's input
  3. Latency: The time a unit of work needs to wait before it gets processed by a service

Even though Nomad comes with a comprehensive set of metrics, most of them are low-level and using them to understand server performance can be non-trivial for cluster operators.

At a high level, Nomad servers operate like a queue. Every time a user submits a job, Nomad creates an evaluation — a unit of work — and this evaluation ends up in a queue for processing. This queue is called an evaluation broker. All evaluations initially end up there, and that’s where scheduler workers pick them up in order to create deployment plans and put them in another queue — the plan queue — and create allocations. Allocations declare which tasks in a given job should be run on a given node. The diagram below illustrates this process at a high level:

Nomad queue-like operations

Even though the diagram above illustrates two queues — the eval broker and the plan applier — the eval broker is the queue that captures throughput, response time, and latency, as of Nomad 1.8. The time spent in the plan applier queue also has to be taken into account because scheduler workers return the status to the eval broker, so in order to capture throughput, response time and latency, Nomad 1.8 added three new metrics to the evaluation broker: wait time, process time, and response time:

»Wait time

The wait_time metric measures the time an evaluation spent in the evaluation broker queue before it got picked up by the scheduler worker.

Metric Description Unit Type Labels
nomad.nomad.broker.wait_time Time elapsed while the evaluation was ready to be processed and waiting to be dequeued. ms Timer job, namespace, type, triggered_by, host

»Process time

The process_time metric measures how long it took the scheduler worker to process the evaluation, i.e. from the time it left the evaluation broker queue until you received a response.

Metric Description Unit Type Labels
nomad.nomad.broker.process_time Time elapsed while the evaluation was dequeued and finished processing. ms Timer job, namespace, type, triggered_by, host

»Response time

The response_time metric measures the period of time from when the evaluation first enters the broker queue, until you hear back from the scheduler worker. The result of the evaluation is irrelevant here; only the time spent processing it matters.

Metric Description Unit Type Labels
nomad.nomad.broker.response_time Time elapsed from when the evaluation was last enqueued to when it finished processing. ms Timer job, namespace, type, triggered_by, host

»What about throughput?

A Nomad server’s throughput can be easily captured with existing metrics. If you define throughput as “the number of successful work units completed in a given time,” then you can use the nomad.nomad.eval.ack metric. It measures how long it takes for the Eval.Ack RPC to complete, and if you’re interested only in how many times the RPC endpoint was called, you can just count the number of calls and divide by the interval time you’re interested in.

Throughput = count(EvalAck)/IntervalTime

The new Nomad metrics also have the ability to capture the mean utilization of scheduler workers via the Utilization Law.

MeanUtilization = ProcessTime * MeanThroughtput

Nomad processes

Prometheus showing the new broker_response_time metric described above.

»Final notes and further research

These new metrics introduced in Nomad 1.8 are designed to make it easier for operators to understand their Nomad servers’ performance. Note that these metrics only capture the performance of servers because the team’s primary objective was to create testing scenarios that are as reproducible as possible. There are also many hardware and networking factors that determine overall client performance, but we wanted to separate those factors from server-focused analysis. In the future we hope to revisit Nomad client performance, and of course we’re eager to hear any suggestions on how to improve Nomad metrics via the project's issues page.

If you’re interested in trying Nomad, you can download it from the project’s website. It’s easy to install and we offer many tutorials. To learn more about observability in Nomad, read Monitoring Nomad.

Sign up for the latest HashiCorp news

By submitting this form, you acknowledge and agree that HashiCorp will process your personal information in accordance with the Privacy Policy.