Nomad’s internal garbage collection and optimization discovery during the Nomad Bench project

A look into Nomad’s internal garbage collection process and the optimization discovered during the bench project.

Aug 12, 2024

During the work on Nomad’s 1.8 LTS release, the team spent some time creating a benchmarking infrastructure for Nomad, running performance tests and trying to find places where we could improve Nomad’s efficiency. This short article describes a process for finding a problem within Nomad’s garbage collection mechanism and an optimization that we made.

»What’s Nomad’s garbage collection and how does it work?

Much like Go, the programming language in which it’s written, the Nomad workload orchestrator supports garbage collection. It can be triggered manually with the CLI or an API call, but some users may not know how Nomad servers handle garbage collection implicitly. This article will go into detail about that implicit side of Nomad garbage collection.

Nomad garbage collection is not the same as garbage collection in a programming language, but the motivation behind its design is similar: it’s there to free up memory allocated for objects that are no longer being referenced or needed by the scheduler. Nomad garbage collection applies to evaluations, nodes, jobs, deployments, and plugins (consult the Nomad scheduling overview article to understand these concepts better), and can be configured by users.

When one of the Nomad servers becomes a leader, it starts periodic garbage collection “tickers” that clean objects marked for garbage collection from memory. Some of these objects, like evaluations, are marked for GC automatically, and some, like jobs, can get marked for GC by RPC calls initiated by users.

An interesting example is job deregistration. Whenever users issue a job stop command or API call, Nomad stops the job but doesn’t remove information about it from memory. You can still see information about a stopped job, such as its deployments or allocations, e.g.

$ nomad job status example
ID            = example
Name          = example
Submit Date   = 2024-05-17T13:49:28+02:00
Type          = service
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = dead (stopped)
Periodic      = false
Parameterized = false
 
Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
echo        0       0         0        0       1         0     0
 
Latest Deployment
ID          = 82460a04
Status      = successful
Description = Deployment completed successfully
 
Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
echo        1        1       1        0          2024-05-17T13:59:11+02:00
 
Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
daa3693d  459490f1  echo        0        stop     complete  30s ago  3s ago

This default behavior can be overridden using a -purge flag, which will force GC on a given job and objects that depend on it, but by default jobs won’t be GCd until job_gc_interval time passes.

Nomad’s basic unit of work is an evaluation. Every evaluation is essentially a “work order” for the Nomad scheduler; something that needs to be done. A new job creates an evaluation (or multiple evaluations), and so does a job update, or stopping a job. The “tickers” mentioned earlier also create evaluations, which end up in the “core” internal scheduler. Since Nomad is a distributed system that coordinates its actions using Raft, most of the new evaluations are created by RPC calls that can then be replicated using Raft transactions.

»Discovery

While running an experiment that measured how many job dispatch requests Nomad could handle per second, we noticed periodic spiking in the number of “nomad.nomad.eval.ack” data points being emitted, among other evaluation-related metrics. Each invocation of this metric in particular indicates an evaluation has been successfully processed and reached a completed state.

Periodic spiking in the number of “nomad.nomad.eval.ack” data points being emitted

Why then was Nomad creating evaluations in large numbers at a periodic interval, which was well outside the bounds of evaluations created due to job dispatch requests?

»Job garbage collection

As previously discussed, Nomad runs an internal garbage collection process to remove old and obsolete state objects. This process can be manually triggered via the “/v1/system/gc” API and was a process we were using within the experiment to control memory growth. Without this, the Nomad servers running within the experiment would fail due to out of memory (OOM) errors, because the load test was creating approximately 2,400 job objects each minute, and each job would move to a completed state after ~3s.

»Job batch deregister RPC

When Nomad determines a job can be garbage collected, it performs an RPC call to the “Job.BatchDeregister” endpoint. As the naming implies, Nomad can provide an array of jobs to deregister that are deleted in a single Raft transaction, providing write efficiency.

Inside the RPC handler, we can see a section of code that loops through the job array and creates an evaluation per job. Evaluations need to be created when a user submits a job deregistration, as we need to calculate what allocations need to be stopped, however, the batch deregister endpoint is not exposed via the HTTP API.

»Optimization

We double and triple checked that garbage collection was the only process to use the batch deregister RPC, then we added some additional tests and removed the loop and code that creates evaluations from the handler via #20510. This change means Nomad no longer creates an evaluation when garbage collecting a job, reducing both Raft and eval broker load.

Running the same experiment from when the evaluation spikes were first noticed with the modified code, we no longer see spikes in “nomad.nomad.eval.ack”.

Across the two experiments, it was also found that Nomad subsystems such as Raft and the eval broker were comparatively put under less load with the new changes. This means we are now able to achieve higher throughput and greater stability along with minor improvements in CPU and memory consumption.

»Final notes

We hope this article shed some light on Nomad internals and in particular how its garbage collection works. If you’re interested in more Nomad deep dives, have a look at a recent article on Nomad’s eval broker, and of course feel free to try Nomad for yourself. You can download it from the project’s website, and we have lots of documentation to get started.