Nomad’s internal garbage collection and optimization discovery during the Nomad Bench project
A look into Nomad’s internal garbage collection process and the optimization discovered during the bench project.
During the work on Nomad’s 1.8 LTS release, the team spent some time creating a benchmarking infrastructure for Nomad, running performance tests and trying to find places where we could improve Nomad’s efficiency. This short article describes a process for finding a problem within Nomad’s garbage collection mechanism and an optimization that we made.
» What’s Nomad’s garbage collection and how does it work?
Much like Go, the programming language in which it’s written, the Nomad workload orchestrator supports garbage collection. It can be triggered manually with the CLI or an API call, but some users may not know how Nomad servers handle garbage collection implicitly. This article will go into detail about that implicit side of Nomad garbage collection.
Nomad garbage collection is not the same as garbage collection in a programming language, but the motivation behind its design is similar: it’s there to free up memory allocated for objects that are no longer being referenced or needed by the scheduler. Nomad garbage collection applies to evaluations, nodes, jobs, deployments, and plugins (consult the Nomad scheduling overview article to understand these concepts better), and can be configured by users.
When one of the Nomad servers becomes a leader, it starts periodic garbage collection “tickers” that clean objects marked for garbage collection from memory. Some of these objects, like evaluations, are marked for GC automatically, and some, like jobs, can get marked for GC by RPC calls initiated by users.
An interesting example is job deregistration. Whenever users issue a job stop
command or API call, Nomad stops the job but doesn’t remove information about it from memory. You can still see information about a stopped job, such as its deployments or allocations, e.g.
$ nomad job status example
ID = example
Name = example
Submit Date = 2024-05-17T13:49:28+02:00
Type = service
Priority = 50
Datacenters = *
Namespace = default
Node Pool = default
Status = dead (stopped)
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost Unknown
echo 0 0 0 0 1 0 0
Latest Deployment
ID = 82460a04
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
echo 1 1 1 0 2024-05-17T13:59:11+02:00
Allocations
ID Node ID Task Group Version Desired Status Created Modified
daa3693d 459490f1 echo 0 stop complete 30s ago 3s ago
This default behavior can be overridden using a -purge
flag, which will force GC on a given job and objects that depend on it, but by default jobs won’t be GCd until job_gc_interval
time passes.
Nomad’s basic unit of work is an evaluation. Every evaluation is essentially a “work order” for the Nomad scheduler; something that needs to be done. A new job creates an evaluation (or multiple evaluations), and so does a job update, or stopping a job. The “tickers” mentioned earlier also create evaluations, which end up in the “core” internal scheduler. Since Nomad is a distributed system that coordinates its actions using Raft, most of the new evaluations are created by RPC calls that can then be replicated using Raft transactions.
» Discovery
While running an experiment that measured how many job dispatch requests Nomad could handle per second, we noticed periodic spiking in the number of “nomad.nomad.eval.ack
” data points being emitted, among other evaluation-related metrics. Each invocation of this metric in particular indicates an evaluation has been successfully processed and reached a completed state.
Why then was Nomad creating evaluations in large numbers at a periodic interval, which was well outside the bounds of evaluations created due to job dispatch requests?
» Job garbage collection
As previously discussed, Nomad runs an internal garbage collection process to remove old and obsolete state objects. This process can be manually triggered via the “/v1/system/gc
” API and was a process we were using within the experiment to control memory growth. Without this, the Nomad servers running within the experiment would fail due to out of memory (OOM) errors, because the load test was creating approximately 2,400 job objects each minute, and each job would move to a completed state after ~3s.
» Job batch deregister RPC
When Nomad determines a job can be garbage collected, it performs an RPC call to the “Job.BatchDeregister
” endpoint. As the naming implies, Nomad can provide an array of jobs to deregister that are deleted in a single Raft transaction, providing write efficiency.
Inside the RPC handler, we can see a section of code that loops through the job array and creates an evaluation per job. Evaluations need to be created when a user submits a job deregistration, as we need to calculate what allocations need to be stopped, however, the batch deregister endpoint is not exposed via the HTTP API.
» Optimization
We double and triple checked that garbage collection was the only process to use the batch deregister RPC, then we added some additional tests and removed the loop and code that creates evaluations from the handler via #20510. This change means Nomad no longer creates an evaluation when garbage collecting a job, reducing both Raft and eval broker load.
Running the same experiment from when the evaluation spikes were first noticed with the modified code, we no longer see spikes in “nomad.nomad.eval.ack
”.
Across the two experiments, it was also found that Nomad subsystems such as Raft and the eval broker were comparatively put under less load with the new changes. This means we are now able to achieve higher throughput and greater stability along with minor improvements in CPU and memory consumption.
» Final notes
We hope this article shed some light on Nomad internals and in particular how its garbage collection works. If you’re interested in more Nomad deep dives, have a look at a recent article on Nomad’s eval broker, and of course feel free to try Nomad for yourself. You can download it from the project’s website, and we have lots of documentation to get started.
Sign up for the latest HashiCorp news
More blog posts like this one
Nomad 1.9 adds NVIDIA MIG support, golden job versions, and more
HashiCorp Nomad 1.9 introduces NVIDIA multi-instance GPU support, NUMA and quotas for devices, exec2 GA, and golden job versions.
Terraform, Packer, Nomad, and Waypoint updates help scale ILM at HashiConf 2024
New Infrastructure Lifecycle Management (ILM) offerings from HashiCorp Terraform, Packer, Nomad, and Waypoint help organizations manage their infrastructure at scale with reduced complexity.
Terraform Enterprise improves deployment flexibility with Nomad and OpenShift
Customers can now deploy Terraform Enterprise using Red Hat OpenShift or HashiCorp Nomad runtime platforms.