Advanced Node Draining in HashiCorp Nomad
HashiCorp Nomad 0.8 introduces advanced node draining to simplify cluster wide upgrades of Nomad client nodes. This post explores how HashiCorp Nomad’s improved draining features can be used to drain an existing workload from one set of nodes to a new set of nodes without downtime.
Traditionally upgrading a production cluster managed by a scheduler can be challenging for operators since the cluster may be running live workloads that are serving customers and can not be disrupted. A further difficulty is that the cluster operators may not be service owners and are not aware of the requirements for all the production services.
Core to the the goals of HashiCorp Nomad is to make cluster management painless for operators and to minimize service down time. With Nomad 0.8's advanced node draining we strive towards both these features by giving operators and developers control of how migrations occur in a cluster-wide, coordinated fashion.
» New Node Drainer
With Nomad 0.8 introduces a new node drainer to safely drain jobs running on Nomad client nodes. The new node drainer will inspect all draining nodes and has the ability to detect which jobs are affected by the draining operation. Then, it will inspect the migrate
stanzas for all affected jobs and try to reduce downtime for a service by honoring the max_parallel
defined in the stanza. The max_parallel
parameter limits the number of allocations that can be migrating at any given time. Nomad respects this field by limiting parallel migrations to not exceed this value. Since applications do not become immediately ready to serve traffic, Nomad waits for the replaced allocations to become healthy before continuing migrating allocations for the job which helps reduce service downtime. The new node drainer in Nomad 0.8 allows Nomad to have a cluster-wide view while draining nodes and allow service owners to define all the necessary parameters required to migrate jobs among nodes using the migrate
stanza.
» Migrate Stanza
With Nomad 0.8, the draining behavior has been improved by introducing a new migrate
stanza at the task group level, which allows developers to define the draining behavior for their jobs. Below is an example of the migrate
stanza for the my-api
job that allows Nomad to migrate the job at the rate of one allocation at a time and requires the allocation to be healthy for at least 10 seconds before continuing to the next one.
job "my-api" {
datacenters = ["dc1"]
type = "service"
group "my-api" {
count = 2
migrate {
# Perform one parallel migration at a time.
max_parallel = 1
# Ensure that the newly placed allocations are healthy for at least 10
# seconds before moving on with the migration process.
min_healthy_time = "10s"
# Give the allocation at most 3 minutes to be marked healthy before
# other migrations can continue.
healthy_deadline = "3m"
}
restart {
.....
When draining a node, Nomad will use group’s migrate
stanza to create new allocations on other nodes in the cluster. In the above example, if a node drain
command is issued on a node running the my-api
job, Nomad would migrate the job one allocation at a time by creating one new allocation with the same version of the my-api
service on another node in the cluster. The newly created allocation will need to pass it’s health checks within 3 minutes before Nomad can continue migrating the allocations in the job. This process helps developers responsible for my-api
service to use the parameters defined in the migrate
stanza to define how the job will be migrated in an event of a node drain. This also helps the operators be unaware of these granular settings and instead focus on upgrading the nodes in the cluster.
» Node Drain and Eligibility Command
The new node drain
command introduces the concept of node eligibility. Each Nomad client node can either be eligible
or ineligible
for scheduling. When draining a node, Nomad will mark the node as ineligible
for new placements.
Nomad 0.8 also allows operators to set a deadline when draining a node. When set, Nomad will wait until the deadline by which all allocations must be moved off the node otherwise they are forced removed from the node. Setting a deadline gives operators a final time in which they can tear down resources but allow sufficient time for the jobs to migrate to another node. Further, Nomad allows batch jobs to continue running on a draining node until the deadline. This allows nearly complete batch jobs to finish, helping reduce costs associated with running batch jobs. Below is an example of using the -deadline
command line flag with the node drain
command.
nomad node drain -enable -self -deadline 30m
In the above example, Nomad will wait 30 minutes before forcing the jobs to be removed from the node.
System jobs in Nomad allow services such as log shippers and metrics collectors to run on all Nomad client nodes. If client nodes are drained, system jobs are the last to be migrated which allows for all metrics and logs to be shipped before these jobs are drained.
The video below shows a demo of advanced node draining in Nomad.
» Improvements Over Previous Drain Command
Prior to Nomad 0.8, when a node drain command was issued, Nomad would mark a node with Drain=true
which would not allow any new jobs to be scheduled on that node. Nomad would then create new evaluations for all jobs that were running on the node and reschedule them to other nodes in the cluster. This behaviour would in some cases result in service downtime. Few other problems associated with node draining behaviour were as follows:
- When doing rolling drains and restarts of clients, jobs can repeatedly get moved between nodes and can be placed on a node that is about to get drained.
- In a scenario where all jobs for a particular service are running on one node, draining that node would cause all jobs running on the node to stop simultaneously which in turn would cause an outage for that service.
- Draining one node at a time and waiting for new job placements was tedious, and error-prone.
- Draining multiple nodes at once could cause an outage due to lack of coordination between the draining nodes.
- Draining nodes that were running batch jobs could cause the jobs to stop just before completing the work that may require them to be restarted in order to redo the work.
- Draining nodes would cause the system jobs running on the node to get drained immediately.
In summary, prior to Nomad v0.8, orchestrating zero downtime drain required large amount of manual supervision by operators since they didn’t have control over how individual jobs would be be drained.
» Difference from Update Stanza
It is important to understand the differences between update
and migrate
stanzas since they have similar parameters but are meant to be used for different use cases in Nomad.
The update
stanza is used to handle the transition between job versions. It is meant to help orchestrate things like rolling deployments and canary deployments. It might also be of more interest to the developers since they would like to control how a job should be upgraded from one version to the other.
The migrate
stanza defines how the scheduler should behave under cluster changes with existing jobs. Since these jobs have been already running in the cluster, operators can use the migrate
stanza to define how the jobs should be rescheduled when draining nodes in a way that doesn’t affect the quality of service for the jobs. In an event when a node drain is issued, the same version of the jobs are migrated between nodes, hence the parameters such as auto_revert
, canary
, and stagger
are not offered by the migrate
stanza.
» Conclusion
With Nomad 0.8, we released advanced node draining that helps both operators and developers by giving them control of how migrations occur in a cluster-wide, coordinated fashion. This post shows how advanced node draining pushes all the smarts to carry out service migrations safely into Nomad, allows developers to focus on building their services, and operators to focus on making sure the infrastructure running these services are stable.
Sign up for the latest HashiCorp news
More blog posts like this one
Nomad 1.9 adds NVIDIA MIG support, golden job versions, and more
HashiCorp Nomad 1.9 introduces NVIDIA multi-instance GPU support, NUMA and quotas for devices, exec2 GA, and golden job versions.
Terraform, Packer, Nomad, and Waypoint updates help scale ILM at HashiConf 2024
New Infrastructure Lifecycle Management (ILM) offerings from HashiCorp Terraform, Packer, Nomad, and Waypoint help organizations manage their infrastructure at scale with reduced complexity.
Terraform Enterprise improves deployment flexibility with Nomad and OpenShift
Customers can now deploy Terraform Enterprise using Red Hat OpenShift or HashiCorp Nomad runtime platforms.