Autopilot: Simplifying the Integrated Storage Experience with HashiCorp Vault

Autopilot for Integrated Storage is taking the operator experience to the next level.

Aarti Iyengar

Vault

Apr 12, 2021

Aarti Iyengar

We introduced official support for Integrated Storage in HashiCorp Vault 1.4, which allows Vault admins to configure an internal storage option for storing Vault’s persistent data rather than using an external storage backend (via the Raft consensus protocol). With each subsequent Vault release, we have continued to improve the operational experience and we are pleased to announce a highly requested feature called Autopilot in Vault 1.7.

Integrated Storage eliminates much of the operational overhead of managing a separate storage backend and avoids the additional networking imposed by these separate systems. This, in turn, reduces the complexity of the Vault cluster deployment and also makes the diagnosis and troubleshooting of issues easier, reducing mean-time-to-detect for issues and mean-time-to-restore for customers.

»The Integrated Storage Journey

The Vault team has added several enhancements over the last few releases to improve deployment of Integrated Storage clusters.

High Availability (HA) coordination: Introduced in Vault 1.5. Provides the ability to use Integrated Storage for HA when you need to use a storage backend that does not support HA, such as Amazon S3, Cassandra, and MSSQL.
Cloud Auto-join: Introduced in Vault 1.6. Enables auto-discovery of Integrated Storage peers when working in a cloud environment. Auto-join allows new Vault nodes to automatically join a Vault cluster.
Automated Snapshots: Introduced in Vault 1.6. Lets customers take automated and scheduled data snapshots of Vault Integrated Storage clusters at different points of time.

These features made Integrated Storage easier to use with cloud environments and snapshots. However, operators still had to use manual methods to monitor the health of a cluster, ensure cluster stability when nodes are added (or removed), and clean up failed nodes in a cluster.

»Improving the Operator Experience with Autopilot

Vault 1.7 was made publicly available on March 25, 2021. This release introduced support for the Autopilot features in Vault open source. Autopilot, as the name itself suggests, will help automate and simplify Vault operator and admin workflows for monitoring and operating Vault Integrated Storage clusters.

$ vault operator raft autopilot get-config

Key                                   Value
---                                   -----
Cleanup Dead Severs                   false
Last Contact Threshold                10s
Dead Server Last Contact Threshold    24h0m0s
Server Stabilization Time             10s
Min Quorum                            0
Max Trailing Logs                     1000

Autopilot is enabled by default with Integrated Storage clusters using Vault 1.7. Note, though, dead server cleanup is not enabled by default, it must be explicitly enabled. The primary pain points that Autopilot helps alleviate for operators are elaborated below.

Autopilot provides improved insight into cluster and node state via a new State API that allows operators to know the node and overall cluster health, the list of nodes in a cluster (by node ID and IP address), the node type (leader, voter, non-voter) and the failure tolerance of the cluster. The health of a node is determined by:
- “Last Contact Threshold,” which specifies the maximum amount of time a server can go without contact with the leader node before being deemed unhealthy.
- “Max Trailing Logs,” which specifies the maximum number of log entries in the Raft log that a server can trail the leader by before being considered unhealthy.

$ vault operator raft autopilot state

Healthy:                  true
Failure Tolerance:        1
Leader:                   raft1
Voters:
  raft1
  raft2
  raft3
Servers:
  raft1
    Name:             raft1
    Address:          127.0.0.1:8201
    Status:           leader
    Node Status:      alive
    Healthy:          true
    Last Contact:     0s
    Last Term:        3
    Last Index:       38

Autopilot ensures cluster stability when new nodes join a cluster. A newly joined voter node is initially added to a cluster as a “non-voter,” and its state is monitored for the configured “server stabilization time” period. If the node stays healthy for that period, the node is promoted to “voter” status. This ensures that an unstable new node does not disrupt the entire cluster, and is handled without requiring operator intervention.
Autopilot takes away from Vault operators the burden of monitoring and cleaning up failed servers. Dead server cleanup, which needs to be explicitly enabled via the API, periodically scans the cluster and automatically cleans up failed servers. The “Dead Server Last Contact Threshold” configuration can be used to tune the time to wait until declaring that the lost node is “failed” and cleaning it up from the configuration. When dead server cleanup is enabled, a min-quorum configuration needs to be provided to configure the minimum number of servers to be retained in a cluster despite enabling dead server cleanup. This is essential so that cluster stability is not impacted due to quorum disruption.

»Summary and Next Steps

With the various features now supported for managing Integrated Storage, operators have access to simple, automated workflows to manage and operate their Vault Integrated Storage clusters. HCP Vault, which is HashiCorp’s managed cloud service for Vault, also uses Integrated Storage for these reasons. Users of Integrated Storage will be able to benefit from the wide variety of workflows that have been tested for the customer-managed and HashiCorp-managed Vault products.

To get started exploring and using Integrated Storage, please refer to the HashiCorp Learn guides and the reference architecture documents as well as the documentation. For more information on Vault, please visit the Vault project website. As always, we are very interested in hearing about your experiences with the product, so please share your feedback so that we can continue to improve our products.