Autopilot: Simplifying the Integrated Storage Experience with HashiCorp Vault
Autopilot for Integrated Storage is taking the operator experience to the next level.
We introduced official support for Integrated Storage in HashiCorp Vault 1.4, which allows Vault admins to configure an internal storage option for storing Vault’s persistent data rather than using an external storage backend (via the Raft consensus protocol). With each subsequent Vault release, we have continued to improve the operational experience and we are pleased to announce a highly requested feature called Autopilot in Vault 1.7.
Integrated Storage eliminates much of the operational overhead of managing a separate storage backend and avoids the additional networking imposed by these separate systems. This, in turn, reduces the complexity of the Vault cluster deployment and also makes the diagnosis and troubleshooting of issues easier, reducing mean-time-to-detect for issues and mean-time-to-restore for customers.
» The Integrated Storage Journey
The Vault team has added several enhancements over the last few releases to improve deployment of Integrated Storage clusters.
- High Availability (HA) coordination: Introduced in Vault 1.5. Provides the ability to use Integrated Storage for HA when you need to use a storage backend that does not support HA, such as Amazon S3, Cassandra, and MSSQL.
- Cloud Auto-join: Introduced in Vault 1.6. Enables auto-discovery of Integrated Storage peers when working in a cloud environment. Auto-join allows new Vault nodes to automatically join a Vault cluster.
- Automated Snapshots: Introduced in Vault 1.6. Lets customers take automated and scheduled data snapshots of Vault Integrated Storage clusters at different points of time.
These features made Integrated Storage easier to use with cloud environments and snapshots. However, operators still had to use manual methods to monitor the health of a cluster, ensure cluster stability when nodes are added (or removed), and clean up failed nodes in a cluster.
» Improving the Operator Experience with Autopilot
Vault 1.7 was made publicly available on March 25, 2021. This release introduced support for the Autopilot features in Vault open source. Autopilot, as the name itself suggests, will help automate and simplify Vault operator and admin workflows for monitoring and operating Vault Integrated Storage clusters.
$ vault operator raft autopilot get-config
Key Value
--- -----
Cleanup Dead Severs false
Last Contact Threshold 10s
Dead Server Last Contact Threshold 24h0m0s
Server Stabilization Time 10s
Min Quorum 0
Max Trailing Logs 1000
Autopilot is enabled by default with Integrated Storage clusters using Vault 1.7. Note, though, dead server cleanup is not enabled by default, it must be explicitly enabled. The primary pain points that Autopilot helps alleviate for operators are elaborated below.
- Autopilot provides improved insight into cluster and node state via a new State API that allows operators to know the node and overall cluster health, the list of nodes in a cluster (by node ID and IP address), the node type (leader, voter, non-voter) and the failure tolerance of the cluster. The health of a node is determined by:
- “Last Contact Threshold,” which specifies the maximum amount of time a server can go without contact with the leader node before being deemed unhealthy.
- “Max Trailing Logs,” which specifies the maximum number of log entries in the Raft log that a server can trail the leader by before being considered unhealthy.
$ vault operator raft autopilot state
Healthy: true
Failure Tolerance: 1
Leader: raft1
Voters:
raft1
raft2
raft3
Servers:
raft1
Name: raft1
Address: 127.0.0.1:8201
Status: leader
Node Status: alive
Healthy: true
Last Contact: 0s
Last Term: 3
Last Index: 38
-
Autopilot ensures cluster stability when new nodes join a cluster. A newly joined voter node is initially added to a cluster as a “non-voter,” and its state is monitored for the configured “server stabilization time” period. If the node stays healthy for that period, the node is promoted to “voter” status. This ensures that an unstable new node does not disrupt the entire cluster, and is handled without requiring operator intervention.
-
Autopilot takes away from Vault operators the burden of monitoring and cleaning up failed servers. Dead server cleanup, which needs to be explicitly enabled via the API, periodically scans the cluster and automatically cleans up failed servers. The “Dead Server Last Contact Threshold” configuration can be used to tune the time to wait until declaring that the lost node is “failed” and cleaning it up from the configuration. When dead server cleanup is enabled, a min-quorum configuration needs to be provided to configure the minimum number of servers to be retained in a cluster despite enabling dead server cleanup. This is essential so that cluster stability is not impacted due to quorum disruption.
» Summary and Next Steps
With the various features now supported for managing Integrated Storage, operators have access to simple, automated workflows to manage and operate their Vault Integrated Storage clusters. HCP Vault, which is HashiCorp’s managed cloud service for Vault, also uses Integrated Storage for these reasons. Users of Integrated Storage will be able to benefit from the wide variety of workflows that have been tested for the customer-managed and HashiCorp-managed Vault products.
To get started exploring and using Integrated Storage, please refer to the HashiCorp Learn guides and the reference architecture documents as well as the documentation. For more information on Vault, please visit the Vault project website. As always, we are very interested in hearing about your experiences with the product, so please share your feedback so that we can continue to improve our products.
Sign up for the latest HashiCorp news
More blog posts like this one
HCP Vault Dedicated adds secrets sync, cross-region DR, EST PKI, and more
The newest HCP Vault Dedicated 1.18 upgrade includes a range of new features that include expanding DR region coverage, syncing secrets across providers, and adding PKI EST among other key features.
Fix the developers vs. security conflict by shifting further left
Resolve the friction between dev and security teams with platform-led workflows that make cloud security seamless and scalable.
HashiCorp at AWS re:Invent: Your blueprint to cloud success
If you’re attending AWS re:Invent in Las Vegas, Dec. 2 - Dec. 6th, visit us for breakout sessions, expert talks, and product demos to learn how to take a unified approach to Infrastructure and Security Lifecycle Management.