All the 9s: Keeping Vault resilient and reliable at Indeed.com
Learn how SRE and platform engineering came together at Indeed to scale Vault into a resilient, reliable platform that delivers millions of secrets every day to globally distributed workloads.
» Transcript
Howdy. My name is Mark. I'm a staff site reliability engineer at Indeed. Thank you all for taking the time to come to my talk. I'm super excited to share all we've learned over the last eight-plus years of running Vault.
If you're not familiar with Indeed, we're the world's number one job site. Everything we do is to help job seekers find their next big opportunity. More specifically, the team I lead owns Consul and Vault, as well as the platforms and tooling that we've built up around those two tools. Indeed helps people get jobs, but we make sure that their information is secure while they're doing that.
We're going to talk about three main topics.
How we built up a resilient Vault
How we keep an eye on it as well as how we handle on-call alerting and then general observability
Operational excellence and how we iterate and improve through operational reviews
» How Indeed works under the hood
Indeed is made up of thousands of microservices and scheduled jobs spread globally across six cloud regions. We use Vault performance replication to ensure the services in each of those regions has lightning-fast access to the data it needs.
Our main golden path for applications involves using Vault Agent within Kubernetes to deliver secrets to applications before they start up. This is important to this talk because if Vault goes down, we can't start new workloads. So, if we are trying to roll out a bug fix or fixed production and Vault's still down, we can't.
There are also many other tools that read from and write to Vault. Users might be leveraging Terraform to spin up cloud infrastructure, writing those secrets to Vault so that they can be used from GitLab CI jobs or their Daemon and Kubernetes. Or maybe a workload in AWS Lambda is generating credentials to talk to Consul to do service discovery for some sister service in another platform. The list goes on and on.
» Building a resilient Vault
When Vault breaks, that all comes to a halt, and people get really mad. So how do we make sure that Vault doesn't go down? How do we keep our users happy?
» Quotas and rate limits
Our first line of defense is making sure no one can unintentionally DDoS our clusters. We have some legacy patterns at Indeed — as I'm sure everybody else does — that have forced us to support applications requesting so many more secrets than they need at startup. This means large shifts in our Kubernetes clusters or a Daemon restarting that has hundreds of instances can easily cause a thundering herd on Vault.
We've built two lines of defense for this. First, our load balancers have set request rate limits. We do performance testing in lower environments to understand exactly how much traffic a Vault cluster can handle before it tips over, and we make sure that it can never actually get there. Beyond that, we have a lot more fine-grain control in Vault to limit specific namespaces or clients that we might know are troublesome or we just don't like. That way, we have really fine-grain control over who is using Vault and at what pace.
» Vault benchmark
If you're planning on stress testing your clusters, you should check out Vault benchmark. It's a cool tool that HashiCorp released in May that makes codifying your regular traffic and benchmarking clusters really easy. Just like you define your infrastructure as code, you define your traffic patterns and stress tests as code.
The tool already supports a comprehensive set of off-backends and secret engines, so you can easily map your standard client interactions and reproduce them in lower environments. This is an invaluable tool to understand how your current configuration and, more importantly, how any future configuration changes could impact cluster performance.
Since this talk is about reliability and it's called all the nines, I felt it was tactful to include this slide. Please do not — do not — stress test your production clusters. That's not the point. Don't do it. You can very easily tip over. Actually, the whole point is to tip over your Vault clusters, so don't do it to production. Scale a QA cluster up, have fun with it, and then do it in production.
» Immutable infrastructure
Talking about our infrastructure, it's all immutable. Each server is an ephemeral EC2 instance that's booted from a custom AMI that just has Vault, Datadog, Filebeat, etc. — the basics. These AMIs are fully tested using a suite of tear test functions that spin up, bootstrap, and verify a cluster before we promote that AMI for use in an higher region.
On first boot, Ansible uses EC2 metadata as well as some secrets from Secrets Manager to template out the host configuration. This means we can build a single generic golden AMI and then ship it as-is to all of our disparate cloud regions, relying on Ansible in the surrounding environment to provide the last-mile configuration. As an example, we've got a snippet of a Jinja template, and then after running Ansible, we can see that that configuration was templated out.
Ansible then can start Vault as well as its dependencies through System D At this point — because of the Raft auto-join and and Go Discover configuration that we just rendered — Vault reaches back out to the EC2 API. It reads the metadata for all the other servers in this cloud region, finds the right peers to join, and forms a cluster.
Then we're done. No one touches these servers. Humans do have access, but if something's not on fire and I catch you SSMed into one of these EC2 instances, it's going to be a bad day. Normal configuration changes go through the same process that normal code would. They're checked in to Git, they're reviewed. This makes the infrastructure reviewable and observable because you can check the version control. So if Vault's broken, you can check who broke it, which is pretty cool.
» Auto unseal
Once that node has found its peers and replicated its data, we use AWS Key Management Service to auto-unseal these nodes. If you aren't familiar, Vault stores an encrypted copy of its root key on disk. It's this key that encrypts child keys that protect your data. Instead of relying on Shamir's Secret Sharing algorithm and human key holders to decrypt this key, we delegate encryption and decryption to AWS KMS.
Vault uses the instance profile of the EC2 instance that it's running on to reach out to KMS and unseal itself. Both AWS IAM policies directly on that role, as well as policies on the key itself in KMS, make sure that only Vault servers can do this — keeping the key secure.
» Health checking and automation
By default, AWS considers a node that it can ping healthy. Without additional checks, we'd start sending traffic as soon as an EC2 instance can ping. That's a big problem because we just talked about how Ansible needs to run. We talked about how it needed to replicate data and unseal — all before it can handle its first request.
Vault provides a basic API call you can make that models the basic health status of the service. But Vault's really complex. I don't know if you've ever run it. It's got a lot of bells, it's got a lot of whistles. So it's hard to model the status of that in a single HTTP status code.
There are also operations we need to do both to verify the state as well as automate as we scale up and down. For these instances, we have EC2 lifecycle events that reach out and trigger Lambdas. These Lambdas have the ability to log into Vault using AWS IAM roles and either observe or even mutate the state of the cluster. This makes sure that we can easily treat servers as cattle, not pets, by automating away manual verification and cluster maintenance. Want to replace a Vault server? Ax it. We don't care. The infrastructure around it will clean up after you.
You terminate an instance indiscriminately, you pick one, you kill it. I do it all the time. It's really fun. That goes into the terminating state, which triggers a Lambda, which logs into Vault, cleanly removes that cluster peer from the Raft cluster, exits zero, and then tells EC2 to continue with the termination. A thing that you would manually have to go into a node, find its name, and click a bunch of buttons — automatic.
Take a second and think about all the steps you do when you're spinning up and down a Vault server. If Vault doesn't have an integrated facility for automating that, that's where things like lifecycle events become really handy.
» Redundancy ones
Finally, we deploy Vault in auto-scaling groups across three availability zones and leverage Vault Raft autopilot redundancy zones. This gives autopilot context into exactly where it's running and allows Vault to make some really important resiliency decisions all on its own.
» Resilient clusters
In our base configuration with three zones and six servers it gives us three voting peers and three non-voting peers that serve as performance standbys. This ensures we have read scalability with the performance standbys and redundancy with a boatload of servers.
This together, gives us a lot of room for many different failure scenarios. Let's say we lose a single node. This could be a hypervisor failure. This could be something on the host blowing up. Anything. Autopilot immediately recognizes we lost a cluster member and promotes the other server in that redundancy zone to be a voter. Meanwhile, our load balancer notices it's failing its health checks and removes that node from the target pool, draining traffic so that we don't send clients to that server.
The autoscaler group also sees that failing health check and terminates the node, which triggers the Lambda, which removes it from the cluster. Once that's completed, the autoscaler replaces that node with a brand new fresh node. That instance bootstraps itself, joins the cluster, unseals itself. You getting why all that was important before? As soon as it becomes healthy, the load balancer starts sending traffic to it.
What happens if we lose an entire redundancy zone, though? Autopilot picks another redundancy zone and promotes a non-voting peer to avoid breaking quorum. Depending on the nature of the loss, there are a lot of things that could make a redundancy zone go down. Normally, it's somebody with a shovel hitting a fiber line. Sometimes, it's rain coming through the roof. Who knows. But depending on AWS's own ability to reach that datacenter, we might also scale up the other two regions to avoid being under-scaled and failing that way.
» Monitoring and observability
The important part is that no human lifted a finger. I didn't press a single button. For all I know, this could be happening in my infrastructure right now. I don't care. Vault and the infrastructure around it self-healed. But I should probably know if that's happening in my environment right now though. That's probably important. How do we keep an eye on Vault? How do we make sure that our users are happy?
Because I'm an SRE at a high level, when we start talking about monitoring and observability, I'm going to talk about service level objectives. They literally pay me to talk about SLOs. Fundamentally, if you're not familiar, an SLO is a way of tracking if our users are happy with the service. So, it tends to make sense to start with a high-level user story.
A simple example is, as a Vault user, I need to request credentials for my database. Once we have that metric, we measure it over time and keep track of what percentage of time our users were happy. This is where you start to see things like 99.9% uptime.
It's incredibly important to note, though, that despite the name of this talk, the goal isn't always all the nines. The goal is proper expectation setting and observability. We always need to make sure that our users are happy, but setting an overly tight SLO can be expensive, and it can prevent engineers from doing their real jobs, which is reliability. But it's fine, whatever. They think it's something else.
» Our SLOs
We have three overarching SLOs defined around our Vault clusters. Vault must be available 99.9% of the time. This means it needs to be unsealed, have an elected leader in quorum, and be successfully serving client requests. This SLO is set because Indeed has a target SLO on our client-facing apps of 99.5%. And because they rely on us, we have to have a tighter SLO.
Vault must respond to read and write requests within 500 milliseconds. The agility to stop and start our workloads is incredibly important at a platform level. We've built a bunch of assumptions, and other teams have SLOs around startup time, so it's important for Vault to be fast. If it's not fast and it gets too slow, we can delay important deployments, we can prolong production outages, etc.
» What happens if we break our SLOs?
We track these on our dashboards with seven and 30-day rolling windows so that we always have a snapshot into what our error budget is remaining and our past performance. But we're not perfect, so what happens when we start to burn our error budget? Instead of talking about how we've built our monitoring and observability, I thought it'd be much more fun to do this in the context of an actual incident because our favorite thing is to be under pressure.
So, without further ado, we've got our first page. Cue the sirens. Indeed employs the concept of dev first responders, where the teams who build the systems are the ones on call for those services. Within our organization, we spread DFR responsibilities across multiple teams. So, the person responding to this page very likely isn't a Vault expert.
They might be somebody who works on our service mesh, somebody who owns our Kubernetes clusters. We'll call them Vault adjacent. They know what Vault is, they know what it does, and they probably use it every day. It's a thing: They come in, they get their coffee, they update a secret. But they certainly don't know how to fix it off the top of their head.
Let's dive into the details and processes we've put in place to make this possible and, more importantly, sustainable. All our monitors are split into consistent, informative sections. First, the title tells us where the SLO is set, so 500 milliseconds. Then, we give information about where we're burning error budget and what the current value is. So, you can see which datacenter, and it's taken a lot longer than 500 milliseconds. The impact and action sections are there to help the responder gauge the importance of the page and give them a springboard to jump off of in solving the problem. Then finally, we require links to the services runbook and any relevant dashboards.
» What's a runbook?
The goal of the runbook is to think critically about the monitors you're creating and what could make them alert. From there, we have to think about important metrics to help responders confirm those issues and remediate.
Common phrasal patterns start looking like, "If metric A is doing something, do X, scale to Y," etc. Each monitor, when it's created, must, must, must, have an entry in the runbook because if it alerts and it doesn't go to me — which it very likely won't — somebody has to know how to fix it.
Let's look at an example from our runbook. You can see it details how this specific alert is measured and then notes common issues that could impact the metric. It gives details about specific relevant parts of Vault and explains how external factors could impact the issue. It even references specific metrics to check to help you make better decisions. I think we've got most of what we need from this. It gives us, if you start to see a bunch of traffic, check out tuning quotas and rate limits.
Let's actually go to a dashboard. There was one linked. Oh God, it looks like somebody just slapped all of Vault's 700-plus metrics on a single page. That's not great. Throughout my career, I've landed on so many of these dashboards. A team will make their system admit metrics. They will slap them on a dashboard, link their dashboard in the alert, and then check out. The check still clears.
This is not observability, and it's not observable. As a responder, I have no idea what I'm looking at. And even if I do manage to find a pattern, what does it mean? What's causing it, and how do I fix it? Data without context is almost useless.
If you're going to have a metric graphed, it needs to be clear what it is and why it's important. We've set guidelines for our dashboards that require metrics to be grouped into subsystems with easy-to-consume blurbs about that specific system. Each metric that's graphed must also include an adjacent note that explains why the metric is important.
Now, if a responder finds an anomaly, they have the context about not only the individual metric but the subsystem that the metric is a part of. They haven't left our dashboard, but they have everything they need to make an informed decision or where else to look.
» Important metrics
I don't want to spend a huge amount of time on this slide; it's very text-dense, but it serves as a good marker to come back and look at the recording. Over the years, we've spent an enormous amount of time discovering and distilling the most important metrics for us to understand overall system health. We've broken them down into four sections. This includes system metrics. Vault can only perform within the bounds of the system that it's running on. So, things like CPU memory, disk, and network IO, etc., those are all very important.
Second, whatever storage backend you're using, it's important to understand how it's working. For Raft, this includes things like transaction count, operation timing, lots of stuff about Merkel trees that you probably never need to understand.
You also need to have a solid understanding of how your clients are interacting with Vault. Barrier metrics serve as a great facet for this because it's the outermost layer of Vault, and so you have the best picture of client traffic. Finally, if you're using enterprise performance replication, understanding the write-ahead log and its replication in state are important to understanding how your secondaries are performing.
Putting our incident responder hat back on, let's go back to our actual dashboards. We can see a huge influx of requests that is most certainly causing the issues we're seeing. System load is also skyrocketing because of IO weight and CPU utilization. You can even see a Vault server consumes so much memory that it kills Vault and the Datadog agent on the host. If you're not familiar with Datadog, if you ever see a straight flat line, you're in a bad spot.
Our runbook told us that we could tighten our rate limiting to weather storms like this. If we think about that last slide, that gave us a rough metric of how much traffic we were serving before we hit this issue. So, now our first responder has almost everything they need to solve the problem.
» Responder access
The final key to this puzzle is access. At this point in our incident, the responder has identified the issue, found the right instructions in the runbook. They just need to tune the rate limit. All responders are in an LDAP group that's mapped to an external group in Vault with the right access to follow any instructions in our runbook.
We also include some common troubleshooting endpoints as well in case responders need to follow or end up on relevant public HashiCorp documentation. But you might be thinking this feels like a lot of access to give to somebody who isn't a Vault expert. Keep in mind that users know they're only supposed to use these endpoints within the context of the runbook. We give them very clear, "If you see X, do Y," otherwise, they know it's time to escalate.
» Operational reviews
But that still feels like a bandaid. We didn't actually make anything better. Yes, we weathered a storm. We are back within our SLO, but that doesn't feel like operational excellence. That feels like a bandaid. So, how do we approach operational excellence? We review and iterate in operational reviews. This is where we review the pages we've sent and make sure they were actionable — that they were actioned on — and ensure we aren't abusing our responders.
It's also impractical to alert on every small blip or anomaly. So, operational review serves as an opportunity to look at that fancy dashboard you just built. Go back two weeks: Does something look weird? Cut a ticket.
This meeting is also a great time to celebrate recent wins, failures or demo promising new technologies or concepts. Did we automate a toilsome process? Was there a production incident that involved Vault that we need to follow up on? Did we learn about a new feature in Datadog that might help us better monitor our systems?
» Action items
The most important outcome of these meetings, though, is the forced cataloging of operational tasks. This ensures we are vigilant and constantly iterating on our platform's reliability, resiliency, and documentation. We can catch potential issues before they happen and document patterns we're seeing in tagged tickets. As we plan sprints, we can filter for these tickets and make sure that we're always dedicating time to progressing operational excellence.
» Summary and conclusion
And that's it. We built resilient self-healing Vault clusters. We made sure we can understand how they're performing. We enabled a team of Vault adjacent engineers to respond to common failures, and we created a feedback loop to make sure that we were always improving and iterating on our patterns.
Thank you for taking the time to come to this talk. I hope you got a lot out of it. If you want to dive into any of these concepts, I'll be in the hallway and the hub the rest of the day. But thank you.