Vault at scale: BlackRock's journey
Hear how BlackRock studied their Vault scaling limits as they onboarded 200 Vault namespaces with many more planned.
» Transcript
I'm Joe Pettinicchi. It's really hard to pronounce this name when you're this side of the Mediterranean. I'm Vice President and Lead Engineer at BlackRock. I've been working on a product that we call application identity and secrets management for the last three years, and we've been configuring Vault Enterprise to get ready for large scale.
I'm here to talk about some of the things we've learned about scaling up Vault. Before we do that, I'm going to talk a bit about BlackRock so you can understand what our scale means to us.
BlackRock is an asset manager. Our core business is managing money on behalf of clients across the globe. We're talking about retirement funds for firefighters, nurses, teachers, and technology professionals. We probably are managing investment plans for a lot of different people. We also advise global institutions, so we're talking central banks, sovereign wealth funds. And we're across the world in a hundred countries, so very broad reach.
» Introducing Aladdin
We also happened to build this product called Aladdin. It's an investment and risk platform. Risk management, really. It's a comprehensive, proprietary operating system for investment management.
BlackRock happens to be the largest user of Aladdin, but we also sell this software as a service to our clients. This provides very sophisticated analytics on all securities and portfolios across several asset classes. And it's a technology product used by hundreds of client firms to solve their asset management needs. Very key differentiator for BlackRock.
» Our tech ecosystem
We are using Vault Enterprise and a single logical instance for production. We're also using Terraform Community to manage our Vault configuration. Vault allows us to support a mix of both host-based and cloud-native applications across multiple languages. Host-based may or may not be on-prem. It could be a cloud-provided VM.
We leverage the Vault Agent and the CSI driver on the cloud-native and use in-house libraries for our host-based applications and workloads. The Vault namespaces in Vault Enterprise enable isolation for our client environments. Our Vault namespaces are integrated with similarly client-specific Kubernetes clusters. This allows our clients to have isolated data from each other. They pay us very good money so we don't share their data with other clients, especially when it's billions of dollars at stake.
We also have another problem, though. Now that we have isolation, we sometimes need to share a secret. Maybe there's a shared resource, like a database that has public market data. In our whole ecosystem, we have about 200 Vault namespaces in production and anticipate supporting about 800 Kubernetes clusters. We are still in the early stages, with fewer than 100 applications onboarded. So, we're a few years into a very long journey.
» Challenges
Let's talk about some challenges we have trying to build out a platform that's going to support all of this. This is a quick overview of some of the things we're going to cover: Scaling Terraform, the Vault limits, and maximums, learning about walls before you hit them in a production environment — some challenges we've had with certificate management, some interesting things about ephemeral credentials. Some people may be very familiar with them. Once we started learning about them, there were a lot of a-ha moments that people have to understand. Finally, we'll touch a little bit about shared secrets.
» Challenges when scaling Terraform
We have hundreds of namespaces, and they have different content in each namespace. We may support varying combinations of applications. All of our namespaces have a set of common resources, which can lead to repetitive code. To make DRY code is very challenging in that area.
We also have namespaces that have additional integrations for ephemeral credentials like database connections and varying sets of applications. We have consistently structured roles and policies, so they only usually vary by application name. This also gives us a challenge to DRY code because Terraform doesn't usually allow variables in the one place where we want them, like the resource names for the application-specific policy. There are a lot of places where we require flexibility.
We also need a lot of agility. We must be able to set up or tear down a namespace very easily. We have a new client coming on board. We have three of them this weekend. Well, here we go. New namespaces. Oh, we're going to take this one down. We need to remove that. We need to quickly add and remove Kubernetes clusters. We need to add integrations. And it needs to be done very quickly. It needs to be done sometimes piecemeal. I want to do this one little bit. But we want to be able to do it and keep all the Terraform state files and everything in order.
Another thing is the last mile features. The Vault Terraform provider, as good as it's gotten lately, still doesn't cover 100% of the Vault API. If you've ever used Consul, it likes to have its own PKI engine and manage its own CAs. If you've plumbed that through with Terraform and you've hydrated that, then you've got conflicts because Consul changes it. Terraform wants it put back. They don't play nicely sometimes.
» Solutions for scaling Terraform
We use homegrown templates that we've templatized with Python to generate repetitive and dynamic Terraform code. I think some people told me why don't we use Terragrunt? Good idea. By the time I inherited the project, we had so much Python code that it would've been a real pain to rip all that out.
Also, early application onboarding was configured with a combination of Python and Terraform modules. And for your application teams, that's a real pain. Everybody has to learn Python, everybody has to learn Terraform, and they have to understand our very bespoke setup for how we're managing and defining our secrets. It was just too much of a pain. We replaced most of that with YAML configuration files and are generating the Terraform dynamically under that.
Essentially, we now have one Terraform apply per namespace with separate state files. This was recommended from the team. This provides agility but also creates challenging situations — like terraform import
with namespaces doesn't quite work. It's a bit on and off.
We also use Python’s CI/CD pipelines to instrument the terraform apply
s. So, we have a multi-threaded, multi-processing environment. We call this process hydration. We run parallel Terraform applies, which require dependency management, and we've developed a DAG to manage those kinds of dependencies across the namespaces. We also use Python code to fill in the gaps and handle the last-mile situations.
» Vault limits and maximums
Now we're done with Terraform let's talk real quick about Vault limits and maximums. This is something you really need to know. You need a solid understanding of these. You have to keep scale top of mind when working with any of the resources subjected to these limits.
Let's look at some quick examples: If you're using Raft, which Vault is doing, there's a thing called storage entries, and they have a limited size. And if you're using Consul as a backing store, you've got 512 kilobytes. Not a whole lot. Seems a little small. If you're using integrated storage, you've got a megabyte. That's a little better. Can you increase the size? Sure, you can do that if you want to risk your Raft cluster going a little weird on you.
So, here are some quick examples of some of the things that can be limited by this storage:
» Mount points
There are certain things are grouped within a storage unit. Mount points. There are about 14,000 each of Auth, secret, local only secrets. There are a lot of mount points. I don't think that gets affected by anything else except there's your number.
» The Vault namespace limit
You get about 7,000 of them. But if you add a single secret engine to every single one of your namespaces, suddenly you've lost 34% of that capacity, and you go down to about 4.6K — so something to consider if you want to start putting tons of secret engines in all of your namespaces.
» Entities
Entities are impacted by metadata size. You can have one and a quarter million entities. If you start adding metadata to your entities, 500 bytes per entity takes it down to 480K. The max metadata is 41,000 bytes. If you really want that much metadata about every entity in your system, then you can only have a couple thousand. Consider being light with your metadata and offload it to some other system that's going to be your IDP.
» Groups
We can get about 480,000 groups involved if you have ten entries per group or ten entities. We're looking more at maybe a couple hundred entities per group. So you can get about 50,000 groups with 100 entities in it, and you can have up to 23,000 members in a group. But I don't know why you would do something like that. It seems like why not just add whatever you need there into your default policy.
» Policies
And speaking of policies, the max size of a policy is one megabyte. And each token, entity, or group can have 28,000 policies. So, you have to take trade-offs between do I have a complex policy where I start hitting that limit or do I have too many small policies.
» Certificate management
We have host-based workloads, and we're using an in-house orchestrator with a very simplistic model. We didn't have a lot of time to set this up. It basically issues certificates on startup. There's no renewal. The cert lifetime has to cover your bounce cycle. So, whenever that cert's going to expire, you've got to bounce that app, so you needed to plan this very carefully.
This is just based on some people who mentioned secret ultra legacy code that has these issues. They're hard to change. This creates a lot of unused certs, and if you're in a crash loop and you're creating a 15-day cert or a one-year cert, we're going to talk to you real soon. We need to find those quickly.
Also the longer lived certs that we have — we have stuff that needs to last up to a year — well, you want to store that in Vault because you want to be able to revoke it. You don't want something sitting out there that has that kind of risk profile. We thought about how do we manage all these certs? Should we just revoke them when people are done with them? Well, we've quickly decided no on that because, in addition to people having to read large CRLs or manage all these things, to build a CRL for your certificate authority, every revoked cert needs to be in memory at once. This scales order N, as in no.
That's one of the problems. There are Delta CRLs to solve this problem. There's also OCSP: Open Certificate Status Protocol. I really like this thing. It's a great option. We're exploring it. One problem is taking availability requirements. We need to get a lot of nines out of it, so can we handle all of these OCSP hits?
» Leases on certificates
That's another interesting one. We had people that had leases. They're not helping either because they only revoke a cert when the lease expires. Now, you've still got your CRL problem. Large number of leases also impact your Vault startup time. You don't want to add a lease to something that's already got a termination date on it.
» DR events
Right now, we're not replicating our private keys for our workload. So, in a DR event, we have to create what some success engineer said is the thundering herd. We have to issue tens of thousands of certs to bring up all of our workloads in the DR.
Thinking about that, generating RSA keys for these certificates is massively CPU intensive. It's really going to bring things to a halt. There may be a key management system, or something would be very useful for us in the future.
» Runtime renewal
This is really where we've got to go. Best practices, like if you're using Cert Manager or any of these things, have certs rotate while you're running. But now you've got to make sure your AppSec can handle rotated certs. Can I check? Oh geez, my cert's about to expire. I better load it back from disk, Vault, or whatever I need to do.
» Ephemeral credentials
These are some of my favorites. Ephemeral credentials are nice. You can use Vault secrets engines to create things like Snowflake database connections. We're using Azure Secrets and those kinds of things. Snowflake is pretty good about creating a login.
Doing a completely dynamic service principle can take a while. Also, some people are getting worried about how many privileges Vault has in Azure AD to create service principles with who knows what privileges we're going to give them. There can also be eventual consistency issues where credentials may come back right away. We made the API call. For two minutes, they’re not good because there's a huge cluster out there of endpoints. You have to figure out when my service principal is going to work or when my dynamic credential.
We've looked at static. That helps a bit with the setup time. They're easier to manage, but there can be limits. I think Azure AD – don't quote me on it – but if you have a service principle in static, — somewhere around 645 secrets, and it stops responding. Interesting number. If there's a limit, we're going to find it.
Another thing about static roles and principles: We have to ensure our workloads are going to support credential changes at runtime. There are new features in Vault now. I can say I'm not going to rotate a credential out from anybody until it's the weekend or we have an outage. But ideally, if you have some kind of breach and need to rotate this credential now, it’d be really nice if your applications would self-heal and pick up the new cred.
» Shared secrets
How many people out here have to deal with shared secrets? I see a few hands. That's good.
» Existing challenges
One way to access shared secrets is to authenticate at a higher-level namespace. I'm sorry, I don't have a picture here. But you imagine everybody's down here, and they want to get to something. They authenticate at a higher level namespace.
Well, now suddenly, if you've done that, it can open up access to multiple child namespaces. And this is going to violate our isolation rules. Remember, our clients pay us big money so that client A can't authenticate at a higher level and suddenly see client B's data. We have the same application running for different clients with different data. We need to make sure it's properly isolated.
We also have multiple environments providing these shared resources. We may have them in different regions, and there can also be many domain relationships that are not possible to model in a namespace hierarchy. We may have to refactor shared resources and environments at any time, which will break the relationships.
» Solutions and in-house workarounds
We tried maintaining multiple copies of KV secrets in each namespace, and this kind of works. But you've got to have good tools around it to make sure that it's not a pain for your developers and that it’s always consistent — and your rotations happen in a reasonable amount of time.
You can configure multiple secrets engines against external resources. So, I'm also going to have the common Snowflake engine in every one of my child namespaces. Now we're going to have collisions. And I also have to have multiple root credentials. Do you want a hundred root credentials for the same Snowflake account, with a Snowflake account being equivalent to a DBMS? That's a lot to manage and a lot of rotations to worry about.
You also have to worry about naming conventions so your dynamic credentials don't clash, and suddenly, you've tried to recreate a user that's already there. Even though you have a certain amount of randomness, it may not always work.
The other thing we looked at is let's perform an additional auth in the shared namespace. Well, that's good. You can jump out to the shared namespace through an authentication once you're in Vault. But somewhere you have to maintain that second set of credentials. It's enough to have one for every service. So, we're helping to build a better future in this space.
» Vault 1.13
Vault 1.13 introduced a new solution allowing entities to access shared resources outside the namespace where they authenticate. This has been a game-changer. You create groups in the shared namespaces, give access to the shared resources through the policies, and then add entity members from other namespaces.
Simple. Well, it simplifies your authentication management. It also simplifies ephemeral credentials because you just have to have the shared resources set up in one shared namespace. But it comes at a price: We had to overcome a couple of hurdles with respect to Terraform.
First, the groups and policies in the shared namespaces have to be created first because if you run a data frame to get those things before you add entities, and they don't exist, then everything falls over. Easy for us to solve this with a DAG and our hydration process. We could still get some cycles, but now, at least, we'll create the shared resources before trying to add entities to it.
We also have to create the static entities and aliases in Terraform. If you're like me, things already exist in Vault. You've already got certificates, you've already got Kubernetes auth methods. They're already creating these entity aliases. So, they're already there. And now you want Terraform to take these over. This took a little bit of effort for us.
First, we have to identify and resolve all the aliases that are already in Vault that Terraform wants to create. So, it's in the Terraform config. It's not in the Terraform state — and these guys want to go after it. Well, Terraform import would've worked, except it's a little hit-and-miss with namespace, and you would have to go in and do some tricks. And. depending on your Terraform version, it may or may not work, and you'll still have to edit the state file to fix it.
So we switched to just deleting the aliases for now during hydration. Terraform creates it anew, and we can get things plumbed. We aren't quite live with this, but it's coming soon — a pretty exciting way to automate this solution.
There are also some scale considerations here. Remember I talked about groups and members? Well, in our case, sometimes we have lots of Kubernetes endpoints. You have your east-west, you have your north-south. Each one of those is a separate entity. Each will have to map to a separate entity within your namespace, and each of those entities would need to be a member within the shared group. So, groups with hundreds of entity members significantly limit the total number of groups.
Also, if you look at how a Kubernetes role is set up, you have a list of bound K8s namespaces and a list of bound SA names. If you combine those to create the namespace SA name for your alias, it can be, oh, I've got four SAs and four namespaces. Well, what's the Cartesian product? Oh, bad, stop. No Cartesian products. Don't do it. Someone now has to explicitly take these two lists and tell you exactly which combinations they want to use for sharing. Otherwise, you end up with 16 of these, 200 namespaces. You've got a mess on your hands.
» Key insights from scaling Vault
We absolutely could not live without Terraform and Python to manage and apply our infrastructure as code. We learned how to read and understand Terraform state files and manipulate them.
Terraform import has been hit and miss in our complex federated codebase with multiple namespaces. Sometimes, it doesn't get it right. Also, not all resources at this time are supporting Terraform import. So, we've had to do a little bit of work there.
An interesting time, Vault 1.11 introduced the multi-issuer PKI feature. This was really awesome, except the Terraform Vault provider and Python HVAC library we use got left behind, so we had to fill in the gaps internally. The Terraform provider is actually catching up now. HVAC: We are considering open source community contributions. Maybe we'll do that ourselves.
Another key insight: Stay on top of version upgrades. We got behind on the provider version, and it cost us a lot to make some of the leaps where we had to jump and do multi-step upgrades without totally trashing our Terraform state.
» Gotchas
» Unbounded configuration growth
Early on, somebody decided in Terraform, oh, let's create a Cartesian product that we'll use to generate our K8s endpoints. Because we have east-west. Sometimes, we have common clusters. We have web platforms like research clusters. So, they created a quick little Cartesian product and applied it across every namespace.
Once this number hit 79, you couldn't create a namespace anymore. And the next thing we knew, one of our operators said your Raft storage unit's at 48%, and we only got a dozen namespaces. What's up with that? We had to refactor our Terraform and set up sparse mapping — so each namespace has a distinct set of mappings for K8s endpoints.
» Leases on ephemeral credentials
Our common operating model obtains secrets at startup and lets tokens expire. Well, that doesn't work for ephemeral credentials because the lease on the ephemeral credential is a child of the token that created it. So, when your token's gone, there goes those ephemeral credentials. So, you've got to maintain that.
We could use Vault Agent sidecar, but that's only going to work for our K8s workloads. We really didn't want to take the extra cost of having an agent run everywhere and have double number of processes on host-based workloads. So, we ended up setting long TTLs on tokens to support ephemeral credentials. A little bit of a pain, but as long as everything gets shut down when your workload shuts down, things are working OK.
» Load testing stored certs
I mentioned DR events before. We had our host-based orchestrator run a load test on our longest-lived certs that were being stored. This ended up 250,000+ certs across several environments in pre-prod. They're all stored. They all live for a year. Couldn't revoke them. Our support team came to the rescue.
Go in, edit sys/raw, delete these things, get them out of Raft. Good thing that happened in pre-prod. If you're going to do load testing, pick a cluster that you really don't care about.
» Azure_* star environment variables
This was a neat trick. We started building performance replicas on Azure and using Azure Key Vault for the unsealer. So, we configured the environment variables. These Azure star environment variables are the same ones used by the secrets engine, the same Microsoft library. Now, suddenly, whatever subscription, whatever SP you're using for your Azure Key Vault, you got to use that for all your clients.
No, you can't. We have 100 clients. They've all got different subscriptions. They've all got different SPs. That's why we couldn't get this thing to work. So, be aware of those environment variables. They get shared across the library. It's not any fault of HashiCorp. I mean, you got to use the Microsoft SDK. It is what it is. But I got to learn how to read Golang.
» Require namespace attributes when using Terraform
Now we're on the more modern version of the Terraform plugin, we require the namespace attributes when using Terraform. Otherwise, what happens? You're pointing at the root. Well, that's bad news for you.
All of a sudden, where'd my resource go? Oh, you created it in root. Or how come this data element didn't work? Well, you were in root. So, over a weekend, I quickly learned how to use tflint and the experimental OPA plugin — and learned enough Rego without using ChatGPT.
It was wonderful. So, now we've got our own little linter, and we can start adding all kinds of things. It's in our CI\CD. It's got tests, it tests everything. When we generate our Terraform, we lint it out. We're starting to mature in that way. I really do like it.
» Summary
Know the Vault limits. That page I showed you: Know it, learn it, live it. Keep scale top of mind. Understand what's going to get in the way before it does. Because when it bites you in production, you're going to hear about it. Especially when you've got a lot of very angry clients with full site outages right now, and it's high trade volume.
Measure and identify scale issues early. I didn't even talk about performance. That's another talk for another time. But we're just talking about where this is going to go as I try to scale up? Where am I going to hit? And automate everything that you can. You can't automate enough.
» Connect with us
Here's a quick little slide. I'll give you a chance to take a picture. Follow us on these places. I'm sorry we don't have a QR code. I am going to get back to my communications management team on that.
Finally, because we are a big public company, we have to have all the important notes. And I didn't take this, but I like to give some special thanks while I'm here. I'd like to thank Steven Zamborsky, who's our customer. He's a support engineer. Matt Greco and Ken Westland who are here. They helped promote my talk. Also, all the conference and speaker coordinators — awesome job. Everyone who worked behind the scenes to make this an awesome conference.