How Weyerhaeuser automates secrets with Vault and Terraform
Learn how to fully automate HashiCorp Vault just-in-time secrets provisioning with Terraform using Weyerhaeuser's multi-cloud landing zone processes.
Just-in-time access and zero trust are at the forefront of security talks all through the industry. Ensuring everyone has enough access and access only when they need it is the challenge everyone is facing.
With the help of automation tools like Terraform, DevOps practices, and HashiCorp Vault, Weyerhaeuser ensures they won't have long-lived users, secrets, access keys, and tokens on their next security vulnerability evaluation. Watch this session and demo to see how you can implement a setup like theirs and make your security team love you.
» Transcript
Thanks, guys, for coming to my talk today. As you can tell, we're going to be talking about how we can get our IaC (infrastructure as code) pipelines and our app devs access to the tokens that they need access to when they need that. But before we do that, I want to look at the HashiCorp State of Cloud Strategy and see what it had to say around some of the areas I'm going to talk about today.
» HashiCorp State of The Cloud
What we quickly see is that the theme was multi-cloud, obviously, but that automation and security were very common themes across a lot of organizations. Many of you probably responded to the survey as well. And it's been mentioned, but 90% of the survey respondents said that multi-cloud was working for them, and that's pretty awesome. What they didn't mention was that that's up from 53% last year, and that's even cooler.
89% said security is a key driver or indicator of success in a multi-cloud strategy. So that's pretty cool. 86% said that for their multi-cloud strategy, they depend upon a cloud platform team, which many of you are probably working on 99% — so I'll assume somebody didn't understand the question quite right — said that they depend upon automation as a key contributor to their success of multi-cloud and cloud adoption. So somebody just misunderstood the question and we're actually at 100% there.
The next number that I found a little interesting was that 29%, only 29%, said they're realizing the benefits of a centralized secrets manager, such as HashiCorp Vault. But what's reassuring is that 35% said they would realize its benefit in the next 12 months. So that's really good for the program managers of Vault and HashiCorp Boundary.
Now, before we get too far, who am I? I'm Jeremy Myers, and as we've cleared up, it is Weyerhaeuser, and that's how it's pronounced. Certified with HashiCorp Terraform and Vault. Now that I have Brian's book and he's signed it personally, I'm sure I'll be able to pass the test next week when I take it. And I’ve been doing this for a while.
» Cloud Trends
Let's get back to word clouds because I love word clouds, but let's talk about a multi-cloud strategy with a security focus and some of the things you're going to encounter. If you're doing AWS, then you've got IM users, roles, policies, access keys, secrets, many of which probably haven't been rotated in 300+ days. If you're doing subscriptions, you've got enterprise apps, app registrations, service principles, client IDs and secrets — most of those secrets are probably sitting there expired and we haven't cleaned them up — and data. It's everywhere.
So we've got databases, data warehouses, data blocks, data grids, data factories and data encryption. Whether in transit or at rest or in memory, it's all got to be encrypted. So we're seeing an increased usage of certificates, and we need those to be everywhere and rotated. Lastly, the open source platform is exploding in our businesses today, and so we're dealing with most of those being run on Linux, with SSH credentials to log into those machines because people still believe they need to log into the machine to actually do the work. With all that complexity, it’s up to the cloud engineers and the cloud platform teams to remove that complexity and give our app teams access to the resources they need access to, obviously at the right time. What does it look like to have a good security model for your cloud platform team and how should you enable that?
» 4 Keys to Cloud Platform Security
Scoped access. We're building accounts, we're building subscriptions and maybe just resource groups based on line of business or application teams. So we should be scoping our access to that resource specifically, not the entire non-prod stack, not the entire non-prod and prod stack, heaven forbid. And when we give someone access, it should be to an exact environment for that account, for that subscription, maybe all the way down to the resource group level I've seen in some cases. Next is the time-to-live. How long should that credential and that access last? I don't really give out temporary passwords anymore on notepads because they tend to stick them on the front of their monitors for safekeeping. When we give them a credential, we put a short time-to-live: they've got one job to do, go do it, and that expires after that. So consider that in your pipelines — whenever you're generating tokens for your pipelines, that you only generate it long enough for that pipeline to execute and complete the process that it needs to achieve.
Next is renewable and non-renewable. How many of us have resources that took a little longer to deploy than we expected? It was supposed to be 30 minutes and here we are at the four-hour mark, and it's still spinning up the resources in the cloud. Possibly have renewable and non-renewable tokens, and set the renewable limit to something reasonable, two or three times, to allow it to accomplish the process and the task that it's running on. I submitted this talk because we all really enjoy a good password rotation conversation, so here we are.
Lastly, centralized logging and auditing. Some of you may have built your own logging strategies, and that's pretty awesome; some of you are using the native cloud logging strategies; and some of you are sort of like me. I struggle to even find the logs when compliance says, "Are you rotating the tokens?"
“Uh-huh, I am.” “Show me.” “I'll get it back to you, don't worry about it.”
» Team Onboarding
Looking at this from a human perspective (and still some technical here), self-service IaC and the DevOps movement here, the shift-left movement that we're in — the operations teams aren't really involved in the day-to-day build and release processes anymore. The app teams own and manage those build and release processes a lot on their own. So whatever we choose for a centralized secrets manager needs to be enabled to them quickly, easily.
They need to be able to onboard it very easily or they'll just straight up go around us, will they not? So the next piece is, those pipelines and ADO (Azure DevOps) or Jenkins are fully codified. YAML or Go — whatever solution we choose needs to be fully codified as well; repeatable, reusable, and adoptable on a moment's notice across the organization on many teams.
Next to last was the many workspaces. With accounts and environments being built kind of one-to-one, we see many workspaces matching that alignment as well — four or five workspaces maybe per application to cover all your environments across hundreds or thousands of applications. I talked to a few folks who were in the thousand range this week. So this isn't a handful of tokens we're talking about, we're talking about a lot of tokens here. Again, repeatable, reusable, scalable is what we're thinking about here. Lastly, certainly not least, as we're all readily aware, it's really hard to find cloud engineers and security engineers. I'm lucky to be in a room full of them now. Whatever we decide on here, we have to support. We have to own this. So we need this process to be fully automated so that it doesn't create any undue overhead for these teams.
» How We Got Here
A little bit about our journey here and how we landed on the solution that we did. We were already using Terraform in a very niche area of the business, and they said, "We really like this. This is cool, this is good. Make everybody use it, roll this out, modernize it." And so that's what we began to do, and we're in Terraform Cloud and we've fully rolled that out now. We've automated the landing zones both for Azure and AWS, fully using Terraform to deploy those resources for the app team. We'll get into that more in just a minute. But it made a lot of sense when it came to the secrets management aspect of it to also use Vault. It's tightly integrated with Terraform, and we've built it in such a way that the app teams really don't even know that we are using it. We've extracted that piece of it away from them, so they don't need to know it. It just works. They have the access they need when they need it without needing to know about it.
Why Vault? Well, they just mentioned in the keynote before: many providers — the ecosystem is alive and well for both Terraform and Vault. So the capabilities and the requirements we had were easily met by these two solutions. Next is the shared ownership, if you will, within namespaces. Create a namespace for the folks who own that secret and need to manage it. It's not a one size fits all.
Security approved our usage of Vault, but they also quickly said, “We want our own namespace to manage our stuff.” It's like, “Sure, that's yours. You can do whatever you want with it. We'll have our cloud namespace, you have your security namespace, and separate that, and we don't have access to each other's secrets at that point.”Built-in logging and auditing. As I mentioned, as a Splunk user, we flip the switch, we stream all the logs to Splunk, and now we can have the Splunk team set up dashboards. It gives our security and auditing folks access to all the logs that they need to see to prove that we are actually rotating these like I told them I was. Next is ease of use. We're all familiar and more comfortable with one of the API endpoints, or the CLI, or sometimes we like to go and play clicky in the GUI kind of thing. With all those available to us, it does meet your needs one way or the other. It's also allowed us to fully automate our management and configuration of Vault and Terraform along the way as part of our landing zone process. So that's really cool.
Lastly, minimal operational tasks. This is a SaaS offering, obviously, in HTTP, so I don't need to worry about patching the OS or updating the software. As they mentioned, they're emailing us now before they upgrade the version, and we're readily aware. We can set patch times and things like that, so that's really cool. Less overhead for my team, and we can get on to doing other cool things like reading Brian's book or listening to one of Ned's podcasts. (He wasn't here. I'm not sure if he is.)
» Day 0
Let's look at Day 0. We're probably all familiar with Day 0, if you will — the landing zone deployment. We get some information from the business — app name, environments, cloud providers, some default tags, that sort of thing. We make a call to the respective cloud provider to create that account, right, or a subscription. We bring back that ID and some other information, and we call Vault now. In Vault, we create an AppRole, we create a policy, we create a secrets engine back in path for that account or subscription, and that is put in place. That'll be used later. We'll talk about that more in step four and later on in the slides.
Next we call ADO, we create a repository, we inject an infrastructure as code template that's there and it's got a Vault provider module on for them already there. They don't have to build that themselves. We create some environments for approval steps — stage gates, if you will. Then we create the actual YAML pipeline for them, the IaC deployment pipeline. So they don't need to create anything. On the day that they get that, hit and, hit go, it's ready to go and it works.
Lastly, we make the call to Terraform, and we create the workspace that the YAML pipeline is attached to. We create a team to give the team access to Terraform Cloud. They shouldn't need it, but if they did, they have access. And lastly, we update the variables on the workspace to have a token which allows it to call Vault, and the subscription ID or account ID. We also attach the Sentinel policies, which just got a whole lot easier as of this morning with the release of OPA, so that's really cool. I'm excited to go back and get to work on that.
» Day 1
Once the landing zone's done, Day 0 is complete, everything's successful. We hand this to the app team. Day 1, here we are. And this is how they'll interact with Vault while they don't really know it in the background. Don't need to know it, I should say; they know it but don't need to know it. They'll trigger their pipeline. Their IaC code has a Vault provider module call on it that calls the Terraform registry to pull that module down, passes it in the token and a subscription ID in this case, and calls Vault to a credentials path. (We're going to dive into this, so don't get too caught up on catching it all yet.) Calls a credentials path and generates those temporary credentials in Azure AD.
The client ID and secret come back to the Vault, come back to Terraform, come back to the module, get passed into the Azure RM provider block, at which time they can authenticate into Azure. No need for them to see the client ID. No need for them to know the secret. No need to store it anywhere locally. Just use the system that we've built.
Now, that token that's on the workspace is still a secret, and we should treat it as such. We should rotate it frequently. Caveat: I left it in clear text in the workspace in the demo on purpose so you can see it, but we shouldn't be doing that. It should be sensitive.
The pipeline to do this triggers on a scheduler, cron or manual or however you want to do it. I said 30 days — might be three days, might be three hours, whatever you choose that works for you. That pipeline, though, will call the Terraform API, get a list of the workspaces based on the tagging. (Please use tags.) Based on the tagging, it separates the list into AWS and Azure and pulls subscription IDs and account IDs. Makes a call to Vault using an AppRole that it has that gives it just enough access to generate new tokens — that's it. It doesn’t create anything else. It gets that token and brings it back to the pipeline, which then makes a call back to the Terraform API to update the workspace with that token that it'll use later — again, sensitive. All this is logged to Splunk, and we can tell our monitoring and compliance folks that we are in fact doing this and we can prove it.
These tokens are per environment. It's not one token for all the workspaces or all of one stack within an application. Dev is different than QA which is different than prod. So we get real separation of duties there and scope.
Roles and responsibilities — Day 0, Day 1, Day 2 stuff. The cloud ops folks, they're building things with landing zone automation. Their token has access to create — not delete, not update, create only. Very scoped here. Create AppRoles, create policies, create credentials backends within Azure and AWS, that sort of thing.
App team. They don't really need a token. Their workspace needs a token. So their token has access to read those credentials that we just put in the Vault backend. That's it. Read — not update, not create, not delete, read only.
Security ops. I hope they own the token maintenance process for you. Their token needs access to update — not create, not read, not delete, update only. So we give them a token.
And our folks in security, governance, and compliance, haven't forgotten about them. No token for them. They don't need access to Vault. We're logging this to Splunk. Send them to the Splunk team, get access to Splunk, look at the dashboards, there's the logs. Don't give folks access to Vault that shouldn't have access to Vault, right? These are your secrets.
If we did this right, we've automated the entire configuration of Vault on Day 0. We've generated a provider module that allows the app teams to onboard using Vault very easily. We injected into the repo. They don't even have to do anything, they just use it. We're renewing and refreshing those tokens on the Vault at a cadence that makes sense to us. And we have good README documentation in that module so they can onboard and see it and use it if they need it. They shouldn't really need to, but they could. All this has reduced operational overhead to our ops teams, hopefully, and we can get to doing other cool things.
» Demo
So, live-ish demo. It's all recorded. What we see here is the Day 0 AppRole that allows the pipeline to create credentials. A lot of this is going to be focused on Azure — that's kind of my area of expertise — and some AWS as well. But Azure's primarily where I've been working, so these are all based on Azure.
» Building the Token Policy
The key here is the subscription ID that's in the token policy and some of the rule names and things like that, because this is very specific to that subscription for that environment. This allows it to create in that environment and update as it goes along. They kind of work hand in hand.
Next is the Day 1 role and policy that the token on the workspace gets. Note: it has read access to a very specific subscription ID path in the credentials path. And we happen to see that it probably is getting contributor-level access. We'll see that in just a moment. The secrets backend role that we created for the Azure subscription and for the team has a TTL (time to live). If you're familiar, this is about four hours with a max of eight hours, so you could potentially renew it up to eight hours.
We're giving it contributor-level on that subscription, and we're also giving it Key Vault Secret Officer — we'll see why in just a minute. You could drive this down to the resource group if you want to. I've seen that done. It's pretty common. On AWS, it's usually just at the account level, and you can set different rules for it to assume within that. You define the role and the policy attached to it to determine what they can do within that. But this is kind of a generic setup that you might start with.
Within the Vault GUI itself, we'll look around. I mentioned separation of duties, separation of ownership, and that's what we have with namespaces. So we got the cloud folks, we got the corporate folks, we got database teams, DBAs. HR wants their stuff because it's super special. And then security team wanted their own, let's say.
» Secrets Engines
Only enable the secrets engines that you're actually going to use. Decrease your attack vector, your tech coverage, by only enabling the engines that you're actually going to use. In this case, Cubbyhole is automatically enabled. We enabled Azure, AWS, and SQL for some database stuff and KV2 for static secrets. And that's a good starting point for a lot of folks.
The key thing on the ACL (access control list) policies: very scoped, very limited in usage for a particular use case. As you all know, you can stack those policies on a token as deep as you want. But the policies are free, so keep them small, keep them use case-related. In this case, on an Azure subscription, I can create credentials that allow me to create the secrets backend path for them. Or, I can read that same backend with a different policy. So keep that limited. There's also a token refresh policy that allows me to do a little bit more work there.
» App Team Vault Provider Module
This is the actual Vault provider module that we built for the app teams. This is owned by the cloud platform team. When we built this, we really wanted to kind of pull away the work that was necessary to call Vault for the app teams and just make it ready for them to run. This module is super simple, not hard, and I'm sure most of you have already built something like it.
Just the key things here: make sure you're marking your outputs sensitive
so you're not passing back the client ID and secret access key in clear text and the logs. This sensitive
marking came out a long time ago, so make sure you're doing that. We still see that occasionally.
The variables. I've set most of them to have a default, so the app team doesn't need to pass in much, but give them the ability to send in other parameters if necessary.
You might have more than one Vault provider module for different cloud providers like AWS. You can log in with the AWS secret and key, or you can assume a role. Maybe you'd have different modules for each of those techniques, or the same module with the dynamic block in it to determine on what they passed in. Then that's what you do. That would be cool, the simplicity of that. I think that's it there.
Let's look at the registry real fast. Nothing shocking here. We just imported the module to the [private Terraform] registry. Just note that as part of our process, you don't get a module to the registry without approval from the cloud platform team. We want to know that it's been well documented, that it's meeting our security standards and policies, and that it is a functioning, working module. So we have a PR process, and once we approve that, then the pipeline kicks in to publish it to the registry just to make sure we got good code in our environment.
This is a provider authentication that's in the IaC template that the app team gets on Day 1. Pretty straightforward again — just makes a module call, passes in a token and subscription ID and validating credentials. We are in fact using the output in the Azure provider. This is ready to run, there's nothing else needed. And in this case it allows for minor updates on the version if we decide to make a change.
» How the Entire Access Process Works
Here it is in practice. We actually have the module, we actually run it here and we'll see that it will go get those credentials. Now if you're not familiar with this, what it's actually doing behind the scenes is:
Calling Azure AD
Spinning up an app registration that's titled “Vault-something” if you don't change the naming standard
Creating that client ID in secret
Waiting for it to propagate through the AD system (which does take a minute or so)
Testing those credentials over a series a number of times
I have commented out on 10 and 11: tested it three times, 20 seconds between each test. That way when it comes back and goes to the provider, those credentials have been propagated across the system.
This is a bug. They're working on it, and have been working on it. So you may have to play with that just a little bit to see what works best for you. But we're going to get there.
Then the token refresh pipeline. This is going to show you the token refresh process that we go through. It's an ADO pipeline, and what I really want to focus on here for just a second is that the token is in clear text, and that is terrible and you should never ever do that. Do as I say, not as I do. This is for a demo purpose, okay?
We'll let this run. We'll talk about it a little bit. What happens here is we make a call to the Terraform API, get a list of the workspaces, iterate over those, and begin to make the calls to Vault to get a new token. Now, this is where it's important to know that we're using response wrapping.
This solves a little bit of the secret zero issue. If you're not familiar, just Google it. There's plenty of documents and videos from different customers that I watched to learn more about what it is and how to solve it. I did a lot of that in here to ensure that we aren't passing that secret around in clear text and that it's stored securely.
Essentially, there's an AppRole in there, and we saw that for the token refresh. I make a call to that AppRole, and I tell it to wrap a token and put it in the Cubbyhole. When you wrap it, that's where it puts it, in the Cubbyhole. I put a time-to-live of 20-ish minutes on there, so if this pipeline doesn't complete, that thing goes away and it's gone and I don't have to worry about cleaning anything up. I then get the role ID. I then get the secret ID out of the Cubbyhole, unwrap it, use the role ID and secret ID to call the AppRole, which gives me a token. I use that token to log back into Vault.
I'm now under new credentials in Vault that have access to generate policies or a particular policy we'll see in a minute. And with that, I can get a new token to bring back and actually attach to the workspace.
So lots of tokens, lots of login credentials there, but it ensures that I have the right access all through the process here. You can see that we actually did fetch it, get it, unwrap it, attach it to the workspace, and that kind of stuff.
» Code Review
Now, my favorite part of this talk is, we're going to do a quick code review, over 300 lines of code. So here we go, let's do it. It did in fact rotate the token — I think it circled through a few times, you probably saw it. Now I'll take the first pass, and I removed all my logging, monitoring, error handling, for_each
loops, and else
's and everything else. We're down to 32 lines of code, and we will just step through that real fast here.
I'm running on Windows here in this case, so I do a choco
install of Vault. I set or I call the API in Terraform, as I mentioned. I set some environment variables. I get the secure secret ID and wrap it, and it lands in the Cubbyhole. I string it down to get just the token name that I needed, I get the role ID. The next one, I unwrap that secret ID out of the Cubbyhole. Now that I've got it, it was one-time use, it's gone, right? It's not there anymore in the backend.
I then get my token based off a write to the AppRole, and I get my token back. I log into Vault. I now have access to actually create another token using that policy, and I do exactly that. I just create a Vault token, and I'm using a specific policy that says “Azure read” for that particular subscription ID that I'm iterating over — remember, I'm in the middle of the for_each
here.
And lastly, I just make the Terraform API call back to the workspace to inject that new Vault token, which is stored as sensitive
, not as clear text. So that was fun.
Response wrapping and secret zero. If you're not familiar, we'll quickly cover what that is. It solves the secret-zero or day-zero kind of scenario. How do you start that? You have to start this process with a secret, and that needs to be secure. That's what response wrapping and secret zero kind of solve. It's in the Cubbyhole. It's single use. No secret sent or stored as clear text in the process because we're using the response wrapping. It's very tightly controlled by the AppRole that it's scoped to, and the access. We get a limited time-to-live on both the token and the Cubbyhole, which was like 20 minutes. Also, the token that's on the workspace is limited, so that's good.
» Enabling Devs
What's next? Now that we've got all this in place, we can start to enable a lot of different opportunities and features for the app devs. So they probably are using Azure Key Vault or AWS Secrets Manager to store credentials or secrets for their application, which they pull at runtime. And so we're able to inject those services with secrets from the KV store or other areas along the way as part of their pipeline.
We can also get dynamic database credentials using, in our case, the MSSQL backend provider. One of the folks talked about that in their keynote as well. And so we've historically had these long-lived database credentials and connection streams just sitting around, and they aren't really needed anymore, right? This allows us to allow the app teams to get those dynamic credentials to the databases they need, when they need it. And then once they're done, TTL them, we can get rid of that.
SSH. Again with Linux, open source, inner source, we need one-time passwords instead of folks storing SSH credentials locally on their machine or passing them around or whatever they're doing.
Lastly, Active Directory service. I've seen Excel sheets of service accounts, more than one tab long, full of service accounts-associated secret passwords, and it's just not needed anymore. Vault has the ability to do service count checkouts, where you check out the service count you need, it gives you a password, go use it in that process. When you check it back in, it recycles that password in the backend so the next person in gets a different password. We've eliminated the possibility that somebody could reuse that password at that point. So that's really cool and we're going to use those. And I assume service accounts are going to live on our networks and systems for some use case for a long time.
Thanks for coming to my talk. I hope you picked up something along the way here. If you have any good information about how to do this better or want more information on how we did it, please come talk to me. I'd love to hear about it. It's all about learning and growing and that's why you're here. So definitely come and say something. If you want to keep in contact, that's my LinkedIn and you can scan that, it'll hit it. But I'll probably be publishing some of this stuff after the conference via the LinkedIn portal. So thanks.