Case Study

Dev ready in minutes, not days: Schuberg Philis and HCP Terraform

Published 12:00 PM UTC Oct 01, 2024

Schuberg Philis uses HCP Terraform and their account vending modules to help companies quickly achieve an internal developer platform experience.

See Schuberg Philis’ account vending machine Terraform module here on GitHub.

»Transcript

My name is Stephen. I am a Mission Critical Engineer at Schuberg Philis. People often ask, what does a Mission Critical Engineer do? Schuberg Philis is a niche company where we manage the critical IT in some of the key industries in the Netherlands. We focus a lot on finance, payments, logistics, banking, that sort of thing.

But I want to talk more about how we engage with these customers and how we have a single KPI. This KPI is the 100% functional uptime—and what it takes for us to provide this KPI and this way of working.

Typically, we're engaged with the customer. We run the whole infrastructure stack, so we can provide this KPI. But more recently, customers are coming to us. They're asking us to help them, either by being in their teams or running platforms for their teams—rather than running their actual applications.

»How do we engage with a customer or a prospect?

We build a team. It's one team that is through the whole lifecycle journey of the customer. So, you see the same people who are there throughout the initial process and the proposals, through to the initial plans, the build phase and the run phase.

These teams are self-steering. They're autonomous, and they have all the power to do what they need to deliver this 100% KPI and deliver on this promise. All choices are theirs.

You can imagine this model is fantastic for innovation. You have all these teams who are solving customer problems. I think we build cool stuff. But we have a problem with standardizing and there can be lots of reinvention and unnecessary work.

Today, I'm going to talk more about how we use Terraform and Terraform Cloud (now HCP Terraform) to bring some consistency and to reduce the time it takes to be dev-ready.

»Planning a cloud journey

I lifted this from one of our sales decks, but I think this probably resonates with anyone who's on a cloud journey. You have your, "How do I use the cloud to get value back quickly?" You want to be safe and compliant, and you want to be able to enable your teams.

When we chat with customers, they say it shouldn't take months to stand up an environment. They want to have a safety net around their workloads. We often hear things from developers who are new to cloud that, "I'm scared to create an instance. I'm not too sure what I'm doing. How can you help protect me?"

Then, finally, to get the most value out of the cloud, they want teams to have autonomy. If they have the ownership, then there's more success. There's no throwing over the fence.

»Enablement Platform explained

We have embraced the concept of enablement engineering to help with the Enablement Platform. And it's two parts: the foundation and the services we use. But at the end of the day, it's all about the workloads.

From a platform point of view, we don't mind what is in the workload. We're more about giving the teams the freedom and flexibility to do what they need to do—having the ownership and autonomy and that safety net that we'll get to in a bit. We started out in this journey by encouraging teams to do everything themselves, so really own the end-to-end. But then, as we saw more common patterns across different teams, we started to pull those out into a service layer.

We talked about things like vulnerability management, out-the-box monitoring, backups should just be setting a tag, maybe even provisioning your own workloads—all that sort of thing.

Finally, we get to the foundation, and the foundation is made up of two parts. It's the organization management, so your whole environment. Then we talked about the account vending machine, and that's what I want to focus on today: How we do the repeatable deployments of our accounts and pipelines and associated services.

»What is dev-ready?

For us, this is the moment a team can start being productive. The environment is ready; they can log into it. Any services they need are then available. They can get to work and start delivering value and building some stuff. I work more with AWS teams, so I would like to share an AWS example.

»An AWS example

If you look at the most basic version of what dev-ready could be for someone, we're talking about the account being ready. A Terraform Cloud (now HCP Terraform) workspace is created, we have our code repository, and any necessary services are configured to use. That's accounts and services.

Let's take you through what it looks like. We have a Terraform plan. It's going to run in HCP Terraform, it's going to reach out to AWS and talk to Control Tower via the service catalog, and it's going to provision an account.

Once the account is created and it's been enrolled—and all the policies are done—then the resource in Terraform says it's complete. Then, we assume a role in that account, create an OIDC provider and a few IM roles to use.

The next step is to then talk to HCP Terraform. We have an HCP Terraform run talking back to the HCP Terraform API. That's to create a workspace that's going to consume that role with the OIDC provider. So, no static credentials. It's all dynamic. We have a baseline workspace that we also hook up, which I'll get to in a bit.

Next, we create a repository inside GitHub, which will have all the files needed to use the workspace. For example, for any backend config, any variables you need—that sort of thing—and we hook up our baseline repo to our workspace, which I’ll get to in a bit.

Finally, we are using Okta in this example for our IdP. We create a group in Okta. We're going to create some permission sets for the different roles inside AWS, a GitHub team, and a HCP Terraform team.

Then, we merge all the RBAC (role-based access controls) and push rules in Okta. Basically, all the logic is there, and fa-user is dropped into that group. If all goes well, they're logged into the Okta dashboard, and they're greeted with these three buttons.

»How do we do this?

When I do this sort of thing, I like to show you something more concrete. These are the modules we're using. On your left side, you have the actual module—like configuration via your locals—and we set some variables there that will be reused across different deployments.

On the right-hand side, we call it our AVM module—so our account vending machine. Then, we specify things like the name and some tags. Then we tag some default values and we will merge in stuff specific for that workload.

I want to draw your attention to the workspaces variable because once we've deployed this one time, we don't touch anything on the right-hand side. It's all happening on the left with the workspace configuration.

»A more logical overview

We start like this: We have GitHub on the left-hand side. The users can push their code into GitHub. This is going to trigger a plan inside Terraform Cloud (now HCP Terraform) and then push those changes through to AWS.

The baseline now comes into play. With the baseline, it's a single repository that we use across every environment. This is so we can do things that cannot be done from a cloud environment control level. Maybe every account should have a VPC. Maybe every account should have an ATM-SD. Maybe we have policies like you can't create volumes without KMS.

These are things or resources we can assess in the baseline and roll them out everywhere. Just to show you how this looks, the developer teams get to work on the top, and they have no idea what we do at the bottom. It's fully under our control.

When we talk about this way of working, it's probably quite a familiar image. We get the app teams or dev teams to push their code into the GitHub repository, and when they create a pull request, that will create a job inside Terraform Cloud (now HCP Terraform)—and that's going to run any checks, any plans. They can keep pushing into that branch until they're ready to merge it.

Once that's been merged into the main branch, they can either have it auto-applied or applied, depending on their risk appetite. From there, it gets deployed out into whatever resources they want to manage.

The different view on this is how this is actually managed. From a platform engineering team, we sit on top, and we are managing all the modules inside the private registry. From a security perspective, it's either a security team—or actually us working on behalf of a security team—where we're creating Sentinel policies.

These Sentinel policies contain anything from business requirements—like certain platform-required tags or allowed regions —to naming conventions and that sort of thing. We tried with Sentinel policies to replicate everything we have as a constraint in the cloud, which was quite a bit of work.

There's nothing worse than having your plan be all green, you merge the code, and then when you deploy it, you find you're not actually allowed to deploy to that region or you're missing a tag on a resource.

Then, you have the situation where you basically have bad code merged into your main branch—and you have to go back and fix it. So, when the developers are pushing code to their repository and starting a workspace, they know that they can keep pushing that branch until it goes green and have a higher success rate if they have a green light from HCP Terraform.

»Our workspaces module

I can imagine that when you look at this, you think it's great to have one workspace, but in reality, this is not how it goes. As developer teams get more mature, we encourage them to split up their state and workspaces so they have a smaller blast radius and can have different resources for different risks.

If I think about traits in an RDS instance or an S3 bucket, I put deletion protection on there, then that's a great candidate to put into its workspace and have it on auto-apply. Because if you do something wrong, then the actual resource and the provider will stop you from having the error.

But maybe some other resources don't have this protection, and then you want to have the manual check. That's really up to you.

When this module runs, the same thing happens. It's going to generate the directory, generate all the configuration. It's going to add another workspace, and then you're able to move your code around and start. Still having the same account, still have the same baseline, but you're able to cut up your resources as you desire.

»Why Terraform?

To start, Schuberg Philis has a long history with Terraform. We started to use it in 2014. When I joined the company in 2015, we were already using Terraform to provision internal cloud resources.

I think the concept of a state machine appeals to us. The fact that if it has an API, we can manage it, we can install the states, we can inspect it, we can update it. I think the sheer number of providers that we have makes it a no-brainer. If I come back to those teams who want autonomy, then they're able to use Terraform and choose different services as they want.

When we started on our journey, this is what it typically looked like. We had Terraform with AWS, used GitHub as our code repository and Okta as our IdP. Then, quite often, we had Datadog as our monitoring platform. They would introduce things like team access and automatically scraping resources as it deployed and that sort of thing.

But to give a more concrete example, you can just as easily swap out Datadog for something like Sumo Logic. We most recently, maybe a year ago, added support for GitLab, so that will swap out GitHub.

»Why Okta?

We already have Azure AD or Microsoft Entra ID. We don't want two IDPs, so we're going to swap that out also. Now, at the moment, we're working more to also get Amazon as replaceable units, so we can even put Azure there. If we wanted to go further, we could even change GitLab to use Azure DevOps. If we really wanted to go fully into it, we could change out Sumo logic to use the Azure Metrics.

I think when we have this pattern of having a code repository that's backing a Terraform Cloud (now HCP Terraform) workspace: Whether it be the VCS mode where it's all user-friendly and easy for you. Or to use the AP mode where you have a bit more control and make pipelines to roll things forward and back, it's really powerful.

»Why HCP Terraform?

I often get asked, "Why do you need this? We've done this for years without it. We don't need it. We can build this stuff ourselves." And, to a degree, I think we could. There are some smart people out there. But I think the mind shift is that we are not here to build and deploy services. We want to be more integrators.

»Access management

There are a few key points I want to talk about, and that is access management. Another very difficult problem is people talk about the state. How can this state thing have all the credentials? With Terraform Cloud (now HCP Terraform), it's easy. You push a button, and then you can't read the state. You can read the outputs but nothing else.

We're able to have teams limited to different workspaces, so I can have users moved around in Okta that are propagated to different teams in HCP Terraform. We like to drive a least-privilege model, so no one has an overarching permissions set. Everyone is stripped down to what they need to do.

»Job queuing

Job queuing may sound easy. But to implement this yourself is extremely difficult. You can imagine if you have more risky environments or maybe risk-averse environments, and you merge something—and then wait to do the manual apply. Then, someone else has a quick fix. They merge that, and they push it through. And hey, presto, you have a free update from your change.

Terraform Cloud (now HCP Terraform) allows you to run these jobs sequentially. If you want to actually do a manual inspection, you have the choice to opt-out and to do a conscious discard of the run. Additionally, the data deployments: Having this manual date to say yes or no is really powerful.

»Policy as code

I spoke a bit about the sort of customers we engaged with and how important it is to codify policies. Using Sentinel took a lot of work, but we were able to translate everything as a policy and guide people to the best way to do something.

They get feedback from the machine, which I think is the best way. If you have someone do a code review and do manual comments, it's not as good as the machine telling you, "This is the wrong setting. Please do a fix."

»The private registry

Some customers say we want to have an approved list of modules. How would we tackle this? We publish modules into the registry and then have a policy to say: You can only pull modules from this registry.

»Challenges

»No programmatic way to create AWS accounts

This has not been without challenges. This journey started about four, five years ago, maybe a bit longer. At that point in time, there was no way to create AWS accounts programmatically with Terraform using Control Tower and such.

You had the option to create the accounts using the AWS organizations, but then there was a manual enrollment step. There was always a manual tweak. We created our own provider. The provider reaches out to the service catalog API. From there, we can pull the API until the account is created. When we get a response back, the resource is marked as completed, and we move on.

I think we like seeing the feedback in Terraform. And at that point in time, the suggested way to trigger a Lambda, but the Lambda we felt was fire-and-forget. You have to wire up additional ways to have notifications—keeping it native is best.

»Could not template backend Terraform files

Another pain point is we had was the backend of TF—so your cloud configuration. Those are real pain points where you create the workspace, you create the code repository. The workspace, by default, can now be turned off to have a default run. But when you create the workspace initially, it always tried to do a run.

The configuration was never there, and so the workspace always failed. Every workspace had a failed job to start with, which doesn't look good if you are rolling this out to developers and teams.

So, we created the GitHub repository file resource. We used this to template the file using all the data we have in the workspace and from our AVM module—I mentioned earlier—to create this resource, that's out-the-box. They have any variables, any backend config, maybe even some data resources. Anything to get them fast-tracked. The whole goal is to get these people productive sooner.

»Could not use for_each due to nested provider

Finally, or most recently, we've just overcome the repetition of the AVM module. For those who are more familiar with Terraform and using nested providers, you'll know that you can't use a for_each on a module if the module has the provider.

We have one provider that is creating the accounts via the service catalog. Then we instantiate another provider that's using the account that was created, so we can't loop over it. For this, we introduced CDKTF.

To illustrate this point, this is what it looked like. Until the last few weeks, we were deploying these different workloads. On the one side, you see dev. On the other side, you see prod. But if we look at it, this is the only difference. We have the names and maybe some email and environments and some OUs.

It was really frustrating that every time we had to provision a new account—because, of course, the account and the workspaces are going to that same account—you copy and paste this block and then change a few fields. It just didn't feel good to do this sort of manual labor to provision a new environment, and so we introduced CDKTF.

»Some background on CDKTF

For those that don't know about CDKTF, it is a way to give you a more programmatic way to create Terraform resources. There are a few supported languages, like TypeScript, Python, and Java, and that will create some Terraform code for you. Then, we can follow the usual API or method of pushing to Terraform to put out our resources.

It's nice to share how we tackled this. The locals you saw before are how we do it with HCL. And then, we ported that to YAML. It could be JSON because it could be anything you want because you can import stuff as you like. What's nice is that you see the account defaults, the baseline workspace, and the workloads are now distinct different keys.

If I show you how we use this, having a programming language and now generating the code, we can source this configuration from different files. For a platform team, we can read a YAML file and say these are our baseline defaults.

And for an application team, give them their own configuration file and let them configure their workspaces and environments as they see fit. At the end of the day, we can enforce our own defaults and standards.

When we give them this very thin API of a workloads key. For this example, we're talking about that YAML that file that sits in their environment. They have no access to set any of these other fields, so we'd have our account defaults that they can't touch. We can make sure that we always have, for example, a boundary policy. They have no way to alter the possibility for our baseline workspace, so this gives them what they need to do for their job and lets us keep control to do our job.

When we run this code we get—well, the CDKTF inits would happen beforehand. We run a synth, and the synth is creating a bunch of JSON files. If you want to debug something, you can CD into the directory and run the Terraform console. That's normal Terraform just in JSON format.

Once you run CDKTF, we package up this code and it maps out dependencies between different stacks or workspaces. For example, provisioning the teams before the RBAC but after the accounts—something like that. Then, we follow our usual development workflow.

We have the same Sentinel policies. We have the same checks, the same manual dates if you want to auto-apply, but it fits into the same ecosystem. It’s nice to mention that we now have this structure in place.

The example I gave was using a YAML file. We could as easily have something like a DynamoDB table. You can front that with an API, and then have a way to post data to the API. It ends up in a DynamoDB table, and then that's shared as a Terraform Cloud (now HCP Terraform) run.

You even have a way to programmatically have an API to manage your environments now—or do customer integrations like some of our customers have done with Jira and ServiceNow. But as their customer, that's their approval flow—so we try to fit into that ecosystem.

»Takeaways

Where it comes to takeaways, map out your journey of what dev-ready means. For this presentation, I'd have a very basic version just to illustrate the points and how we map things out. But it could be as big or as small as what you need for your environment.

Don't be shy to include manual steps. Remember that this is going to be an iterative process. This was a four-year journey from nothing and writing our own resources and providers. To now, we have a module that we just push a button.

When deploying the source stuff, I think it's worthwhile to look out and see what is out there rather than building yourself. We have a strong culture of reuse before build. Or at least that's what we are moving more towards with these sorts of patterns.

Finally, please check out our AVM module. It is one of a few, but this one illustrates the pattern of having a workspace backed by a VCS.

»How long does this take?

I went a little bit too quick, but we're talking about 14 minutes So, from the time we merge a pull request, this is the time it takes for the Terraform Cloud (now HCP Terraform) agent to be aware of the job, to run the plan, pass all the Sentinel checks, and do the auto-apply.

And, in 14 minutes, we can stand up this environment with RBAC and everything for someone to go into the application with Okta. Which means that if I was a little bit slower, we could've had two different deployments done in the time that this talk has taken.

Thank you so much. It's been great to talk to you and to share what we are doing. If you want to get in touch or learn more about Schuberg Philis, please check out the Linktree or the QR code. We also have links to our modules and to our providers. Thank you.

Sign up for the latest HashiCorp news

More resources like this one

2/3/2023Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

1/20/2023Case Study

Packaging security in Terraform modules

12/22/2022Case Study

Architecting Geo-Distributed Mobile Edge Applications with Consul

12/13/2022Case Study

Nomad and Vault in a Post-Kubernetes World

View all resources