How Deutsche Bank onboarded to Google Cloud w/ Terraform
Deutsche Bank shares the story of how Terraform Enterprise was fundamental for its teams to safely land in GCP.
» Transcript
Jeremy:
Well, it's wonderful to be here today. It's great to be back doing these events again, after so much time without the travel — really excited to be here. Also, hearing some fantastic news already with the keynotes from this morning.
I think what Dave spoke about in terms of Cloud 1.0 and then Cloud 2.0 was interesting to reflect back on our journey at Deutsche Bank, and try to consider where we think we are on that journey And perhaps you can decide after you've heard our story, but I think we've done a reasonable job so far.
Today, we'll be sharing some insight into that journey. How we started with very little at all and had a very constrained target and timeline to bring our application teams on board — and get them into Google Cloud as quickly as possible.
My name is Jeremy Crawford. I'm the head of Cloud Product at Deutsche Bank.
Thomas:
Hey morning, everyone. My name's Thomas Chalmers. I'm more from an engineering perspective, and I've helped to engineer the TFE that we've got within Deutsche Bank. Cool. Let's start with a bit of history — a bit of context as to where we are.
» Starting Our Cloud Journey
Our team formed back in April 2020. This is the height of the pandemic, probably a time not many of you want to remember. But it was an interesting challenge because we were scattered around the world, none of us had met in person — we didn't have that psychological trust that you might have in a pre-existing team.
Myself and Jeremy were from a previous platform team. We had some of the transferable skills — or so we thought — to go into the cloud, but none of us had a deep understanding of any cloud technologies. Neither of us really knew Terraform either.
We had this announcement that the bank was going to use Google Cloud — as in GCP — quite heavily. That was announced, but it asked the question for us — well what do we do next? How can we leverage GCP? We knew this was the end state, a lot of people need to use GCP, but what's that middle ground? What framework, what tools are out there that we can maybe leverage to allow teams to get on there safely? That was a starting point for our journey.
Jeremy:
Reflecting back on that time, it was quite a daunting position to be in. As Tom said, height of the pandemic, we formed this new team, with a mix of skills — some of us had platform experience, some of us had a little bit of cloud before, a bit of Azure, a bit of AWS.
The announcement was made quite soon after that formation of the team but for the first few weeks, we still didn't know if we were going to be going with Google or Azure. We did know that as soon as that announcement was made, the hoards were waiting at the gate — they wanted to get in as quickly as possible.
So, we were under quite a lot of pressure, and to the anecdote about where we started from, we really did start from zero. As soon as the decision was made — we're going with Google — one of the first tasks was to grab the domain db.com. Very important to the bank to have that.
Turns out that ten years prior, Deutsche had already done a POC with Gmail, and all of the people that had been involved with that — there were some super admins — and they'd all gone AWOL. Well, most of them had left to be clear.
There was one engineer remaining. He was standing between this ten-year partnership and us unleashing db.com who happened to still be in the bank — and trying to explain to him how he needs to get his password reset, and make me the super admin — and then I'm going to go and open this thing up and away we go. That's literally where we were at.
» Safely Scaling GCP Cloud Consumption
The hoards were waiting at the gate. We knew we had to open that gate, but we had to be able to do it safely, and we didn't have much time. We had some idea about the kinds of concepts that we needed to address — principles around automation, blueprints, identity and access management, networking, security, all the standard stuff.
We had some loose concepts, but we were still really unsure, somewhat naive at this point — we didn't know how we were going to bring this all together. We did have a team that had already been doing some of this stuff with, what looked like a standardized approach with GitLab, with open source Terraform, and controlling and carving up different aspects of the cloud and managing those with different state files, etc.. But other than that, we were unsure.
» Our Core Principles
Thomas:
With that in mind, we wanted to frame our journey with a couple of core principles from the start. These are what we established, and most of these headings probably won't be news to any of you in the audience —but these are things that we wanted to start from the beginning. Then these have proved true throughout the journey.
» Everything as Code
Let's talk through some of these. in the top left there, everything is code. We wanted to emphasize the everything part of that. So you'll be familiar with Terraform as the infrastructure as code. We then discovered and started to love Sentinel — and that's policy as code for us. Then also, documentation as code. Obviously, I don't need to labor the point of why that's important, but for us, it was then tying into automation, which you see on the right.
» Automation
And only with the everything as code principle, could we get the value of automation. Through automation, we could then have scale, and we didn't have this problem of toil of engineers manually deploying things that weren't in an appropriate code base.
» Governance
Everything as code, automation, very tightly coupled, then obviously, we're a bank, a financial institution, lots of regulation we need to adhere to — so, we have this governance aspect. That's overlapped with Sentinel —we saw Sentinel prove its weight in trying to do that. But we weren't sure how we could achieve that at scale.
» Federation
There were those three principles, the fourth being Federation. Throughout all of this, we were quite a small team. The skills gap is something you are all aware of — so we wanted to do this in quite a pragmatic way.
We chose to federate our Terraform modules to allow us to scale and reuse problems that have been solved once, multiple times for the rest of the org. We then started further in the journey, and the federation bit was around getting out of the way.
We didn't want application teams to feel as if we were the bottleneck to them, and there wasn't this centralized infrastructure pool. We wanted to disperse this to all app teams from across the bank. They can deploy infrastructure, how they see fit. I think this was a big paradigm shift compared to the on-premises world of teams, waiting a long time for specific infrastructure. They now have the ability to scale this as they see fit — so a huge paradigm shift.
Jeremy:
Maybe what else was shifting there was what we are going to be handing over in terms of building this platform. Maybe this is more of a Cloud 1.0 piece, but I'd seen previously other enterprises going through this journey and at the far end of the scale, then serially making individual services available one by one, using central infrastructure teams where the actual consumers were kept away from the provisioning of that infrastructure.
And per these principles, we knew that wasn't something we wanted to offer. If the platform was Google Cloud Platform, that's what we want to make available to our consumers. We don't want to be in the way. We wanted to provide the thinnest wrapper around that and adhere to those principles. The app teams would have to be educated, they'll have to learn all about AppSRE, looking after their infrastructure, looking after the full stack. I think that part was clear.
We already had a team. I think this was one of the pioneers that was scheduled to go first into the cloud on our retail banking side. They'd already done somewhat of a POC using GitLab, open source Terraform, and what have you. So, we knew that we needed to use some tooling. We looked obviously at the options that were out there, we looked at a couple of open source products as well and started to crystallize these principles.
But there were still open questions. I think we knew that if we don't find the right tool, we're still going to be building a lot of this ourselves, and that's going to push the timelines out. Time that we didn't have. We had to find something that's going to help us accelerate. But, the idea was becoming concrete. We now had a real concept on which to build.
» Landing Zone Concept
We saw some tooling out there, some open source tooling. A lot of these tools and ways of working, they made reference to this concept called a landing zone. I still get questions today, what is a landing zone? And it's not always the easiest thing to articulate. So I'm actually going to go back to the military origin of that term. Which is all about securing terrain — often hostile terrain — and being able to make rapid safe landings, getting into that terrain.
The steep approach that is the tagline to this talk is a type of landing zone where helicopters typically will approach, and there being a lot of high obstacles, getting into confined areas, maybe the center of a city, very dangerous often in hostile areas, very difficult.
Just relating that to the challenge at DB — anyone who works in a financing institution will know there are many obstacles you're going to have to overcome. Just the sheer weight of the program, the timescales, the expectations. We were going to need help to get in there. I think we had some fundamentals established.
Thomas:
We did — yeah! With that framed As you can see illustrated on the slide, we thought a good concept of a landing zone would be to start with this collection of Git repositories. That was going to be the base level, somewhere to store the infrastructure as code — and then layer on top of that some execution of Terraform.
For us, that ended up being Terraform Enterprise workspaces. But at this point in time, we weren't too sure what that'd look like. But we knew we needed code, we needed somewhere to execute it and that there's going to be an output.
The output in the truest sense — so the beginning — was a Google Cloud folder, which is your biggest area that teams can have out of the box. And that's what we gave teams the ability to use. Coupled with a couple of service accounts, this was the vanilla landing zone out of the box, and then teams are free to go and use that as they see fit.
With those components in mind, we thought this provided a fairly secure manner to get to the cloud. One that we could do that with velocity and speed. This provided the path into the cloud for a lot of the early teams. Then we iterate on this as we went through.
Jeremy:
This was the concept, and, there were these tools out there, Terraform Enterprise appeared to be helping out. But obviously, I think we needed to dig further into that.
Thomas:
And that begged the question, there might be some tool sets out there that can achieve this for us. Is there the option of doing this in open source, and what are the trade-offs that we'd make between those decisions?
» Terraform Usage
That ties into this slide, and many of you in the audience might be familiar with this picture, where on the left, you've got an organization, quite a scale, with a lot of sporadic use of open source Terraform or any other language. Various flavors, various sizes scattered across the organization.
We weren't at the position where we had that because, we were new to the cloud, and as a result, fairly new to Terraform as well. But that was a bit of a luxury for us. We didn't have this problem of technical debt and teams having to look after state files in various different places. We did see that with the enterprise solutions there was a much more standardized control — not only with the state file but how you can scale as a team and collaborate on a wider level.
No scaled OSS usage, we were then presented with a problem of do we build or buy? Do we provide a wrapper around Terraform as an open source product, or do we start to evaluate the enterprise solutions in a bit more depth?
We were seeing the value that we might get from some of those enterprise solutions, whether that be a strategic place to execute Terraform, managing the state files, or Sentinel policy, which is really quite unique to TFE. It wasn't a feature you could get in the open source world that easily. Sporadic versus standardization, that was the point where we're at.
Jeremy:
Fundamentally, we wanted to provide a single gate through which everything could flow. I think, with the build option, it opens up many questions about where is that gate? And is there one? Yes, you can have your Git repos, but how do you police where they are? What controls do you have over wiring that into your workflow engine? How do you control where people might source their modules?
All of that was quite scary. We needed somewhere we could herd the cats and have them all flow through this gate. Once they're out the gate, then we have control. And we want to try to give them as much autonomy once they're inside. Some of those key features:
» Sentinel Policy as Code
You're going through the gate, but we are checkpointing you every time and we can federate that — we can build communities around policy authoring. Doesn't have to be the central security organization that's writing those policies.
That means, we get to inspect everything. Clearly, there are other products out there that we looked at, but this being a HashiCorp product, you are going to be on top of the curve. As new resources, and new providers become defined and available, then there's going to be Sentinel policy ability to counter what we might be allowing and permitting and creating with Terraform.
» The Private Module Registry
I think there were still probably a couple of gaps. Similarly, the modules, we have to have a single central place where only from this location, the modules can be consumed. The private module registry is doing exactly that. Coupled with Sentinel, we can have policies that say, first of all, if you're creating any resource, you can only use modules from our private module registry. Then if you are using a certain type of module specifically, you can only use this module.
So, for the project resource, which if you use Google, that's your basic container for resources, we've already created a mandatory policy that forces you to use our project module. That way, we get to define all the metadata and labels. You've got the control, but you're also giving the autonomy. You're also allowing, federating this out.
» Module Authoring Framework
There are a couple of pieces that were missing from the product which, hearing Armon talk about some of the new features that are coming, is great to hear. But at the time, back end of 2020, we are thinking about how we plug these gaps.
We came up with a concept of the module authoring framework, providing pipelines for creating modules and ensuring that standard templates are used — ensuring that kitchen tests are authored, and then running that pipeline, creating the module, running all the tests against it, does it do what you expected to do? Great. Then you could go publish it to the private module registry (PMR).
Likewise, with policy, we had to create a policy authoring framework. Same principle. Author the policy, create some resource, run the Sentinel against it, run the positive case, run the negative case. Then if that all works, we're great, the policy gets published.
And again, federating that. It doesn't have to be run and delivered by a central team. If a team wants to come on board and make their new service available — and they're the pioneer — then let's have them involved in writing that module, writing those policies.
» Journey With Terraform Cloud
I think quite clearly there were these couple of gaps, but the vast majority of it was being fulfilled. So, we made the decision, and we made the call, and we got in touch with our account team in the UK and said, look, here's the deal — we've got limited time, how can you help us? Terraform Cloud Pilot was proposed and we approached it in that way that if it all works — and we were hoping fingers crossed it does — then let's evolve this into a production solution.
We needed a rapid answer. We had the hoards at the gate. We had to turn this around quickly, and — credit to the account team that worked with that — all of that happened rapidly with some fantastic technical support and guidance as we progressed.
Thomas:
The TFC pilot was born. For those unfamiliar, the TFC versus TFE is a SaaS versus PaaS question here. We started with the software offering, and that was great because we got to very easily understand the features and benefits of the solution without having to have the toil of trying to develop it ourselves.
» Workspaces
We had the TFC Pilot — immediately saw the value of workspaces. Teams can segment their Terraform resources and deployments in a nice, controlled way. Don't have to worry about their state files — as we mentioned. That was meeting the needs of identity and security to some extent.
» Central Private Module Registry
Then layered below that, we had this central PMR. As Jeremy alludes to, this is a great place to store blueprints for teams to move forward. Not only was it the modules, but the providers from Google, which was the main one we were using to deploy GCP resources. The tight coupling between HashiCorp and Google was helpful because every time we were trying to find an API, there was a Terraform resource for it. It's a bad place to be if you discover there's an API and there's not a Terraform resource just yet available.
» VCS Tracking
There's also the concept of VCS tracking. I don't know how familiar you guys are with this, but basically, you've got your Git repo, and there's a nice, easy, and elegant way to link that to automation in the Terraform workspace. Every time you push a commit into your Git repo, you're triggering a run within TFC. That was great to see, and that answered the question for automation.
» Our Infrastructure Landing Zone
Going back to some of the core principles, what was left? Well, that was security and guardrails. And that was covered off by Sentinel, which was a great feature to have and allowed us to build out a lot of guardrails very early on. So, we started with the SaaS offering, and then we moved forward from there.
Jeremy:
All of those principles — going back to that automation, blueprint, security identity, and access —the one area, and I'll be clear on this, that we need to improve is on the networking side. I think the way that we had to solve that, and in fact, the way that we solved some of these other challenges, was to use the same principles, infrastructure as code, etc..
This is the one area where we were unable to provide full autonomy to the app teams. We had to work with what we'd already devised and built those fundamentals — and we came up with a concept of an infrastructure landing zone: One that would exist across three GCP organizations and promote across the organizations but allow the network team to use those same principles, the same policies, guardrails, etc., to create the networking resources.
We went with a shared VPC model, which I think created some difficulties, being honest. But, we were using one of the newer cloud providers, and there were some limitations in GCP. But nevertheless, this allowed us to move forward, solving the networking question — albeit, using this centralized infrastructure landing zone approach. The network teams were able to then create those shared VPCs, and get them interconnected back to our datacenters.
» Terraform Enterprise Adoption Timeline
We were doing all of this right at the back end of Q4 2020. We already had, at this point, established all of those fundamentals. We had actual resources being created in our production GCP organization within six months of starting this journey. I think at that point, it was quite clear that we'd made the decision to use Terraform Enterprise.
Thomas:
And as this slide illustrates for the dates. The timings were quite rapid. We had started on this journey in the beginning of 2020 — by the tail end, we were doing an MVP with TFC. Then shortly after, we made the decision like many other large organizations, that perhaps the SaaS offering wasn't best suited to our needs.
So, we made the decision to bring TFE internal — if you will — and started on the journey to build our own. Even within the platform team, we didn't have a great experience of GCP or Terraform to start with. We didn't have those skill sets. So, as we were building out TFE as the front door for everyone to come and join us on the journey to Terraform, we were exploring that ourselves. It was a good challenge to even deploy TFE, which has various different resources, and get a flavor of consuming these products ourselves.
The engineering began for TFE towards the tail end of 2020. We had that up and then migrated off TFC the workspaces that we did have the following year. Then from there, various iterations of TFE have come out, one being Active/Active. We're no longer dependent on a single VM. You've got some high availability for what is a critical platform for us. And the end state today — we've obviously got multiple production environments that teams can deploy to.
» The Growth of Terraform Enterprise
Following on from that, what does it look like now — what's the state?
We're not quite at the end. We're still very much getting going, but this is to reflect on how far we've come in a short period. Looking on the right, you've got the count of runs within TFE. Runs — for those that aren't familiar — is a metric for how much we're sweating the workspaces. It's great having lots of areas carved out for teams, but how well are they actually using it? How many runs are they doing a day? That was a metric we thought was useful.
We've hit over 250,000 Terraform runs now. Whether that be a plan or an apply, the vast majority are plans because teams are iterating, reviewing the plan — is this what they want — and then obviously moving forward to do the apply. Coupled with that, there are 350 Sentinel policies. Every time those plans are running, there are 350 guardrails in place to check that teams are deploying things in a manner we see fit.
We started, obviously, at the beginning with this MVP, the Sentinel policies were in this dev advisory mode, as they call it. As we progressed, these gradually hardened to become hard mandatory production, as you might expect. We give teams a little bit of a flavor of what it's going to be like when they promote their code up to prod — so allow them to adjust accordingly.
Hopefully, all this growth is interesting from a developer perspective — or it certainly was within the organization — because it allowed teams to be productive. They didn't have to worry about how they could deploy the infrastructure. Yes, they had to get to grips with Terraform but we wanted to abstract away as much as possible while at the same point in time, allowing them to use GCP to the full potential.
» Summing Up
Jeremy:
As we come to the end of this, let's revisit a little bit of what we did. The message that I'd like you to take away is how very much Terraform Enterprise enabled and accelerated our journey.
We went from the formation of a team to conducting a pilot in summer 2020. Within six months, we created our first infrastructure landing zone with minimum guardrails in place, with a module factory, a policy factory., Three months subsequent to that, we then had the door, we opened the gate, and the hoards — we were keeping them back all the way through 2021 as we started to onboard more and more app teams onto the landing zone.
Very key to this is the Sentinel aspect — giving our security organization the confidence that we can unleash the cloud, and get that acceleration in place because we can start with a minimum set of guardrails. We can add to those guardrails as we go through dev. We can add more services frequently as we go through dev, we can federate, we can use application teams to open source or inner source that within the company to help with the module authoring — with the policy authoring. We can then gradually tighten the screw as we move towards production throughout 2021.
That gave our security team confidence that they ultimately can control — they can turn off, or enable at their discretion — the hard enforcement coming in production on all of these policies and the app teams having to adjust to that, but understanding the reasons why, and happy to take that on board themselves.
Getting into production at the back end of 2021, going into 2022 with another wave of onboarding and more landing zones being onboarded. 200+ landing zones now operational. 200 app teams landing safely in GCP. The steep approach rescued the last known survivor from the db.com organization, unleashing the cloud for DB, and avoiding a lot of the common obstacles and pitfalls faced by many large organizations when they typically do their v1.
That's pretty much it from us. I hope you enjoyed that talk and you got something out of it. We are able to take questions on a personal basis. We'll be around — myself and Tom here — we're here for the duration, and you can come up to us. We'll be happy to answer any questions you might have. Look forward to meeting with you.
Thank you very much.