Case Study

How Deutsche Bank adopted and standardized on HashiCorp Consul

Published 4:00 AM UTC Sep 03, 2024

Learn why Deutsche Bank uses HashiCorp Consul for their cloud applications instead of Anthos for their hybrid cloud, microservices architecture—an architecture that remains critical just to have the ability to quickly deploy new applications that handle constantly shifting regulations.

»Transcript

I'm Oliver Buckley-Salmon. I work at Deutsche Bank, and I'm here today to talk about our adoption of Consul. At the moment, we're still at a departmental level, but we're looking at how we roll out bank-wide.

»What are we going to talk about?

Firstly, our problem statement, we're going to talk about some of the issues we had as a bank, challenges with our legacy estate, and how we are moving into a more modern architecture. And as we adopt public cloud, how we face some of the challenges which are regulatory-imposed—about how we move to a hybrid solution as well.

We're going to talk about why we use Consul for our cloud. We have GCP (Google Cloud), we're unicloud at the moment. And the obvious statement might be, why don't you just use Anthos? But as we'll talk about, there are lots of the things that Consul can do which Anthos doesn't—really means that Consul was the obvious solution for us. There's a bit talking about our partnership with HashiCorp because we work with HashiCorp beyond Consul. We also use other stuff from HashiCorp.

A bit about our current state as a department, what our future state will be as a department and potentially future plans for the bank-wide stuff. And then some of the challenges we faced—not necessarily just with HashiCorp, but with re-architecting our systems. And then, obviously, a quick summary at the end.

This is me. I'm sorry, that's a bad photo but taken from my laptop. Not very photogenic, but as you can see, I work in an investment bank, so I have many job titles. I think these days think I'm called the core CTO of Risk, Finance and Treasury because many people are called CTOs. I'm also a director and a distinguished engineer in the bank. I've also got other job titles, but I won't bore you with them here. If you want to contact me on social media, I'm on Twitter and LinkedIn. I don't post anything, though. I might look occasionally if you want to post me something.

»What's our problem statement?

»Monolithic systems

As a bank, we have a lot of legacy systems. Big, monolithic systems. Over the years, they've been built up and patched about a million times. We've run 1,000 projects, which we've just crammed into a system rather than re-architecting. Those systems are large and fragile.

Pre-GFC, we were making money hand over fist, and those things were brushed to the side because we were making so much money. Who cares? But obviously, post-GFC, we are desperately not making money—well, we are making some—but we are very focused on cost control, which is not something we had before.

»Change volume increase

Those monolithic systems were slow to change. They were complex partly because of the spaghettification of the code over many years of patching. Testing is obviously a big part for the bank because we can't be down when the market's operating. We need to make sure our systems are up and functioning correctly. So we have extensive testing. And certainly with our legacy systems, that was extraordinarily expensive and also high risk.

When we put those changes live, we mostly did big bang releases. We'd have many downstream systems, which all had to change on exactly the same day, exactly the same time—because we changed our interfaces. Well, I say interfaces. Probably, many of those were file transfers. But those file formats might have changed.

»Risk exposure and regulatory compliance

So there were a lot of challenges, and those risks are enormous for the bank. Well, there are financial risks. If we can't trade, well, obviously we can't trade, we're a risk against the market. But it's not those kinds of risks. It's not that the trading systems can't work. If we don't, our risk systems don't work, and then we are legally not allowed to trade. If we can't settle trades, then we're going to be annoying our clients and our counterparties. There's a huge number of financial risks that we can incur.

On top of that, we are very heavily regulated. We're a German bank, so our primary regulator is BaFin. But we operate globally. So, we have the ECB monitoring us as well. We've got the Fed, the CFTC, and the SEC— they’re in the US. Globally, we've got regulators, like MAS in Singapore, which for any people who work in banks here will know, that’s a particularly exciting regulator to work with.

Those regulators are more than happy to fine us for any kind of issues that we have. It's just not a case of something going wrong in London, so we're going to be fined by regulators in London. That could potentially incur fines from the US, Germany, and all sorts of places. So, it is critical for us to have reliable systems, but we can change without damaging ourselves.

Finally, reputational risk. Banking is a reputation-based business. So, if we don't have a good reputation, if we are not supplying our clients with the services they want—and I don't mean necessarily retail clients. If you bank in the UK—I'm sorry, I don't want to disparage any banks in the UK who deal with retail customers. But they normally have to treat you quite badly before you change your bank account. Whereas in the global markets, the hedge funds, asset managers, etc., who are our clients are more than happy to change at the drop of a hat.

With all those risks that came with trying to change our systems after the financial crisis, we've just had a continuous stream of regulatory change. Some of the ones you might've heard of are there. MIFID II, EMIR, Basel 3. Obviously, Brexit was a special gift for us—and FRTB's financial review of the trading book.

But there is a long, long list of regulations that we've had to implement since the financial crisis and are obviously still ongoing. One challenge with regulatory projects is the regulators don’t really care how much it will cost you or how hard it is to implement. They come up with a timeline and say everyone must be compliant at this time.

That means we have a huge amount of projects driven in the bank which are just regulatory change. With all that regulatory change, that kind of swamps out business change.

»How do we do business change as well as regulatory change?

When we've got our rigid monolithic systems, that's very difficult to achieve. We need to think about how can we re-architect our systems so they're more flexible and can isolate changes to smaller components?

How can we make sure we can run multiple versions of our applications—or our services—to allow people to migrate on and off as quickly—or as slowly—as they need without enforcing changes in a big bank manner? Well, obviously we've looked at moving to microservice architectures.

And another driver for microservices was about our public cloud adoption. We're using GCP. Why do we want to use it? Well, we want faster time to market. Elasticity is a big thing for us. Lots of banking applications are variable in their nature during the day. There's still a lot of batch operation. So, we need systems which can grow and shrink their footprint to save costs on the cloud.

We need to be able to scale out horizontally and shrink back in again. An example of that is where I work in risk finance and treasury: One of our finance systems does the P&L calculations and accounting. They're effectively doing a big end-of-day run to calculate all the P&Ls across the bank.

They have a huge number of services that need to burst out. So, from those two or three instances running at one time, to scale that up to maybe 100 instances of that service. So that’s burst out, scale back in. That's the kind of thing we need to ensure we can facilitate. Obviously, that wasn’t possible with our legacy systems.

»On-demand infrastructure

That's been a game-changer for us as well. Traditionally, it probably took us more than six months to get a server from buying to getting it into the datacenter. By the time we've been through all the necessary engineering, getting into the datacenter, installed, everything else, the time is huge. So, the fact that we can spin up new infrastructure on demand—that is really important for us.

But we do have challenges with the cloud and they've all come out of the woodwork as we've tried to move on to GCP. So, there are things people might be familiar with, lots of the data residency rules: Things like Indian payments data has to be in India. We've got a growing list of countries—starting with China, Saudi Arabia, and UAE—that have no public cloud usage.

Swiss data has to stay in Switzerland. Data from Luxembourg has to be in Luxembourg. So, there's a lot of challenges for us around–A, moving stuff to the cloud. And then, B—how are we going to manage our hybrid stuff? So stuff that we're going to keep on-prem, how are we going to manage that? That hybrid stuff is not necessarily where we've got no public cloud usage done by the regulators. It's also just the complexity with our regional loss of cloud services. If we need databases or other applications that are spread across multiple regions on GCP, that is, one, expensive and, two, complicated. So, sometimes it's cheaper to keep it on-prem.

»Why did we choose HashiCorp Consul?

»Required hybrid solutions

As a department, we realized quite early just because, unlike lots of other parts of the bank, we're back office-y, we've got lots of global systems. We've got one big system that calculates market risk for the entire bank and one that calculates credit risk for the entire bank. Same for the finance stuff as well. When I say one big system, I mean we've got lots of systems, but effectively one big application.

As those systems are global, we're impacted by all these regulations about data residency and public cloud usage. Because we spent years putting all these into global applications, we don't necessarily want to pull them all to pieces and deploy them in lots of different places. But we do need hybrid stuff, so we will have part of our applications on-prem and part in the cloud.

»Regulators mandate cloud exit plans

The regulators, well ECB in particular, say we have to have a cloud exit strategy for any cloud we use. Because we are unicloud—that is a bank policy—our only option is to bring that back on-prem. That makes it very difficult for us with our cloud adoption because if we go too heavily into we're just going to use everything on GCP, everything's all wonderful. But when we want to exit GCP, how will we provide an equivalent infrastructure on-prem? It is much easier if we have something that can run on-prem and GCP.

»We need a global solution but have regional regulatory requirements

I've talked already about China and Saudi Arabia and, like I say, UAE since I wrote this slide. If we had to move our applications to China, for the banking industry there is actually a rule about how much business we do in China before we have to start locating systems into China.

At the moment, we're under that level. But the bank strategy is to get beyond that level because we're interested in a lot of business going on in China. To do that, we'd have to deploy stuff into China, and Google has a policy of no Google products allowed to be used in China.

»Legacy considerations

Using Anthos would complicate it because we'd have mixed service meshes all over the place. We also need to be cognizant that—while we are looking at re-architecting into microservices and making everything super shiny, all wonderful and modern—the reality is we have a large legacy estate. We have SaaS and vendor applications. Those things might be exposing APIs, but they are not running in a service mesh. We've got applications in finance in SAP. SAP is exposing services through NetWeaver. Those aren't running in a service mesh. They're all part of the SAP infrastructure.

How are those services being discovered? How are those services interacting with each other? How are they discovering services in the mesh? How are services in the mesh discovering those? How are they discovering each other when they're all outside the mesh? It is the support for an integrated service registry, which has been critically important for us in that regard.

»Zero trust security and mTLS encryption

Obviously, in public cloud, one big thing for the bank has been about zero trust security. The difference for us between being on-prem and being on the cloud is we don't own that network on the cloud. Someone else controls it. So, we want to make sure we secure everything through mTLS. mTLS, if we were doing that without a service mesh, would be a painful operation. We'd have to install and rotate certificates everywhere, which would be a nightmare.

But that's one of the great things about running Consul: It's managing all of that for us. Our services are unaware that they're talking through mTLS. That's all handled through the sidecars. It means that all that kind of code, which a developer would normally have to write to implement all those certificate management and security in their service—that's all taken away. They can just focus on writing business logic.

We need on-prem and on-cloud deployment Consul's great for that. On-prem, we run OpenShift.So we need a service mesh that can run on OpenShift, but also run on GKE on the cloud as well. Then we can have a consistent platform across both on-prem and GCP.

I know Istio used to support a plugin for Consul as a service registry—and even Eureka at one time—but a few years ago, they decided to take that out. Obviously, they were feeling the threat from Consul as a service mesh, so they removed that ability. That ruled them out of managing our service infrastructure.

»External service management

One thing we like about Consul is the registry. It's just not that it's telling us where services are. It's also health checking those services for us. Whether those services are in the mesh or outside, Consul can health check them—and external service management effectively allows Consul to do that. The idea there is the registry will never supply the URI with a service when the service isn't running.

»We want to be somewhat cloud-independent

I just will say this is not necessarily bank policy. This is more a departmental policy, that we want to be cloud independent if we can for critical bits of infrastructure. The bank's still very much full ahead on GCP.

»How have we developed our partnership with HashiCorp?

»Consul

Well, I was talking to Caroline from HashiCorp earlier, and it must have been about five years ago when we started with Consul service—just Consul as a service registry on OpenShift. That was probably the start of our microservices journey. Maybe it was a bit longer than that, maybe about seven years. When we first got OpenShift was the first time we had containerization, and we could start thinking about re-architecting into microservices.

Of course, when you build lots of microservices, you start getting to the problem of managing those microservices. How do I discover them? How do they discover each other? Consul as a service registry was a critical part of how we put that forward.

I should say, we did start with Eureka. We started with the Netflix stack, but we quickly came to the limitations of that and moved on to Consul instead as our service registry. But one nice thing about Consul is that it is widely supported by open source frameworks. So, when we were using things like Spring Cloud Gateway as our API gateway or Spotify's Backstage as our design time store, those were simple to integrate directly with Consul.

So, rather than manually configuring those, we were using Consul metadata to drive which services are published on our API gateway. So, it was entirely dynamic just through Consul metadata. The same with services, which were also published into our API store as well—which was Backstage.

We are now running Consul as a service mesh on GCP, on GKE. Unfortunately for us, the people who ran OpenShift on-prem were very reluctant to install the service mesh on OpenShift. Mostly because they were worried it would damage the cluster and then it would all be their fault. They were very nervous about problems with their OpenShift cluster being installed on a service mesh.

That was a very long-running argument with them. It’s still ongoing, really. I'll talk a bit about it in a minute. But when we got to GCP, it meant we had GKE. We could deploy Consul as a service mesh on there. That was our start with Consul.

»Terraform and Packer

As the bank moved on to GCP, we got more seriously into things like Terraform and Packer. So, Infrastructure as code—absolutely. When the development teams first knew we were going to GCP, everyone was super excited. It was escape from the tyranny of the infrastructure teams. We don't need to speak to them anymore. We can do it all ourselves. I can do it with the press of a button. But of course, that is not a controlled environment for a highly regulated organization.

We need repeatable infrastructure. We provide Terraform modules for standard bits of infrastructure for across the bank. People use those Terraform modules to install stuff. Terraform also has policies run against it using Sentinel. That’s all integrated into our GitHub Actions pipelines as well. For Packer,, if we're using infrastructure as code, then why not VMs as code as well? VM images?Again, we can integrate that with Terraform.

One nice side effect of using Terraform is we are actually documenting our entire architecture across the bank. If you've worked in a large organization like ours, you'll know we have solution design documents written for one project. Five years later, they're completely out of date. No one updates them.

But if we are using Terraform, then we are self-documenting exactly what our architecture is at any time. And even though I've forgotten the name of it, there is a nice diagramming solution you can plug into Terraform, which will generate a dependency graph across all your Terraform stuff.

»What's our current state?

»On-Premises

We have Consul running as a service registry on OpenShift. We are currently in the process of migrating from OCP3 to OCP4. Yes, that is painful and has certainly been very painful for our dev teams. But it's just running as a service registry. And I should say that as a service registry, it's been pretty much bulletproof. I don't think it has ever been down, so that's been good for us. I am not saying OpenShift hasn't had some problems, but Consul itself hasn't had any problems while we've been running it.

We've got about 400 services registered on-prem. That might not sound like very many, but it's just at a departmental level, and we still have a huge legacy estate. So, this is more about the journey we're going on to get more services registered as we start re-engineering our systems.

One challenge with re-engineering into microservices is how do we get that funded? If we just say to the business it's a new architectural style. It's going to be great. Let's do it. Then they'll say where's the business benefit? And the business benefits can be difficult to quantify.

Anyway, it's a slow journey. But some applications are more friendly towards going into microservices than others. It's integrated with the rest of our service infrastructure, so integrated with Spring Cloud Gateway and Backstage as well. The fact that we could integrate that nice and easily through microservice frameworks—and things like Spring Cloud—has meant we've been able to swap out components without changing our data source. The Consul service registry effectively is our source of fact about what services we have and where they are.

When we started off, we were using the Netflix OSS stack. We had Zuul as our gateway. We were also using Zuul as a primitive API store, Hystrix as our distributed tracing system. Over the years, we've changed all that. Well, partly because Netflix moved to Spring Cloud, we moved our stuff onto Spring Cloud, But the fact that our source of truth is the same meant we could effectively be pulling the same data at all times.

»On Google Cloud

We've currently got 12 applications using Consul service mesh. Every application is running its own cluster or clusters. So, we've had to install Consul service mesh on every single cluster. But while that might've been very Anthos-like to start with—because it would've been isolated—we've taken advantage of Consul's peering to peer those clusters together to effectively form a single logical mesh across all of those clusters.

I don't know if people are familiar with Consul's peering. It's effectively connecting two clusters so services can call each other. Peering itself doesn't necessarily allow any service to call any service because Consul enforces service-to-service connectivity through intentions. So, we need intentions to effectively allow the services to call each other.

It has a thing called service exports as well, service export and import. So if, for my cluster, I want to have someone with another cluster to call my services, I can export my services to their cluster, set up an intention for them and then they can call them. Obviously the reverse is true. I can import services that are exported from another cluster and then also set up intentions for those.

The cluster peering—we've only got five peered together far. That might not seem very much. But one cultural thing we've battled against with installing Consul is people are very secretive about their applications. They don't want people seeing the insides of what's going on.

And obviously, Consul gives us full visibility of everyone's application, all their services, when those services are up and down. We can see the metrics for the latency calls through their applications using the metrics UI inside Consul. People are very nervous about that kind of visibility being given to their application. But we are enforcing it. All those 12 will be done this year—so it's an ongoing process.

I should say the peering is very simple. It's just a case of setting up some firewall rules, exchange a couple of tokens and then OK, job's done. We've got 14 more going in the pipeline this year. We're getting close to 30 applications on GCP. To give you an idea of scale, we've got about 4,000 developers—and about 200 applications just in our department. We are trying to pull that 200 down to 14 strategic applications. There's a lot of re-engineering work that has to go on.

»How do we support Consul?

We support Consul with a team of three. So, the guys who work for me. In fact, in reality, there's only one guy who knows Consul super deeply. That's just because he spent the last five years trying to solve problems and looking into the source code and things to try to see how it works—what we're doing wrong.

But no, the reality is it's very simple to operate. Once you've got it installed and up and running, the day-to-day operation is very low. We have had some challenges with application teams because we've told them you're using Consul. And then obviously, if they have any problems with their system at all—and they did—it's Consul's fault.

So, we have to spend a lot of time proving to people that this isn't Consul's fault. This is the way you've set up your GKE cluster, or your firewalls aren't configured correctly, or some other thing. With experience, 99.9% of all issues raised as tickets against us have been something to do with the application or the way they've set up Kubernetes—so, not Consul—which has been really good news for us. And, on the way, we've learned a lot of lessons about how not to use Kubernetes as a dev team.

»What's our future state going to be?

»On-premises

We are looking to install Consul service mesh on our new version of ACP4, which is our latest version of OpenShift. We are currently running a test. As part of our ECB exit strategy testing, one of our applications needed to test if they could come back on-prem. As part of that, we installed Consul service mesh onto OpenShift 4.

That test is going on at the moment, but it's been successful. I think it has shown the OpenShift team—the team that manages Kubernetes clusters—that Consul will not destroy their cluster when it's working. So, we're building confidence that this will be a viable solution. One nice thing about Consul and OpenShift is you can actually choose which OpenShift types of projects—or namespaces—can be included inside the Consul service mesh.

That can be done through labels on the projects themselves. We are trying to convince the people who run OpenShift for us, that if we deploy that for our department, it's not affecting anyone else using the shared cluster—unless they opt to come into it.

»On GCP

We want to have our 27 apps using Consul service mesh. All peered together, all calling services cross-cluster with our imports, exports and intentions.

Then this is stuff that might happen, and I'll talk about that. There's still a lot of work to go, partly to move to Consul Enterprise. We want to do that just within our department. Everything we've done far has been done with Consul open source. We are now looking at the point where we've proved Consul works and does what we want—so now is the point of getting the budget to buy Consul Enterprise licenses.

Unfortunately, the way the bank works is that budgeting is done on a yearly basis. We are waiting for our next cycle before we can put in the request for it. But that's something we're planning. Really, that's about getting support. There are lots of features in Consul Enterprise that we'd like. The automatic backups and upgrades, and the fact that we can do OIDC for calling in for users into the mesh. Those kinds of things are important for us.

I personally am also working with the bank-wide CTA team. They're looking at building a global multi-tenant service mesh for the entire bank. As people have talked about before, one problem with going to the cloud is suddenly your dev teams are in charge of all their own infrastructure. And quite frankly, they don't like it and don't understand it.

That's why we get these misconfigured Kubernetes clusters. I mean, the service mesh, it's a non-trivial system. So, to understand how to use it properly (this isn't just ‘how to install it’, this isn't Day 0) This is effectively Day 1 to Day N when I'm running my service mesh. How do I manage deploying my applications? How do I manage my application lifecycle within the mesh? How do I enable my connectivity, reinforce my security? All that kind of stuff. It's a lot for dev teams to manage.

So, the common service mesh project is about building a global service mesh with Consul Enterprise. That will be effectively a completely bank-wide one. Well, it won't be one Kubernetes cluster, but it'll be one Consul cluster across many Kubernetes clusters across the globe. That's the aim there, and that will be with tens of thousands of services running globally.

I should say, there's still a little battle to go there. We're still at the RFI, RFP stage, but Consul is the leading one. Well, according to the requirements we've written, Consul is really the only one that can satisfy those. The rest of them are variations on STO.

»What challenges did we face?

»Winning the argument for doing microservices, while still on-prem.

Lots of people were asking , what's the point of this? It's a waste of time. Our application works fine. What business benefits are we getting?

Those arguments took a long time to win, and it was even from dev teams. No one likes being told, no, we're going to re-architect your entire application. What you've done before is wrong, and we're going to do it in an entirely new way. It'll all be wonderful. That doesn't really please people.

»Getting buy-in on Consul from all the other teams

Even though I am the CTO of our department, Deutsche Bank is a consensus-led organization. I can't just dictate to people and say this is what everyone has to use. I have to convince them. There were a lot of questions on why we weren't using Anthos. That kind of campaign had to be run.

»Metadata standards in the mesh

It's all very well deploying a service in the mesh, but that doesn't really tell us much about it. It tells us where it is, it doesn't how many instances we've got. But to understand it, we need to be able to register metadata against those. We had standards on that. You can write a document, tell them how to do it. That doesn't mean they're going to do it. So, enforcing that has been a driver.

One reason we needed to drive that, was we have KPIs—for better or worse—on the number of services we have in the mesh and which services are calling each other. And the system which needs to understand that KPI has to pull that data from Consul. So, unless we can put that in the metadata, then we can't do it.

»Each team has their own cluster

It has been a challenge with every team having their own cluster because we have to enforce upgrades, which people don't necessarily like. We have to support all those teams with their own clusters and —it's a normal developer thing—they know better than you. So, they want to do something slightly different from what you've told them because they think it's a better way. And then they’re moaning it’s broken, so that was a challenge as well.

I've talked about the knowledge of using a service mesh, and it is complicated. People don't necessarily know how to run their own or don't really get the best out of it.

The arguing for a service mesh with external registry—that was initially dismissed as not a service mesh concern, but it is a service concern. It doesn't matter whether your services are in or out of the mesh when in reality, not every service that lives in mesh, and not every service is micro. So, we need to be able to support everything.

Obviously, the argument’s going on at the moment when I've talked about the global multi-tenant mesh. That's something which is still being debated in the bank.

»Summary

Super quickly because I see I've run over time. We started on Netflix, went to Consul's registry, heading for a departmental multi-cluster mesh, potentially going for a bank-wide multi-cluster mesh. Lots of the challenges we faced: The argument just for doing microservices, battling against Istio and Anthos—and just supporting an open-source product.

Okay, thank you very much.

Sign up for the latest HashiCorp news

More resources like this one

2/3/2023
Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

1/5/2023
Case Study

How Discover Manages 2000+ Terraform Enterprise Workspaces

12/13/2022
White Paper

A Field Guide to Zero Trust Security in the Public Sector

9/26/2022
Case Study

How Deutsche Bank onboarded to Google Cloud w/ Terraform

View all resources