Terraform and regulated financial services at the German stock exchange
See how Deutsche Börse Group — also known as the German Stock Exchange — uses HashiCorp Terraform, Consul, and Packer for financial services applications.
In this session, Aymon Delbridge talks about how his team at the Deutsche Boerse Group built the foundational cloud infrastructure in AWS, Azure, and GCP with an extremely small team of 7 people by relying heavily on infrastructure as code and automation. Aymon sheds light on what's next: moving to Terraform Enterprise to overcome some of their current limitations.
Transcript
Hello and welcome to the HashiConf Digital 2020 and the session — Terraform in Regulated Financial Services. My name is Aymon Delbridge. I’m the head of Cloud Platform Engineering at Deutsche Börse Group — also known as the German Stock Exchange.
My team is responsible for planning, building, securing, and operating the foundational public cloud infrastructure for the group on AWS, Azure, and Google Cloud Platform. We do this in a regulatory-compliant way with an extremely small engineering team — currently eight people — relying heavily on infrastructure as code and automation.
Across the three clouds, my team is managing a bit more than 40,000 foundational infrastructure resources. Our product groups then deploy and operate their applications, services, and workloads on top of this foundational infrastructure. During this session, I'm extremely happy to share with you some details about how we're using Terraform and Packer in regulated financial services.
But first, I would like to give you a short overview of the Deutsche Börse Group, an introduction to the regulatory landscape, and a bit about how we define controls to meet ISO standards. We'll also take a look at the high-level architecture of our three CSPs, how we structure our Terraform code base, how we're using Packer — and finally, where we're going in the near future.
Deutsche Börse Group: An Overview
We're a so-called market infrastructure provider. As such, technology is really in our DNA. Our first fully-automated trading system was launched in the 1990s. We're operating extremely low latency networks and the latest technology for high-speed trading. Together with our co-location partners, we're operating one of the largest European financial datacenters. All of our core business applications are self-developed and self-maintained. And we accomplished this with about 2,000 people in our IT to provide services across the entire trading value chain.
Our core is trading and clearing applications. We use them for securities trading, derivatives, currency, and commodities like energy. The Deutsche Börse Group is also offering pre-trade and market intelligence services. We calculate real-time indexes — for example DAX and stocks — and operate ultra-low latency data feeds for our customers.
Our post-trading area is settlement and custody. Then we have value-add services like investment fund services, and collateral management. As the German Stock Exchange we’re a core element of the European and Global Capital Markets. As such, we're under scrutiny by the highest regulatory supervision. Not only by the German regulator but also regulators in Luxembourg, Switzerland, and the UK.
This is important because our IT is a product-based organization that aligns with this trading value chain. And this organizational structure drives many of our decisions when architecting our foundational cloud infrastructure. Terraform and Packer are key components that help us standardize this foundational infrastructure and, additionally, many of the applications and services running on top.
We're using Terraform, Packer, and Consul — mostly with public cloud. Like many financial services institutions, the Deutsche Börse Group uses public cloud as one of the main technologies to fuel the foundations for other technologies, such as Blockchain-based services, cohesive technologies and further big data automation and collaboration services.
Shared Responsibility Model
The shared security model is very important to us. This well-known model divides security responsibilities into the security of the cloud — taken care of by the cloud service provider, and security in the cloud — taken care of by the customer. Of course, the certifications, badges, reports that the CSPs can show — indicating their compliance with global standards — are important. It's the starting point of the conversations we have with the regulators like Germany's BaFin or Luxembourg's CSSF.
Since cloud is a form of outsourcing, many of these controls are non-technical. Regulators want to know about the contractual conditions. Especially in relation to material workloads, risk management, how we implement controls, the processes surrounding controls, and how we can demonstrate that the controls are implemented.
A supervised company should develop and document a process covering all steps of relevance for outsourcing to the Cloud Service Provider — from the strategy, migration to the cloud, right through to the exit strategy. There are several processes, both technical and non-technical — like audit reports, risk assessments — that we can go through to ensure the security of the cloud. And for security in the cloud, we can start building in the technical controls for networks, server specific controls and the cross-service controls.
Security in The Cloud
Terraform and Packer help us with security in the cloud. This is the ability to provide a consistent, repeatable, idempotent, and an auditable way to meet the controls required for security in the cloud. For example, being able to create modules that enable secure access, define role definitions, and the user role assignments — ensuring data protection with encryption services, enabling logging and monitoring services by default, and maintaining and applying security policy. Terraform is helping us model and modularize these technical controls. Packer helps us generate and publish secure and compliant operating system images, which give our developers a head start in their efforts.
Cloud Controls
What are the controls? We're an enterprise — we have a significant on-premise footprint. And as we're planning for our secure cloud landing zones, we've also been considering the infrastructure services, processes, and controls that we already have in place.
If we look a bit more into our specific requirements. We could look at standards like ITIO, SOGP, COBIT, but today we'll take a high-level look at the ISO 27001 controls. We've put some time and effort into identifying the ones that are most relevant for cloud workloads. We can start performing and mapping all the required ISO controls — aligning existing, on-premise services, and new cloud native services, which may give us a productivity cost or usability improvement.
Asset Management
With asset management, we can ensure that assets are correctly identified, tagged, and classified.
Access Control
Access control ensures that we maintain centralized control over authentication and authorization in the cloud environments — and that secrets are properly managed.
Cryptography
Cryptography so that we can solve the questions like how do we remove any possibility that there may be access to data by other jurisdictions? How do we ensure that the security of the data and systems are also ensured within the entire outsourcing chain?
Compliance
Of course, we need to ensure the integrity of our running virtual machine and container workloads.
Operations & Communications Security
And we need to ensure we have the correct controls in place for operations security. Do we have the alerting, monitoring, scanning, and detection in place? And how do we help the IT product groups meet these controls?
Information Security Incident Management
We also need to ensure that we have segregation in our networks, and we can, by default, provide platform integration with our security incident and event monitoring services. We can — and we do — use the native cloud security services. But then how can we provide this also in a centralized point with a single pane of glass for our security teams?
Terraform Strategy
If we take our controls, we know our requirements, we have a good understanding of a high-level architecture. Now we can start planning some of these reusable modules and resources in Terraform. This is over-simplified, but let's go through it. We follow a hub and spoke architecture for all of the public clouds, which is best practice in enterprise hybrid setups.
Foundational HUB Infrastructure
All of our connectivity between our data centers and the clouds uses regional and state-redundant leased lines. For example, direct connects, express routes, and interconnects. We have multiple hubs — these are region and environment-based — and hubs hold the shared infrastructure for all the tiered spokes.
Our hub module in Terraform would include consistent, repeatable infrastructure, such as VPN termination, shared security services, central audit — or SIEM — folders, published hardened base images, central firewalls. This core infrastructure will also contain policy assignments at a root or organizational level and then propagate through the peered accounts.
Policy exceptions can be made throughout this stack if required. A typical example of this might be that — for cost reasons — we may restrict VM sizes that their users can deploy. But then, for special workloads that require GPU instances, we can then apply these exceptions at a product group to enable our data scientists to do GPU instances for machine learning.
Foundational Spoke (Products) Infrastructure
Peered to the hubs are the spokes, which in our case are the product landing zones. This is where the product teams do the work to develop, build, and run their applications and services. Our engineers are able to deploy their virtual machines, databases messaging services on top of this foundational infrastructure. We want to enable our engineers as much as possible without compromising on security. We can give the engineers a bit of freedom because the product modules have several guardrails built-in, including those policy assignments for the resource types, VM skews, regional enforcements, default SIEM integration.
They also include the role assignments for the landing zones. Everything we do is role-based access control, and backed by a centralized active directory. The integration services, rules for asset management, and mandatory security services are implemented in the product modules. Then we have additional toggle-able services, which are also possible.
With Terraform supporting conditional modules, our product groups can start picking and choosing some of these optional services that they would like to have deployed — that are then maintained and operated by our product team. These can include things like automated start/stop robotics, or transparent proxies for that software that doesn't support an enterprise proxy — and cross-regional peering for those products that require duo-redundancy.
Isolated Sandboxes
We have sandbox modules that deploy specific configurations for internet-facing accounts. And these are areas for experimentation. They have a reduced security posture — and almost all services that are provided by the CSPs are enabled here. These are extremely valuable for trying things out and having an understanding of a service before enabling it in one of the connected spokes. They're also important in that somebody with an idea can get started right away with very little risk to our on-premise infrastructure.
Creating this modular approach in Terraform to define the hubs, the spokes, and the sandboxes, allows us to highly automate the creation of new environments, ready for different legal entities, the product groups, and the different application teams throughout the company.
Sample Code Structure — Repository Root
What does it look like in code? One of my favorite things about Terraform is that there are many ways to structure your approach to the workspaces, states, modules, and code. From a simple VM — with the necessary surrounding services — or entire stacks with the interdependencies.
We've chosen a structure that works well for us. It gives us the ability to provide this cookie-cut approach to the landing zones with the additional flexibility for making customizations that are consistent across the stages from development to production. This ensures the developers are able to work in environments that look and behave very similar to production.
It's also a structure that has worked for us across our collection of CSPs. In each cloud repository, we have an administration folder for the helper scripts, the code snippets, and other individual data processing scripts. We have a community for the contributions from the internal colleagues — includes code samples and learning modules. We have Terraform, which is also the main code base. We have documentation that is helping the developers get started in this repository.
Sample Code Structure — Terraform
Here we can see our Terraform foundational code base. We can see the main modules are for core infrastructure, the environmental infrastructure, which has expanded here and in the sandboxes.
Sample Code Structure — Environment
If we jump into the environment configuration — in this case, development — we have in our top-level, main.tf
, a definition of a new product called product1-dev. The product module source you can see here — I will come back to this in a minute — but this is a wrapper around what we're deploying for each product. We give the alias a provider configuration. What's important in this case is that an Azure service principle — which is the API client — has permissions only on the product environment that it is managing and additionally has access to the shared hub services so we can perform some network peering and role assignments for the central services.
Sample Code Structure — Product Core Module
Finally, we pass in the product configuration, which is a map object that may contain details like the default values for mandatory tags, service, or network configurations, and the custom services that should be enabled. If we want to add a new product, it's just a duplication of what you see here with the reference to the core product module — with the new provider connections and a product configuration.
Our product called module references the common DBG product module. This DBG product module provides the landing zones for all of the connected products. We need to explicitly provide the alias provider configuration for the hub, and we pass this on to the product configuration. And that's it; that's how we add — with a few lines of code — a new product stack in the development environment.
Of course, the complexity is underneath, but we can then scale this out very quickly. Remember, we have quite strict segregation of duties requirements. We don't give developers the ability to independently create their networks or expose services externally in the connected product accounts.
Sample Code Structure — Product Customization
What happens if a product needs some additional centrally managed infrastructure that should be common across all stages like new subnets, web application firewalls, or technical accounts? This is where customization comes in. Each product may have unique requirements that are represented here in a customization.tf
file.
This contains a reference to the module that — in this example — would create the custom networking infrastructure specific for one server. One application, which then must be common to the product across all stages up to production. That means in the production environment, the only difference in the configuration should be the provider credentials and the IP addresses. Here again, we provide the alias provider, reference to the module source, and some input variables to be used.
We can see here as well, the two virtual networks that should be peered to the hub. In Azure terminology, we're providing the IDs of two virtual networks in West Europe and North Europe. In this case, we're also deploying some underlying transparent proxy infrastructure required for that software that doesn't support the enterprise proxy. Then merging some additional tags, which gives us the reference back to that centralized asset management service that then associates the application and the infrastructure.
If we dive deeper into the modules, we can see a standard module for the hub, the product, the sandboxes, the services, and the repeatable customization modules for the product groups. We've generally tried to provide these modules around a specific function. For example, providing a monitoring module, which may be made up of a collection of different types of resources. Or a policy module, which then lets us configure policies that may or may not be deployed to the product. But we know that the monitoring or the network modules deploy a series of services, components, and policies that will then make a consistent and compliant product landing zone.
Hub and Spoke — Foundational Infrastructure
Now that we've gone through the stack with hubs and product cause and customizations. In the end, this is a high-level overview of what we've accomplished. We have here two hubs for productive and nonproductive workloads in different regions. Then peered these to the spokes corresponding to the regions with the network module. We've also deployed role-based access control, customizations, and monitoring. If we needed to scale this for 20, 50, 100 product or application teams, it's a matter of replicating that core module — and we have a fully connected, secure, compliant product landing zone.
Now that the product groups are able to consume their landing zones and start deploying the application services on top, how are we rolling this out and then maintaining these environments?
Terraform Deployments
Of course, we are an engineering team, and we need to implement the essentials of secure software development. That means ensuring we have a defined process, that security checks are integrated, and that we have a lifecycle for our software.
Of course, an enterprise source code repository fits this preface perfectly. We also need to have documentation for all of the code changes with a link to a ticket in some kind of backlog. Making fully automated software updates is difficult for foundational infrastructure. Hence, as a regulated entity with segregation of duties requirements, we need to have an approval process before rolling out anything to production.
We must have code reviews and ensure code quality. As part of our pull requests, we have integrated several actions that do Terraform validation, FMT, and documentation generation. For each PR — after these actions are successful — will then alert the configured code owners. The output of a Terraform plan is also posted to the PR, and then the colleagues within the team are able to approve these code changes.
Approved pull requests are then applied to a test environment where some automated tests are run for the activity and service onboarding. If successful, a Canary is then rolled out into simulation and the production environments. If the Canary is good, then the rest of the rollout proceeds, and a release notification is sent to the cloud community on some of our internal communications tools.
Packer — Image bakery
For Packer, we have similar requirements. Our operating systems baselines are defined by our internal group security. Then we develop against this baseline following our secure software development processes. We then have several build triggers — either on a weekly schedule, on a new merge to the main branch, or manual. These are useful in the case of a new CVE or a new kernel patch.
We also try to keep each cloud independent from one another. We use the cloud native CI tooling for controlling the build process. On AWS, it's CodePipeline, Azure — DevOps, and GCP — Cloud Build. All of that code is stored on GitHub, and all releases are then from the main branch.
Packer controls image creation. It creates virtual machines. It calls the provisioner and then controls the publishing of the image. Ansible then configures the machine so that we have roles and tasks aligned to each deliverable from those baseline definitions.
We can then generate a compliance score for the image using Rapid7 and Inspec IO. These are performing checks against our internal baselines and the CIS benchmarks. If these meet a certain threshold, then the image is published to a cloud provider. In AWS, it's Marketplace; on GCP, we have a dedicated project; in Azure, we have the shared image gallery. And again, these build notifications are then sent to our internal communications tools.
These base images are then used by the product groups as the foundations for their products and services. The code is open — we have several contributors from different teams around the internal organization, which is a great example of internal developer collaboration.
What's Next?
We'd like to implement an on-demand project factory. We're already seeing improvements in our delivery speed in those teams that are using Terraform and Packer. We're looking to expand the usage even further — to further accelerate our developers, testers, and infrastructure experts.
We're also starting the onboarding of Terraform Enterprise. This will fill some gaps around single sign-on and role-based access control for the deployments. It will improve the control and auditability of the deployment cycle. It'll broaden — I think — infrastructure as code adoption outside the cloud team and provide an approved, compliant resource module in the private module repository. It also enables us to do pre-deployment policy checks.
Thank you very much. I want to thank you for your attention and letting me show you how we're using Terraform and Packer in regulated financial services.
If you have any questions or if you think that we can do things a little bit better, please don't hesitate to reach out.