Case Study

Automatic multi-cloud landing zones via HCP Terraform at Helvetia Insurance

Published 7:00 AM UTC Sep 16, 2024

Learn how Helvetia Insurance structures their landing zones for deployment in a multi-cloud environment.

A definition of landing zones is available in our docs.

»Transcript

I will talk to you about automating deployment of landing zones in a multi-cloud environment. The first talk from Lufthansa is a perfect introduction because we are using the same kind of technologies.

First, some words about myself. I'm Matthias. I've been working for Helvetia Insurance for about two years. I'm working in the cloud enablement team. This means we provide landing zones, cloud accounts, and subscriptions to our internal teams that need to deploy applications in the cloud. We try to do this in the most automated way possible to avoid losing too much time. I will present here the work we did with this.

»About Helvetia

It's a Swiss insurance group founded in 1858 in St. Gallen, Switzerland. We are one of the leading insurance companies in Switzerland, but we also have some branches in Europe—in France, Spain, Germany, Austria, and Italy. We also have a small specialty market in art and transport around the world. Some facts and figures—if you are interested, you can check on the internet.

I will talk a bit about how a company like ours can go to the cloud, what the requirements are, how we tried to go there, and finally, how we went there nicely.

»Journey to the cloud: The beginning

I will not talk about our journey to the cloud exactly, but more about how we feel that a company— like us with regulations like us—is going there and what challenges we may have when we start to use the cloud.

Generally, when we started, we had an on-prem datacenter. And some teams who are interested in machine learning and AI have heard that on some cloud providers, you can have some tools that are not available on-prem and that they may use.

So, they decided, let's try. They’re just tests. We can try different clouds, Azure, Amazon, Google. We have a company credit card, so let's put it there. We open an account. Everything is good. We can do some tests. We are just a small team, so we create an IAM user—or local user—share the credentials, and let's go. It's very easy and very quick. We can start straight away and have our tests running directly.

»Journey to the cloud: Challenges

Unfortunately, it does not work like that because if we have finished our tests and decide to go to production, we may have some issues. We are a financial company. The FINMA is the regulator of the financial market in Switzerland. So, if we start to do everything and nothing, they will not be very happy with what we are doing.

We have to know for audit, for example, who did what and where—and did we follow the regulations? Also, we currently have some tests running in some sandbox systems isolated in the cloud. Then, you ask your firewall team, could you please whitelist all Amazon's IP addresses for our datacenter so we can use it? Of course, the security guys will say no.

We also created the user locally. If someone left the company, and decided to take the credential with him—and we forgot to revoke these credentials in the local system—I may have some problems. Someone can do some bad things.

Last but not least, cloud incurs costs. Depending on what you're doing—especially if you are doing machine learning / AI—it can be very expensive. Who is taking responsibility for this cost? And especially, is what we are doing good? Are our resources underutilized? Are they well monitored? And so on? All these things are not managed and not well-defined with this.

»Our cloud requirements

For companies, moving to the cloud looks easy, but it isn't at the end. We need some well-defined architecture and processes to be sure we are doing it right. Of course, the most important are user management and connectivity.

We have to define these requirements because we cannot just saye need that,we need that. So, we need to have some discussions with all the teams involved to see what we need to do. The main requirements for us, for regulation purposes are:

»Build on multi-cloud

We needed to be multi-cloud to have workloads on at least two clouds. For us, it's AWS and Azure.

»Integrate with current services

We need to integrate with our on-prem services—at least in the beginning. We are also using some SaaS services, and we need to integrate them with them too.

»Team responsibilities

As a cloud enablement team, we were just a few guys, so we cannot take responsibility for all the applications that are running. We have something like 200—we cannot manage them all. So, we decided that each team is responsible for their cloud account or subscription and that we are just giving them access to that.

»Self-service

As there are not enough of us to do everything, it should be done in a self-service mode. This means if someone needs some accounts, wants to deploy a new application, it should be easy for them to get there. Those are the main requirements.

»Our network requirements

We are working with a hub-and-spoke approach. This means we have some on-prem datacenters. On each cloud account, we have some hub networks that are deployed, and that we are managing.

Each of our—I call them customers, but each of our internal teams—has their own VPC or virtual network that is connected to the hub network. These hub networks are connected to on-prem with Direct Connect or ExpressRoute, depending on which thing.

Of course, we have a security team, so we have firewalls everywhere. And we should be sure that applications that need to talk to each other can talk to each other, but no one else can.

»Our governance requirements

»Single sign-on

We need to be sure that our employees can log into our platforms with one credential set, which are their company credential set—and are invalidated when they leave the company.

We also want to use the least privilege principles. So, we have some role-based access controls that give people the exact access that they need—not more, not less.

»Centrally saved audit logs

Everything that happens on the platform should be logged and sent to some central location where our security team can check if there is something. In case of an audit, we can also explain what happened and when. And if there is some anomaly they can aggregate the data and find some anomalies to react quickly in case of a breach.

»Data location

We cannot put our data in any fancy or exotic region. We are regulated, so we need to know where our data is. And we should be able to limit where our teams or projects are deploying their assets.

»Required policies

All discs should be encrypted: There is no exception for that. Internal servers should never have an IP address. And even if by design they don't have them, it should not be possible for someone to add an IP address. And, of course, we don't want to have a crazy cloud bill, so this means that we want to forbid too expensive services for that.

»Rough setup

Here you have a diagram of how it is done, more or less. In the middle, you have our hub infrastructure with our networking assets. For each of our customers—for each project, application—we create one account or subscription. This means we have 200 AWS accounts and 100 Azure subscriptions deployed like that.

In each account subscription, we have to have a network which is connected to our central VPC or central network. We have some routing tables to route all the traffic through the firewalls. And we have some log forwarders. The policies are managed with AWS organization for AWS or management groups for Azure.

We have some Infoblox DNS systems. It’s also our IPAM system—so, IP address management. For IDP—so, identity provider—our corporate solution is ForgeRock. We have to be sure the created accounts are linked with ForgeRock as our IDP.

You don't need to know all the details, but you can see we need to deploy many assets. In each subscription, we have a lot of assets, and everything should be configured and done for every application.

This has to be done multiple times, so we have 200, 300. And if there is an update—like the security team or networking team decided to change the policy or the route in the routing tables—these have to be applied everywhere easily, if possible.

»What did we do and how did we get there?

It's clearly productive to do everything wrong because you always learn from your mistakes. We made some mistakes, but not everything was bad.

»ClickOps

As my colleagues explained before, this means going to the portal and clicking around. You create a new network, create a route table, enter the address by hand, create the log forwarders—and so on. Each time you get a new request, you have to do that more or less 200 times. And if there is a new application, we have to do this again.

We evaluated the advantages and disadvantages of this method. We didn’t find any advantages. And about the disadvantages, of course, everything. Human error: If engineers are clicking around, this means that they can make errors. It's not reproducible, so we don't know who did what, when.

We have to do it a lot of times—and, sometimes you have to do something quickly, so you click quickly in the portal. You do some temporary stuff. Then you get interrupted because your manager asks you to do something else. Then you forget that you did this stuff, and you have some open port in the firewall running on production. And you discover it two months later when there is a newspaper information that you have a big data leak from your company. Of course, this is not the best solution that we can have.

»Native cloud provider tools

The second idea that we partially implemented, in the beginning, is using cloud-native provider tools. For AWS, it would be CloudFormation Templates, State Machines, Parameter Store and Secrets Manager. By Azure, ARM templates or Bicep Blueprint and Policy. Again, we can evaluate what is good and bad. And here, it's not all good and all bad.

We have some advantages:Well-supported tools. These tools are provided by the cloud providers. This means they want us to use it. So, if they want to use it, it's well-supported, well integrated into the platform, well documented, working well with this platform. And it takes advantage of the specificities of the platform because it has been developed by them.

But what's not so good is that it's not a uniform tool. We need to know a lot of tools to have a multi-cloud strategy. We need a lot of knowledge. You need to know how to work both things, all the specificities of these tools, and so on.

And no one can manage both platforms, so we needed two teams. One for Azure, one for AWS to develop these tools. We also like to use external tools—like managing GitHub with code, for example, it could be good. But with CloudFormation and ARM templates, it's possible with customer sources, but it's not optimal. This means we need to find a tool to manage both platforms.

»Which tool did we use?

We're here at the HashiCorp event, so I guess you all have little idea. It is Terraform, of course. Terraform does not solve everything. It's not a magical product—that you deploy Terraform, and everything is done. Thank you, goodbye. No. Terraform will help us deploy the resources we need to manage the SaaS assets that we have. But we need to define the workflows, the processes, and how we store the customer data. For this, we cannot just rely on Terraform alone.

»Request workflow

In our company, there is a rule that the single point of entry for all requests—not only IT requests but all requests—should go through ServiceNow. We should create a ServiceNow request for everything. It's like that. We have to live with it.

So, we decided to define a request, so that a user should fill in a form to order an account. Here, the user should enter the name of the project, who is responsible for it, and who will take accountability for the cost of it. Is it a prod or non-prod account? Do we need a network? What is the network size? And all the information we need to deploy this cloud account. Then, we get this information from them. How do we store it?

»A single source of trust

We decided to use the YAML file as a single source of trust. So, everything we get from the customer and our services is stored in this YAML file and then used to create and deploy the assets.

Why YAML? Because it's easy for us to read and understand for humans— and it's really easy to parse and write by a machine. We have this thing: Everything should be done with SNOW, so ServiceNow. But sometimes, there are some errors, and we have to make some manual changes. So, we should also be able to edit it. Then, we can create a pull request in Git to modify this or let SNOW create this file for us.

»The full diagram explained

We have this YAML file, which is stored in a GitHub repository. In this case, along with the GitHub repository and with these YAML files, we have some Terraform code which will deploy several assets.

We are using HCP Terraform with several workspaces that each has its own responsibility. The first one is what we call the account factory. This one is responsible for creating an AWS account or an Azure subscription directly—and assigning it to the correct AWS organization or Azure management group. It will also create other HCP Terraform workspaces, and these ones will then deploy the resources inside the accounts.

So, we have one workspace which creates the accounts. And then , one workspace per account which will take what is in the GitHub landing zone file—in the account file, YAML file—and deploy the assets based on the configuration there.

Then, we also are using Prisma Cloud to configure the security. We are configuring Prisma Cloud, which is our CSPM tool, to be sure that once the subscription is created, our teams are not deploying some stupid stuff—and that they get alerted if they are using, for example, an unencrypted disc. So, if they are creating a machine with an unencrypted disc, Prisma will directly yell and say no, you don't do that.

Last but not least, we also offer our customers the option to use Terraform themselves to deploy the applications on top of the landing zone that we created for them. And for that, we are creating a GitHub repository directly with Terraform.

So, it has branch protection rules and permissions already assigned. CIDR workflows have already been made to ensure that the code is good. We also create HCP Terraform workspaces, so they can directly write their code, commit it to the Git repository, and then it is deployed.

We would like to prevent them from using access to the portal to deploy stuff. We are not yet there, but this is our goal. To avoid sharing credentials with them, we are also using trust relationships and OIDC to authenticate the Terraform workspace with the cloud account.

This means no one can, at any time, see any credentials. If we remove the right authorization from portal users, they won't be able to do anything and will have to use Terraform. We would like that maybe in the future.

»What are our processes?

People are doing a request in SNOW. Then, from ServiceNow, we have a webhook that will trigger a Lambda function. This Lambda function will create user permissions for our IBAK tool. It's IBM IGI for us. Then, it will request an IP range in our IPAM tool, Infoblox.

With all this information that it has, it will make a commit into the GitHub repository to create this YAML file. Once this YAML file is created, the Lambda function is finished, but GitHub will automatically trigger our Terraform workspace, which will deploy everything that I showed before.

»What challenges did we face?

First, we are using HCP Terraform. This means the workspaces are running in the order they receive the workflow—or not. It depends. They are queued, and then it runs in some arbitrary order. Sometimes, it can happen that one workspace needs a resource that has been created by another workspace.

We try to keep the workspaces quite small to avoid the runs taking too much time. What did we do here? We wrote the Terraform code in such a way that it checks if the needed resources exist. And if not, it will just fail silently. So, stop and wait for the next run.

But when the needed resources have been created by another workspace, then we use an API call triggered by another resource to run the first workspace again. So, at the end of one run—or one deployment—each workspace is run at least twice because, first, maybe all the resources are not there. And then, it's run again when the resources are there.

Then, of course, sometimes something can fail because that happens. What should we do about that? First, write the correct code and make bugless software. Of course, always.

We monitor each failure carefully. Our team is always informed if there is a failure. And then we correct it. In the beginning, we had a lot of failures. Now, we are getting closer and closer to perfect software, which does not exist, of course. But we have less and less failures.

Last but not least, I said we can modify the infrastructure and apply it automatically to all our accounts. This can trigger 200 runs. If you are using HCP Terraform, you know this can be some problem. It may block the queue for one or two hours.

So, when we have these things, we have to plan and try to communicate upfront to our customers that Terraform won't be available for the next hour. Sorry for that. Or doing that overnight when no one is using it. Here it's common sense, I can say.

»What do we want to do in the future?

Try to make our deployment go faster. For now, I think it's already half an hour to create a landing zone, which is quite fast. But when we are making some changes in the existing landing zone, it can take some hours because of these run queues. So, we also need to improve this.

We want everyone to use Terraform of course, so we try to push this—but here it's more political than technical. So, I am trying to convince them of the advantages. Last year, at HashiConf, HashiCorp announced Terraform stacks. I guess this could solve our ordering issues. I guess this can make the process much better, much more fail-safe, and so on. So, we're really looking forward to this new feature.

That's it from my side. Thank you for your attention, and enjoy.

Sign up for the latest HashiCorp news

More resources like this one

2/3/2023
Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

1/5/2023
Case Study

How Discover Manages 2000+ Terraform Enterprise Workspaces

12/22/2022
Case Study

Architecting Geo-Distributed Mobile Edge Applications with Consul

12/13/2022
PDF

A Field Guide to Zero Trust Security in the Public Sector

View all resources