Building a migration factory with Terraform Enterprise at AXA Group
For AXA Group, migrating 6,000-8,000 applications to public cloud meant that a self-service platform and process was non-negotiable. See how they built it with Terraform Enterprise.
» Transcript
My name's Kelly Monteith. I work for AXA Group operations, and I'm a bit embarrassed to be here working for a French company but having to do this presentation in English. Very sorry about that, but honestly, you would not like to hear me try to speak in French.
» Introducing AXA
AXA is a very large financial company. I have to read this from the slides. This is the only part I'm going to read from the slides properly. We operate in 51 countries around the world. We've got 145,000 employees and distributors, and we're serving 93 million clients. I had to look that up yesterday, the same as anyone else.
We're in the finance industry. We're a highly regulated industry, and we've got a lot of the same challenges that I heard Samya talk about earlier on. I understood enough to understand we've got the same challenges, and it's good for me to understand we've even got some of the same approaches to addressing some of those challenges.
» AXA Group operations
We're present in 16 of those countries. We operate in Europe, Asia, the US, and Mexico, for example. We have over 6,000 employees, so quite a large organization. We're the infrastructure provider and the core IT application provider for the AXA Group. We have a very diverse set of people working in AXA Group operations, from traditional infrastructure managers to operations, but we're trying to move towards a product-driven organization.
Everything we do now, we're trying to build as products, and we make those products available to our customers. Our customers are, therefore, the security teams and the DevOps teams that are working to develop those IT applications AXA uses to run their business.
We support nearly all of the AXA entities, and more recently, we've become an internal cloud provider for the AXA Group. We provide some products — we take some cloud services from the main cloud service providers we want to work with. Those are AWS, Microsoft Azure, and GCP. We build some products that sit on top of those cloud service provider platforms and we also act as a cloud broker. We've got a lot of teams that are co-located with the business. They sit in 16 of those countries, but they help integrate those cloud products and enable those cloud services into the AXA IT teams.
I lead the multi-cloud team in AXA. My role is to make it as easy as possible for our application teams to consume those cloud services in a safe way — so, in a secure and compliant way. But make it so it’s easy for them to consume those services — to move their applications into the cloud.
» Our cloud strategy
We've got a huge program at the moment. It's called the Atlas program. But before I talk about that Atlas program, I want to talk about our cloud strategy. Our cloud strategy has changed. About five years ago, we defined a cloud-first strategy. All new business applications should target public cloud services. We had agile development teams, feature teams, agile tribes, etc., and we were trying to modernize all our business applications.
We wanted the operating companies to modernize their business applications — tore-architect them, re-platform them where possible, and move them to managed platform-type services. We wanted to move away from the old way of delivering applications on server VMs, infrastructure services, etc., and we didn't see any benefit of moving those infrastructure services as is to the public cloud.
We wanted to focus on functions as a service, database as a service, and all the serverless applications you can get from those cloud service providers. We built our own container platform, so we could containerize our applications and run them in the public cloud. That was about five years ago.
Then we built a private cloud infrastructure running in our core datacenters. Then we came to the next refresh cycle. We had a big decision to make. Do we refresh that private cloud service that we’d developed? Do we refresh all of our core IT services? Do we continue with the same datacenter approach? Or do we now make the jump and move everything towards the public cloud? That's what the Atlas program is doing. The cloud strategy has changed. We're looking to move everything to the public cloud.
That gives us a lot more challenges than relying on the cloud service providers for those modern managed cloud services. We had to replicate a lot of the infrastructure management services also in the public cloud. To do that lift-and-shift migration, we've had to bring a lot of those management services — those operational capabilities — into the public cloud, and that's what the Atlas program is doing.
» The Atlas program
The Atlas program's a global cloud migration program — objective to exit our private IaaS and to do our core IT refresh in the cloud environment. The good news for me, it's still a multi-cloud strategy. We're continuing with that multi-cloud strategy. Again, we did say do we want to partner with one strategic cloud provider or do we want to continue our approach to use AWS, Microsoft Azure, or GCP? At the end of the day, we're an insurance company. We're hedging our bets a little bit, and we wanted to spread the risk. We're also quite regulated, so spreading that risk is quite important.
We still have that multi-cloud strategy. That's not changed, but we're not going to develop any cloud portals. We won't develop something that allows you to log onto a console and build your infrastructure in the public cloud. We're not trying to replicate what's already been done by the cloud service providers. The key principles — everything is code and everything automated. So, the Atlas program, to migrate these applications, everything must be fully automated. It must be fully infrastructure as code.
I'm sure you can imagine that — for those working in finance companies — we've got a lot of old and legacy applications. We don't have the source code for all of that. They were all configured over many years. I see a lot of people nodding in agreement here. We don't have the infrastructure as code deployment, so we had to develop that. We have to build those automation tools to enable that migration to the public cloud.
» Moving forward with Atlas
The Atlas program is more than the migration. It's not the lift and shift. We had to look at the whole IT landscape. We had to look at all the management services. We can't bring everything and we can't do everything at the same time — so we’ve got some priorities.
We want to exit our private IaaS end of this year/middle of next year. It means we cannot wait for everything. We can't move everything to the public cloud and then use that to do the migration. We have to be a bit pragmatic.
The Atlas program itself — some stats:40 entities around the world are involved in this. 600 people working on those cloud solutions, those migration factories. We had to take a migration factory approach because there are only so many highly skilled technical people that can help migrate business applications.
If we want to migrate 6,000 business applications — some said between 6,000 and 8,000. I've gone on the conservative side, 6,000, but 13,000 virtual machines and 5,000 technical servers. I was quite amazed when I saw that — the number of technical services we have to run our business applications is huge. We couldn't do that just by a simple lift and shift. We had to take a factory approach and build all the automation tools to enable that to happen.
» The start of our journey and our roadmap
About 5-6 years ago, we started to enable AWS, Microsoft Azure, and then a little bit later, GCP. GCP for some very specific use cases — the big data, the business intelligence-type stuff.
But we enabled AWS and Azure and focused on building those foundations. We did a lot of work to build the foundations. Whenever you create them — and we've got something like 1,200 AWS accounts, 500 Azure subscriptions. The foundations, the landing zones — I think Sammy was referring to them — we had to fully automate that. We did that about 4-5 years ago, and we used Terraform for that, but it was Terraform community edition.
We used Terraform —not for everything — but in some cases, we used Terraform, and we built those landing zones, that foundation that could support our new business applications. We built this container platform based on IBM or RedHat OpenShift. That container platform runs on Azure — runs on AWS — and we were using that for modernizing and re-platforming a lot of our business applications.
Then, in 2020, one of our more mature operating companies, AXA Germany, started an infrastructure automation project. When we started the DevOps teams, they were all using different tools. They were using Azure DevOps, CodePipeline, CloudFormation — all using different CI/CD tooling.
Germany's got a history of industrialization, really industrializing the operations, and building those standard infrastructures. So, AXA Germany's CIO came to us and said we want to look at building a standard infrastructure as code. They were using AWS and Azure, so they wanted something which went across both clouds. They selected Terraform, but they wanted us to build that industrialized implementation. That's when we first delivered Terraform Enterprise.
» Terraform Enterprise — Where we started
Our German team, Group Operations Germany, built this Terraform Enterprise platform. Within about nine months, I think, we'd migrated all of those DevOps pipelines to Terraform Enterprise, either directly so they're using Terraform Enterprise and nothing else with a VCS-driven workflow, or they integrated their existing pipeline tools, and we started to build it. Then within nine months, the diktat in Germany was that everything now must use Terraform Enterprise.
It was a project that went to a product that was only specific for Germany. Then when our strategy changed — when we decided now we're going to shift everything towards the public cloud — we had to build something.
As I said, we wanted to do it automated, everything as code, so I was asked to look at how we could support this migration. How can we support the entity teams to take their applications that exist today and move them into the public cloud? I didn't want to reinvent anything, so I grabbed those people from AXA Germany and said, "Let's make this a global product."
» Where we are today
A couple of years ago, we did exactly that. We looked at what was done in Germany and how they'd implemented Terraform Enterprise in a private network — it's actually hosted in AWS. and we took that and made it into a global product. It's a multi-tenant implementation of Terraform Enterprise. I think we've onboarded something like 35 entities, and we're now using that for all the migrations and new business applications that are being deployed into that cloud environment.
We've enabled that infrastructure automation. Every application that we build today uses that Terraform Enterprise platform. Every migration from the on-premise, private cloud IaaS infrastructure — all the servers that are being migrated — are migrated using that Terraform Enterprise platform.
We've tried to fully integrate that into all the other AXA landscapes — the monitoring tools, the configuration management databases, all of those other things, the networking, etc. We've tried to build all of those features, some of which I saw we've built — and I saw announced earlier — are being replicated in the next versions of Terraform Cloud or, hopefully, Terraform Enterprise. So it's good news for us, but we've already done some of the work.
» Where do we want to go?
We've seen developers are a funny bunch. If you give them the tools just to use to consume cloud services, if it works, they will use it. If you don't give them the tools, they're engineers; they're going to invent something. They're going to write it themselves. They're going to build it themselves. They probably won't consider all of our security or compliance needs that we have, so our approach is not to mandate the use of something, it's try to provide the services that those developers are asking for.
We've got a lot of community events. We had a software engineering summit in Cologne two weeks ago, and our partners from HashiCorp joined us at that event. But we're really trying to listen to those developers because if the developers are asking for something, and if we are not delivering it — if we're not making it easy for them to consume — I know they're going to build it themselves because they're engineers.
So, our approach is to try and build the services they're requesting — or even if they're not requesting — try and anticipate what they need and try to build those services and make it as easy as possible for them to consume secure, compliant, safe services. I didn't understand everything in the previous presentation, but I think it was a very similar approach.
We want to increase automation. We're looking at GitHub actions, and Terraform runs tasks, a lot of things that we're going to improve. We're constantly evolving that platform.
We've built those foundations, so we've got a team of cloud brokers today who will create landing zones. They will onboard an entity. Now they've done that integration, we want to make it simple for those developers to spin up new environments to build a developer landing zone. So, they don't need to know how to create an AWS account or request an Azure subscription or resource group — whatever it is.
We want to make it easy for them to have a module that they consume, and that bootstraps the whole DevOps environment for their application. That would include all the networking, the monitoring tools, and the integration to the identity management systems, etc. To really make it as easy as possible for developers to spin up new environments, develop their applications, and onboard their Terraform workspaces.
Ephemeral workspaces: Been talking about that for a long time. We've actually developed something of our own, I’ll maybe show that later. It's good to see that HashiCorp's direction is going in the same direction as we want to go.
» No-code provisioning
A bit of a buzzword, maybe — I don't know. But if a developer wants an object store just to store some data, they'd be quite happy to consume a module which creates an object store for them. And if they just put in some business attributes like the confidentiality level of their data, then we can build the storage bucket with the right encryption, exposure, etc. The developers are quite happy, I think, to consume those services if we make it really easy for them to consume.
The no-code provisioning approach, for us, is working today. For the migration, for the server infrastructure, it's working. We're just saying it's this flavor. It's Windows 2019 running SQL server, so we build those standard patterns. Then they're happy to take that, build their applications, and use it. We want to do the same. We want to expand that. So, glad to see HashiCorp going in the same direction.
» SaaS model
Then we want to look at the SaaS model because I'm running a platform team today. I don't want to do operations. I don't want to necessarily operate our Terraform infrastructure platform. Today, we're running some VMs in AWS, which we have to maintain. We have to upgrade every month because HashiCorp releases a new version with nice new features every month, and I don't want to do that. So, I’m Interested to see where the SaaS model will go and whether that will be appropriate for AXA. I'm hoping it will be.
» Enabling group-wide cloud adoption and migration
I've got an architecture background, so I had to put a framework up here. But you can see here; Terraform sits here in the cloud provisioning stuff. But let me focus on this small piece down here first.
You've got those foundations. We're using cloud services, AWS, Azure, GCP. We've built that layer of foundation. Now it's not a big layer that sits on top. It's really just the things you need to do whenever you create a new account or subscription. It's the integration with IAM, networking, and security services that must be configured. So we configure all of those in exactly the same way. Then we've built what we call MPI as a product. MPI is a managed public IaaS. That's just a set of patterns that allow you to deploy secure and compliant virtual machines.
We've got this open PaaS. It's an OpenShift-based container platform. We're running something like 20 OpenShift clusters worldwide, 27,000 containers running in that platform — making it very easy for the developers to consume a container platform, and we build it for them. Then the native-cloud services, which sit on top of that. That's the kind of cloud platform and products that we've traditionally been building.
I haven't got all the tools on here. I noticed on the previous slides that you've got Artifactory and, I don't know, Checkmarx and all those other tools. They all exist up here as well. I've just not included them on the slide. But we're trying here to have a managed Terraform Enterprise cloud provisioning service. And for all those native services or the products we develop ourselves, to have standard modules that can be consumed.
With those modules comes a set of policies, and those policies will tell you if they're compliant or not. I think — slightly different from the last presentation — we're not enforcing many of those policies today. We are starting to work a lot closer with the security teams, but most of it today is advisory, and we report. But I'll come on to that in a second.
» Standardized modules
For every new type of service or product we build, we're creating those standard modules, the set of policies that enable those developers to consume and build their business applications. As I said before, not all of our — I don't want to say legacy — core systems are migrated to cloud-native services. They don't all have Terraform providers. We haven't written Terraform providers for them, so we still have to do a degree of automation to integrate those products into those backend systems.
An easy example, if you deploy an EC2 VM into AWS and, we need to register that with a number of backend systems. I don't know; it's Puppet, identity systems, privileged user access management systems, our configuration management database. Not all of those have Terraform providers. We haven't developed that yet. If you are using those modules, we've got those foundations running in those cloud platforms. We've got a provisioning event. We can use that provisioning event to do some integration into those backend systems. Eventually, we want to move all of that as part of the provisioning process, but it's not today. We can't fix everything in one go. We've got, as I showed earlier, a lot of applications to migrate.
We've come up with this orchestration platform capability. It's event-driven. It's an API service with various workflows for registering services into those backend systems. So, the scenario is that developers, migration teams, they'll use the modules, they'll provision those services into the strategic provider platforms using our products. An event will happen. The event will trigger something, that something will do some integration into some backend services.
» Improved insights and observability
Because we've got all of that, we've got some very good insights into what's been deployed and some very good observability into our products and the cloud services. We just use the data lake technologies and collect as much data as possible about all of the resources we deploy into the public cloud.
Actually, it's not only public cloud. We've got something on here we call POD. It's a point of distribution. That's just a VMware-based on-premise. There will be some stuff that we know will remain in our core data centers. We want to migrate everything, but we know we're not going to do it, and we're not going to do it immediately.
But everything that gets deployed, we collect all the data about that into data lake technologies. Then we have some insights and can use tools like Power BI or other business intelligence tools to analyze that data and then to report. That observability can go to the governance teams, the architecture teams, the finance teams. It can be the operations teams and primarily the security teams who want to know those business applications are running in a secure environment and we're not exposing our customer data on the public internet, for example.
This is the responsibility of my teams: We provide that provisioning capability, the orchestration capability, and the observability. The foundation stuff — that was my old role. Now that's being run by our Azure Competency Centers — and we've got competency centers based on the different technologies, Azure, AWS, GCP, OpenShift, etc.
» Full integration into AXA landscape
Well, first of all, we run Terraform Enterprise itself. We just run it in AWS today. As I said, we want to look at moving that toward a SaaS service in the future. We also run a small instance for our Swiss operating company. They've got an implementation on Azure. So, we've got both an AWS implementation, which serves most of our customers. But if anyone's worked in a Swiss company, you'll know that they want to keep everything in Switzerland — very highly regulated — so that's running in Azure because Azure has a region in Zurich.
But primarily, our Terraform Enterprise implementation is in AWS. It's on a private network, and we've integrated it with all the other AXA services — Power BI for reporting with Dynatrace for monitoring. Obviously, we integrate with all of our CI/CD tools. We use them to deploy Terraform itself. And we back up all the state. We do all of those things.
» Reporting Terraform Enterprise use
This picture on the left, this is an old picture. It's not the latest figures, but we've done a lot of work on reporting the use of Terraform Enterprise. We built our own workspace explorer. I just saw the announcement this morning. Maybe a lot of what we've done here will be replaced by out-of-the-box solutions provided by HashiCorp.
But we've done a lot of work in taking all that data, reporting the number of organizations. We've got reports which tell us have we have got compliant workspaces? Compliant workspaces mean they've got all the right policies attached. They've got all the right naming conventions being used.
We've got reports which tell us about low-utilized workspaces that were touched seven months ago, and not touched since. So probably either the resources are just there and never changing, or those resources are simply not used — in which case we've got other information which would tell us how much it's costing, and we can look at removing those if necessary.
» Summary
We started our cloud journey by enabling new managed services in the public cloud. We didn't initially see the benefit of migrating all of our workload. We've since changed. We've now got a large program. It's actually the largest program in AXA for the next four years to migrate all our workloads to the public cloud.
We've done a lot of work to integrate Terraform Enterprise to ensure everything that now gets deployed is infrastructure as code and is fully automated, so there are no manual provisioning actions, let's say. There are still going to be some manual operations. But we try and make sure that everything that now gets provisioned into the cloud will enable us to — in the future — maintain it, sustain it.
Also reversibility's coming along. So, in the future, we’ll be able to take an application running in AWS and move it to Azure? I don't know. But that's maybe the long-term goal of where we want to be, and we could only do that by selecting a common tool — The tool we selected was Terraform Enterprise.
That's it for me. I've got one minute left.