Self-Service Infrastructure at Lufthansa Systems with Terraform
Learn how Lufthansa Systems created a vending machine that allows teams to consume prefab, pre-approved cloud components in a consistent, self-service, and on-demand way.
» Transcript
Steffen Wagner: Welcome to our HashiConf talk, "Liftoff to the Cloud." I brought my colleague Bence Wax today. He will take over later for a demo. He's located in Budapest usually, and is in our team as a systems engineer. Our department is called the Technology Center of Excellence.
I'm Steffen Wagner. I have around 3 years with Lufthansa, based in Berlin. I'm an enterprise architect in the same department.
We are both from Lufthansa Systems, a 100% subsidiary of the Lufthansa Group, which includes also Lufthansa Airlines. Lufthansa Systems was founded in 1995. Our headquarters is in Raunheim, which is a small city right next to the airport in Frankfurt.
We have offices around the world. Our larger ones are in Germany, in Hamburg and Berlin, as well as Budapest in Hungary and Gdansk in Poland. We also have a facility in Zurich and around 2,500 employees worldwide. We provide around 350 products and services. We have external customers of all sizes, including airlines all over the world. When you have been on an aircraft, you probably have been in contact with one of our products, even if you don't know.
» Facts and Figures about Lufthansa Systems
Our products include everything that has to do with ground operations, including payment and refunding. Around 45% of all flights operated in Europe are using one of our biggest products, called Lido flight. In the operational area, we have various products for crew planning and also various products around the scheduling and planning of the aircraft schedule.
We also have various products on the commercial side. Imagine you're booking a flight online. The price is being calculated based on some demand figures, and this is what we supply.
In an aircraft, the entertainment system is also a part of our product portfolio.
» Starting Our Cloud Journey
Before we jump into the topic, I want to give you a bit of an overview of where we started with our cloud journey and where we are at the moment.
Like lots of enterprises, we had classic datacenter operations until some years ago, and we still have them, but we were forced to move into cloud because those datacenters shall be closed. Therefore, we have to migrate all of our applications and customers to another landing zone.
We also have a very classic IT organization, mostly ITIL and ticket-based working, probably also nothing special in today's world. We started with some automation, basically using Terraform open source and some Ansible for configuration management, and started to do our thing, but it was a start from 0.
» A Need for More Automation
We realized some time ago that we do not have close to enough automation. This is like binding our valuable cloud engineers and resources to repetitive tasks. We are not able to deliver new, fancy features because we have to work on migrating applications, spinning up work for machines and services manually because there was not enough automation.
Besides that, we had built an ecosystem with very static secrets management. Secrets never expired and were static, and this is just not good for modern operations.
Obviously, our environments grew. We have more customers migrating to the cloud. We weren't able to properly scale up our teams. That resulted in an overwhelming of our teams. We had bursting ticket cues, and folks were just busy with doing ticket work.
On the other hand, everything just gets more complex. Meaning we get new providers, we do new prototypes, we should introduce new technologies. We are developing into a continuous learning organization, and this just brings complexity.
When you have more providers, you have to onboard them, you have to manage them. This gets complex without any automation.
On the other hand, we have our different product teams being responsible for their product development. They just grew on their maturity level of knowledge in cloud-native technologies, in platform services, in what the different cloud providers can offer, which is a really good thing, but we didn't expect that to happen so fast.
That is why we have to provide our central services inside Lufthansa Systems with an API, so the product teams do not have to wait for some engineer to do the work that they need to be executed. They just want an API so they can do the things on their own.
» An API 'Vending Machine'
On the other hand, we are still ramping up teams, we are still spreading knowledge about GitOps expertise and Terraform and the usual tooling. So we realized we have to introduce this API framework. We have created a thing that we call the vending machine.
The vending machine is a 24-hour store. You can go shopping for anything you need, self-service, and you just get the goods out. This whole thing is fully modularized so you can easily extend it with additional goods, like putting new items into a physical vending machine.
The idea is that we are providing this from developers for developers.
Each good in the vending machine represents a central service offering. That could be, for example, an Azure subscription, a project on Google cloud, or just an AWS account. The vending machine can be understood as a framework. It offers the runtime environment for it, but it itself does not implement a good. It's just like defining the API structures to the consumers of the vending machine. The core concept is based on Terraform, and internally we have a code name for it, GitOps 2.0.
The services that we are building in some of our delivery teams in our department, the technology COE, are also plucked into the framework. The teams are responsible for maintaining those services such as Kubernetes orchestration, observability platform, etc. We have a lot of those services, and the aim is they will be plucked into the vending machine. Every one of those internal services should be ultimately provisioned by it.
We are not there yet, but we have a good set of services already integrated.
» Seeding Components
I want to give you an overview of the structure so that you will understand how it works. So we realized that when you are trying to onboard a new component, there is nothing in it.
You get a root account on AWS or an empty Azure enrollment access to the enterprise portal, and you usually start creating stuff manually. Instead, we started to create things in an automated fashion. This is Step 0, which we call "the seeding."
On screen, you can see a couple of the components that we are seeding. There could be the Terraform Enterprise provider to create service principles or the automation accounts that we can use later on for the proper integration with the vending machine.
You can also see our different cloud environments. That is the service account, with an initial set of root privileges that we use to configure a Google Cloud, Azure, GitHub, or any of the other components.
The goal of the seeding is, with impersification, is to execute a Terraform run as a one-time thing, more or less, with an outcome that has a persistent secret that we can then use to configure dynamic secrets engines and continue to configure the actual service.
» Step 1: The Landingzone Self-Service Definitions
Step 1 is the more interesting one, something we call "the base-landingzone." It's self-service, where the consumers can define their services. There is a central repository in Git, and the consumer can just raise pull requests and put in the code of the services that he wants to be provisioned.
Then we continue with various checks running on actions like validation, some linting, some notifications, and a set of policies with Sentinel that has to be extended in one of the future steps. After a short peer review, if everything looks fine, we get this shipped and it's merged out, and an apply will start running. The execution itself is happening on Terraform Cloud, which doesn't give us the hassle of having to maintain our previous setup, with Jenkins, with different code version systems, and so on.
This time we decided to use platform services wherever possible to avoid the hassle of operating it ourselves. That is why we go with GitHub Actions and with Terraform Cloud.
Back to the vending machine. On Terraform Cloud, this landingzone provisioner is invoking various sets of modules that we have created. We have an umbrella module that we call the "TCOE landingzone module." This has a subset of multiple modules, which reflect the services.
If a consumer, for example, wants to provision an Azure landingzone, this invokes the Azure landingzone module, in this case based on the CAF framework provided by Microsoft.
This will take care of creating new subscriptions and making sure the subscriptions are part of a proper management group structure. It's configuring the RBAC permission. It can attach a different set of Azure policies and so on. This is extensible to whatever you need, as long as the provider supports it.
We do the same thing with the rest of the modules. You'll see modules for GitHub, which is our version-control system in this demo, and the same thing with Vault and, for example, Google Cloud. When this provisioning has finished successfully, the consumer gets notified and he has his landingzone provisioned, ready to use.
You see on the right side, the hashi-lsy-env1.
» Step 2: Consumer's Landingzone Customizations
This brings me to the second step., what we call "the consumer's landingzone customizations." The customization now is fully up to the consumer. He can do what he wants as long as it's compliant to the policies that are applied.
Effectively, a new GitHub repository has been created, and it has been preconfigured with all of the providers and input variables in the workspace that are required.
In the example on screen, the consumer requested a new Azure landingzone, and this resulted in a dynamic secrets engine being created in Vault for Azure. This is hooked into his workspace and also templated into his repository.
The consumer will always find a Terraform file calledtcoe.tf containing all of the configuration that we give the consumer. He's not supposed to touch it, but he's still fully responsible. So if he changes something and it doesn't work, then usually he has to fix it.
The rest is left to the consumer. We'll push him some pipelines that take care of provisioning and execution, but the rest is up to him. How that works is he has a consumer pipeline that also runs on Actions, it does the same things, some validation, some documentation, and so on.
When a pull request is fine, it gets merged. The Vault provider is hooked up. We use Vault and HCP. We've been one of the launching customers for the product. We are really happy with it. It's been very stable, not a single outage or anything. I can only recommend it so far.
We invoke that one, a secrets engine gets hooked. It's connected directly to the subscription that was created for the consumer, so it has the respective permission to execute and update any resources on Azure. That is what is passed on to the Azure provider. There will be a very temporary service principle available during the execution run of the pipeline.
When the pipeline is finished, it gets cycled after the TTL expires.
Also, for some governance, we have cost estimation enabled. So the consumer will see if he accidentally provisioned more resources than he intended to. The goal is to bring in an extensive set of Sentinel policies, but we are in the beginning of that journey.
» The Runtime
Looking at the runtime, you've already seen some of the components that we are using, but I think it's important to bring it onto one slide so you know where stuff is running and what we are using.
We have something that we call the frontend, but the frontend until now has not been well developed. We just have the version-control thing right now, but we do have on the roadmap to add our systems, which gives an interaction with nontechnical users.
That could be Jira REST hooks from Atlassian and ServiceNow for provisioning. The idea is that they get hooked into the Gitflow, so we still have the Git as the single source of truth for everything.
Besides that, we have the GitHub Actions ecosystem. If non-Terraform resources are configured, we still use Ansible because we have there an excessive set of roles written for our different products, so we'll just continue using it.
After a resource has been created, like a virtual machine on Terraform or something, we invoke the hosts and so on into the Ansible ecosystem and then continue to configure it if required. If there are resources that don't need configuration management, this doesn't have to be used.
After GitHub Actions, we have options for writing Ansible roles and Terraform modules. They will be stored in the private registry on Terraform Cloud. We also have our workspaces with the states on Terraform Cloud.
Vault on HCP is hooked into the system to take care of the dynamic secrets creation for any of those components, as long as there is a secrets provider available. Then we get out the apply, and this will talk to the different APIs that we're using, be it Google Cloud or Azure or anything else.
Besides the hosted runners or agents on Terraform Cloud, we have a set of self-hosted ones and we can't get rid of them because we need the private connectivity into our network—for example, to let Ansible connect to the hosts, because they're not reachable from the internet.
» What Is in Each Landingzone
This slide represents the structure of the landingzone that a consumer can choose.
As you can see, each of these always has a version and some metadata. We map it to a single Git repository. We plan to support multiple version controls. At the moment, it's just working with GitOps.
We have a single Terraform workspace on the Terraform Cloud and some hidden space where we configure the secrets engines. At the bottom of the screen you see the configuration plug, where the consumers can put in the goods that they want to be vended.
Currently we support Azure, Google, Vault, and some network configuration so that the consumers, when they have their subscription, get connectivity into our backbone. We provision them spokes into the deployment. They can also provision Git repositories and various other things.
Now my colleague Bence Wax will show a demo.
» The Demo
Bence Wax: This demo shows how to provision a consumer landingzone from our base-landingzone. As you can see on screen, we have the landingzone definition, which incorporates our own TCOE services coming from the umbrella module, where we put in the metadata and a lot of stuff configured through this.
We send the information through maps. Here, we are setting the Terraform Enterprise workspace variables and all the runtime configuration for it. We'd also do the Vault consumer namespace, and the consumer himself can use his own Vault namespace.
And we have Azure here. We would use some Azure resources as well in the demo.
Let's go to the umbrella module structure. We have nested modules here, which are describing all the services that we want to provide. Steffen will be our consumer and will create a pull request for his new consumer landingzone.
We just got the request Steffen made to create a consumer landingzone. He wants it very urgently. Our process is now started and we review his code.
Let's check what Steffen wants to do as a consumer. He needs a new landingzone with this configuration. Let's review these changes and approve it. Looks good to me.
Now all the pre-checks have started and also the documentation generation, and here we have the steps when the pull request is opened, those basic steps like linting and others.
But the most important part in this pull request workflow is the plan phase, which could take some time. But we have plans for further optimization of the runtime.
You can see the plan is running nicely. After a moment, we have the actual plan, and then I need to review again, because documentation generation just made some modifications. That looks good to me as well.
We can see that everything's green and we can hit merge, which will fire our build workflow, which will do the real stuff behind all this. Let's go back to the base-landingzone and you see that we've merged. So our committee's on master. Now we start the actual build, which will take more time.
All this stuff is running in Terraform Cloud. Therefore, we don't need to hassle with the Terraform environment setup.
This will take some time, so we'll jump a little and see that our resources are being created. Then we are good to go with a short Terraform apply.
Let's check our new, fresh, and crisp consumer landingzone. As you can see, we have already a pull request open, which is for the TCOE services to be merged as all of our backend configuration is coming from there for our services. The consumer itself doesn't need to hassle with the workflows or with the backend configurations to our services.
thetcoe.tf, we have the pre-configured backend endpoints, for example, the Terraform workspace or the Vault namespace for generating dynamic service principles for Azure and the Azure provider itself as the Vault consumer namespace backend.
Once we know everything's fine, we create the pull request. Now you can see that the test pipeline has finished and the plan output is what we've defined in the consumer landingzone with our example resources for Azure and for Vault as well.
Now let's go back and I use my administrator privileges to merge this, but this would be on the consumer's side. So the merge is finished. Then we fire up our build pipeline as you saw before in the base-landingzone. And after everything's finished, we see that the apply is checked. So the build has finished
Let's check what resources we've created. We've seen that the Terraform apply is OK, we bring up the Vault terminal, and let's just go and get the secrets from Vault. We can see that we have all the resources that we've generated and all the corresponding platforms.
Thank you. Steffen?
» Limitations
Steffen Wagner: Thanks, Wax, for that awesome demo.
I will keep it short now. I just want to bring some stuff off about the limitations that we had to deal with during our implementation. One of the bigger things that we were dealing with was that providers were lacking features. So we usually had to find some workarounds with non-Terraform resources, doing it in Azure CLI or Bash, whatever fit in.
One of the persistent issues is the authentication of the dynamic providers. Because we are configuring stuff that we are handing over to the consumer, we want to configure it before we hand it over. This forces us to do another Terraform run in a different workspace, which could be avoided if this had been implemented in Terraform. As you can imagine, this is getting to be quite a beast. We have complex graphs, and this sometimes requires long plans, but we are optimizing it.
There are some API issues now and then with Azure. We also run into funny things that we couldn't explain until they were magically solved by the Microsoft support.
Also, with GitOp Actions, we have to compensate for some of the features that Terraform Cloud doesn't have yet. We know that there's a faster function of the tasks, but we just don't use it yet.
We still have a lot of things to do, but I think we have some good bases on which we can continue to develop. We are planning on doing some more integration of Sentinel policies and definitely improving the frontends. And we have to get a general available version, like the version 1, shipped and so we'll just keep it as it comes and continue our flow here.
Wax and I are very happy you joined us, and we wish you all a very nice HashiConf.