Using Terraform to Manage Multiple Kubernetes Clusters On-Premises and in the Cloud

Recently, a SaaS company in the operations software industry needed the ability to provision and manage multiple Kubernetes clusters both on-premises and in various public clouds. Redapt used HashiCorp Terraform and various Terraform providers to make the process efficient, repeatable, and recoverable in case of disaster.

Sep 10, 2020

This is a guest blog case study written by Jerry Meisner and Michael Little. They work as the DevOps Practice Lead and the Director of Engineering Services respectively at Redapt, an end-to-end technology solutions provider that brings clarity to a dynamic technical environment.

At Redapt, we often help organizations adopt better DevOps practices, and one of the tools we love working with is HashiCorp Terraform. Recently, a SaaS company in the operations software industry needed the ability to provision and manage multiple Kubernetes clusters both on-premises and in various public clouds.

Redapt used HashiCorp Terraform and various Terraform providers to make the process efficient, repeatable, and recoverable in case of disaster. Combined with the power of Rancher’s Kubernetes management, the client was able to perform disaster recovery while maintaining the benefits of Terraform pipelines.

»The Client’s Needs

When this client approached us, they were already using Terraform Enterprise to manage infrastructure across multiple clouds. Since they knew they were going to leverage Rancher to manage their Kubernetes clusters, they wanted to ensure they had a solid Terraform Enterprise workflow that would allow them to deploy Rancher, manage upgrades, and handle disaster recovery (DR).

The client was also looking for a solution that could work across multiple infrastructure providers, starting with AWS-based instances. The Rancher control plane in the cloud would manage Kubernetes clusters across multiple environments, including backups and DR.

The client needed a full solution for the Rancher control plane that wouldn’t impact their ability to use Terraform for upgrades or de-provisioning going forward. If the control plane lost quorum due to infrastructure failure and had to be brought up on new instances, they wanted to be able to use a Terraform-friendly process to not only restore the cluster, but continue managing Rancher upgrades. They also had multiple environments, each with its own control plane, that needed to be covered in the same way.

»The Solution

Terraform is a fantastic tool designed with immutability in mind. The general method for restoring a Rancher Kubernetes Engine (RKE)-based Rancher control plane is to bring up a brand new Rancher installation and use a command line utility to restore it from backup with the original configuration files. In the event of a disaster, new nodes can be brought up with Terraform and leverage the RKE Terraform provider to reinstall a fresh Rancher control plane onto those nodes. Then the backup and original configuration files conveniently stored in the original tfstate by the RKE provider can be used to restore that cluster to the original state using the RKE commands.

The problem with this approach is that after restore, Terraform would no longer be able to connect and manage the restored cluster. The data in the tfstate would still reference the fresh installation, which was overwritten by the restore. And while the data in the original tfstate was more accurate, it referenced the old nodes and api endpoint.

To solve this problem, we leaned into our knowledge of RKE and the Terraform state manipulation procedures, such as import, pull, and push. We also leveraged the RKE provider, which for this project was a custom plugin provided by Rancher.

Since the rke_cluster resource that we were leveraging supported cluster imports, we understood that we could import the cluster with some caveats. As mentioned earlier, the RKE restore process effectively reverts etcd to its previous state, so the underlying Rancher installation—which was using the Helm provider—also needed to be imported and, along with other things, applied directly to Kubernetes. While we could go through manually or via script to remove/import into the current tfstate, we knew that the original tfstate prior to the disaster would still reference all of the data.

Because of this, it made more sense to push the original tfstate to the backend, remove the original, and import the new rke_cluster resource. Once the modified original tfstate was in the backend, subsequent Terraform plan/apply operations could then continue, including updates to Rancher and its underlying Kubernetes cluster.

Since Terraform Enterprise has API support, and RKE has CLI support, we could write a script that connected to the backend and performed all the necessary steps, including the RKE restore. This script was later integrated with the client’s CI/CD system to be run as a parameterized job. Whenever the client encountered a disaster scenario, they could trigger the job and target the last good backup.

»Conclusion

By using our understanding of how Terraform and its providers store their state, Redapt was able to create a DR procedure that allowed our client to have full lifecycle control of their Kubernetes clusters, including their Rancher control plane, regardless of the hosting infrastructure.

For more educational resources on Terraform and other HashiCorp products, visit HashiCorp Learn. To read more case studies about HashiCorp products, visit the HashiCorp Case Studies page.