How to keep the Death Star (Terraform Enterprise) safe and performant
Learn how G-Research maintained high Terraform Enterprise performance at scale by using Terraform agents and Active/Active architecture.
» Transcript
Carl Heinst:
This presentation is a real-world example of the approach advocated by Nick when he opened the leadership track this morning. There's something about this slide that doesn't look quite right. That's better. We are here to talk to you about the various approaches we've taken to maintain responsiveness and uptime in our on-prem instance of Terraform Enterprise, or, as I have evocatively named it for this presentation, the Death Star.
» Introductions
We will start by introducing ourselves, telling you about G-Research, and then moving on to how we deploy and configure Terraform Enterprise. I shall discuss our customers and their requirements, how we initially mitigated some performance issues, and George will provide an overview of how we successfully deployed Terraform Enterprise Active/Active.
I'm Carl Heinst, and I manage automation engineering within G-Research. I'm blessed to lead two of the most technically capable teams I've known throughout my career — Infrastructure Automation and Infrastructure Development. George joins me from the Infra Auto Team, who, among their numerous responsibilities, own our instance of Terraform Enterprise.
I love automating manual tasks almost as much as I love science fiction films, which perhaps explains the title of today's presentation. My interest in automation stems from a comment by the PowerShell author, Don Jones, to Windows administrators in 2010: "Your choice is — learn PowerShell or, would you like fries with that?" As a Windows engineer at the time, the idea that you could get things done without having to click next, next, yes, I agree, go away, finish sounded like Nirvana.
PowerShell led to Python, Python to Ansible, Ansible to Terraform, and two years ago, I fell to the dark side and moved from senior engineer to management. These days, my Python skills are more used for creating PowerPoint presentations and data visualizations rather than Ansible playbooks and Jenkins pipelines. As the Infra Auto Team is so fond of reminding me, "You managers love your pie charts."
George Short:
Hello, everyone. Good afternoon. I'm George Short. I am an Automation Engineer at G-Research with Carl. I've been there for around 18 months. Previously, I was a Senior DevOps Engineer focusing on public cloud at an investment bank. To touch on the Infrastructure Automation Team, as Carl mentioned, our primary focus is to deliver TFE as a service within G-R.
That ranges from tasks such as keeping it up-to-date and support activities, to developing and maintaining our self-service capability for other engineers to consume. Also, around this, as a team that specializes in automation, we have developed a lot of automated workflows around the product. So, very minimal manual effort is required from the group, which is obviously a good thing.
Also, in terms of Terraform Enterprise, we've got a great relationship with HashiCorp where we are providing persistent feedback and freeze requests to keep the product. We want it to be the best product for us and other customers. They might call it nagging — but positive feedback.
Aside from TFE, our team primarily works alongside other teams within G-R to remove any existing manual processes they may have or to develop new solutions for projects that arise. We've got a broad skill set within the team. Skills range from anything such as Terraform and Ansible to Python and Golang, Kafka, Kubernetes, and many more. That's it for me for now. I'll hand back to Carl.
Carl Heinst:
Let me set the scene by talking about G-Research and our instance of TFE.
» G-Research
G-Research is the most successful quantitative research and technology firm in Europe. We have made significant investments in the spheres of machine learning and artificial intelligence — where possible, leveraging open source technologies to achieve this.
We are headquartered in London, which is the office myself and George are based out of. We are currently growing an engineering hub in Dallas, Texas. Good morning to our colleagues over there. And in terms of size, our calc farm has a comparable number of GPU cores to companies such as Tesla.
Notably, five years ago, we embarked on a migration from what was largely a Windows environment to a new Linux-based infrastructure platform. Our goal for this platform was that it would be 100% deployed and managed from code, and TFE is a fundamental part of that capability.
100% deployment from code allowed us to achieve two broad goals across the organization: Consistent and more rapid deployment of applications and services; and security, and risk reductions due to the benefits of operating in a zero trust environment.
» The Death Star (Terraform Enterprise)
Now you know who we are. Let's get back to our subject matter — the Death Star, one of the most maligned fictional technologies of the 20th century. Contrary to popular belief, this peaceful space station was designed to destroy dangerous asteroids before they could devastate populated planets. There are some unhinged, anti-establishment, unfounded conspiracy theories about it being used to destroy entire planets.
Prior to Active/Active, Terraform Enterprise (TFE) had much in common with the Death Star. Realistically, you would only ever build one at a time. Deploying that instance would be resource-intensive and time-consuming. And, like the Death Star's Super Laser, TFE workspaces are optimized to handle a finite number of resources. Too many resources in a workspace, and it can be overwhelmed.
There are, of course, a few differences between our instance of TFE and the Death Star. Unlike the Death Star, our instance of TFE is entirely deployed from code. We make use of Jenkins as a scheduler and orchestrator. Ansible playbooks are used for configuration of the application servers. Then we make use of Terraform command line executions to handle the configuration of the application. The majority of teams that use TFE do so to deploy and manage their applications or platforms from code.
» Why Terraform Enterprise?
Its plethora of providers that makes Terraform an extremely flexible tool, further enhanced by our internal ability to develop custom providers. Within G-R, it is one of our de facto standards for infrastructure as code — harmonizing with our new platform's goal of immutable deployments.
Infra Auto use TFE to provide our internal customers with a secure, self-service, fully automated offering for the consistent deployment of their environments. And, as we're about to see, we have some demanding customers who want to manage very large numbers of resources within our estate as efficiently as possible. Over time, it became clear to us that workspaces with tens of thousands of resources were unsustainable.
» The Rebel Alliance
Referring to them as customers seems a little pedestrian. What name would be better for a group who are conspiring to attack the Death Star? Perfect. First in this band of anarchists and criminals bent on the destruction of expensive government property are the GitHub team.
They don't look like much, but they've got it where it counts, with 30,000 resources in their workspaces. The Vault team is next with 27,000 resources — the Kubernetes team with a staggering 84,000 resources. Finally, the OpenStack team with 30,000 resources. And if all of them decided to run their Terraform plan and apply on TFE at the same time, we get this.
Look how happy the GitHub team were in that clip as they fly off to their next task, leaving Infra Auto to pick up the pieces as CPU utilization on Terraform Enterprise hits 100%, and it stops responding to anybody else. I am now anticipating an argument amongst the GitHub team as to which one of them is Lando.
» Our Mitigations
There's the crux of the problem. Resources in a single workspace, which result in high CPU and memory utilization and, ultimately, the TFE platform becoming unavailable. From a customer experience perspective, simple configuration changes could require upwards of 30-60 minutes to land in production, which was unacceptable.
Unlike aboard the Death Star, we didn't have the option to destroy our problems ship-to-ship. Shooting down your customers in a starfighter isn't very good customer service, apparently. Nor, as one of my fellow managers suggested, could we put a vent over the thermal exhaust port.
» Evading turbo lasers
So, we invested in a number of other solutions. Early on, the team increased the T-shirt size of the VMs we were using to have more CPUs and more RAM available. This brought us quite a bit of breathing room. Some of our medium-sized customers were able to successfully shard their workspaces, splitting them up to reduce the number of resources within a single workspace.
As time went on and resource counts continued to grow, the large workspaces were consuming more memory and taking longer to plan and apply. We were eventually forced to drop our overall concurrency to just eight jobs with six gigs of RAM allocated to each job as we work with large workspace owners to shard their workspaces.
» Deploy the fleet (TFE Agents)
However, our old nemesis GitHub was preparing a fresh assault. Unfortunately, for our colleagues in the GitHub team, due to certain structural decisions around our GitHub organization, it's really difficult for them to shard their workspaces. So, just like in Return of the Jedi, we deployed our fleet to keep them from escaping as we had something special planned for them — TFE agents. A fleet of star destroyers would've been awesome. I'm going to teach them to use all those CPU cycles.
The GitHub team was the first to be moved across to using TFE agents. By deploying a fleet of virtual machines to act as executors, we could move our most problematic workloads to execute there —alleviating load on the TFE application servers. The Kubernetes team followed suit shortly after as the increased available resource of the agents gave them the breathing room needed to shard their workspaces into more manageable chunks.
» Active/Active Death Stars
So, we reach our ultimate solution. Why deploy only one Death Star at a time when you can have two? Now let me hand over to George, who will take us on a deep dive into deploying Active/Active.
George Short:
As Carl mentioned, I'll now go into a little bit of deeper detail of how, at G-R, we adopted Active/Active in our migration process towards that — but try and keep it as high level as possible. I will also say for those enjoying the Star Wars theme; it is not my area of expertise. I'll stop there for a minute. But like the Jedi, Carl will return.
» Existing architecture
Where were we before Active/Active? Pretty much a representation of our architecture. Single TFE server, some underlying core services replicated to our DR site where we've got an offline TFE server ready to be brought online in the event of a failover. With a single instance of TFE, there are two clear issues. The first — single point of failure in administration. That's just one of them. What comes with that? Restricted performance.
As Carl mentioned, we were stuck. We could only support eight concurrent runs at one time. For those not using agents, we could only have eight runs at once within our environment — it doesn't meet the demand for our customers. Therefore, when Active/Active was made generally available, we see that as an opportunity to improve customer experience for the service. I'm going to touch on some high-level steps involved.
» Migrating to Active/Active
The first is highlighted within HashiCorp's docs. You need an externalized Redis solution for that. If you're fortunate enough to benefit from being in the public cloud — simple. A few clicks, the Redis solution, it’s yours. However, for us at G-R being on-prem, we do things the hard way, and for us as a team, rapid upskilling was required, which we undertook.
As a result, we ended up with a three-node, Redis cluster — three VMs running Docker, each with four containers. So, Redis, Redis Sentinel, and then, for observability, Filebeat, and Telegraph. I'll touch on those later. If anyone in the room has looked at Active/Active, it's clear in the docs Redis Sentinel is not supported by Terraform Enterprise or Redis clustering.
However, for us, TFE isn't actually aware of Sentinel. It doesn't know we're using it. We are just using it under the hood to provide cluster life capabilities. But yes, TFE isn't aware. As TFE isn't aware of Redis Sentinel — or doesn't support it — it's unable to support Redis out-of-the-box in a highly available manner. So, we had to come up with an additional solution.
We introduced HAproxy into the architecture. As you can see in the diagram, we've got HAproxy running as a container on each TFE active node. Its primary responsibility is to poll Redis to discover which Redis node is the master — and TFE points to HAproxy for its Redis configuration. So, from a TFE point of view, it sees HAproxy as the Redis connection, and HAproxy is doing the clever stuff under the hood to navigate that traffic to the current master.
You probably can't see it — it's a bit small, but I've included an example polling configuration, if anyone's interested. I'm sure the deck will be sent out, but it took us a little while to get this, so it's quite useful information. But the key thing here is TFE is unaware we are using Sentinel. It uses HAproxy for the high availability functionality.
» Configuring Terraform Enterprise
That's prereqs out of the way from an external point of view. It's now time for us to configure Terraform Enterprise itself. First, we needed to ensure that it was on a compatible version that supports Active/Active. Active/Active was introduced a bit ago, so most should be on a compatible version, but for us, we always stay within two versions — so not too much of an issue for us, but it's something to highlight.
The next step is to configure the application settings themselves for TFE — including the enable Active/Active step and that Redis configuration that points to the solution we've just spun up. It's probably a bit small. But you will see in the configuration I've included that. The Redis host says HAproxy rather than the Redis cluster itself. That's how we're tricking TFE in that sense.
For those that are used to using the replicated UI, that becomes deprecated when using Active/Active. As part of that, you run some command lines to disable the Replicated UI, but if you're familiar with it, that goes away. Then it's time for the all-important things, scaling up your Active/Active nodes.
Between two and five nodes are supported. At G-R, we run on two nodes, and we use the agents, as discussed, to back up our run capacity. That seems to be sufficient for us. Maybe we'll introduce three . But as we deploy from code — easy for us to spin up if we need to. And also, again, if you're in the public cloud, auto-scaling groups are a perfect example of how to use those. But it’s down to the individual’s requirements.
» Current architecture
You'll end up with something like this: Active TFE nodes fronted by a load balancer. We've got TFE looking at HAproxy. HAproxy monitoring Redis for the current master. Sentinel under the hood providing cluster-like capabilities — failover, health checking, etc., for Redis itself — ensuring Redis stays healthy.
And then the existing DR services. So, this is a replica of what we've got in our DR site, but offline. And then, in terms of deployment and security, this is fully autonomous. So, no human interactions deploy this. We deploy everything by code based on zero trust, no human interaction.
If we go back to the original slides where we talked about the issues, we had a single point of failure. Well, that's now gone. We've got more than one active server. We could have 2-5. Increased performance — we've doubled our run capacity already by introducing another node. We could triple it if we added another, quadruple it if we had another, etc.; So, customer experience benefits straight away from us going to Active/Active in that sense.
» Observability
It might be useful to touch on observability in terms of how we monitor this environment. For TFE, we use Prometheus and Grafana to expose metrics and have nice Grafana dashboards that management love. In the top right-hand side, you'll see I've included some run data. That's how we monitor ongoing runs within the environment. We use ELK for audit logging or troubleshooting of logs.
For HAproxy, we use the default stats backend. In this image, you'll see the middle line's green. That's highlighting which node is currently the Redis master. That's a stat that we can expose. Then, for Redis itself, as I mentioned earlier, we deploy the Filebeat and Telegraph containers to expose metrics and logs to also consume within ELK, Prometheus, and Grafana.
» Rolling rebuilds
Now for one of our team's key achievements out of moving to Active/Active. As mentioned previously, our deployment process is fully autonomous. There's no human interaction. Alongside having multiple active servers with the ability to deploy in a fully autonomous manner, we rebuild our environments once a month on a schedule — no human interaction.
We turn up in the office, message in Slack, and say, "TFE prod was rebuilt this morning." Literally, no involvement from us. This is done through the use of custom Jenkins pipelines. They rebuild Redis. Then they rebuild TFE, performing health checks throughout. Then also the ability to roll back and alert in an automated manner — so there's a rollback plan in the event of any failures.
Key for this piece of work is we automatically rebuild, not patch — we don't do any patching — every TFE environment once a month on a schedule without any human interaction. That's pretty cool. That's it from me. I'll hand you back to Carl. But if you have any questions about what I've spoken about, please grab me after this, or I'll definitely be on the terrace.
Carl Heinst:
Thank you, George.
» Benefits
I'll now summarize our presentation and discuss the benefits and outcomes it's brought us. First and most importantly, as George said, we've doubled our workspace execution capacity. Whereas before, we were restricted to eight concurrent runs on a single application server, we now have a total of 16 spread across two. This leads to a superior customer experience for small and large workspace owners.
By having an additional application server and using anti-affinity rules at the hypervisor level, the impact of large workspace execution is lessened. Even if both servers are hit simultaneously by large workspace runs, we are not allocating them against the same CPU. We're using two distinct hypervisor CPUs to perform that execution. Combining this with TFE agents mitigates most of our performance bottlenecks and provides options for dealing with any future ones.
As George said, Active/Active can now scale to five application servers in a cluster, unlocking the option for horizontal scale. We could reduce the T-shirt size of our application servers and deploy more of them. Those dangerous asteroids won't stand a chance. And the nicest benefit of all is happy customers. Our desire is for TFE to be a utility service within G-R. When you push a button, it's there.
» Outcomes
The amazing work George and I have discussed with you was a combined effort across the entire Infrastructure Automation Team. While the two of us are fortunate enough to be here talking to you today, I would be remiss if I didn't mention the rest of the team. Please indulge me as I thank Sam Calvert, Rudy Zau, Matt Sheridan, Nick Mackelroy, Tom Iskra, and of course, George Short, for the hard work and technical excellence that has brought us to this point.
Finally, let me review the outcome of moving to Active/Active. TFE running an N-2 on 2X2 application servers, 100% immutable, and deployed from code because — as one indelicate member of the Infra Auto Team declared, "patching is for suckers,” — in 90 minutes, every four weeks, fully automated with no customer downtime.
Thank you very much.