5 Lessons Learned From Writing Over 300,000 Lines of Infrastructure Code
This talk is a concise masterclass on how to write infrastructure code. Yevgeniy (Jim) Brikman shares key lessons from the “Infrastructure Cookbook” developed at Gruntwork while creating and maintaining a library of over 300,000 lines of infrastructure code that’s used in production by hundreds of companies.
Come and hear Gruntwork’s war stories, laugh about all the mistakes they made along the way, and learn what Terraform, Packer, Docker, and Go look like in the wild. Topics include: - how to design infrastructure APIs, - automated tests for infrastructure code, - patterns for reuse and composition, - patterns for zero-downtime deployments, - refactoring, - namespacing, - versioning, - CI/CD for infrastructure code.
Speakers
- Yevgeniy (Jim) BrikmanCo-founder, Gruntwork, Gruntwork
Transcript
Hey, everybody! Thank you for coming. This talk, I'm gonna try to share some of the lessons that we learned from writing 300,000 lines of infrastructure code at Gruntwork. First lesson is—I actually originally called this talk "10 Lessons Learned" it's a really bad idea before you write the talk because then you write it and it's like three hours long, you're like ugh, okay.
It's gonna be five lessons, which hopefully will be enough for now. The goal is, as you start writing infrastructure code, as you start using some of these tools, hopefully you'll be able to avoid some of the mistakes, some of the silly little things that we did. Before I get into that, I wanna set the context a little bit. I wanna give you a little bit of a confession.
Here's the truth. I've been doing this stuff for a while, working on DevOps and infrastructure and the reality is DevOps is still very much in the stone ages. I don't say that to be mean, we're new to this. We've only been doing DevOps, we've only been figuring out SRE practices, how to do software delivery for a little bit of time, and this is a little scary because we're being asked to build infrastructure that looks like this (skyline view of Shanghai).
We're supposed to build these incredible, modern, powerful systems, but it feels like we're trying to do that with tools that look like that (duct tape and paper clips). You got your duct tape and a little bit of bubblegum and good luck. Of course, if you read the headlines it sounds like everything is so cutting edge.
You open any web browser, you look at any blog post and you hear nothing but "Kubernetes" and "Docker" and "Serverless" and "Microservices" and buzzword this and buzzword that; and it sounds really cool, but it doesn't feel like that to me. I don't know about the rest of you, but when I'm actually doing this, when I'm actually working on infrastructure, it doesn't feel cutting edge. It feels a little more like I'm trying to heat up a slice of pizza on an iron.
I don't know what the hell is going on here, but this is DevOps. This is my new hashtag, #thisisdevops. It doesn't feel cutting edge, it feels like this. I mean, I don't know, I guess this works but why? Why are we doing it this way? This is crazy, alright. Nothing fits together right, I don't know, come on man. Like this (badly spliced power cords), this is probably an accurate depiction of the typical CI/CD pipeline at most companies. This is what it feels like.
We don't talk about that enough. We don't admit it to ourselves often enough. We like to talk about all the cool stuff, we don't talk enough about the fact that this stuff is hard. Building production-grade infrastructure is hard. It's really hard, it can actually be really stressful. I think, generally speaking, it's probably not crazy to say that people in the DevOps industry tend to be a little more stressed out than other parts of software engineering. It's also really timeconsuming, and this aspect is always hidden. Because you only see the end result. You read a blog post and at the end of the blog post, it's like "and ta-da! Kubernetes." You're like, "Wow, that sounds great." But you have no idea what time or work went into it.
So I wanna share with you some of the rough numbers we have found from working with a lot of companies, doing this stuff a lot of times. And these numbers are probably best case scenarios, to be honest with you.
Production-grade infrastructure timelines
Hopefully you can read that, this is what it's gonna take to write infrastructure code that you can actually use in production. I'll come back to what I really mean by "production-grade infrastructure" a little later in the talk, but here are the basic numbers.
Managed services
If you're using a managed service, something that AWS or Azure or GCP is managing for you. You can probably get up and running, in a production capacity, in a couple weeks with that; 1-2 weeks. Because the managed service provider is doing all the hard work.
Distributed system (stateless)
But if you're gonna spin up your own app and it's some sort of distributed system—but it's stateless, it's not writing to disk or anything that you care about on disk; like a Node.js app or some sort of frontend application—that's gonna be double, 2-4 weeks.
Distributed system (stateful)
You wanna build a distributed system that's stateful, something that writes the disk and you don't wanna lose that data—something like setting up your own ElasticSearch cluster, your own Kafka cluster, MongoDB—now we're talking 2-4 months. We're gonna jump in order of magnitude here, and that's for one distributed data system. If you have multiple, it's 2-4 months for each one and then the interactions between them actually make things even more complicated.
Entire cloud architecture
Finally, if you wanna put together your entire architecture from scratch, we're talking a 6-24 month project. 6 months is probably for a tiny little startup, 24 months is more realistic for most larger companies, for some, it'll take far, far longer.
So when you read those blog posts that are celebrating these incredible milestones, put a little checkbox in your head that these are the timeframes we're talking about, these are the best case scenarios for things that are understood.
Infrastructure as code is making us faster
The good news is a lot of this stuff is getting better. I don't want to be a downer, there are problems but they are getting better. One trend that I really like, I think something that's very positive is we're able to manage more and more of what we do as code. We're seeing this across the stack. We're going from manually provisioning hardware to infrastructure as code. We're going from manually configuring our servers to using configuration management tools and using code. Going from manual application config to configuration files. Everything is moving to code.
There's a lot of benefits to code, this is why I'm excited about it. Hopefully if you're at this conference, you don't need to be convinced too much of the benefits of code; but it's everything from being able to automate things instead of having your poor SRE do everything by hand 300 times, you get to store the state of all of your infrastructure in version control, you get to review every change, you get to test every change. The code is documentation, so it's not just locked in somebody's head, and you get reuse, so you can use code written by other people as well.
At Gruntwork, what we've done is, we've kind of run with this idea of infrastructure as code and we built a library of reusable code. Most of it is written in Terraform, Go, Bash, I'll come back to the tools we use a little bit later on. We have these off the shelf solutions that give you production-grade infrastructure for a whole bunch of different technologies.
We have a whole lot of customers using this stuff in production. What that really enables is you go from these timeframes of weeks and months to do these things from scratch, to about a day, mostly because the Gruntwork team has invested that time on your behalf. That's the benefit of code, you get to reuse us toiling away for months and years on this stuff—and we've been working on it for a while. We've been working on this library for more than 3 years, it's well over 300,000 lines of code. It's a pretty sizable amount of infrastructure code. So based on all the ways we got this wrong, hopefully I can help teach you all some pretty useful lessons.
I'm Yevgeniy Brikman, go by the nickname 'Jim', I'm one of the co-founders of Gruntwork, author of a couple books, "Terraform Up & Running" and "Hello, Startup." These books talk quite a bit about these DevOps concepts and these are the five lessons that I'm gonna share with you today.
First, I'm gonna go through a checklist of what it really means for infrastructure to be ready for production. Then I'll talk about the tools that we use and some of the lessons we learned around that. I'll talk about how to build modules in a way that they're actually useful, I'll talk about how to test your infrastructure code, it's a very rare topic. Then I'll talk about how to release this stuff into the wild.
Lesson 1: Use an infrastructure checklist and be realistic
We'll start with the checklist. When I tell people about these numbers, when I show somebody that's relatively new that it's gonna take this long, I get a lot of skeptical faces. Usually they look something like, "Come on, really? 6 to 24 months? I read a blog post, I read this thing, I'll get this done in a couple days." No, you won't. It's gonna take a long time, you're just not gonna. So why? I get a lot of incredulity and people are asking, "Why? Why does it take this long?" So I think there are two main reasons, they're very similar but they have slightly different flavors to them.
Yak shaving
The first one is something called yak shaving, who here is familiar with the term "yak shaving?" Wow, that's less than half, okay. For the rest of you, you're welcome. This is gonna be a term you're gonna use probably every day of your life from now on, so pay attention. Here's what yak shaving is: yak shaving is this seemingly endless list of tasks you have to do before you can work on the thing you actually wanted to do that day.
The best definition I've seen of this is from Seth Godin's blog, and he gives an example like this: you come home one day and you want to wax your car, so you go out to your backyard, you grab the hose and oop, crap, my hose is broken. No problem, I'll go to Home Depot, I'll get a new hose. You get in your car, you're just about to go and then you remember, ah, to get to Home Depot I need to go through a bunch of toll booths, I need an easy pass. No problem, I'll go to the neighbor's house, grab his easy pass. Then you remember, he's not gonna lend me his easy pass because I borrowed his pillows, I gotta return the pillows first. So you go rooting through your house, you find the pillows and you realize that all the yak hair has fallen out of the pillows while you were borrowing them and so the next thing you know, there you are, sitting in a zoo, shaving a yak, just so you can wax your car.
So that's yak shaving, and if you do DevOps, you know what I'm talking about. You wanna go deploy a service over here and then there's some TLS thing you have to do and while you're working on that you uncover this weird bug, then while working on that, you broke something in production. Then you're miles away from what you originally wanted to do, and this happens all the time. DevOps just seems to consist of a thousand tiny little interconnected systems and they all break and everything you do, you end up yak shaving. The weirdest part is, at the end of the project you're like, "The hell did I just spend the last three months doing? I feel like this should have been five minutes." It's maddening, so yak shaving is one big reason.
A long list of things to do
The second reason, there is genuinely a long checklist of things you have to do to build infrastructure for production. So, I'm gonna share our checklist with you, these are the things we look for when we're building infrastructure that we wanna use in production. What I mean by that is this is the kind of infrastructure you would bet your company on. If you're gonna deploy a database, you don't want it to lose your data because if you do, that could be a company ending event. This is what I'm talking about when I say production-grade.
Here's the checklist (See slide 45 in embedded slides), actually this is the beginning of the checkout. The first part of it, most people are familiar with it. These are the parts that just about everybody can guess. You have to have a way to install your software, duh. You have to configure it, you know, port numbers, users, folders, etc. You have to provision some hardware to run your software and you have to then deploy it, and potentially update it over time. Everybody knows this, this part is easy, but that's part 1 of 4. This is what you got, next is what most people forget.
Here's part 2, security is usually an afterthought. Things like "how do I encrypt things in transit?" That whole TLS thing. "How do I encrypt things on disk?" Authorization, authentication, secrets management.
How about monitoring. If you want to deploy something to production you probably wanna know what's going on with that thing, have some sort of metrics and alerts. You want logging, you wanna rotate the logs on disk so you don't run out of disk space, you wanna aggregate them somewhere so you can search them. Backup and restore, again, if you're running a distributed data system, or any data system, you probably wanna know that if it crashes, you're not gonna lose all your data and go out of business.
Here are a few more items that tend to get forgotten. How about the networking setup? All your subnets, all your route tables, static, dynamic IP's. How about SSH access? VPN access? You have to think through these things, these are not optional, you need these for almost every piece of infrastructure. Then you gotta think about high availability, what happens if a server crashes? What about a whole data center? Scalability: What happens when traffic goes up and down, how do you react to that?
Performance: There's a whole field of things that you have to do. If you're deploying a JVM app, you have to think about GC (garbage collection) settings. For just about anything, you have to think about memory, CPU. Might have to do load testing. That could be months of work just by itself.
Then, these ones, almost nobody gets to page 4. That's part 4 of 4. People usually have dropped out long before this. Cost optimization: How do you make this thing affordable? Documentation: If you ask people, "Where are the docs for your infrastructure code?" They're like, "What?" Tests, it's rare to see tests for infrastructure code.
This is the checklist. These 4 pages are what you want to use. When you're building infrastructure, this is what you need to do for just about every piece. Now, not every single piece of infrastructure needs absolutely every single item here, but you want to at least make the conscious decision that we're going to build these aspects and not build some others. You don't want that to happen by accident. You don't want to find out 6 months later, "Oh, crap. I forgot about backup and restore," and you usually learn that the hard way.
Key takeaway: Next time you go to build a piece of infrastructure, use this checklist. I'll be sharing this slide deck. Actually, it's already on SlideShare, but I'll tweet out the link, and I'll post it in the HashiConf hashtag. We also have, on the Gruntwork website, a checklist for your entire architecture, not just one piece of infrastructure but everything that you're going to do. You can find that on our website: gruntwork.io/devops-checklist/. It's also in the footer, so it might just be easier just to go to Gruntwork.io and click it.
Okay, that's the checklist.
Lesson 2: Learn your tools well
Second thing that's worth thinking about are the tools. You know what you need to do. What are the tools you're going to use to actually implement that checklist? When we're thinking about the tool that we want to use, these are usually the principles that we're following. We want things that let us, of course, define our infrastructure as code. We want them to be open source and have a large community. We don't want to have to build everything ourselves.
Hopefully, they're cloud agnostic and work across many different providers. We want tools that have very strong support for code reuse and composition. That'll come up in the next section. We don't want to run extra infrastructure. I want to deploy my infrastructure, not deploy infrastructure to deploy my infrastructure. I want to keep it simple. We like immutable infrastructure practices so, when possible, we like to use tools that make that easier.
The tool set that we're using now, and bear in mind, this is not some religious debate here. This is just what we're using today because, for practical reasons, it makes sense:
Our basic layer, the underlying infrastructure, we use Terraform. That's going to lay out all of the basic networking configuration, load balancers, databases, and, of course, all of the servers that we're going to run our apps on. Those servers, we need to tell them what software to run, so on top of that, we're going to use Packer to build our virtual machine images. Some of those VMs will have things like Kafka installed or Redis. Some of those VMs are going to have Kubernetes agents or ECS agents, and so they're going to form a Docker cluster.
In that Docker cluster, we're going to be able to run Docker containers. Generally speaking, the way we work as of today is we use the Docker containers for stateless applications, and stateful applications we run outside of Docker, usually as their own VMs. That's starting to change. I think Docker and Kubernetes are becoming a little more mature in that space, but I would still be terrified to run MySQL on top of Kubernetes. That just wouldn't do it.
The final piece of technology that we're using here is, under the hood, all of this stuff, this is our duct tape. This is our paper clip and our twist tie. It's a combination of scripts like Bash and Python. We use them, for example, to define the Docker containers. We use them as glue code. We also use Go anytime we need a standalone binary that's going to work on any operating system, we'll write that application in Go.
Those are the main tools, basically: Docker, Packer, Terraform, and some mix of general purpose programming languages to hold everything together. Here's an important note: These are the tools we've picked. They fit our use cases. You're welcome to use them. You might find other tools are a better fit for your use cases, but the key insight here is not, "You should use those tools." That's not really what I'm trying to teach you here. The key insight here is that, while tools are useful, they aren't enough. You can use whatever tools you want in the world. It could be the most incredible piece of technology out there, but if you don't change the behavior of your team, it's not actually going to help you very much.
Here's why. If you think about it, before you start using infrastructure as code, the way that your team is used to working is: if you need to make a change, you do it directly. You SSH to the server, and you run some command. You connect to the database, and you run some query. What you're saying now is, "I want to change this," so now there's this level of indirection, this layer of indirection. You need to go change some code. There might be code reviews, there might be some tests, then you've got to run a tool, and that thing is going to apply the actual change. That's great. There's a lot of good reasons to do that, but these pieces in the middle, they take time to learn and understand and use properly.
If you don't invest that time, you're going to have the following scenario: Something is going to break, and the ops person is going to have 2 choices. They can either do what they already know how to do, make a change directly, 5 minutes. Or they can spend 2 weeks learning that stuff. If it's a rational person, every time they're going to choose to make the change manually. If they do that, now your code, whatever tool you chose, that code is not going to accurately reflect reality. The next time somebody tries to use the code, they're going to get an error. Then they're not going to trust that code. "I tried to run Terraform. I got a weird error. Nevermind." They're going to go and make a change manually. Then the next person is going to be even more screwed and the next person, and, within a few weeks, that incredible tool that you spent all that time working on is basically useless because people are changing things by hand.
The problem with that, of course, is that doesn't scale. Making changes manually, if you have a few people, a few pieces of infrastructure, fine. But as your infrastructure grows, as your team grows, this does not grow with it very nicely, whereas this does. The key takeaway here is you need to change behavior. If you're going to be adopting Terraform, Packer, Docker, any of these tools, you have to give people the time to learn these things, the resources to learn these things. You know, do training, go to conferences like this, go read blog posts.
And you have to make sure that they're bought into it. The last thing you want to do is have an ops person who's just like, "Ah, I hate this Terraform thing. I'm never going to use it." That's not going to work. Terraform's not going to help you. You've got to change behavior. That's really the takeaway here.
Lesson 3: Build infrastructure from small, composable modules
Okay, third item is how do you build modules? Something that we've found, something that we learned the hard way, is even if you pick a good set of tools, even if you change behavior, and you just sit somebody down who hasn't done this before and tell them, "okay, go build your infrastructure," they tend to shove all of it into a single file or a single set of files that's all going to get deployed together. The dev environment, test environment, stage environment, everything is just shoved into one super-module, essentially. What you'll find out very quickly and very painfully is this has a huge number of downsides.
For one thing, it's just going to be slower. We've had customers where running Terraform plan would take 5 minutes, just to get the plan output, because there's like 3,000 resources, and you have to make API calls and fetch data about them. Even if that wasn't as big of an issue, the plan output is impossible to understand. If you run terraform plan
and you get 900 lines of result, no one's going to read that. You're just going to say, "Yeah, I hope it's good. Fine. Apply, yes." You're not going to catch that little line in the middle that's red that says, "By the way, I'm deleting your database." You're not going to see that if you have a gigantic module.
It's also harder to test. We'll talk about testing a little later, but the bigger the module is, the harder it is to test it. It's harder to reuse the code, because if everything is in one giant super-module, it's hard to reuse parts of it. Another one that a lot of people miss is, if everything is in one super-module, then to run anything, you need permissions and access for everything. Basically every user has to be an admin because all of your infrastructure is in there. That's terrible, from a security perspective.
Finally, your concurrency is limited, basically, to one. Only one person at a time can change anything in any environment, ever. That's obviously a problem. Of course, the biggest issue with all of this is, if you make a mistake anywhere in this code, you could break everything. You might be editing your staging environment, you make a silly typo, and you're not paying attention, and because it's all shoved into one set of code, you blow away production. This happens. This isn't hypothetical. This happens on a regular basis.
Here's what I'm going to claim: large modules, in infrastructure code world, are harmful. They're an anti-pattern. They're a very bad idea. What you really want is isolation. You want the different parts of your environment to be separated from each other at the code level and, obviously, in the infrastructure level. That's why you have separate environments in the first place. Right? You want stage to be isolated from prod.
You actually also want isolation between different types of infrastructure. For example, you'll probably deploy your VPC once, and you're probably not going to touch it again for a year or a long time. Whereas your frontend app, you might be deploying that 10 times a day. If they're in the same set of code then, 10 times a day, you're putting your entire VPC at risk because of some silly typo, some silly wrong command, for no reason whatsoever. There's absolutely no reason to do that.
You want to group your infrastructure by risk level, by how often it gets deployed. The way you do that is you take your architecture, and it could be arbitrarily complex. You can have your VPCs, you can have your servers, your databases, load balancers, etc. You're not going to define that out of one gigantic super-module. You're going to define it out of a bunch of tiny, little, individual, standalone modules. This pattern is the only one we've seen that works well, especially at scale, especially as your infrastructure gets more complicated. It's building it out of these little, tiny building blocks.
What that's going to look like, in terms of code, is you'll start with your environments, as the top-level items. So dev, stage, prod, and they're all separate from each other. Within each environment, you'll have the different types of infrastructure, so maybe the VPC is separate from your database is separate from your frontend application. Of course, you don't have to copy and paste between all of these. The actual code itself can be implemented in these small, reusable modules. Even the modules themselves can be built out of still smaller pieces. It's kind of like modules all the way down.
There are a lot of advantages to doing this. Basically, it's the opposite of all the problems I talked about. Things will run faster. You can limit permissions based on just what you need for that one particular module. If you break something, the damage you do is limited, etc.
Here's a quick example that I'll browse through. In the Terraform registry, we have a module to run Vault, which is a reasonably complicated distributed system. This is some open-source code. If you take a look at that module, there are three key folders. This is how we organize the code for even one piece of infrastructure.
Those three folders are modules, examples, and tests. I'll actually show you in IntelliJ itself. Here we have Vault ((23:43)[https://youtu.be/RTEgE2lcyk4?t=1423]). This is the modules folder. That font might be a little small. Essentially, in here, we have a few things. I'll switch back to the slide deck, since the font is not super readable. We have, for example, Vault cluster as a separate sub-module.
You notice, even though it's just Vault, it still consists of a whole bunch of different little sub-modules. The Vault cluster, Terraform code to deploy an autoscaling group, a launch configuration, security group, whatever you need to run a cluster of Vault servers.
Separate from that, we have other modules, such as Vault ELB. This is Terraform code to deploy a load balancer. Security group rules are separate. Install Vault isn't Terraform code. That's a batch script. You can see it takes in what version of Vault to install. If you're using Vault Enterprise, the path to where the binary is. We have run vault
in here. That's a script that will start Vault on boot. Here it will configure TLS, port numbers, etc.
Why do we have all of these tiny little sub-modules in that code base? The reason for that is it gives you a lot of power. You're going to be able to combine and compose those pieces in many different ways. If everything was in one gigantic module, it either works for your use case or it doesn't.
This isn't some hypothetical thing where you just want to support every possible use case out there. It turns out that you're going to want to run your own infrastructure in different ways in different environments. For example, in production, you might want to run Vault across three separate nodes. Vault, usually, you run it with Consul, you might want to run Consul on three separate nodes. In a pre-prod environment, you might want to co-locate all of those on maybe one server or just one autoscaling group.
You need to write your code in a way that you can do that. The best way to do that is to break it up into these tiny little modules. Just to recap, these are actually patterns. You might want to commit them to memory or use the slide decollator when you're building your own modules, because they come up again and again and again. Almost every piece of infrastructure has a similar structure. You have the modules folder, there's some sort of install script. Basically, you need to install the module, as you can remember from the checklist. There's a run script, this is what's going to configure it during boot. There's going to be some sort of Terraform code to provision the infrastructure for that module, and there's also going to be a bunch of these reusable components that you might want to share.
Security group rules are a separate module because I might want to use it with my Vault cluster, but if I'm running Vault co-located with Consul, I might want to attach the same Security group rules to my Consul cluster instead. These little pieces are very, very important, and each of them exposes a bunch of input variables and output variables so that you can wire them all together in a bunch of different ways.
You can have one configuration of Vault that's set up like this ((26:34)[https://youtu.be/RTEgE2lcyk4?t=1590]) or, using the same set of code, you can configure it differently with a different load balancer or a different way to install it, configure it. You have that flexibility and power, and you're going to use it yourself. Let alone what happens as the infrastructure changes over time.
There's an examples folder, and that's going to show you examples of different ways to combine and compose those little sub-modules. Now, we build that, in part, as documentation. It's executable documentation. It turns out it's more than that. Some of these examples are Terraform code, some could be a Packer template, etc. Where the examples become really powerful is with testing. In the next section, I'll talk about testing, but the key thing to remember here is, when we're doing testing, what we're testing usually those examples. We're usually deploying those actual examples as the way to validate that these modules work. The examples are really, really, really useful.
Key takeaway: Build your infrastructure out of these small, composable pieces. If you see a gigantic module, you should react to it a lot like you would if you were using Java or Python or any general-purpose language and you saw a single function that was 10,000 lines long. We all know that's an anti-pattern. You don't write 10,000-line-long functions. It's the same in infrastructure code. You should not have 10,000-line-long infrastructure modules. They should be built out of these little pieces.
Lesson 4: Infrastructure code needs automated tests
Okay, how do we test this stuff? Something that we've found, again, the hard way, and something that you're going to find as you're starting with this journey, is that infrastructure code rots very, very quickly. Pretty much everything we're using is changing all the time. Terraform 0.12 is about to come out. That's a massive change to the whole language. Docker is changing, Packer, Kubernetes, AWS, JCP, these are all moving targets. None of them stand still. Your code, even if it works today, might not tomorrow, and this happens really, really quickly. Much faster, I think, than a lot of other programming languages.
Really, the best way to put it is not that it rots very quickly, but I think this is the general law: Infrastructure code that does not have automated tests is broken. I don't mean it's going to be broken in the future. I mean it's probably broken right now.
Something we found out the hard way is, even if you test this thing manually, you're building a module to run Vault, you deploy it, it runs, you can store secrets, you can read secrets. As best as you can tell manually, it's working. When you take the time to write the automated test, every single time, we have caught non-trivial bugs in our code. Every single time, almost without exception. If you don't have tests on your infrastructure code, it's broken. Just be aware of that.
Now, we know how to test code in general-purpose languages: Bash, Go, etc. We can write unit tests that mock the outside world, and we can run those tests on localhost. For infrastructure code, this is a little harder. There's no localhost for Terraform. If I have Terraform code to deploy in AWS VPC, I can't deploy that onto my own laptop. There's also no real unit testing. I can't, for the most part, mock out the outside world, because the only thing Terraform does is talk to the outside world.
Really, all of your Terraform testing is actually going to be integration tests, and the test strategy looks like this:
You're going to deploy the infrastructure for real. Remember those examples that we saw earlier in the modules? You're going to deploy those examples for real. If you're building infrastructure for AWS, you're going to deploy it into a real AWS account.
You're then going to validate that it works. Now, how that works obviously depends on your use case. For Vault, our automated tests actually initialize the cluster, they unseal the cluster, they store some data, they read some data.
At the end, you're going to undeploy that infrastructure.
It's not very sexy, but this is how you test infrastructure code. To make that a little easier, we've open-sourced a tool called Terratest. I did a talk here last year. It was not open source then, so I'm happy to announce that, now, it is available as an open-source library.
It allows you to write your tests in Go. It gives you a lot of utilities that make this style of testing easier. The philosophy here is, basically, if you're trying to figure out, "How do I test this module using Terratest?" All you really need to think about is, "How would I have tested it manually?" Because that's all we're doing. We're just automating whatever it is we would have done by hand.
Here's a typical test structure that you would use with Terratest. You can see, at the top, what we're testing here is one of those examples in the Vault repo. So we tell Terraform, "Okay, this is where the example code lives." At the start of the test, we're going to run terraform init
and apply
to deploy that example into a real AWS account. We're then going to validate that this thing is working.
Now, what does that do? That, of course, is very specific to whatever you're testing, but Terratest has a whole suite of utilities to make that piece easier as well. There are libraries for looking up IP addresses of EC2 instances. There are libraries for making HTTP calls, checking the response code in body
, and doing it in a retry loop, because you'll find that there's a lot of asynchronous and eventually consistent systems here.
There's a lot of code that uses this SSH helper. It's basically SSH to a server and execute a command and give you back the standard out and standard error. There are all these utilities built-in for validating that your code actually works.
Then, at the end of the test, that's what that "defer" words means, it means "at the very end," we're going to run terraform destroy
to clean up after ourselves. These are all Terratest helpers, and it's not going to make the test easy to write, but a heck of a lot easier than doing it from scratch. Terratest, these days, supports AWS, supports GCP, Oracle cloud. There was a talk today about using it with Azure, so you should be able to use it for a whole lot of use cases.
One important note is that, when you're running these tests, we're going to be spinning up a lot of infrastructure and tearing it down all the time. Two important things to note about that.
The first one: You want to run your tests in a completely isolated sandbox account. Don't run them in prod. I hope that's obvious. Yeah, I wouldn't even use my staging or dev account. I would actually create a completely separate place to run these tests, where you don't have to worry, "What happens if I accidentally delete the wrong thing or create the wrong thing?"
Second tip is: Occasionally, the tests don't clean up after themselves very well. The test might crash, you might lose internet connectivity. You don't want to leave a whole bunch of resources around, and so we built a tool called cloud-nuke. As the name implies, it's quite destructive. Don't run it in prod, but, in your isolated sandbox account, you can use this to clean up resources older than a day or older than a week, and basically not lose lots of money.
Okay, one final thing to chat about with testing is this test pyramid. Usually, you're going to have a lot of unit tests, smaller number of integration tests, smaller number of end-to-end tests. The reason for that is, as you go up the pyramid, it takes longer to write those tests, they're more brittle, and they run slower. You want to catch as many errors at the bottom of the pyramid as you can.
Within infrastructure code, similar idea. The bottom of the pyramid, that's going to be those individual sub-modules. You want to have small, little sub-modules that you can test and be confident that they work as you put them together. Then integration tests is combining a few sub-modules. End-to-end tests is your entire stack. You're not going to have too many of those. The reason for that is they're really slow. The unit tests for those individual sub-modules, they're not fast, but it's as fast as you're going to get. Then it gets slower and slower as you go up the stack. Check out Terratest docs. We have a lot of best practices about how to speed up testing, so make sure to read that. Add tests.
Lesson 5:
I'm almost out of time, so final section is releases. This one's really quick. I'm going to put together the whole talk for all of you. Here's how you're going to build your infrastructure from now on:
Use the checklist. Make sure you're building the right thing.
Go write code. Don't do the stuff by hand, go write code.
Write automated tests with Terratest.
Do a code review. Make sure your code is actually working. Have your team members take a look.
Release a new version of your code. What that really means is add a Git tag.
Now, you can take that code, this versioned artifact, and you can deploy it to QA. If it works well, then you can take the exact same code, because the artifact is immutable, deploy it to staging. If it works well, finally, to prod.
Here's the key takeaway for this talk: We're going to go from that to that. We go from pizza on an iron to tested code that's been through a checklist, a code review, it's versioned, and we're rolling it out from environment to environment.
It's not easy. It will take a long time, but it is helpful. That's it. Thank you, everybody.