Upgrading Your Provider for Terraform 0.12
Learn about Terraform 0.12 and how you can update your providers to harness its language improvements.
Terraform 0.12 was released earlier this year with major improvements to HCL, the declarative configuration language used in Terraform. To take advantage of these great new features, provider maintainers will need to update the code in their providers to support Terraform 0.12.
In this talk, HashiCorp senior software engineer Paddy Carver will run down the checklist for upgrading your providers to 0.12 and point out some potential pitfalls. He'll also share some data about how the Terraform is using version 0.12.
Speakers
- Paddy CarverSenior Software Engineer, HashiCorp
Transcript
Good morning, HashiConf.
This is “Upgrading Your Provider for Terraform 0.12.” My name is Paddy Carver. I use he/him pronouns. I am a senior software engineer at HashiCorp, and you might recognize me from GitHub or Twitter. I go by @paddycarver on both, and if you're going to live-tweet this talk, please, by all means, tag me. I love having the little red dot when I get off stage. Please do remember that our conference hashtag is #hashiconf.
Today we're going to have some fun, and you know we're going to have some fun because I'm starting this talk off with disclaimers. Nothing says a good time like disclaimers right out of the gate.
First disclaimer: This is a provider development talk. If you're here and you don't know what provider schema is, first of all, you're valid, you're a developer, everything's cool, I don't think less of you. But some things might go over your head. If you're expecting me to talk a lot about HCL today, that's probably not going to happen.
Alvin Huang does have a talk "Upgrading Terraform Code from 0.11 to 0.12", my coworker, at 3:05 in our hallway track, and he's going to be talking through that, so you should definitely check that out if that's something you're interested in.
Also, I'm sorry, I'm going to pull a little bit of a Jedi mind trick on you. I am absolutely awful at writing abstracts, so this might not be the talk you think you're going to get. When I was writing the abstract, I was like, "This is going to be great. I'm going to take a provider and walk everyone through how you go about upgrading that provider for 0.12 support." It turns out that's not super-useful, so I decided not to do that.
Create tests for Terraform 0.12
If that is the talk you're looking for, this is the abridged version of it: You want to make sure your tests pass, and that you have tests. You want to upgrade to use Go modules. That's not strictly required, but it's a good idea. Just do that, please. You're going to go get github.com/hashicorp/terraform@v0.12.8. You're going to run your tests, and you're going to fix the test failures that come up, because there may be some. And then you're going to run the tests again.
You're going to keep cycling between those last 2 steps as many times as necessary. And that's what I want to talk about today. I want to talk about those last 2 steps, why they may be necessary, and why you may have to cycle between them.
Today we're talking about context. Today we're talking about, Why are tests going to break? Why did 0.12 happen in the first place? What's going on?
But if that's not your cup of tea, that's totally cool. We've got 3 other great talks going on right now. These fine people are doing a great job. Please feel free to sneak out the back. Go see them. I won't be hurt. It's totally cool. Get your money's worth.
That's the end of our disclaimers. After this point, you're all stuck with me.
I work for the Terraform Ecosystem Team. We maintain the providers that HashiCorp has a maintenance team in, which means the things that we pay people to maintain. That's not all of our providers; that's just some of them. It's a complicated and weird thing on how we figure out which ones those are. I don't really have time to go into that today, so I'm not going to.
The many people who work on Terraform
Specifically, I work on the Google Cloud Platform provider. I'm not the only person working on the Google Cloud Platform provider. I have a teammate. She's pretty great. But of course, HashiCorp co-maintains the Google Cloud Platform provider with Google. We have a team of engineers working at Google on the Google Cloud Platform provider, and they're all pretty great.
And the Terraform Ecosystem Team isn't just me and my coworker. There's a bunch of people, because we have multiple providers, and we need people to maintain all of them.
And we're not the only team at HashiCorp that's working on Terraform, or even on Terraform open source. The core team does things like maintaining our CLI, maintaining our protocol, and otherwise taking care of Terraform as a whole.
Our enablement team does things that make our lives easier, like maintaining the GitHub HashiBot. They take care of our provider SDK, and they write a lot of the documentation on how to develop a provider. They're really like a force multiplier on provider developers in general, and they're fantastic. I can't say enough nice things about them.
To keep all of these Terraform open source working cohesively and building the right things, we have a design director, a product director, an engineering director, and a technical program manager, who not only manage Terraform open source, but also Terraform Enterprise.
And finally, HashiCorp and our partners aren't the only people building Terraform. Terraform's an open-source project, and so we have open-source contributors. In fact, we had over 4,700 open-source contributors to Terraform and its providers as of last week, I think, which is kind of a large number.
First of all, if anyone in this room is represented on this slide, could you just raise your hand, stand up, acknowledge yourself as you're comfortable or able to? Could we just give these people a round of applause? They do great work. You all are awesome.
I want to talk about the 4,750-ish people I just put on that screen, because it's easy to be like, "Yeah, 4,750 people. I've got 5 million users. That's just how software works." But that's the crux of what I'm talking about today.
There are about 1,600 people at HashiConf. This is the largest HashiConf we've ever held. If each and every person at HashiConf this year brought 2 friends with them, that is roughly the number of people contributing to Terraform over the last 5 years.
Washington state, where we're giving this talk right now, requires 1,500 inhabitants to incorporate a city. It's not important why I know that. Let's just accept that I do and move on.
Now, I'm not implying anything, but we could take all of our Terraform contributors and form, not 1 city in Washington state, but 3 of them. And really, I'm really not trying to suggest anything, but Init, Washington, Plan, Washington, and Apply, Washington, do have a nice ring to them.
My point is that a lot of people work on Terraform. And Terraform's only as useful as its providers. I don't mean to knock the work of the core, the enablement teams, or the other things that go into making Terraform great, because they're necessary and make Terraform wonderful, as well.
The importance of the provider
But at the end of the day, what a user sees, what a user interacts with, the part that's moving that does what the user wants it to, is the provider. It's the one interacting with the API.
Over time, the number of providers that Terraform has had has increased rapidly. We have almost 1,200 providers that are open source and available on GitHub today. That does not include providers that are developed internally, at companies, and never released to the public. That is a startling growth over the last 5 years.
Terraform has a people problem. I hate classifying it as a problem, but it is a problem, because with all those peoples, and all those disparate projects, how do we keep everyone moving in the same direction? How do we cook with that many cooks in the kitchen? How do we alter the recipe, to make sure that we're making the dish we intend to? As we learn more things, as we taste the dish, how do we adjust course?
Terraform’s provider evolution
I'm going to abuse that metaphor a little bit further, because we've adjusted our recipe quite a bit in the last 5 years. These are my top 10 favorite moments in the last 5 years, of times that we've changed provider development. I'm not going to go through them all today, because we don't have time, and they're really just the tip of the iceberg.
But we've evolved how we think about providers, and how we develop providers, a lot in the last 5 years.
And I have a secret to tell you, and before I tell you my secret, I need you all to promise to be cool. This is a safe space. Everyone just don't panic.
My secret is: Terraform works on accident. I don't mean that terraform apply
works on accident, that you're going to run it on your production and it's going to fail at runtime. We work very hard to make sure that doesn't happen. We work very hard to make sure Terraform is reliable.
But the way that Terraform works, the way that we're able to tell what a provider is able to do or not, is an accident sometimes, and that's because Terraform's implementation is its contract. When you are a provider developer, and you try something out, and it works, and you ship it, at that point we feel responsible for it, and we feel like we need to keep it working.
So, a lot of things that people end up depending on are not some abstraction above our implementation. They're relying directly on the implementation details. And this is probably fine, as we've rewritten basically our entire implementation over the last year. Nothing possibly can come of this that would make us feel sad and regretful, because we definitely do not have any spacebar heating moments, as was the case in this XKCD comic. No, sir. None at all.
Why is it like this? That's something I've been thinking a lot about. I want to talk about that today.
Learning about our own creation
Terraform's emergent. We didn't know, when Terraform was started, all the things people would use Terraform for. We didn't know all the things that it could do, and what people expected of it, and we didn't know the entire problem domain, just like any engineering project.
And over time, we had to deal with that emergent behavior. We had to deal with learning about new things about our problem space, and figuring out the solution for it. So, over time, we allowed the flexibility to go down and the abstraction layer to go up.
That's kind of the ideal: Over time, you lock in what the program is for, what it does, and how it works, and you take away some of the flexibility to solve other problems by adding abstractions that let you keep those details hidden.
But because we're so concerned with backwards compatibility, we didn't necessarily take away all of the hooks into them. That's what Terraform 0.12 was.
Some things will no longer work in 0.12
It was a little bit of cleaning house on technical debt, and getting back to working with abstractions, and taking away some of the flexibility. So a lot of what you're going to encounter as you upgrade your provider to 0.12 is that things that you were relying on, things that required some flexibility on Terraform's part, will no longer work.
I want to talk a little bit about what those are today. Give you some examples, and some insight into what to look out for, what to make sure you've thoroughly tested, and where you might find some sharp corners.
Because Terraform is like an ogre, or a cake, or a parfait, or an onion. It's got some layers. These are some of Terraform's layers. Just some of them:
HCL
HIL
Helper/schema
State
Provider RPC
Diff customization
Remote Backends
Complex types
Validation
Graph(s)
The layers of Terraform
We have HCL, which is our configuration language, which has a type system.
We have HIL. If you've ever used the dollar sign curly brackets, that's HIL. It's got its own type system, and also everyone's like, " This is a first-class part of Terraform." But it's really not. If you've ever noticed that you've got to smuggle it inside the double quotes, that's because to HCL, that's just a string, and then Terraform parses it as a string.
It's like, "Ah, yes. I understand. This is HIL," and then further parses it from there, which leads to some weird stuff.
There's helper/schema, which is how providers define their types, and so it has its own type system.
There's state, which represents types in a different way than the other 3.
There's the provider RPC, which is how provider plugins communicate with the Terraform core library, and of course it has its own representation of how types are.
We've added things like diff customization and remote backends, which interact in strange ways with some of our assumptions.
There are complex types to be considered here, and how we've implemented them, and we've got validation.
And so I say "graph(s)" here, with an "s" in parentheses. It originally said "the graph," and then I showed it to the core team, and everyone laughed at me and they're like, "Yeah, sure, Paddy. There's just the 1 graph."
And it's like, "Please don't tell me that. Let me exist in my world where things are nice and simple, and there is just 1 graph." Apparently, there's not just 1 graph. Apparently, there are multiple graphs. So for correctness, "graphs." Today we're not talking about the multiple graphs, so you may all stay in my nice, safe world of simplicity, where there is only 1 graph.
My point here is there's a lot of type systems. I think I listed 4 or 5 different representations of types there, and I want to talk about some of them really quick.
Things in Terraform’s type systems changed with 0.12
Let's talk about flatmap. If you don't know what flatmap is, that's OK. We're about to talk about what flatmap is. Have you ever opened up a terraform.tf
state file? It's a big JSON file, looks like this.
Inside it, you'll notice something like this, and that's a flatmap representation. We're taking complex types and we're mapping it down to map(string)string
, because all of Terraform's state is stored, in 0.11 and before, as a map of string to string, no matter what it is.
You're like, "Wait a minute, Paddy. How does it do that? You've got complex types. You've got scalar types that are not strings." We just coerce them all to strings. It's probably fine.
How we coerce complex types down into string, or into a map(string)string
, is we use things like this. This is telling me that I've got a list or a set. You know it's a list or a set because it's got the "dot number" at the end of it, and it's got 1 item in it, so the value of it is 1.
That 1 item is a set. You can tell it's a set because it's got this long number in there, which is really just a hash of the values of the set. You used to have to calculate this, manually in the provider, for each and every resource. We did away with that 2 or 3 years ago. You used to have to write a function that would tell it what this number was. It was a good time.
And within that set, we've got a block that contains a map. You can tell it's a map because it's got that percent sign at the end, and the map has 1 item, because the value's set to 1. It's got a key in there, and the key is GCE_PROJECT
, and the value of that is my-project
.
We're using this dot-separated syntax to create complex types in a map of string to string. You can see it when interpolating. In this case, we're interpolating a resource, then the name of the resource, and the field of that work interface.
And then the network interface is a list, so we're using the first item in that list, and the first item in that list is the block, and that block has an access config type on it, and that access config is a list, so we're using the first string in that, and that's a block, so that has a nat_ip
field in it, and that's what we're ultimately selecting.
That's a flatmap string.
We ran into a problem with maps, because if you had this interpolation, Terraform's not sure whether that's the key, and it's a map with a key of bar and the value of bar is a map with a key of baz, or if it's a map with a key of bar.baz. Because fun thing about maps is that their keys are strings, and strings can have periods in them, so Terraform's not really quite sure which one that is.
And you're like, "Wait a minute, Paddy. What do you mean it doesn't know what type it is? Terraform should know about my schema." But Terraform doesn't know about your schema. And you're like, "That's not right. I told Terraform about my schema. I wrote it into my provider code."
You have to know how Terraform actually works
So let's talk about how Terraform works really quick.
Very high level how Terraform works is it calculates a diff, and the config and state get parsed, the graph is built, and the diff is calculated. That all happens in the Terraform core binary, and then it sends all of that over to the provider, who then makes the API call to turn that into reality. And finally the provider sends the new state back to Terraform core, which then saves that and persists it.
And we have a problem there. Terraform doesn't know about your schema because your provider knows about your schema, and it's not the one making the graph, and it's not the one doing your diffing, so at interpolation time, Terraform doesn't know what your schema is.
How do we fix this? First step is we standardize the type system, so we're not working with like 5 different type systems, because that's just confusing.
Second thing we do is, at the very beginning, Terraform says, "Providers, what is your schema?" and the providers say, "My schema is this." And then when Terraform is parsing all of its config and doing all its interpolation, it knows what your schema is.
That's one of the major things we introduced in 0.12 that we didn't talk about at all, but one of the major reasons behind 0.12 is to fix our protocol to allow for this.
Unfortunately, fixing this breaks things.
As I said, we have some spacebar heating moments, where users are relying on implementation details, and relying on the type flexibility of Terraform not knowing what our types were.
One of the first things that breaks is that what you set in state matters. I know people are like, "Of course what you set in state matters; that's why I set things in state. I don't just set random data there."
But there are rules to what you can set in state, and Terraform 0.11 didn't have enough information to enforce them properly. But Terraform 0.12 does, so you're going to be subject to these new rules now, and you can run afoul of them.
I know, because I definitely did.
An example of why you need to test
If a value is known in the config or the state, only the values from the config or state are valid to set. That's a high-level description. The protocol's text for this is: "Any attribute that was non-null in the configuration must either preserve the exact configuration value or return the corresponding attribute value from the prior state."
I know those are both very abstract, so we're going to talk through an example which may or may not be the exact thing that I did that made me realize this.
I work on the Google Cloud Platform provider, so this is my Google Cloud Platform example.
It's got these things called "self links," and these are URLs, and there are multiple versions of them that are all semantically equivalent. I've got 5 different versions up on screen right now, and these are all basically the same element. They point to the same thing. Terraform should consider them equivalent.
Our issue is, we decided at the provider level that we were going to be liberal about what we accept, and conservative about what we set, and that helped us allow users to have flexibility, but it also gave us the ability to have predictability in what was going to be input when you interpolated.
So we accept any of these 5 things, but we always set the top 1 in state. That's always what's going to be in state, because that's the easiest one for us to get all the information we need out of.
Our problem is when you have something like this. You have a compute instance. It's got a network interface, and you're specifying a specific network, and then you're trying to interpolate that network elsewhere.
What Terraform's going to do when you try and create this, it's going to run a plan, and the plan is going to assume that the network is going to be default when it's interpolated, because that's what's in the config, and there's nothing in the state.
So the only thing Terraform has to build that plan with is the default, and it's like, "I've got a value for this. I know what it's going to be. This is going to be default." Then we go ahead and apply it, it creates the instance, and what we set in state is the full URL of it.
The problem with that is that the full URL of it is not, in fact, a default, and Terraform does not like that. You can't stray from the plan. You must keep to the plan.
That's why you've got to be careful. You've got to make sure that what you set into your state follows these rules.
Don’t let schema migration bite you
But there's also schema migration. In previous versions, 0.11 and before, when you changed the layout or the format of a resource, you changed what fields are available, or the types of them, and we gave you this way to migrate your schema to use the new layout and the new format, so that you didn't lose any data.
The way we did this was by giving you that map(string)string
of state, and said, "Fix this, and give it back to us when you're done." And that was fine, because you could just mutate this map(string)string
, and make it look like what you wanted it to, and set it back in state, and that was cool.
Unfortunately, now that it's typed, we can't do that anymore. You need to be able to parse the state to each and every representation of the schema that you want to migrate between, so you need to keep all of those different versions of your schema around, because, again, we need those types. We have type information, so we need to un-marshal it to the right thing, and you need to convert between types, not just map(string)string
anymore.
Types are now inflexible
The last thing I want to talk about today is that types are suddenly inflexible.
You used to be able to take advantage of Terraform's type system in some really bizarre and unexpected ways. And people did, because they wanted to offer a good user experience for their users. They wanted to create the user experience that their users came to expect.
We wanted to support that.
Types suddenly became inflexible, and this was one of the biggest problems. This caused, on its own, a couple months' delay in Terraform 0.12, because we ran into this, and had to figure out a fix for it.
I want to talk about optional and computed blocks. This was a major issue that we ran into, and I really want to stress the importance of looking for this in your and testing for it.
Because we had exactly 1 test failure across the main 3 providers that the Ecosystem Team manages. A test failed and a partner noticed it, and if they hadn't, we would have shipped Terraform 0.12 and broken all 3 of our main providers. This is definitely a subtle and insidious bug that usually doesn't get covered in test cases, and you want to make sure that you're testing for this.
When an API has an optional default value, it's like, "If you don't set anything, I'm going to specify a value for you, like an API default for you." There are 4 users' intents that we want to capture.
First is, "I want to set it to a specific value." Second is, "Whatever is set is fine. I don't actually care about this value at all." The third is that, "I don't want it set to anything. I want it set to any empty string, or an empty list, zero, something like that." And the fourth is, "I want it set to whatever the API's default is. Even if I changed it at something before, I want to reset it back to the API default."
Capturing that in Terraform, this is how we traditionally did it, and usually that's fine with strings and scalar types, because we overload the empty value to be either empty_value
or whatever the API default is, as is appropriate for that specific resource.
Blocks presented an issue for us. In my imaginary resource here, there's a disk block, and it's got an image, and the disk block is optional, and if you don't specify disk, the API's going to pick a default one for you.
The issue here is if we go ahead and create the resource and specify a disk, but afterwards we're like, " I don't want any disks at all," or, "I would like to go back to the API default."
Normally you would just say, "OK, remove the disk." Right? That's going to get rid of our disk. The problem is, because this is set from an API value, it has to be computed. Because it's computed, if you remove it from your config entirely, and something is set in state, Terraform will say, "This is fine. There's no diff. You said whatever was in state is fine, so there's no issue here."
This doesn't actually work. This is going to say, "There's no diff, nothing to apply, everything's cool," and your disk is going to stay attached to your resource.
How we worked around this is, we abused the type system. We said, "Blocks are implemented internally as a list. You access a list, and that's how you get access to the block. So we're going to abuse this fact, and we're going to set it to an empty list."
And that'll be like, "Aha!" and that successfully tricked Terraform's type system into being like, "Oh, there's a diff here. Remove the disk." And we're like, "Hooray, problem solved."
Until 0.12, when the problem was no longer solved, because, here's the fun thing, a block is not an attribute, apparently.
In 0.11, there was no real concept of a block and attribute. There was no type for them. A block was just a list of complex objects. In 0.12, a block is a type distinct from an attribute. If it has an equals sign, it's an attribute. If it does not, it's a block. You can't change between the 2 of those. You get to use 1, and a block is either present or not present. It has no explicit not-present option for it.
To reiterate: Make sure you test
Let's recap. When upgrading your provider, you want to make sure you have tests. I really cannot stress enough how much you want to have tests. Even if you're not going to upgrade to 0.12, if you don't have test state, you should make some tests. They're really nice to have.
You want to run your tests, make sure that either they're passing, or you understand their failures. I'd like to say you need to make sure they're passing, but I don't think I've seen all of our tests pass in quite some time. Because we create a whole lot of resources, and APIs sometimes don't always manage to do that reliably.
If you're not on Go modules already, you want to upgrade to that. Again, that's technically not entirely necessary, but you want to do it. It's a good idea. Please do it. It's fine. I promise everything's OK.
And finally, you want to get upgraded to the latest 0.11 SDK. Go to the last 0.11 release and upgrade your provider to that. Make sure everything works, so that you know what's a 0.12 breaking change and what's just weirdness between Terraform versions.
That's a good way to get started, a good way to prep. Before you change anything, before you get any 0.12 features.
There are some sharp corners you want to watch out for. When you're setting things in state on Create, you want to be careful about what you're setting in state, because that's when you're going to get into the problem where you're straying from the plan. Also, if you're doing creative things when setting things in state, you just want to be a little bit wary of that.
Generally, if you're setting something in state on Create, technically what you're supposed to be setting in state on Create is exactly what's in the config. That's, in theory, the thing that's supposed to be set there, and doing anything else may lead to some hard-to-find and hard-to-debug "plan is different from apply" errors.
If you're taking advantage of ambiguity in the type system, you're going to have a bad time. I know that's a super-hard one to find, because you don't think of it as taking advantage of the type system. You think of it as just doing the thing that we told you to do. Unfortunately, that is something that we're tightening down now.
Finally, how to stay aligned. How do you make sure that in the future, you're not digging through your codebase trying to find these sharp edges, trying to figure out what's going on?
There's really no shortcut for it except to familiarize yourself with how Terraform works, and the rules it follows, and I know that's super-difficult, and not always the easiest thing to find.
There is a docs folder in the HashiCorp Terraform repo that does have a lot of information about the protocol, has a lot of information about how Terraform thinks about resource lifecycles, and generally how to think about Terraform like the core team does, which is going to help you stay aligned and make sure that your provider's using Terraform the way that Terraform expects to be used, which is something that we will make sure we explicitly don't break on you in the future.
What to look forward to
The good news is, once you've done all of this, there's a lot of good stuff to look forward to. I have not run these by the core enablement teams, or more importantly, their product managers, so let's maybe not take these as promises, but suggestions of things that might happen in the future.
We are working on a new SDK (Now Available). I don't know when it'll land. I don't know what's going to be in it. I don't think there's any consensus on that. But we are eventually going to get a new SDK that takes advantage of some of the new features that 0.12 has to offer.
Things like support for null values might make it in. I think we have at least 3 different ways right now of getting a value out of Terraform's config or state, and trying to figure out whether it's set or not, and none of them work reliably all of the time.
Some of them work sort of reliably most of the time, but also break at some random weird points. That's because everything's turning into a map(string)string
, and an empty string and a null end up both being the empty string in state, and weirdly you can't tell the difference between those 2 anymore.
Warnings: You have the ability to output warnings to users in the Terraform protocol, and it's just not exposed yet in the SDK. Things like, "This API doesn't implement a delete method, but we let your terraform_destroy
continue so that we didn't throw errors that you can't fix in your face."
But you need to know that your infrastructure's still there. That is a common one that I've had, where I just put it in the debug log, and put it in the documentation, and hope for the best.
But we'll be able to surface those warnings to users now. That's super-exciting and something I'm looking forward to.
That's everything I had to talk about today. I know I touched on a lot of topics. My hope is that I identified some of the hard edges, some of the pointy bits of Terraform that you want to make sure you're looking out for: that your tests are covering, and make sure that as you're going to upgrade your provider, you don't release something only to find out that something you're expecting to work broke when your users file bug reports on it.
Thank you so much for your time. I really appreciate it, and I hope you enjoy the rest of your conference. Thank you.