How Ellie Mae delivers "everything as code" using Terraform Enterprise
Ellie Mae engineers share their DevOps best practices and lessons learned after migrating some Jenkins processes to TFE.
Scott Winkler and Anthony Johnson, two engineers for Ellie Mae, call their presentation: "the missing DevOps handbook for running Terraform Enterprise (TFE) with everything as code." Their talk provides a real-world case study that weaves in all of their best practices that they discovered while deploying, configuring, and automating Terraform Enterprise.
After using Jenkins for several years to tie together infrastructure provisioning and deployment, they migrated to TFE, which allowed them to truly take an "everything as code" approach. Hear about the challenges they overcame around multi-team environments, governance, and security. You'll also receive some inspiration from their use case on how to leverage the TFE API and custom providers. They used these two elements to create an application layer for managing promotions between workspaces.
Speakers
- Anthony JohnsonPrincipal cloud engineer, Ellie Mae
- Scott WinklerSoftware engineer, Ellie Mae
Transcript
Anthony: So at Ellie Mae, we go back with Terraform Enterprise probably about eight or nine months, so our title today is essentially, "Winning with Terraform," and we'll get into exactly what the winning is.
First, we only have 35 minutes, so we're gonna burn through some of these quick slides. These two joke photos, which they literally are jokes, the smug guy is me, principal engineer at Ellie Mae. Just a little bit about me: I joined Ellie Mae a couple years back building their cloud platform team and cloud platform engineering team, and public cloud team. I like to bake sourdough and the Mars up there in the corner is cause, you know, if Elon's successful, my wife's not gonna be happy, I'll be living on Mars.
With me today is Scott. Just for the sake of time, I'm going to introduce Scott. He is a software engineer that joined about a year ago. He is a lead engineer on our deployment project, so I had these big ideas, he's the one unfortunately had to implement it and he actually teaches ballroom dancing, so he's not a one dimensional guy. He has that caffeine symbol at the top, cause he's actually trained as a chemical engineer and funny thing about his photos is that actually is not really him. That's his twin; it's all about efficiency.
So Ellie Mae—we're not Fanny Mae. We're actually a play off of that name. We're basically 'electronic' Mae is the idea, but we automate the mortgage process. And so the mortgage process is all of the automation that we go to close the loans. So you think if you want to go and get a home, it takes you 45–60 days to close that mortgage. Essentially our goal is to make that as short as possible. Our founder wants an ATM-type transaction at some point where you see a home, you want it, you go up to the ATM, you push a button, all the magic happens, and then you get that home maybe that day. This is a someday kind of thing, but the point is that's what drives the company.
We are a public company. Our customers are big banks and mortgage brokers. About 30% of the mortgages that happen in the US go through our backend. And we are East Bay. If you have a bad commute, don't forget that.
So today what we're gonna talk about is a bit of what we've learned implementing Terraform Enterprise. A bit of why we went with Terraform Enterprise. How we maintain Terraform Enterprise and then also we wanted to do true CD with Terraform Enterprise, and so we actually have built some software as well that we'll give a demo. We've been looking at potentially trying to open source that one. Largely, if we do, depends if anybody here really cares about it.
But again, we'll have a couple demos. Scott gets the luxury of annotating those, so we'll see how well that goes.
Deployment goals
So any good project starts with a goal and actually Jeff Barr talked today about working backwards. We actually work backwards at Ellie Mae and so this actually all started with kind of a press release like Jeff Barr said.
If I'm gonna summarize it, we wanted a self service system. I don't want people bothering me to create a workspace, to deploy a workspace, to provision credentials. I want the least amount of touches as far as sort of a centralized horizontal cloud operations team can be. So really building services for the business to leverage and move forward with.
Continuous deployment: I already mentioned. But the point is that we're talking Terraform, so we're talking Terraform code. I wanna be able to take that same piece of Terraform code and I wanna push it through each environment with proper quality gates all along the way. So we have our AWS deployments, we actually cross from a sort of a development AWS account to a prod account. Probably security should know what's going to be changing in that prod account if it puts our customer's data at risk.
We want this to be multi-tenant: A lot of you probably support Jenkins farms of like hundreds of Jenkins servers for all these different teams and a lot of places do that centralized Jenkins, which honestly is kind of what we wanna do. So we don't wanna have 50 different Terraform Enterprise installations to maintain. We want one main one that sort of services everybody. So this is really being able to govern the system to make sure that Team A doesn't affect Team B, but we can still use the same system to handle them.
Separation of duties: Because we process mortgages, we are a FinTech company, which means lots of compliance regulations. And so one of those, unfortunately, is separation of duties which simply just means that if Guy A wants to do something or writes the code, he can't be the guy that actually deploys the code. There has to be some sort of check and balance all along through the process. Which is a pain in the butt, but it also makes the problems fun.
And of course, still related to FinTech, it needs to be fully auditable: When somebody comes in and they say, "Prove it," we need to have the audit trail to be able to prove that.
Business control enforcement: And then obviously, you guys already know that this is Sentinel, but the point is that we do want to be able to control as early as possible in the development process for people to be able to fail fast when they're doing the wrong things. So those are kind of in a button where we were looking at.
Journey to automation
But to give a little bit of context, just so you know, and like I said, we'll go through really quick. This started before I was at the company and I think a lot of people in here probably share this similar path. Maybe you're ahead of us. Maybe you're behind us. Or maybe you haven't even started. But really, the company started with, basically all these different teams all together, maybe the QA engineers are copying stuff to a shared drive somewhere. The DevOps guys are pushing buttons or doing things by hand, moving .jar files around. But just basically, it sort of just started as not a great state.
And then these guys started doing some scripting. Pretty straightforward. They just write a script, a Python script, a shell script, a PowerShell script, and then good old Jenkins makes the magic happen. And so we looked at this, this was mainly for our private data center, and it worked, but it really didn't meet a lot of our concerns to compliance. We weren't very happy with it.
Journey to Terraform
And so enter the cloud journey of AWS and then people get a taste of CloudFormation, "Gee it's declarative. It's great," and then of course, who deploys the CloudFormation because separation of duty is of course our friend Jenkins.
As a principle engineer, I sat back and I go, "Well gee. Luckily I own both of these concerns of deployments in the private cloud and I inherited these problems." And I basically looked at it and I said, "Is there anyway we could centrify this?" And this was a year and a half, two years ago. And I said, "Gee, Terraform meets that need." And so that's really kind of how we got to Terraform.
And then we had the Jenkins guy in the room who was handling our build concerns and or deploy concerns, our credential management, and with Terraform state management. Ultimately, Jenkins had become this sort of golden hammer to solve all nails. He kind of became a beast that was really, really hard for us to audit. It was really, really hard for us to know who could do what in the system, and at the end of the day, it just ended up being a lot of grief and a lot of trouble. Of course, the more load you put on any system, the harder it is to maintain for that central team. So he was kind of a jack-of-all trades Jenkins.
TFE vision
So that led us to re-envision and we looked at a whole bunch of different software, but really this going to our deployment goals and sort of thinking, "Let's let Jenkins do what he's supposed to do." Jenkins originally was envisioned as essentially a build system. Run unit test, get those things there. Run test. Any sort of security tests, all that stuff, Jenkins is really, really good at that. He can do all the other scripted things, the scripted jobs, but the reality is that he's sort of becoming that jack-of-all-trades. So we wanted to kind of say, "Hey. Let's let Jenkins be Jenkins, and he can still do that other stuff. We don't have stop it overnight."
And then let's focus on Terraform Enterprise still taking source directly from engineers. Taking them into GitHub and then Terraform Enterprise get into git hooks and being aware of them. And then those artifacts to get put into Artifactory, which by the way, now that they're in Artifactory we can actually scan them for security, license violations, things like that, we actually have a provider that we wrote that it can actually at deploy time take those artifacts out of Artifactory. And then what's nice is that we have the clouds that we're gonna use.
I didn't put Azure up there because it just didn't fit in the slide. But Azure's perfectly fine, we actually use Azure as well. But we have our clouds, including our private cloud with vSphere. And we also, I want people to reimagine that it's not just about deploying infrastructure, but there's also provisioning as well, and so if you need a Vault ACL, a Consul ACL, or a Kafka topic, or an elastic search index, or a Kibana view into the system, or just about anything, essentially you can start to do that as code.
So everything is code. So that was kind of my original vision was this, everything is code, with Terraform Enterprise as kind of the hub of that.
Deploying
What we learned is essentially that when you go to deploy TFE, we started with the Amazon AMI, HashiCorp told us to get off of it, and we resisted, but ultimately, we basically say there's no reason not to deploy Terraform Enterprise with Terraform. It just makes sense.
And who watches the watchman? Who deploys the deployer? Basically HashiCorp, we were able to use the Hosted SaaS. I think it's kind of a very elegant pattern to do it, but you have your Terraform, put it in a private/public GitHub repo and no problem being able to access that and deploy it. And so you essentially get a full Terraform deployment stack all along the way, it's just you use the public SaaS version to help you deploy your private stack.
A lot of the documentation mentions point and click, log into an EC2 instance or log into the virtual machine and running commands. You may have to look pretty hard and try some, but it's all fully automatable. There might be things here and there, but the point is—as Jeff Barr was saying, "Settle for excellence," right? So there's no reason to leave that there even though we're not going to be installing Terraform Enterprise all the time. The fact of being able to upgrade them and manage them and create Dev environments to test things out. It doesn't make sense for use to leave that work there that's gonna be somebody else's debt. So fully automate it.
And of course, single sign on. Everybody should be doing single sign on, but the point is if you do single sign on early, it really does effect your organization of how you name your teams, how you implement them, how you maintain them, so I definitely recommend doing that. We got burned recently with failover and recovery. This is mainly our own fault and a little bit of HashiCorp's fault, but the point was that we basically had the problem of the Vault unseal, which there's everybody has to unseal Vault, right? And we have a method of where we do not allow access to production, but doesn't mean that we don't have break glass access to production via like some sort of public cloud owned key pairs. And so of course, we didn't set them on the instance, we didn't have the Vault unseal key. We couldn't really log in without a restart and it was just a mess.
But luckily, being able to account for those early, test your dire situations and figure out how it goes.
Managing Terraform Enterprise
We need to manage all of that: Organizations, workspace, teams, VCS connections, credentials, module management as well.
Teams typically would do something like JIRA. I like this JIRA guy because I feel like JIRA is human orchestration ticket monkey kind of things, which no one wants to be a ticket monkey but the point is is you've got that guy in the middle he's literally juggling tickets. I looked at it and I go, "Well, can we do it better? Can we do it better than juggling tickets?" I'm going to let Scott tell us a little bit more about how we organize this code.
Scott: The first thing that we looked at was how can we administer Terraform Enterprise using Terraform. When we started the journey of using Terraform Enterprise there really wasn't a provider. Now there is and you can see the two resources there, that we can declare an organization, we can declare a team. So these are just administrative things that we don't have to create—point and click—in the dashboard, in the UI. So this makes operations work significantly easier. If they want a provision, a new organization, a new team, new workspaces even, this is all possible with the Terraform Enterprise provider.
What this makes possible is instead of using JIRA now we're using GitHub pull requests. Because, actually, most of this work should be done in a versioned fashion using GitHub.
This is our Terraform Enterprise deployment. We have an organization here for managing our Terraform Enterprise actually through Terraform Enterprise. One of the things someone might want to do is be able to create a new organization. So we have this workspace, TFV management, that someone can make a pull request against. This is actually the code here.
In this case I want to make a public cloud organization. I'm just using Terraform module, Terraform best practices here, always use modules. I declare an organization, I declare a VCS connection, I declare a team, oauth token, workspace. So, pretty straightforward. There's really not too much here. This is a little bit more detailed one that I already have existing. I already made a pull request for this. You can declare the remote plan now, that's cool. And this pull request you can actually view the result of what that plan would be.
This is just what changed and you can merge the pull request. So in this case I'm just going to be making a few resources.
We actually made this possible using our shell provider. The Terraform Enterprise provider, that doesn't cover all the resources that we need to be able to make this. So in this case I'm just going to be merging the pull request. There's really not a lot here.
So we kind of filled the gap of being able to create other resources, that the Terraform Enterprise provider doesn't fill, by having our own provider that just allows us to call some simple shell scripts. Like create
, read
, update
. These are just hooks into normal Terraform life cycle events. We're able to use this custom provider in Terraform Enterprise, it's actually a feature of Terraform Enterprise to be able to bundle your own providers that aren't part of the public registry of providers. That's actually one of the most appealing things for us as Enterprise users of Terraform.
Again, you can just see the plan being done. It takes a few seconds. You get all the logs, you maintain the state files. This just makes it a little bit easier working with Terraform. Especially if you wanted a rollback or whatnot. And here, when I refreshed the page, you can see actually the new organization was created successfully. It was created with that workspace, and it was configured with a team, and then a VCS connection to our internal GitHub.
Module Registry
The next thing we wanted to do was look at this private Terraform Enterprise module registry. I think module registry is a great tool to use because it's all about sharing code. Sharing snippets of modules that people have made and being able to reuse that in other places. You have the producers of the Terraform modules that they can version these modules and release them, and consumers can use these modules very easily. It's quite cool.
The problem with the private Terraform Enterprise module registry is that it's very point and click-ish. You have to add each provider one at a time, or each module one at a time. So we decided that that wasn't good enough, and we wanted to automate that process using Terraform. You can see here in this snippet we have two modules. The first one just gives a list of the organizations in Terraform Enterprise. Then the second one actually provisions that module in the organizations specified by the list organizations
.
In this case I want to register this module with that VCS repo identifier. The CloudPlatform/terraform-cool-mod
actually points to a repo in our GitHub. It has releases, so it's a versioned thing. And I'm just registering it with all organizations in Terraform Enterprise. Of course, xkcd everyone loves that.
For the demo here, you can see that we've created the workspace, we've done some of the initial provisioning. But the modules are just empty. Again, if you wanted to be able to do this in the UI, be my guest. I've done this a few times, it's not hard but if you have many organizations it's not that easy to use either. So we created this GitHub repo that you can register your modules. Kind of similar to a central library. And you can declare where you want these modules to go. I'm always just passing them to all organizations, but you could actually restrict this to just the list of organizations that you want to pass them to.
If somebody wants to create a new module that they want to register, then they're welcome to make a pull request. Again, same as before, same as how we managed the TFV management with the organizations and teams. Then it populates it in the module.
This actually takes a while, it takes a few minutes.
We actually have the shell provider as open source. It has a whopping two stars on GitHub. So if you want to make it three stars, or maybe even four stars please be my guest. It's actually very useful because it helps patch the need for resources that just don't exist right now. Actually, that shell provider is how we managed to fill in a lot of the gaps. Whether that's a good thing or a bad thing remains to be seen.
But you have a create operation, update, read, delete. So the same as creating a normal Terraform provider, but rather than having to go through the process of compiling the provider, writing Go code, stuff like that, you can just write a couple scripts and these get called during normal Terraform lifecycle events.
So, if you just wanted to do a post, or a git on read, or a delete on delete you just HTTP request. These are just one liners now, it's very interesting.
Provisioning credentials
Anthony: One thing we haven't covered is the idea of credentials. Like I said, we're FinTech, so credential management is something we care about and obviously people owning the credentials they're supposed to have. I do think there's a bright future with Vault integration. But even with Vault integration there still has to be a human that maps a Vault ACL to, say, a workspace or something like that. I definitely think there has to be a human interaction. And of course humans themselves are lossy. We don't like humans. Not only will they forget passwords, but they'll also lose the passwords. So they're double lossy, I guess you could say.
The concept here is that we wanted to be able to delegate different teams in the organization to be able to provision their credentials for certain teams to be able to give us least privilege within these things.
So you can imagine, we'll use our public cloud team as a good example. Our public cloud team, they delegate AWS rights to certain AWS accounts. They own that, Scott and I do not. I technically still can but that's only historical. The point is, they are the ones that need to make the judgement call of does team A, and whatever environment they're deploying to, do they get access to this particular AWS account. We have about 120 AWS accounts. Managing those credentials is something we need and we need it, essentially, to be in a way that we can audit it, we can know when it happened, we can do what it is.
Using the same kind of concepts we've actually done the similar thing, but the difference is we don't actually own the repo, we don't own the pull request cycle. So we end up with a very similar sort of fashion with the demo here.
Scott: We have another demo here which is provisioning credentials. I wasn't able to show the end of the modules but the modules are there, I promise. In the workspace now we have these variables and, as you can see, they're empty. Now, if we wanted to do this manually you could go through the UI, you could click the button to add variable such and such. But now you have somebody looking at the credentials, first of all, and if you have many workspaces, many deployments, you have to put these in manually, one at a time. Imagine key rotation, just a nightmare.
What we have here, is we have a workspace that, first of all, you can provision the IAM user so this is for AWS. Then we have this mapping of workspaces to environment variables. In this case, we have three workspaces, and we want to set the AWS access key and the AWS secret access key created from this IAM user with the access key created from that.
We have this other workspace that manages that, and I've already made the pull request here. You can just confirm that. It actually provisions the IAM user, it creates the AWS access keys, and then through that variable set mapping, it maps those to the workspaces where they need to go. Now, an operator or a public cloud engineer can actually look in one place and see where these credentials are being used and what is the policy that they're using it for.
Just to show you that this actually does work, we have the variables now in the workspace and no one has seen it. You could trigger this workspace to run, for example, once month, to have key rotation in Terraform Enterprise. It's actually modular. Right now we just have support for the AWS access keys because the public cloud team really insisted upon that, but we can make modules for any kind of secrets, potentially even a Vault integration or something like that.
Also, I know that the Terraform Enterprise people are working on their own thing, maybe just in time, getting the credentials. This is another solution. It's okay.
Continuous delivery pipelines
Anthony: Scott and I had a debate on the slide, because we couldn't decide what we wanted to do in pipelines, drawing a pipe, bunch of pipes, and doing a graph tree, they're all cliche. I was sitting here and I was like, “I wanted something different, so you can tell me if it's things there and that.” Scott did tell me that somebody is like version of hell was rolling a boulder uphill, and never got to the top. So, you take it as you want but I do like the boulder aspect, because the boulder stays the same all the way up to the hill. You're pushing it, and along the way you think about your quality gates of approvals, and things like that.
I like to think that they're getting out and helping to push, and of course, once you get to the top, that's the launch. Of course, it's got rollback. I like to think that the metaphor is there, and Scott didn't like that, either. The point is, is I thought it was fun. We didn't really have much more to put on the slide, but just introducing. We did want to have continuous delivery, essentially pushing the same code all the way through. The secret of Terraform Enterprise right now is that the APIs already exist, the UI just doesn't implement them. So, we decided, hey, well, we'll go ahead and do it.
Scott: Please excuse my UX skills. I'm not the best here, but I tried. I did try. This is actually our implementation of pipelines using Terraform Enterprise. I think this is actually one of the coolest things that we did. Organizations have pipelines and pipelines have pipeline steps. In this case, I have a simple pipeline that only has three steps, dev
, int
and prod
. Again, back to the TFE management, here's where we provision the organizations and the workspaces and stuff like that. The reason why this is important, is because now we can also create pipelines and pipeline steps. Again, this was made possible using our shell provider. The pipeline steps really just mapped to workspaces in Terraform Enterprise. So, you give it an ID, and it knows where to do the run for the configuration version.
The Terraform Enterprise is actually cool, because you don't need a VCS connection for all of the workspaces. You don't need to connect the workspaces directly to GitHub. In Terraform Enterprise, you can see, actually, only dev is connected to GitHub. I make a pull request in dev, I approve it, that goes to dev, and then the int and prod actually get their configuration version through this UI that we created. You can see the list of all the configuration versions that are valid for dev, and you can actually promote from dev to int to prod. So, you can see that only the ones that I have promoted have made it to the next stage.
You can see that I'm going to be promoting the current one, I can actually look at the diff and see what changed. Only people that are on the list of approvers can promote. In this case, actually, I am on the list of approvers so you don't really see that. In Terraform Enterprise, you can see it's actually executing a run using that configuration version that I pushed. We actually have the ability to do rollbacks too. You can roll back to a previous configuration versions that you have successfully used before, because it makes sense. Like if you push from dev to int, and you find that that is not working for you, and you want to do a rollback to a previous Terraform configuration that you had used that you knew was good, then you can make that possible.
Also, you can see that I just implemented pipelines as a linked list. Actually, that's because I'm not the best UX developer here but you could make this a graph. You could envision having a one to many relationships, or even pipeline steps connecting back to previous ones. You could use this for provisioning many different workspaces. The idea is that the workspaces are just containers for your Terraform configuration code. The only difference is actually the variables that you use for them. So, you'd use different access keys, you send them to different AWS accounts, for example, but the configuration code that you're going from dev to int to prod, it's actually the same and you can track where that goes.
You can see here that I also implemented a little email system. So, if somebody is not on the list of approvers, they can send a message through the app just implemented using SES. Say, hey, maybe they need to have this promoted ASAP, and they want to know who the people are that can do that.
Winning
Anthony: The title of our talk was winning. We've done a lot of work, so we implemented this self-service, self service with quality gates. If your business doesn't have FinTech requirements and you guys don't have to have somebody approve it, you get rid of all those quality gates. It truly is self-service, unfortunately, we have to have those. The point is, is that the right person in the business to approve whether an application change is going to hurt the business is not some DevOps guy who just joined the company two weeks ago, it's going to actually be the business owners for that content. It's going to be the guy who loses his job when it goes down. The DevOps guy, he's just doing what he's told a lot of time. So, we do want to push those quality gates out to the people that really are accountable to it.
We've reduced ambiguity. Ambiguity is important, because you think about the JIRA tickets. What we see with JIRA tickets is essentially somebody says, I don't know what I really need, but here's what I got. Then you go back and forth, back and forth. The actual things that happen are never truly captured in the JIRA ticket, it's just closed as done. You're left where you just say, “Hey, here's a random firewall rule maybe over here, maybe there's one over here.” The point is, is now we're getting to the point even those firewall rules could be done as code and could be in the GitHub repo, if we wanted to implement that it's a next version of this. We're just pushing the business into, let's just have everything as dry as possible and into essentially state your intentions and let's make that reality.
Less work, but essentially though this is I think in a really good position to where the team maintaining this is not going to be doing this day in and day out and essentially juggling those tickets to sustain this.
Sustainability. The wisest engineer I ever knew told me—I took over his code, and I said, “Where's the documentation?” He looked at me and he goes, “The codes are documentation.” I thought he was trolling me, because you're sitting here like the codes are documentation. I mean, you're just lazy. I realized, learning from his code that he actually was right that I learned an incredible amount of stuff from his code. Honestly, we should want the code to be the documentation because it is the only thing that matters, and the only thing that's authoritative is what is actually there.
With all this time you have left, you get to finally schedule and take that vacation but Scott and I both know the reality is that you can be looking up cats off the internet or unicorns or figuring out your favorite UFO stories. By the way, I know the guy that wrote this book. I don't think he's very popular, so if you want it, support him.
Anyways, the point is, is that we enjoy these things. I don't enjoy just juggling tickets and being stressed out about people coming over and saying, “When are you going to do the ticket.” So, keeping it simple, keeping it easy, we get to do more of the stuff we like and look like we're amazing.