State Farm's Terraform journey: The good, the bad, the ugly
This presentation will follow State Farm’s journey as it embraced infrastructure as code.
In this talk, State Farm architecture manager Karl Cardenas discusses his company's Terraform adoption journey: the challenges encountered, lessons learned, and how the Platform Enablement team was able to provision public cloud accounts in under 5 minutes, while previously it took 3 days.
Speakers
- Karl CardenasManager of Product Education, HashiCorp
Transcript
As you can probably tell from the title slide, today's presentation is about Terraform and our journey embracing infrastructure as code. I'm going to walk you through the story of our platform enablement team that took three days to provision a public cloud environment, down to under five minutes.
Before I get started, let's do a quick thought exercise here. Who here reads technical articles? Raise your hand. It's most of us. We all probably read those articles where they talk about various technologies, infrastructure, and how to consume them. Like Terraform, React—someone, give me a buzzword.
Kubernetes. You read about all these technologies. You have authors that explain how to use them. At the very end, it's almost like it's the greatest thing in the world—a ta-dah moment. Let me ask you something. Is that how it is in real life? Has that been your experience? No, infrastructure is hard. Infrastructure as code is also hard. So, let's have an honest conversation.
Terraform is awesome, and if you're in this room, chances are I don't have to convince you. But it does introduce a series of challenges. I'm going to go over some of the challenges we encountered.
This is going to be the outline for today: I'm going to talk a little bit about State Farm, who I am. Then we'll talk about the public cloud enablement team—its composition. I want to break it up into three stages—crawl, walk, and run. We'll talk a little about automation; the impact it has on the workforce, what are some of our next steps, and then we'll wrap it up with takeaways.
State Farm, most of you have probably heard of us. If you're not familiar with us, we're an insurance company. But we're more than just an insurance company. We also provide a lot of financial products. We have a bank—we deal with a lot of stuff.
At the end of the day, insurance is nothing more than an intangible product—it's a promise. Well, those promises require a very large IT system. If you look at the number of policies—we have over 83 million policies. That requires a very massive IT infrastructure. Not only that—but staff. If you take that out of the question—just look at our agency workforce—we have almost 19,000 agents. That's more State Farm agents than there are McDonald's in the US. Pretty crazy.
Who am I? I've been with State Farm for five years—started as an intern in a mainframe networking area. Did that for about three years, focusing on a lot of infrastructure stuff. Got a little bit tired, and I went back to application development where I got to work on modern JavaScript frameworks. Then after a year of that, I came back to infrastructure where I focused on automation of public cloud space.
Most recently, I'm an architecture manager responsible for architecture in public cloud, and our developer evangelist movement in public cloud. In my free time, I like to hike, exercise. I'm a technical writer—but when I write these articles, I don't do ta-dah moments. I share my code with you, so you can see where I go along. In general—just love to dabble with technology. Then you've got the generic friends and adventures.
The public cloud adoption team (PCAT)
Let's talk about the main character of today's story, the public cloud adoption team. We are about two years into our public cloud journey. This team is responsible for standing up and providing a public cloud platform for our developers. We have the risk and the governance side. These guys focus on legislation; making sure we have hard controls, making sure we do stuff that our auditors expect. And that allows our leadership to sleep soundly at night.
Then we have the automation and support side. This is what we're going to talk about today. At this point, the team is composed of three analysts. They have a lot of infrastructure experience, but almost zero developer background.
I bring this up because this is going to have a big impact on our journey. We were doing some of the infrastructure as code, but it was mainly to give public cloud providers IaC solutions, which is usually in YAML or JSON format.
Stage 1: Crawl
A little bit over a year ago, we decided to embrace Terraform as our public cloud infrastructure as code solution. And that kicks us off to a crawl. I picked this ASCII emoji because it's a pretty accurate representation of where we were at this point.
We weren't sure what we're doing. We knew that we were going to do Terraform, but we didn't know where to start. The beginning was probably to convert the public cloud provider solution to Terraform. That's where we started, and we had an intern that summer who converted all of this to Terraform. Connor; he's here in the audience today—thank you for doing that, Connor. Appreciate it. Give him a round of applause. That's not fun.
At this point, it took our team three days to provision a public cloud environment to a product team—product team being developers. That's not very agile. That's actually not good; three days is unacceptable.
I called this presentation Terraform, the good, the bad, and the ugly. Well ladies and gentlemen, strap in because we're about to get into the ugly. To give you an idea of how bad of a shape we were in, we were leveraging monolithic Terraform—meaning that it was all in one big massive file; not pretty. There was a lot of code repetition. We weren't leveraging modules. We certainly didn't have a CI/CD pipeline, so Terraform was run locally.
We weren't working collaboratively. We were also using local credentials. And here's the worst part; we weren't even using a version control system, we were using Git. It's pretty bad. Oh, it keeps going; we were doing semi-automation. That means we're doing some stuff in Terraform, but other stuff manually.
I call this automation purgatory. You don't want to do this. This is the worst of both worlds. For those in the audience who have dabbled in this space, you know that it's ugly to fix stuff when you're doing semi-automation. It can be pretty time-consuming. But we were doing one thing right—we were leveraging remote backend. At least we had that going for us.
Dealing with a complex state file
We were also struggling with the scope of our state file—for an environment, how should we structure it? The way we started wasn't easy—it was pretty complex. So let me walk you through that. If we take an environment—this test folder, for instance—you can see that we split it into a couple of different folders. It's pretty complex at this point.
You can see that there are multiple state files. The global folder—usually we would put resources here that don't span a single region, so global in nature. A good example of that would be identity access management. The init folder is where we define the Terraform resource that would initialize the account. It would also include stuff like service control policies.
Then you have the regional folder, and within the regional folder, you also have the folders for the respective region. Here you would find stuff such as compute resources, and other stuff specific to that region.
It's a little bit complex at this time. It was difficult to bring new people on board. If we take a common scenario from a pipeline standpoint—if you wanted to do a terraform plan
and apply
, you can hardcode that into your pipeline to CD into that folder. But you don't really want to do that.
You could write a script that loops through every folder, and does terraform plan
and apply
. That's still not pretty. You could write a script that loops through every folder, and identifies if there's a Git difference in that folder, and then does terraform plan
and apply
. But it's too complex, and it shouldn't be this hard.
We were pretty paranoid with the state file, and we paid some price for that. What did we do to go from crawling to walking? Well, we hired a software developer on our team. This had a big impact because originally we were composed of infrastructure analysts that had a tremendous amount of infrastructure experience, but no programming background. Now we had someone that could show us the rope—someone that could do pair programming with us. Someone that could show us what tools to use—how to structure a repository for the organization.
After we brought in this developer, we started using Git. We started structuring product teams to get their own project repository. If you're a product team that had a public cloud environment, you had your own repo because we laid down the core infrastructure for all our product teams.
We also started looking into modules. We put all of our modules in one repository. That way if we ever needed to change our modules, we do it in one place. Then our downstream consumers will—if we ever make an update—we run a terraform plan
and apply
. Pretty good.
By embracing Terraform modules, we were able to reduce our code base by over 10,000 lines. That tells you that there's a lot of inefficiency going on. That takes us over to the walking phase.
Stage 2: Walk
We're about four months into our journey, and we're not walking—it’s more like a baby learning how to walk. We were stumbling around, but we're making progress. First of all, we implemented a pipeline. We tied it into our version control system natively, with Kubernetes. Now we're running Terraform in a containerized environment.
There's a lot of good stuff with that. We no longer need local credentials, or to run it from our workstation. That made our friends at InfoSec pretty happy—I know that.
You also have a fast feedback loop. That way, when someone does a Git commit and pushes their branch up, you can get that Terraform plan and see what you're doing—or if you're trying to do something crazy. If everyone's happy with it, you can do the merge request, and then that will deploy your Terraform. Everyone in your team has a chance to see what's going on.
We also created a Terraform Docker image for everyone in the organization. Terraform is awesome, but there are some limitations. A lot of you in the room know that you can get around those limitations by using a local-exec
. Sometimes you can do a CLI, a Python script; you name it. We create a Docker image that had one of the most common tools that people in our organization needed—such as bash, SIP, and so forth. We made that on their behalf so that everyone can consume, and so not everyone has to reinvent the wheel.
At this point, PCAT—the public cloud enabled team—we're pretty much the Terraform leads for the organization. But we're not scalable because we're four analysts at this time. So we created a Terraform cookbook, which is a single page static site where you can go and read about how to do stuff in Terraform. Such as: How do you achieve a for loop? How do you configure Terraform to work with the enterprise proxy? All this good stuff. This is pre-Terraform 0.12. Now that we have full reach, it's amazing.
We also created a Terraform forum internally. PCAT is not scalable, but we're a big believer in the community support model. We believe an organization of our magnitude—the only way we're going to be successful in our public cloud journey is by embracing a community model where people share knowledge back and forth with one another.
We're doing a lot of stuff to help people consume Terraform in the product teams, but we still weren't happy with our state file, or how we had it structured. You guys saw that it was pretty complex. What did we do? We rolled up our sleeves and started doing research.
We came to HashiConf last year—we spoke to several of you. We spoke to Terraform HashiCorp product leads. We got a lot of insights. We got to ask a lot of good questions and learn from other people.
We read a book—Terraform Up & Running. If you're new into Terraform, that's a fantastic resource by Yevgeniy (Jim) Brickman]. Something I would strongly recommend.
We tried a lot of new things—and with a lot of them, we fell flat on our face. Some of them worked; some of them didn't. But remember, our vision is to be able to provision a public cloud environment in under five minutes.
Stage 3: Run
At about eight months, we finally settled on what we think was going to be a good solution that could take us to the five-minute vision. We modularized everything. Meaning that at this point, we have a pretty good understanding of what you need to do to deploy a public cloud environment for our product teams. The other thing is that we started leveraging tfvars
a lot.
Leveraging tfvars
To help you picture or understand our architectural or hierarchy, take a look at this. If you take these two environments, you're going to notice that they pretty much look identical. The backend.tf
and the tfvars
are the only two files that are unique. The backend.tf
contains the key to a remote object storage location. But the tfvars
file has all the unique values for each environment such as your network CIDR site, the account email. All that good stuff that makes that environment different.
At this point, we almost have a cookie cutter template to provision an account. This is awesome, because—by doing this—we're doing this in under a day. We’ve come a long way from three days.
This is great because if you were to look at the main.tf
, all you're going to see are modules. We still have a couple of challenges, but from an organizational standpoint we can now create a Git repo, copy these files, sum it up to repo, and run a pipeline—and we can create the account.
But we're still not happy, because there's still copying and pasting going on. There's still a lot of manual action, and as all of you know, when humans are involved, there's a good chance of error.
Ten months into the project: A new automation tool
We knew we had to step up our game to get to the next point—which takes me to 10 months into our journey where we created a tool—a single page application. It's a UI that abstracts Terraform for our operator. This allowed us to automate copying files and creating repos.
Now, when you go to the single page application that we have, you click on a button that says "create account". You click on that, it's going to take you to a screen where you have all these input values, and those input values are account email, VPC site, or tags. All these different things that we want for our environment.
When you fill those out, you hit the button “create,” and when you do that, a lot of things are happening. First, it starts making API calls to our version control system where it creates a repository. It adds the proper merge approvals, merge approvers, sets the proper push rules. It generates all the Terraform files, creates the backend.tf
, and the tfvars
. And it got those values based on those input fields. We're able to do this because our Terraform is so modularized that they all look the same. The only thing you need to worry about is creating the tfvars
.
At this point, we can create an account in under five minutes, because all those API codes that created the repo—the Terraform files—also triggered a pipeline, and kicked it off. Now, all we had to do was look at the terraform plan, approve it, hit apply, and it creates the account.
A couple of pain points
Changing and updating modules
This isn't a ta-dah moment, because it's taken us over 10 months to get here. But we still have some pain points. A pretty good one is that if I talk about our structure, each product team gets its own repository. And then we have one repository that has all of our modules.
If I'm making a change to a module, I’ve just got to merge that into the repo, do all that stuff, and that model has been updated. But in order for all my downstream dependencies to consume the module in its updated code, I’ve got to run terraform plan
and apply
.
That's not an issue when you have 10 different project repositories. But what do you do when you have over 20? Over a hundred? You don't want to be that guy that clicks on a hundred different repos and start a pipeline. I know that because we did that at one point. Connor did it at one point, too. It's very time-consuming.
Adding a different environment
The other aspect is we could create an account. That's wonderful. But what about when a product team wants to add a different environment? Let's say they start with creating a test environment. Now they're mature enough for a prod environment. Well, the tool can't help me there. Now we're back to copying, and pasting, and filling out values.
Automation, automation, automation
Here's pretty much what the tool looks like. We did get around those two issues I mentioned. We added the capability to add an environment. Now the tool is smart enough to identify which repositories we had, and which environment it already had. If it had to research and test, we can now add a prod folder.
We also got around running all the pipelines. Now the tool is smart enough to identify our repositories, and you can start all the pipelines at once, which, we found out, is an easy way for us to max our Kubernetes cluster. Or you can select individually, which project repository you want to run the pipeline for.
Creating this tool really helped us out, as that we no longer have to maintain this responsibility of creating accounts—we can delegate that now. We delegated that to our second-level team—our operations team—and now we're focusing on other automation.
Now you can to give someone the ability to create a public cloud environment, without having any knowledge of Terraform, and do it the right way every time. That for us, was huge because that takes a big burden off our back.
Terraform's positive impact on our workforce
I do want to talk a little about the impact that Terraform has on an organization. There's no way we would be where we are today if it wasn't for Terraform. It's amazing, but it does introduce a series of challenges.
If you're young in your public cloud journey, and you're embracing Terraform—which I hope you are—you really have to embrace it. That means yourself, your team, and your leadership. You can't go half in. And if you are embracing Terraform, you're going to have to learn some programming basics.
If you don't, you're going to have a hard time. You guys saw how bad we were in our crawl phase. That’s because we didn't have any developer experience. Add a developer to the team if you find yourself in the same situation.
Don't just let go of your infrastructure people, no! They have a tremendous amount of experience that will pay dividends. But bring them up to speed, take them along with you, teach them, coach them, show them the ropes.
We wouldn't be able to get to where we are today if it wasn't for our attitude. We were excited. We had a positive mindset. But we knew we had a lot to learn, and we fell on our face many, many times. I can promise you that.
But you also need leadership support, and we have support from the highest level of the organization to continue our endeavor with embracing infrastructure as code. Not just from approval and not breathing down our necks, but also in enabling us to leverage third-party learning resources or third-party consultants to get input. That's huge.
We also promote public cloud certifications. There are a couple of benefits with that. One, you learn good skillsets about the given platform that you're working on. But when you're embracing infrastructure as code—and you're young in your public cloud journey—there are a lot of new technologies and concepts you're juggling. Sometimes, getting one of the first certifications is a big win. It can be a big confident booster, and you need those confidence boosters when you're learning all of these new technologies.
Next steps
Terraform Enterprise
What are some of the next things that we're doing? Well, we're going with Terraform Enterprise, and to be honest, it comes down to three main reasons. Sentinel—we've gotten good enough now that we have event-driven automation. If someone tries to do something crazy—like allow everyone on the internet to come in, for instance—we have automation that will stop that, revert the change, notify that individual, and notify their leadership. Our auditors love that story. Our leadership loves that story.
But we want to step up our game. Now we want to be proactive versus reactive. That's where Sentinel comes in. Now we can prevent bad actions or developers from shooting themselves in the foot by accident. And now that the new Cost Estimation feature came out we can do a lot more cool stuff with Sentinel.
The second thing we've been seeing in our organization is that a lot of product teams have issues creating a pipeline for Terraform. Everyone's looking for that DevOps engineer, or someone that understands how to tie infrastructure, and pipelines, and application development practice together.
Well with TFE you get that pipeline out of the box. It makes it easier. It reduces the technical barrier for people to come up with a Terraform pipeline.
Lastly, is the private module registry. The PCAT team is responsible for creating architectural patterns. Normally when you see a pattern, you have this nice little diagram, and you have an idea of how you should set up this architecture. But it's a more compelling story when you can consume this architecture/pattern with a Terraform module, and deploy that in a matter of minutes.
That's the vision we envision with the private module registry. We want to reduce the technical barrier and be able to deploy architectures much more rapidly and do it in the right way.
Enable training environments
We're an enterprise, so we're surrounded by firewalls and proxies. Before this, if someone wanted to learn about any public cloud environment—or the ones who were consuming—unless they were a part of a product team that had a public cloud environment, they had no way of learning that, while inside of State Farm. They would have to do it on their own and this has been a roadblock for us.
We've got good enough now, where we can provision public cloud environments for people to come in and learn, and do whatever they want—including Terraform. We clean that account every three days, and it's a continuous loop.
We believe this is going to be a tremendous asset in getting our organization to the next level, as a whole.
Automation
We're never done investing in automation. It keeps on going. That's one of the reasons why we created that tool.
Community investment
For an organization of our size to be successful, we have to embrace the community model. It goes back to PCAT not being scalable. As we get more product teams into the public cloud space, there are a lot more questions being asked.
By making sure that the community stays highly technical—that they embrace knowledge-sharing—we are going to be able to sustain that growth. But it comes down to making sure that community stays strong, and that they do share knowledge back.
That's one of the reasons why I also have a team of developer evangelists to help focus on that. To create architectural patterns, to create modules, to create stuff that will reduce the technical barriers for product teams. But also consume some successes and failures that they had to make it easier for other teams.
Share knowledge back
That's something that we at State Farm haven't always been the best at. Probably the insurance and financial industry as a whole—we like to keep our mouth shut, and stay out of the spotlight. That's something we intend to change, and something we have approval from our highest level in the organization to start changing.
That tool I showed you guys—we plan on open-sourcing that. We're not there yet, but it's something we want to give back to everyone. We also have other content that we plan on open-sourcing and contributing back on.
We're also going to help out by sharing knowledge back, such as presentations such as today. Be on the lookout for more of that stuff.
Wrapping up
To wrap things up here, it's okay to not know everything. One of the reasons I'm here today is not to showcase, “Oh, look at us." No, we were pretty bad. You guys saw how bad it was in the crawl phase.
And a lot of times, we have all these big brands that hide their successes, or their learning growth. Learning is a clumsy process, and sometimes you've got to be open with it. And I feel like we're not always open about that.
Learning can be hard. But my point is, don't be afraid. Try new things, ask questions. You're all doing the right thing by being here, attending HashiConf, learning. That's how you get around these.
Embrace infrastructure as code. Don't do semi-automation, don't do automation purgatory. It's hard. It's painful.
Invest in learning and development—that comes in many forms. Whether it be taking some time on your own to learn to read technical articles or trying new things. Or leveraging third-party learning resources.
If the tool doesn't exist, then make it. Don't sit there and feel sorry for yourself. "Oh, the tool doesn't exist." No, I know that sounds harsh to say, but sometimes we all fall into the victim mentality.
Bring developers on your team. We're all developers because we like to solve business problems. Well, this is a business problem sometimes. So bring developers, create tools if needed. It goes back to—if you find yourself in a situation as we found ourselves, where we were an infrastructure team that didn't have a developer, then bring them along because they can teach you and show you the way.
Lastly, share knowledge back, help others. That's part of being a good neighbor. One of our mottos is to be a good neighbor—and we believe in helping others back.
So if you're starting out in your public cloud journey, then hopefully some of the lessons learned I shared with you today, can help you out.
There's a quote by Oscar Wilde: "We're all in the gutter, but some of us are looking at the stars." That's how it is sometimes when you try out these new technologies. You had those technical articles make it sometimes seem like a ta-dah moment. But that's never the case. And the only way we're going to get better is by sharing our successes and failures back.
I employ you—don’t just try something new when you go back but share back with others. We've been here for two days now, and most of you get some motivation when you're at a conference. I know I get motivated coming to a conference. But don't let your motivation stay here today.
Use that motivation, go back, and just try something new. It doesn't matter if it doesn't work. Maybe it's a crazy idea, but you never know when that crazy idea can pay huge dividends. That tool we created—that was a crazy idea, but man, are we glad we tried it.
We're always looking for good talent. So, a shameless plug—check out StateFarm.com/careers. If you ever want to check out some cool projects.
I included my email address. If you want to shoot me questions, or you want to discuss some of the tools that we created.
But at the end of the day, I want to say, thanks for listening and go try something new. Thanks.