Change Management At Scale: How Terraform Helps End Out-of-Band Anti-Patterns

Get 7 best practices for preventing configuration drift in enterprise scale IT operations.

Michael Fonseca

Terraform

Dec 11, 2020

Michael Fonseca

Infrastructure as code (IaC) has seen large adoption over the past several years; the benefits and importance within DevOps and the overall technology operating portfolio are no longer up for debate. Now that adoption has hit a critical scale and organizations have a large percentage or all of their technology portfolio managed by IaC, a few important questions arise:

What happens when I need to make one change that will impact a large percentage or all of my infrastructure?
I need to make an urgent security vulnerability update to 1000 accounts/resources today, how can I do that?
How do I avoid out-of-band changes for my IaC managed infrastructure?
We have existing ITIL change management practices. How does IaC work in our existing process?
How can I easily show my Security & Auditing team that changes have the proper approval, review, and completion execution?

Many organizations operating today have not put the proper best practices in place to address changes at scale and resort to “out-of-band” changes. Instead of using an IaC tool like Terraform to make the change, they make the change directly through the cloud console/operations management system, not via code. This causes “state drift” or “configuration drift” and, over time, can cause larger issues with infrastructure management.

So what are the best practices? In this blog, I’ll show how very large organizations typically use Terraform Enterprise or Terraform Cloud for Business to make infrastructure changes at scale without resorting to out-of-band anti-patterns.

»Best Practice #1: Work Toward Having All Changes Go Through Terraform

When managing infrastructure with Terraform, you can’t stop out-of-band changes and configuration drift without mandating that all changes are managed within Terraform.

Do not allow any out-of-band changes to occur. We recommend disabling any access/controls that will allow changes to an account or resources that are managed by Terraform.
If out-of-band changes are absolutely required due to current operating practices, then we recommend establishing notification/alerting systems for drift detection so changes can be identified and remediated in Terraform code or tracked as a known out-of-band management practice by the Security & Auditing team moving forward.

Diagram of Terraform-only infrastructure workflow

Best Practice #1 — Manage all changes with Terraform

»Best Practice #2: Module Strategies

Use Terraform Modules to clearly define discrete areas of infrastructure management that can be standardized, centralized, and applied to manage the broader core infrastructure.

There is an extensive amount of information available from HashiCorp and others on the creation, usage, and value of modules.
Modules allow you to make a change in one place and apply it to many thus promoting code reuse. With that architecture in mind, you will develop modules in a Producer/Consumer model to have centralized change capabilities across your environments.

Helpful hint: Use Version Constraints in Terraform code to ensure proper module lifecycle management.

As depicted by the image below, with a centralized VPC Module, one change to the VPC module can be applied to “N” number of VPCs or resources.

Best Practice #2 — Terraform Modules

»Best Practice #3: Map Your Resources

Clearly define and align repos, workspaces, environments, and test.

A clearly defined mapping and consistent usage of resources is key to understanding where changes are required and the impact of those changes.
Breaking down code repositories into simpler functional managed units (and using modules) eases the complexity of targeting and scoping changes pushed through your repository structure.
In Terraform, aligning code repositories as best as possible to a one-to-one repo-to-workspace mapping will provide a more transparent infrastructure and better change execution audibility.
There are many recommendations available for workspace best practices but, as a reminder, ensuring that those workspaces are broken down into sub-application-level environments/functional operating levels ensures that changes can be tracked and tested as they are rolled out.

Terraform workflow with aligned repos, workspaces, environments, and test.

Best Practice #3 — Terraform Repos, Workspaces, Environments, and Test

»Best Practice #4: Centralized Auditing

Use Terraform Enterprise/Cloud for Business for centralized workflow, state management, and auditability.

A centralized workflow, view, and auditing capability are necessary to ensure changes are happening on-time and with transparency.
Terraform Enterprise/Cloud for Business centralizes the storing of all versions of state files safely and securely with access control and encryption. Centralization enables auditors to safely inspect and validate state change via the UI and API.

Workflow showing centralized auditing through Terraform

Best Practice #4 — Terraform Workflow, State Management, and Auditability

»Best Practice #5: Use a Module Registry

Use the Terraform Enterprise/Cloud for Business Private Module Registry (PMR) to publish modules for Consumers.

The Private Module Registry provides a Producer/Consumer self-service approach to ensure that Terraform users are using the correct and most up-to-date modules within their deployments, and when changes are made, the changes are being made and reflected within the PMR as the single source of truth.
The PMR enables module versioning, and when a large scale change needs to be made, a new module version can be published to the PMR and consumed at scale.
Note: If not using the PMR you can use a generic Git module source by selecting a revision by non-default branch or tag using the reference argument.
Providing transparency and a single source of truth enables easy replication of changes across all required resources.

Terraform workflow going through the Private Module Registry

Best Practice — #5 Terraform Private Module Registry

»Best Practice #6: Concurrency and Transparency

Terraform Enterprise/Cloud for Business can also help manage many changes to your infrastructure simultaneously at scale with concurrent runs.

Terraform Enterprise/Cloud for Business can queue and process multiple concurrent runs.
Concurrency provides the capability to queue, process, and track the execution of changes with auditability.
Concurrency is key to ensuring mass-scale changes are executed with transparency across hundreds to thousands of changes in a timely manner.

Concurrent workspace deployment in Terraform Enterprise or Cloud

Best Practice #6 — Terraform Concurrent Runs

»Best Practice #7: Policy as Code

Policy as code is becoming critical in many enterprises to automate custom rules and guardrails so that a team or organization’s infrastructure change policies are enforced without slowing operators down by days or weeks.

Sentinel is a high-level framework and language used by Terraform Enterprise/Cloud for Business to automatically inspect Terraform code in the provisioning process and enforce customizable policy as code to ensure that proper changes are being made.
Sentinel can enforce that modules can only get deployed from the PMR.
Sentinel can enforce that the correct versions of modules are being applied.
With Sentinel compliance is enforced at the provisioning and re-provisioning stage for change compliance. There is much less risk in identifying and fixing security issues pre-provisioning vs. post-provisioning.

Policy as code in the Terraform workflow

Best Practice #7 — Terraform Sentinel Policy as Code

»Centralizing Your Workflow Through Version Control

Architecting your code and workflow are key components of change management at scale. In this new model, Git (or your version control system of choice: GitHub, Bitbucket, GitLab) becomes your change management tracking or, at a minimum, your sub-tracking system operating underneath a change initiator system such as ServiceNow.

In this new “GitOps” model supported by Terraform Enterprise/Cloud, you will focus on three clear steps to support change at scale:

Use modules to drive re-usability and best practice enforcement at scale (single-point-of-change).
Use Terraform as the single point of execution for scalability and auditability.
Use Sentinel to ensure that changes are being executed and that nothing can be provisioned outside of the defined organizational security profiles.

I hope this brief introduction to change management at scale gets you thinking and researching some of these practices to make life easier.