Benchling is a vertical B2B SaaS company that builds cloud software for biologists. With hundreds of engineers supporting scientists who rely on the platform for critical research workflows, Benchling operates a large and complex cloud environment spanning more than 165,000 cloud resources. As the company scaled, developers were dealing with 30-minute Terraform plans, lots of manual coordination, and then having to babysit many of their releases. Their workflows were a huge drain on engineering time and energy.
To address this, Benchling changed how it managed infrastructure as code: No more Terraform runs on individual machines. They wanted centralized governance and execution. Those changes, and several others, eliminated significant operational toil, saving an estimated 8,000 developer hours across the engineering organization.
This post will drill down into the specific workflows and organizational changes that made those time and cost savings possible. Then we’ll see how Benchling used those extra hours to focus on higher-value work, including disaster recovery automation and targeted cloud cost optimizations.
This story comes from a HashiConf fireside chat featuring Christian Monaghan, Senior Infrastructure Engineer at Benchling.
»The cost of decentralized Terraform runs
In its early days, Benchling relied on what Christian described as laptop-based Terraform execution. Developers ran terraform plan and terraform apply directly from their own machines, carrying responsibility for execution, timing, and coordination themselves. That approach worked early on, but it became increasingly fragile as the company and its cloud footprint grew.
At the time, Benchling was managing roughly 350 Terraform workspaces. Some infrastructure changes touched dozens of those workspaces, and in certain cases, a single change could fan out to nearly 100 workspaces.
That scale turned routine changes into slow, attention-draining work:
- Terraform plans could take up to 30 minutes per workspace
- Large changes required manual coordination across environments
- Developers often had to babysit releases as plans and applies progressed
- Releasing infrastructure changes demanded sustained focus over long periods
To avoid conflicts, teams relied on a Slack channel to call “dibs” on environments — a way of signaling that someone was testing a change and asking others not to touch that environment until they were done.
The team had already tried to reduce some of this friction by building a custom Python script to orchestrate Terraform runs across multiple environments. While the script helped kick off runs in parallel, it didn’t eliminate the underlying problem. Developers still had to review results workspace-by-workspace, watching long-running plans and intervening when something went wrong.
When an incident occurred, figuring out what went wrong meant chasing down:
- Commit history
- Execution timing
- Who ran the change, and from which machine
The pain was visible across the engineering organization, and the limits of laptop-based workflows made the case for change obvious.
»Shifting execution from laptops to a shared platform
Benchling adopted HCP Terraform roughly two and a half years ago, giving developers the platform they needed to centralize infrastructure execution and drastically reduce coordination toil.
Instead of developers running Terraform locally, HCP Terraform became the system responsible for:
- Executing plans and applies
- Managing state and locking
- Recording which commit triggered each run
That change alone removed a significant amount of manual coordination and guesswork from the process, especially when incidents occurred.
It also laid the groundwork for broader changes in how Benchling structured its infrastructure.
»Rethinking workspace strategy once automation removed friction
Benchling’s initial workspace strategy was shaped by pain. Running Terraform was slow and attention-heavy, so the team tried to minimize how often it had to happen.
That led to more bad practices:
- Action: Fewer workspaces, to reduce the number of runs
- Result: Very large workspaces, some managing 4,000+ resources
HCP Terraform eventually led the team to adopt better practices because its centralized and automated execution workflows drastically reduced the amount of attention required.
That allowed the team to shift in the opposite direction:
- Biasing toward more workspaces
- Smaller workspaces reduced the blast radius if a harmful change was introduced
- Letting automation absorb the cost of scale, instead of engineers
»Managing everything as code, not just IaaS
As Benchling’s infrastructure matured, the team extended its infrastructure as code approach beyond cloud resources alone. Wherever a Terraform provider existed for a tool they were using, they preferred to orchestrate that system with Terraform.
That meant that, instead of just orchestrating AWS infrastructure, they could also orchestrate setup for:
- Datadog
- PagerDuty
- and more...
… all through one platform.
This approach helped standardize workflows across a growing SaaS surface area and reduced the number of one-off operational processes engineers had to learn or maintain.
»Improving security by changing the release model
One of the most significant side effects of adopting HCP Terraform was how it changed Benchling’s security posture — without requiring a separate security-driven initiative.
In their earlier workflows, developers needed direct production access to run Terraform from their laptops. After the transition, that model changed entirely:
-
Developers submit changes via pull requests
-
HCP Terraform handles execution in production
-
Direct production access for developers was reduced to almost zero
That shift made several things possible:
-
Smaller blast radius for security exposures
-
Easier integration with security processes
-
The ability to introduce Sentinel policy as code into the workflow — something that was effectively impossible when runs happened on laptops
Because HCP Terraform was already in place, later efforts to harden production access were significantly easier than they would have been otherwise.
»Making the transition without disrupting teams
While the technical case for HCP Terraform was clear, the bigger challenge was changing how developers worked day to day. As Christian noted during the discussion, selling the value of a new tool wasn’t especially difficult. What took more care was shifting established workflows.
Benchling approached the transition gradually, focusing first on building confidence rather than enforcing change:
-
Starting with a small number of development workspaces
-
Running hands-on demos and tutorials
-
Expanding adoption only after teams could see how the workflow worked in practice
Team selection also played an important role. Although many teams touched infrastructure code, three teams managed roughly 90% of the infrastructure. Onboarding those teams first helped establish clear patterns that the rest of the organization could follow.
The rollout moved quickly but deliberately:
-
About 3 months to build a prototype workflow in development
-
Roughly 6 months to migrate most production workspaces
As HCP Terraform became the standard way to manage infrastructure, onboarding additional teams became more predictable. The security team, in particular, emerged as an early adopter. Their involvement, especially around policy and governance, reinforced HCP Terraform’s role as a central platform rather than a tool owned by a single team.
»Security and governance as enablers, not gates
As Benchling centralized infrastructure execution, it also created an opportunity to rethink how security and governance fit into their workflows. Instead of treating policy as a blocking mechanism, the team deliberately approached it as a learning tool.
The security team took ownership of authoring Sentinel automated policies, with the initial goal of increasing visibility rather than enforcing strict controls. Policies were first deployed in informational mode (called soft-mandatory in Sentinel), allowing teams to see how proposed infrastructure changes aligned with Benchling’s standards without preventing releases.
That approach was intentional:
-
Visibility before enforcement, so teams could understand the impact of policies and how they would have to change their code before submission
-
Avoid blocking releases while platform, compliance, and security teams fine-tune policies to be clear and correct
-
Build confidence in developers’ ability to meet policies before turning blocking enforcement on
At Benchling, Sentinel policies today only flag issues during runs but do not block changes. Developers see policy results alongside other signals they already trust, such as test results and CI checks.
Over time, this has shifted how policy is perceived. Rather than acting as a gate that slows teams down, policy becomes another source of feedback that helps developers understand expectations and improve infrastructure changes before anything is deployed. With that confidence in place, stricter enforcement is possible when the organization is ready.
»Freeing up time to focus on DR, cloud cost reduction, and customers
Eliminating manual infrastructure toil saved thousands of hours and made day-to-day operational work much easier. What do you do with 8,000 extra hours?
Benchling decided they would use that reclaimed time to have their teams focus on long-deferred priorities, including:
- Automating disaster recovery processes
- Driving deeper cost optimization efforts
- Enabling engineers to focus on features and improvements that directly support Benchling’s customers
Strong infrastructure tagging discipline also played an important role. By embedding tags into reusable Terraform modules, Benchling ensured that new resources inherited cost allocation metadata by default, without relying on manual enforcement.
Their cost explorations also uncovered some unexpected findings. NAT gateways, for example, emerged as one of the larger cost drivers, which was a surprise even for a team with a strong operations background. Having the time and data to investigate those patterns made targeted cost optimization possible in ways that hadn’t been feasible before.
»Business impact and what comes next
The most tangible outcome of these changes was time. By moving away from laptop-based Terraform execution and eliminating manual coordination and babysitting, Benchling estimates it reclaimed approximately 8,000 engineering hours across the organization.
Those savings were driven by a few core shifts:
- No more running Terraform plan and apply from individual machines
- Centralized, parallelized runs handled by HCP Terraform
- Less human attention required to coordinate and monitor releases
That reclaimed time was reinvested into higher-value work, including:
- Reducing dev environment sprawl by reassessing how many environments were truly needed and how large they needed to be
- Right-sizing overprovisioned databases, particularly in non-production environments
- Fully automating disaster recovery
Several lessons emerged from the journey:
- Centralize execution, not ownership: Teams can still contribute through code without running infrastructure themselves
- Start with visibility before enforcement: Learning builds trust faster than hard gates
- Remove humans from repetitive infrastructure tasks wherever possible
Looking ahead, Benchling continues to iterate on its infrastructure practices. Areas of focus include
- Staged or canary-style rollouts for production changes
- Reducing repetitive human approvals for production changes
- Exploring AI-assisted review to cut down on manual oversight
The team is also interested in better workflow orchestration for processes that extend beyond Terraform alone, including emerging capabilities like Terraform Stacks.
Throughout this process, Benchling has leaned on HashiCorp’s field experience to benchmark its approach against other large-scale users.
You can take advantage of that field experience too and benchmark your approach against other large-scale users. Let’s have a chat!
That outside perspective, combined with ongoing experimentation, has helped the team validate what’s working, identify what needs improvement, and continue evolving its infrastructure operations as the company grows.
You can watch Benchling’s full fireside chat here:






