Skip to main content
HashiTalks 2025 Learn about unique use cases, homelab setups, and best practices at scale at our 24-hour virtual knowledge sharing event. Register
FAQ

How to control cloud costs without doing less

Cloud and platform teams are always looking for new ways to cut costs without lowering the quality of their software. Here are some tips that might help you do that.

»Transcript

Today, I wanted to do a quick video on controlling cloud costs. I think there's often this preconception that controlling cloud costs by definition means that you fundamentally have to do less with cloud. While that might be true in some instances where you might have a very optimized cloud environment, in many instances, it's actually not the case. There are a bunch of other things to think about in terms of the approach to doing this. So, I wanted to highlight a few of those so people can think about the various means to control cloud costs. 

»Dev-test environments

The first one, and I think in many cases the best place to start, is what you end up seeing in terms of dev-test environments. For many enterprise customers that we work with, oftentimes, their cloud spend in dev-test is almost as large as what they're spending in production environments, and I think a lot of that ends up being driven by waste.

Oftentimes, these dev-test environments get spun up for something ephemeral. You might be running a CICD environment where every test environment spins up some infrastructure, or each developer has their own dev environment where they're working on their application. But oftentimes the resources that are spun up in a dev-test environment far outlive the actual purposes of development or testing, which results in a whole lot of waste.

So, when we think about how you optimize dev-test, it's not that you shouldn't have dev-test environments; it's that you should be much more diligent about cleaning up those resources when they're not used. 

»Automated cleanup 

When you say being more diligent about it, one option is to depend on human discipline. But you find that doesn't scale particularly well. Instead, it's how do you take a more programmatic approach. Oftentimes, we think about how—if you're having these dev-test environments—you have ways to do automated cleanup.

A great example of this might be: Anything that's auto provisioned as part of a CICD environment I might associate a time to live that might be 72 hours. So, I'm provisioning this as part of my CICD process. I create a bunch of infrastructure, but after 72 hours, I'm going to automatically clean up those resources.

Same thing for a development environment. You say any developer might be able to ask for a sandbox in a cloud environment to do some dev-test work, but I'm going to auto-destroy that environment after seven days if they're not using it. 

That way, I'm not depending on the discipline of Alice who went on vacation but forgot to turn off the infrastructure before she left, and so for three weeks, I'm paying for it. There's an automated process that says because you spun it up on Monday and you still weren't using it the next Monday, we auto-destroy it, and now you don't have to be paying that cloud bill. Or Bob left the company, and nobody realized that Bob's development environment is still running 90 days later. And now, you've spent a quarter paying for those completely unused resources. 

I think the core strategy when you think about dev-test is thinking about programmatic approaches to applying time to live, doing the automated cleanup; that way, you're not paying for those resources. That's up here when we think about the dev-test. 

»Standard patterns

The next area that I think becomes a very obvious solution is how do you standardize on a set of patterns? This becomes important because you can start to drive these standard patterns to do things like optimizing your cloud agreements. This might be through things like reserved instances, enterprise discount plans. Things like that where you might say, if I have 10 VM instances, and each of those is spread across five different instance classes, it's hard to get any economies of scale.

But if I have a standard pattern, that says I have a set of small, medium, and large, and I'm mapping those into a consistent set of instance types, then I can on the backend say I need to have a hundred of the smallest, 50 of the mediums, 30 of the largest. And I can do things like reserved instances to optimize that. This only becomes possible if you're driving a level of standardization, so you have 50 people using the same small image rather than slight variations without being able to get there.

I think the best way to do that is to start thinking about things like can I have a library of things? For example, like Terraform modules. Those might be abstracting things like VM instances or containers, but then that module can be opinionated on how is it mapping an input. The input is small, medium, large into the concrete instance.

That opinion can be baked into the Terraform module. If I drive a standard usage of that Terraform module, then great, all hundred applications that are using that module are going to map down into a consistent set of underlying instances. Now I could optimize that as a platform organization. This becomes an obvious next layer to go. 

»Vertical rightsizing

Then when you think about the next opportunity, it's what I'd consider rightsizing. Rightsizing can happen in multiple dimensions. One is vertical, and what I mean by that is you might have a quadruple extra-large VM that should be a medium-sized VM. Because you allowed your developers to pick the size, they generally will way over-specify, so you're running 64 cores for an app that really needs eight. 

So, if you’re talking about vertically scaling it, you can look at that and say I'm way underutilized, why do I need such a large VM? Let me downscale to something more reasonable. That'd be vertical scaling. 

»Horizontal rightsizing

On the other side of it is horizontal. Oftentimes, for availability and performance, you're not deploying one instance of an application. You're deploying tens, dozens, hundreds of them, depending on the scale of your application. 

It might turn out that you don't actually need that level of horizontal scale. So, you might look at it and say rather than running 20 instances of this application, I can get away with ten instances and still solve for availability and performance.

»Observability and autoscaling 

When you start to look at these different opportunities, to me, vertical is best solved by integration with things like APM and observability solutions. You can look and say, great. What's my actual utilization? I have X amount of cores. What am I utilizing? How far over am I? Is there an opportunity to go to a smaller instance size, potentially?

And then horizontal, I think this is the best opportunity to move to things like autoscaling. If I know my scale is dependent upon some scale factor—number of requests per second, for example, number of clients, etc.—then can I use an autoscaling pattern, where I might say I have a minimum for availability. I always want to have at least five instances, three instances, whatever is the case. Then I might set a maximum that's reasonably high so I don't have to worry if I get a burst of traffic. 

Then I allow my autoscaler to dynamically sit in between my minimum and maximum. That way, my horizontal autoscale is capacity-dependent versus being a fixed size. I'm saying I'm always running 50 nodes because at peak I need 50, use the elasticity of a cloud environment to do that horizontal rightsizing.

»Architectural patterns

Once we've done these—in many ways, which are some of the lower-hanging fruit—I think you can start talking about some of the architectural patterns. Then you might look at different kinds of things. So, when I think about architectural ways to optimize, a few different opportunities come to mind. 

In particular, in a lot of cloud environments, your network architecture makes a big impact. What I mean by that is oftentimes you're paying for traffic moving between availability zones, in and out of VPCs, over managed NAT gateways.

You can start to look and say do I have an architecture that optimizes for some of my spend given the exact same traffic pattern? I've seen some cases where, given the way people have designed their networks, the same traffic—the same packet—is moving between multiple availability zones, multiple networks in and out of multiple API gateways, load balancers, and NATs.

So, on one packet, you're paying for it ten times over based on that network path versus if you can look at that and say architecturally, is there a way we can optimize the flow of these packets, so I’m not paying for the same packet ten times over in and out of the network in various ways? 

Oftentimes, that's looking at east-west traffic. So, as service A talks to service B, what's the network path between them? Is there an opportunity to simplify that and keep the traffic within the same VPC, minimizing the number of hops—things like that?

I think here you want to look at things like how does east-west traffic move? How do I minimize the number of hops between that? Oftentimes this is an opportunity to look at things like how service discovery works, for example—and can I optimize how traffic moves between these different things?" 

»Data storage and retention  

Then, beyond the network there comes things like storage. We often see a lot of applications will suffer from a few different patterns. One is they store data indefinitely, even though they don't actually look at it that long. 

That's where there's an opportunity to look at things like your retention intervals. Oftentimes, people will write data to S3, for example, but they might only need that data for 30, 60, 90, 120 days and so on. But they're storing it for five, ten years at a time, infinite at a time, because they're never deleting it—versus using things like S3 retention policies. 

Here, you might say the maximum I need it for is 90 days. So, to be safe, let's keep it for a year, and then we can auto-delete that data. That can dramatically reduce the amount of data that you're retaining and lead to a much smaller bill without impacting the application because it's not even referencing that data. There's often this opportunity from a retention perspective. 

»Data compression 

Then I think there's another opportunity around things like compression. Oftentimes, developers aren't compressing data before it's going out to cold storage.

For many types of data, that can lead to savings of 90+ percent by compressing the data before it's stored at rest—depending on if it's text or highly repetitive type patterns. I think there are a lot of opportunities to leverage storage in slightly different ways that can yield large benefits with very minor tweaks to the way the application is interacting with the underlying stores. 

»Summary and conclusions 

Hopefully, that's helpful as we think about different ways of controlling cloud costs without having to do less. I think it starts by asking what you can do within your dev-test environment regarding automated cleanup. Ultimately, that doesn't impact production because it's dev-tests, so it's an easy place to start. 

From there, it's about standardization of the patterns because if your app teams use a consistent set of patterns, it makes it easier for the platform teams to drive those Finps optimizations downstream.

Then you get into things like rightsizing, both vertical and horizontal, to leverage compute in a more effective way. Then from an architectural perspective, thinking about storage and networking: How are they utilized where there are some relatively low-hanging fruit—in terms of how those can be used more efficiently without doing less in cloud—either in the form of less workloads or less requests. Those can be the same, but you're using cloud in a more efficient way. 

Hopefully, that's a helpful overview, and it gives a few different things to look at as you think about trying to control cloud costs.

 

More resources like this one

3/15/2023Presentation

Advanced Terraform techniques

3/15/2023Case Study

Using Consul Dataplane on Kubernetes to implement service mesh at an Adfinis client

2/3/2023Case Study

Automating Multi-Cloud, Multi-Region Vault for Teams and Landing Zones

2/1/2023Case Study

Should My Team Really Need to Know Terraform?