The First Cut is the Deepest: Avoiding Costly Mistakes in Cloud Spend Optimization
Wasting your way to chapter 11
It’s an accepted industry estimate that around 30-33% of cloud spend is wasted1. In my experience, that is an underestimate, but let’s take it as a starting point. That means that of the around $560 billion spent on cloud in 20232, nearly $200 billion was overspend. Given the size of the problem, and the current changes to the financial landscape, my view is that non-optimized cloud spend is a phenomenon whose time has passed. SMEs (with 55% spending between 600k and 12 million annually on cloud3) have a strong incentive to control their spend, and given how fast the fiscal landscape changed, they are often turning to quick-fix solutions like SaaS bill optimizers, or free vendor-provided tooling. In doing so, they are making, in my estimation, business-damaging mistakes.
With that background, let’s get into it.
Defeating the underpants gnomes
The golden path for cost optimization of an existing workload is deceptively simple.
- Optimize Workload & Tooling (only pay for what you use)
- Optimize Rate (never pay retail)
- Account Discount (buy in bulk)
The problem is most SaaS tooling, both from third parties and vendors, uses a different approach
- Optimize rate for current workload
- …
- Profit!
If you’re lucky, the rate optimization recommendation will come with some context-free instance rightsizing recommendations, but even then SaaS tooling lacks, and will always lack, the architectural view necessary to achieve the best discounts.
This underpants gnomes4 approach misses the biggest opportunities for savings, and worse, locks you into that mistake for the duration of your pricing agreement. If you treat FinOps as a button you push using available tooling, not only are you wasting money, you’re stopping yourself from fixing the bigger problem, often for years.
To give an example of the risks of doing this out of order, I’ll use a real-world case study. It’s long, detailed, and a little dry, but it’s necessary to understand the cost of *not* doing it (which in this case was around $18,000,000).
Setting the scene
500k/month average EMR spend - running for the last 5 years.
300k/month on a single static cluster, and an additional 200k average per month on ephemeral workloads using EMR, but variable by month. Some months none, some months 500k in ephemeral costs.
300k/month static was covered by a 1-year up front compute savings plan. Ephemeral workloads were 100% on-demand pricing.
Previous FinOps work was limited to the implementation of tooling/vendor recommended rate optimisation (compute savings plans).
Pressure to find cost savings in a market downturn, plus a change in internal chargeback, meant that when the compute savings plan came up for renewal, there was appetite for a more in depth FinOps engagement.
So, the first task was to make sure we were only paying for what we were using. Minimize idle capacity when possible, by using serverless or autoscaling resources.
This is where the biggest savings for the least effort are available, but to be effective, it relies on a deep understanding of the workload, the current architecture, and the vendor’s offerings.
The “well-architected” choice is to autoscale or use serverless to closely match compute to demand. In this case, autoscaling was a non-starter.
At inception time, the choice had been made to use an external vendor for the granular permissions, which increased node scale out time to 30 minutes (serving as a great reminder of the 12-factor rule of avoiding external hard dependencies). Further, registering nodes with the external API was unreliable. It was not feasible to revisit this choice as it was foundational to the industry-required certification of the platform.
EMR serverless was not available in the partition, but even if we looked at moving the workload into AWS commercial, its local storage was insufficient for the usage pattern of the workload. Again, a ‘FinOps as a button’ approach would have led to a broken cluster.
So, paint by numbers doesn’t work here. The previous team had evaluated, been stymied, and gone no further.
The fix is in
Step one: even if we can’t autoscale, we can adjust workload for demand. The workload was predictable, and with 5 years of logs, there was high confidence in the accuracy of those predictions. Task nodes were cut in half on weekdays out of hours, and cut entirely (using only core nodes for weekend scheduled jobs) at the weekend. Out of the gate, this single change saved 50% of the running cost of the static cluster - 150k a month, for approximately 1 hour of engineering work.
The most cost effective compute instance is the one you’re not running. Cut the fat. Step one in any FinOps plan should be identifying that load and killing it. If you can run load only during office hours (mon-fri 9 til 5) that saves 75% of your cost. 128 hours out of 168. This is a deeper discount than any RI or savings plan, and competitive with the best spot instance pricing.
Once the scheduling had been done, the next step was taking advantage of the type of workload. EMR workloads are by nature fault tolerant to task node failure. That makes them the ideal use case for preemptable compute (spot), yet the entire workload (main and ephemeral cluster) was 100% on-demand. When we queried this, the customer expressed a very low appetite for risk. This was a combined outcome of having had a record of failed batch jobs (which, on investigation, was due to resource contention on the shared cluster), and the belief that using spot instances would result in more failed jobs (not understanding the fault-tolerant nature of the platform).
We overcame this by using metrics and education. Most of that can be skimmed over, but a large factor was being able to use the AWS Spot Instance Advisor to show the classes used in the cluster experienced a sub 5% interruption rate (in practical terms lower than the granularity of the tool), so in practice in a cluster with 40 task nodes, less than 2 per month would be interrupted. On that basis, we persuaded the client to undertake a trial, where we replaced 5 on demand task nodes with 20 spot nodes (cost-neutral position) for a month, during which time we saw zero interrupts. Once this was complete, the client agreed to let us update the code for the cluster to use fleets instead of uniform instance groups (to allow auto-balancing of spot and on-demand). We further diversified the instance types to include 4 equivalent compute profiles and tuned the fleet for price-capacity optimisation.
After 6 months of operation, there was 1 preemption event across all clusters. We then further optimized the ephemeral workloads: they were originally constructed to use core nodes only, which are not fault tolerant. An investigation showed that this was due to a misapprehension on how task node storage worked. We therefore applied the same methodology, rebuilding with a mix of master, core, and task nodes, with task nodes running as spot fleets, and using a spread of instance types of the same size.
Once that was done, aggregate cluster-wide running costs before commit or enterprise discounts on ephemeral workloads dropped by ~60%, and for the static workload by nearly 70%.We then purchased 1 year upfront compute savings for the static (master/core) nodes and ran everything else on spot or on-demand instances. With the combination of workload sizing, re-engineerin,g and spot fleets, our total run cost reduction was 80%, reducing a $500k monthly bill to about $100k. In real terms, this was the difference between the business unit being a sustainable business in a new financial landscape, or flying into the side of a mountain.
The last step was to activate cost-allocation tags on the ephemeral workload, so they were charged back to the right team and to put policies in place to block any untagged deployment. The shared cluster was a separate and more gnarly chargeback problem, relying as it did on analysis of yarn logs, so was out of scope for this work. Cost-allocation tagging doesn’t directly save money, but you should do it anyway in a shared environment. Proper attribution of shared costs helps to avoid all kinds of moral hazards and drives good behaviour. Simply making sure teams pay the bill for their work democratizes the incentives to spend efficiently, and the closer they are to the workload, the better equipped they are to optimize it.
If we had taken the SaaS/vendor tooling recommended approach, the total discount available to us was only 20% (all up front on the static cluster portion of the bill only) so ~$100k/month. That left, compared to the architecture-first approach, $300k a month on the table.
Further, if we’d done that, we’d have locked ourselves out of realizing the bulk of the rest of the savings because, and this is the real core of this argument, if we do the optimisation afterwards, the spend is already committed. Further, if you have an enterprise discount plan based on projections of pre-optimised spend, the value of your cost optimisation can be wiped out by that commitment too. All we can do is hope to soak some of it up elsewhere in the org.
And they all lived happily ever after
Thanks for getting through to the end: I hope after all that effort you’ve picked up the two main points I was trying to get across.
First, in cost optimization, order of the work is not only important, it is critical. If you let yourself be driven by SaaS tooling in this space, you’re not just losing out on efficiency, you risk painting yourself into a very expensive corner.
Second: the problem is too idiosyncratic, and the problem space too broad, for it ever to be an easily software soluble problem. You’re leveraging expertise: the ability to understand the workload, in-depth knowledge of the platform, the tooling, and the vendor’s cost strategies that can be brought to bear. Every cookie-cutter ‘best practices’ approach here was a dead end or a mistake. Finops isn’t a button, it's a fully realized, managed, and potentially complex process. Whether you hire it in, or build the capability yourself, given the scope of impact and the duration of the value that you derive from the work, sleeping on that capability is often a costly choice.
Post Credits Jumpscare Bonus Scene
The best time to optimise your cloud architecture spend is before you build it. The second best time is now. There was nothing stopping this tuning being done before a single line of code was written, and this is one scenario where you should absolutely shift *all the way left*. Privileging the fastest, simplest path to delivery and underspending on skilled cloud architect resources at the start ended up costing ~$18,000,000 over five years. A two-week architecture design sprint, and spending $25k/month on an architect for the first three months during implementation (and listening to them) would have prevented that.
Share this
You May Also Like
These Related Stories
No Comments Yet
Let us know what you think