Reducing AWS EBS Volume Cost — Lessons from an Instana SRE

Reducing AWS EBS Volume Cost — Lessons from an Instana SRE

At Instana, we store a lot of customer telemetry data in various databases. A part of our production environment runs in Amazon Web Services (AWS). We use encrypted EBS volumes to securely and reliably store the telemetry data, but the default volume type GP2 comes with a cost of around $0.10 per Gigabyte per month. This does not sound like a lot, but when your storage is measured in multiples of Terabytes, it starts to matter.

When we recently did a review of our cost structure with Mike Julian and Corey Quinn from Duckbill Group, they pointed out that we have a lot of volumes that could be actually switched from GP2 to ST1. This would be a significant cost reduction as ST1 volumes are roughly half the cost!

So we looked at the main factors that differentiate GP2 and ST1 (Source: Amazon EBS Volume Types):

  • ST1 volumes must be at least 500GB, while GP2 can be smaller
  • Max throughput for ST1 is twice as high as for GP2 (500MiB/s vs 250 MiB/s)
  • GP2 has much higher Input/Output operations Per Second (IOPS) than ST1

So it turns out what really matters for being able to switch are the required IOPS. However, the IOPS one gets are not that easy to calculate. For GP2, it is a function of the volume size. Bigger volume equals more IOPS, but it caps at 5TB volumes with 16000 IOPS. Volumes smaller than 1TB have a more complicated formula having a baseline performance and then a burst budget. Since we are dealing mostly with volumes >2TB, we never paid attention to the burst budget.

So we looked at some of our volumes, and indeed, we found this:

At first glance, this looks very good. We do not use the 16k IOPS, but rather at maximum 141 for writes, and just 32MiB/s for write throughput. But since this is an average aggregation for hours for this week, we need to zoom inter the high fidelity data that Instana has to verify how big the bursts can be:

Yikes!… So it does spike occasionally very high. So now the question remained: would ST1 actually work?

The math for ST1 volumes is more complex than what it looks on the surface. Our 5TB volume would get a “Base Throughput” of 200MiB/s. Amazon does not mention how many IOPS that are, but it mentions that on ST1 one IO operation equals 1 MiB. So that means we only get 200 IOPS! That would mean it is insufficient for us. Or does it not?

This is where “Burst Throughput” comes into play. The volume can do 500MiB/s (and as we learned that equates to IOPS) for bursts. The burst budget goes up to 1TiB, which translates to 2000 seconds = 33 minutes of continuous 500MiB/s (or 500 IOPS) usage. Since we are on average at around 100, this means we always refill our burst budget and we are able to use the occasional spikes from that budget.

After a decent amount of analysis, we decided to trial it on canary nodes in our clusters.

The volume type can be easily changed in the AWS console, or via API.

Are you able to see when we changed it? No, the release marker is for our rollout of our release-173 on Tuesday April 14th in the evening.

You can also actually see the slightly lower load we had over the long Easter Weekend (this is a server from our EU datacenter).

We changed the volume type at noon on Saturday April 11th, and there was no change detectable in any metrics –except cost!

So what about those crazy read spikes? Yeah, they still occur:

But obviously they cannot get to 1000 IOPS anymore. The peak takes a few seconds longer than it would on GP2, but that is totally worth the saved cost.

Thanks Mike and Corey for the tip!

Play with Instana’s APM Observability Sandbox

Engineering, Product
Instana is the first and only Enterprise Observability solution designed specifically for the challenges of managing microservices and distributed, cloud-native applications. Our SaaS platform has to process and store large amounts of...
Conceptual, Customer Stories, Engineering
Halloween is a scary time to be in abandoned buildings, cemeteries, and dark forests… and DevOps teams. Developers, operations engineers, and SREs told us some DevOps horror stories that have haunted them...
Engineering, Product
As of the latest release, Instana supports the monitoring of Ruby applications running on AWS Fargate, a serverless container orchestrator managed by Amazon Web Services. This enables Ruby teams to take advantage...

Start your FREE TRIAL today!

Instana, an IBM company, provides an Enterprise Observability Platform with automated application monitoring capabilities to businesses operating complex, modern, cloud-native applications no matter where they reside – on-premises or in public and private clouds, including mobile devices or IBM Z.

Control hybrid modern applications with Instana’s AI-powered discovery of deep contextual dependencies inside hybrid applications. Instana also gives visibility into development pipelines to help enable closed-loop DevOps automation.

This provides actionable feedback needed for clients as they to optimize application performance, enable innovation and mitigate risk, helping Dev+Ops add value and efficiency to software delivery pipelines while meeting their service and business level objectives.

For further information, please visit