At Instana, we store a lot of customer telemetry data in various databases. A part of our production environment runs in Amazon Web Services (AWS). We use encrypted EBS volumes to securely and reliably store the telemetry data, but the default volume type GP2 comes with a cost of around $0.10 per Gigabyte per month. This does not sound like a lot, but when your storage is measured in multiples of Terabytes, it starts to matter.
When we recently did a review of our cost structure with Mike Julian and Corey Quinn from Duckbill Group, they pointed out that we have a lot of volumes that could be actually switched from GP2 to ST1. This would be a significant cost reduction as ST1 volumes are roughly half the cost!
So we looked at the main factors that differentiate GP2 and ST1 (Source: Amazon EBS Volume Types):
- ST1 volumes must be at least 500GB, while GP2 can be smaller
- Max throughput for ST1 is twice as high as for GP2 (500MiB/s vs 250 MiB/s)
- GP2 has much higher Input/Output operations Per Second (IOPS) than ST1
So it turns out what really matters for being able to switch are the required IOPS. However, the IOPS one gets are not that easy to calculate. For GP2, it is a function of the volume size. Bigger volume equals more IOPS, but it caps at 5TB volumes with 16000 IOPS. Volumes smaller than 1TB have a more complicated formula having a baseline performance and then a burst budget. Since we are dealing mostly with volumes >2TB, we never paid attention to the burst budget.
So we looked at some of our volumes, and indeed, we found this:
At first glance, this looks very good. We do not use the 16k IOPS, but rather at maximum 141 for writes, and just 32MiB/s for write throughput. But since this is an average aggregation for hours for this week, we need to zoom inter the high fidelity data that Instana has to verify how big the bursts can be:
Yikes!… So it does spike occasionally very high. So now the question remained: would ST1 actually work?
The math for ST1 volumes is more complex than what it looks on the surface. Our 5TB volume would get a “Base Throughput” of 200MiB/s. Amazon does not mention how many IOPS that are, but it mentions that on ST1 one IO operation equals 1 MiB. So that means we only get 200 IOPS! That would mean it is insufficient for us. Or does it not?
This is where “Burst Throughput” comes into play. The volume can do 500MiB/s (and as we learned that equates to IOPS) for bursts. The burst budget goes up to 1TiB, which translates to 2000 seconds = 33 minutes of continuous 500MiB/s (or 500 IOPS) usage. Since we are on average at around 100, this means we always refill our burst budget and we are able to use the occasional spikes from that budget.
After a decent amount of analysis, we decided to trial it on canary nodes in our clusters.
The volume type can be easily changed in the AWS console, or via API.
Are you able to see when we changed it? No, the release marker is for our rollout of our release-173 on Tuesday April 14th in the evening.
You can also actually see the slightly lower load we had over the long Easter Weekend (this is a server from our EU datacenter).
We changed the volume type at noon on Saturday April 11th, and there was no change detectable in any metrics –except cost!
So what about those crazy read spikes? Yeah, they still occur:
But obviously they cannot get to 1000 IOPS anymore. The peak takes a few seconds longer than it would on GP2, but that is totally worth the saved cost.
Thanks Mike and Corey for the tip!