Things break all the time in distributed systems: Part 2 Cassandra
This post is a continuation of the previous blog “Things break all the time in distributed systems: Part 1 ClickHouse”
In these blog posts I want to give some details about what happened, how we discovered the problems and how we resolved them.
Cassandra node failure
This time we got several built-in warnings from Instana that alerted us about problems with one of our Cassandra nodes. A single Cassandra node failing in our large clusters is not a problem and does not impact customers since all data is replicated. It is still critical to resolve problems quickly, just in case the underlying problem impacts more nodes.
From the alerts we jumped directly to the affected Cassandra node and checked some key metrics. A good indicator if a Cassandra process is running stable are the JVM metrics: Garbage Collection and Suspension looked off in this case.
For Cassandra we have several service level objectives defined and one of those SLOs makes sure that we do not run out of EBS burst credits for the ST1 volumes. In this case we did run out of burst credits and had to convert the volume back to GP2. After restarting the affected Cassandra node the problem was resolved.
You can read-up on how to monitor EBS metrics using the AWS sensor here, https://www.instana.com/docs/ecosystem/aws-ebs/
Throughput credits and burst performance: Like
st1uses a burst-bucket model for performance. Volume size determines the baseline throughput of your volume, which is the rate at which the volume accumulates throughput credits. Volume size also determines the burst throughput of your volume, which is the rate at which you can spend credits when they are available. Larger volumes have higher baseline and burst throughput. The more credits your volume has, the longer it can drive I/O at the burst level. Source: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html
Using our built-in Cassandra cluster dashboards we can easily verify that the cluster read / write performance is not impacted.
Monitoring burst credits for the ST1 EBS volumes helped us identify the underlying root-cause in this case. Keeping an eye on the ST1 volume burst credits using the AWS UI is a bit cumbersome. Therefore I am glad we monitor these metrics with Instana’s AWS sensor.