Life of an SRE at Instana – Things break all the time in distributed systems – Part 2: Cassandra

September 10, 2020

Post

Things break all the time in distributed systems: Part 2 Cassandra

This post is a continuation of the previous blog “Things break all the time in distributed systems: Part 1 ClickHouse

In these blog posts I want to give some details about what happened, how we discovered the problems and how we resolved them.

Tweet
Source: https://twitter.com/MarcelBirkner/status/1301399614503911425

Cassandra node failure

This time we got several built-in warnings from Instana that alerted us about problems with one of our Cassandra nodes. A single Cassandra node failing in our large clusters is not a problem and does not impact customers since all data is replicated. It is still critical to resolve problems quickly, just in case the underlying problem impacts more nodes.

Cassandra alert

From the alerts we jumped directly to the affected Cassandra node and checked some key metrics. A good indicator if a Cassandra process is running stable are the JVM metrics: Garbage Collection and Suspension looked off in this case.

JVM metrics

For Cassandra we have several service level objectives defined and one of those SLOs makes sure that we do not run out of EBS burst credits for the ST1 volumes. In this case we did run out of burst credits and had to convert the volume back to GP2. After restarting the affected Cassandra node the problem was resolved.

EBS burst balance alert
You can read-up on how to monitor EBS metrics using the AWS sensor here, https://www.instana.com/docs/ecosystem/aws-ebs/

 

Throughput credits and burst performance: Like gp2st1 uses a burst-bucket model for performance. Volume size determines the baseline throughput of your volume, which is the rate at which the volume accumulates throughput credits. Volume size also determines the burst throughput of your volume, which is the rate at which you can spend credits when they are available. Larger volumes have higher baseline and burst throughput. The more credits your volume has, the longer it can drive I/O at the burst level. Source: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html

 

Using our built-in Cassandra cluster dashboards we can easily verify that the cluster read / write performance is not impacted.

Cassandra cluster dashboard

Summary

Monitoring burst credits for the ST1 EBS volumes helped us identify the underlying root-cause in this case. Keeping an eye on the ST1 volume burst credits using the AWS UI is a bit cumbersome. Therefore I am glad we monitor these metrics with Instana’s AWS sensor.

 

Play with Instana’s APM Observability Sandbox

Conceptual, Thought Leadership
A microservice architecture is flexible and dynamic but has the challenge of increasing complexity. For example, the picture below is an actual environment where hundreds of services collaborate with each other (a...
|
Developer, Engineering
What is a Zero Width Space? A few days ago I learned that the Unicode character for 'ZERO WIDTH SPACE' is U+200B. "The zero-width space is a non-printing character used in computerized...
|
Engineering, Product
At Instana, we store a lot of customer telemetry data in various databases. A part of our production environment runs in Amazon Web Services (AWS). We use encrypted EBS volumes to securely...
|

Start your FREE TRIAL today!

As the leading provider of Automatic Application Performance Monitoring (APM) solutions for microservices, Instana has developed the automatic monitoring and AI-based analysis DevOps needs to manage the performance of modern applications. Instana is the only APM solution that automatically discovers, maps and visualizes microservice applications without continuous additional engineering. Customers using Instana achieve operational excellence and deliver better software faster. Visit https://www.instana.com to learn more.