Life of an SRE at Instana – Things break all the time in distributed systems – Part 2: Cassandra

September 10, 2020

Cassandra Failure

Things break all the time in distributed systems: Part 2 Cassandra

This post is a continuation of the previous blog “Things break all the time in distributed systems: Part 1 ClickHouse

In these blog posts I want to give some details about what happened, how we discovered the problems and how we resolved them.


Cassandra node failure

This time we got several built-in warnings from Instana that alerted us about problems with one of our Cassandra nodes. A single Cassandra node failing in our large clusters is not a problem and does not impact customers since all data is replicated. It is still critical to resolve problems quickly, just in case the underlying problem impacts more nodes.

Cassandra alert

From the alerts we jumped directly to the affected Cassandra node and checked some key metrics. A good indicator if a Cassandra process is running stable are the JVM metrics: Garbage Collection and Suspension looked off in this case.

JVM metrics

For Cassandra we have several service level objectives defined and one of those SLOs makes sure that we do not run out of EBS burst credits for the ST1 volumes. In this case we did run out of burst credits and had to convert the volume back to GP2. After restarting the affected Cassandra node the problem was resolved.

EBS burst balance alert
You can read-up on how to monitor EBS metrics using the AWS sensor here,


Throughput credits and burst performance: Like gp2st1 uses a burst-bucket model for performance. Volume size determines the baseline throughput of your volume, which is the rate at which the volume accumulates throughput credits. Volume size also determines the burst throughput of your volume, which is the rate at which you can spend credits when they are available. Larger volumes have higher baseline and burst throughput. The more credits your volume has, the longer it can drive I/O at the burst level. Source:


Using our built-in Cassandra cluster dashboards we can easily verify that the cluster read / write performance is not impacted.

Cassandra cluster dashboard


Monitoring burst credits for the ST1 EBS volumes helped us identify the underlying root-cause in this case. Keeping an eye on the ST1 volume burst credits using the AWS UI is a bit cumbersome. Therefore I am glad we monitor these metrics with Instana’s AWS sensor.


Play with Instana’s APM Observability Sandbox

Conceptual, Customer Stories, Engineering
Halloween is a scary time to be in abandoned buildings, cemeteries, and dark forests… and DevOps teams. Developers, operations engineers, and SREs told us some DevOps horror stories that have haunted them...
Developer, Thought Leadership
Kubernetes (also known as k8s) is an orchestration platform and abstract layer for containerized applications and services. As such, k8s manages and limits container available resources on the physical machine, as well...
Conceptual, Thought Leadership
A microservice architecture is flexible and dynamic but has the challenge of increasing complexity. For example, the picture below is an actual environment where hundreds of services collaborate with each other (a...

Start your FREE TRIAL today!

Instana, an IBM company, provides an Enterprise Observability Platform with automated application monitoring capabilities to businesses operating complex, modern, cloud-native applications no matter where they reside – on-premises or in public and private clouds, including mobile devices or IBM Z.

Control hybrid modern applications with Instana’s AI-powered discovery of deep contextual dependencies inside hybrid applications. Instana also gives visibility into development pipelines to help enable closed-loop DevOps automation.

This provides actionable feedback needed for clients as they to optimize application performance, enable innovation and mitigate risk, helping Dev+Ops add value and efficiency to software delivery pipelines while meeting their service and business level objectives.

For further information, please visit