Life of an SRE at Instana – Scaling down Cassandra nodes

February 27, 2020

Life of an SRE at Instana - Scaling down Cassandra nodes

We, the SRE team at Instana, are starting a blog series to share technical stories on how we manage and monitor our Instana SaaS infrastructure 24/7, across several continents, for customers around the world. The articles will be kept short and only tackle one topic at a time. They are loosely coupled and have a limited scope, to keep them short and interesting. This is the first entry in this blog series. 

Scaling down Cassandra nodes

We run large Cassandra clusters to store spans and metrics data. One of the Cassandra clusters grew to around 200 TB and consists of 65 nodes. Due to architectural improvements from Engineering we were able to bring down the total store size from 207 TiB to about 120 TiB, saving about 40% storage capacity. To reduce our infrastructure costs we need to scale down the Cassandra cluster. Due to the size of our clusters we had to decommission one Cassandra node at a time, wait until it left the cluster, before removing the next node. 

All components and datastores in our infrastructure are monitored using Instana. Here is a screenshot of the current Cassandra cluster disk size:

How to decommission a Cassandra node

First we check the cluster health in the Cassandra cluster dashboard.

When the cluster is healthy, we decommission the first node. Be careful not to decommission your seed nodes, that you have configured in the cassandra.yml


# login to cassandra node and start decommissioning task
> nodetool decommission &


Instana will immediately detect state changes in the Events view. In the following case you will see that the node is leaving the cluster. 

The Events view is very useful for filtering for change events, i.e. by using “DECOMMISSIONED” we can see when nodes left the Cassandra cluster. Here is a screenshot of how this looks for one of our regions. The screenshot shows all change events when nodes left the cluster during a 2 week period. 

Cassandra Change EventsUsing nodetool status you can verify the current state of leaving node. 


> nodetool status

Datacenter: datacenter1
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load Tokens       Owns (effective) Host ID                               Rack
UN   2.77 TiB 256         6.4% 58e5ceab-fee4-401b-a2db-6d900c18694e  rack1
UL   2.92 TiB 256         6.8% 5f83764a-09dc-4bb2-92cf-c815adce2906  rack1
UN  2.89 TiB 256         6.8% 97139797-20e1-4322-a0a4-c7b2a3562b23  rack1


Once a node has left the cluster, it will no longer appear in this list. You can find the following message in the Cassandra logs once the node left the cluster.

INFO  ... 2020-02-22 22:18:52,632 - Announcing that I have left the ring for 30000ms


To check progress you can use nodetool netstats -H. This can help to verify that the leaving process did not get stuck along the way.


> nodetool netstats -H | grep -v 100%

Unbootstrap db3f84d0-4a9f-11ea-9a05-77939f28ad95
       Sending 2852 files, 70.37 GiB total. Already sent 2444 files, 45.3 GiB total
           /mnt/data/cassandra/data/<keyspace>/<table>-d38c23809c7a11e89218356b8d10d187/md-93739-big-Data.db 2515599/18301568 bytes(13%) sent to idx:0/
       Sending 2915 files, 79.49 GiB total. Already sent 2308 files, 36.43 GiB total


Create Alert for change event

With our current setup it takes about 24h to 36h for a Cassandra node to leave the cluster before we can terminate the instance. In order to be notified right away once the node has left the cluster we added an alert that is sending us a message to OpsGenie and notifies us via Slack. This allows us to act right away and save money. 

Create Alert:

  1. Specify the name for the Alert so you can easily find it again
  2. Select Event “Alert on Event Type(s)” and enable “Changes”
  3. Use a “Dynamic Focus Query” to find all events that contain “DECOMMISSIONED”
  4. Specify the “Alert Channel” that you want to use. We use OpsGenie and Slack for being notified


Instana’s automatic discovery and monitoring of Cassandra nodes and clusters, as well as alerting on change events, makes our daily works as an SRE a lot easier. We do not have to check the logs for certain messages and can rely on the Instana alerting mechanism to automate as much as possible of our daily workload.

Play with Instana’s APM Observability Sandbox

Thought Leadership
By Karen Gdaniec  Video Demo by: John Gbruoski You can take advantage of IBM Observability by Instana using existing monitoring agents. As an IBM Monitoring client, are you excited about using IBM Observability by Instana to...
Engineering, Thought Leadership
It seems that everyone refers to their application monitoring tool as Observability. Even experienced IT team members have been known to use the terms “observability” and “application performance monitoring” interchangeably – but...
IBM Observability by Instana APM has been named application performance monitoring product of the year Unlike most application performance monitoring (APM) and observability solutions, Instana was built for modern cloud-native workloads. Our...

Start your FREE TRIAL today!

Instana, an IBM company, provides an Enterprise Observability Platform with automated application monitoring capabilities to businesses operating complex, modern, cloud-native applications no matter where they reside – on-premises or in public and private clouds, including mobile devices or IBM Z.

Control hybrid modern applications with Instana’s AI-powered discovery of deep contextual dependencies inside hybrid applications. Instana also gives visibility into development pipelines to help enable closed-loop DevOps automation.

This provides actionable feedback needed for clients as they to optimize application performance, enable innovation and mitigate risk, helping Dev+Ops add value and efficiency to software delivery pipelines while meeting their service and business level objectives.

For further information, please visit