Post

Life of an SRE at Instana – Scaling down Cassandra nodes

February 27, 2020

We, the SRE team at Instana, are starting a blog series to share technical stories on how we manage and monitor our Instana SaaS infrastructure 24/7, across several continents, for customers around the world. The articles will be kept short and only tackle one topic at a time. They are loosely coupled and have a limited scope, to keep them short and interesting. This is the first entry in this blog series. 

Scaling down Cassandra nodes

We run large Cassandra clusters to store spans and metrics data. One of the Cassandra clusters grew to around 200 TB and consists of 65 nodes. Due to architectural improvements from Engineering we were able to bring down the total store size from 207 TiB to about 120 TiB, saving about 40% storage capacity. To reduce our infrastructure costs we need to scale down the Cassandra cluster. Due to the size of our clusters we had to decommission one Cassandra node at a time, wait until it left the cluster, before removing the next node. 

All components and datastores in our infrastructure are monitored using Instana. Here is a screenshot of the current Cassandra cluster disk size:

How to decommission a Cassandra node

First we check the cluster health in the Cassandra cluster dashboard.

When the cluster is healthy, we decommission the first node. Be careful not to decommission your seed nodes, that you have configured in the cassandra.yml

 

# login to cassandra node and start decommissioning task
> nodetool decommission &

 

Instana will immediately detect state changes in the Events view. In the following case you will see that the node is leaving the cluster. 

The Events view is very useful for filtering for change events, i.e. by using “DECOMMISSIONED” we can see when nodes left the Cassandra cluster. Here is a screenshot of how this looks for one of our regions. The screenshot shows all change events when nodes left the cluster during a 2 week period. 

Cassandra Change EventsUsing nodetool status you can verify the current state of leaving node. 

 

> nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load Tokens       Owns (effective) Host ID                               Rack
[...]
UN  10.255.185.88   2.77 TiB 256         6.4% 58e5ceab-fee4-401b-a2db-6d900c18694e  rack1
UL  10.255.154.83   2.92 TiB 256         6.8% 5f83764a-09dc-4bb2-92cf-c815adce2906  rack1
UN  10.255.170.209  2.89 TiB 256         6.8% 97139797-20e1-4322-a0a4-c7b2a3562b23  rack1
[...]

 

Once a node has left the cluster, it will no longer appear in this list. You can find the following message in the Cassandra logs once the node left the cluster.

INFO  ... 2020-02-22 22:18:52,632 StorageService.java:4094 - Announcing that I have left the ring for 30000ms

 

To check progress you can use nodetool netstats -H. This can help to verify that the leaving process did not get stuck along the way.

 

> nodetool netstats -H | grep -v 100%

Mode: LEAVING
Unbootstrap db3f84d0-4a9f-11ea-9a05-77939f28ad95
   /10.255.179.78
       Sending 2852 files, 70.37 GiB total. Already sent 2444 files, 45.3 GiB total
           /mnt/data/cassandra/data/<keyspace>/<table>-d38c23809c7a11e89218356b8d10d187/md-93739-big-Data.db 2515599/18301568 bytes(13%) sent to idx:0/10.255.179.78
   /10.255.185.33
       Sending 2915 files, 79.49 GiB total. Already sent 2308 files, 36.43 GiB total
...

 

Create Alert for change event

With our current setup it takes about 24h to 36h for a Cassandra node to leave the cluster before we can terminate the instance. In order to be notified right away once the node has left the cluster we added an alert that is sending us a message to OpsGenie and notifies us via Slack. This allows us to act right away and save money. 

Create Alert:

  1. Specify the name for the Alert so you can easily find it again
  2. Select Event “Alert on Event Type(s)” and enable “Changes”
  3. Use a “Dynamic Focus Query” to find all events that contain “DECOMMISSIONED”
  4. Specify the “Alert Channel” that you want to use. We use OpsGenie and Slack for being notified

Summary

Instana’s automatic discovery and monitoring of Cassandra nodes and clusters, as well as alerting on change events, makes our daily works as an SRE a lot easier. We do not have to check the logs for certain messages and can rely on the Instana alerting mechanism to automate as much as possible of our daily workload.

Play with Instana’s APM Observability Sandbox

Announcement, Developer, Featured, Product
Monitoring AWS Fargate based applications doesn’t have to be difficult. Instana has brought our automated distributed tracing technology to Fargate. What is Fargate? AWS Fargate is a serverless computing platform for containers...
|
Announcement, Developer, Featured, Product
Instana has been leading the Application Performance Monitoring (APM) industry with our automated distributed tracing technology, AutoTrace™. With AutoTrace, Instana has eliminated the need to manually instrument distributed tracing in your environment....
|
Announcement, Product, Thought Leadership
Kubernetes, Kubernetes Monitoring and KubeCon I have vivid memories of the first KubeCon that I attended – it was in Austin, and it SNOWED. I was also pretty blown away by the...
|

Start your FREE TRIAL today!

As the leading provider of Automatic Application Performance Monitoring (APM) solutions for microservices, Instana has developed the automatic monitoring and AI-based analysis DevOps needs to manage the performance of modern applications. Instana is the only APM solution that automatically discovers, maps and visualizes microservice applications without continuous additional engineering. Customers using Instana achieve operational excellence and deliver better software faster. Visit https://www.instana.com to learn more.