We, the SRE team at Instana, are starting a blog series to share technical stories on how we manage and monitor our Instana SaaS infrastructure 24/7, across several continents, for customers around the world. The articles will be kept short and only tackle one topic at a time. They are loosely coupled and have a limited scope, to keep them short and interesting. This is the first entry in this blog series.
Scaling down Cassandra nodes
We run large Cassandra clusters to store spans and metrics data. One of the Cassandra clusters grew to around 200 TB and consists of 65 nodes. Due to architectural improvements from Engineering we were able to bring down the total store size from 207 TiB to about 120 TiB, saving about 40% storage capacity. To reduce our infrastructure costs we need to scale down the Cassandra cluster. Due to the size of our clusters we had to decommission one Cassandra node at a time, wait until it left the cluster, before removing the next node.
All components and datastores in our infrastructure are monitored using Instana. Here is a screenshot of the current Cassandra cluster disk size:
How to decommission a Cassandra node
First we check the cluster health in the Cassandra cluster dashboard.
When the cluster is healthy, we decommission the first node. Be careful not to decommission your seed nodes, that you have configured in the cassandra.yml
# login to cassandra node and start decommissioning task > nodetool decommission &
Instana will immediately detect state changes in the Events view. In the following case you will see that the node is leaving the cluster.
The Events view is very useful for filtering for change events, i.e. by using “DECOMMISSIONED” we can see when nodes left the Cassandra cluster. Here is a screenshot of how this looks for one of our regions. The screenshot shows all change events when nodes left the cluster during a 2 week period.
Using nodetool status you can verify the current state of leaving node.
> nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack [...] UN 10.255.185.88 2.77 TiB 256 6.4% 58e5ceab-fee4-401b-a2db-6d900c18694e rack1 UL 10.255.154.83 2.92 TiB 256 6.8% 5f83764a-09dc-4bb2-92cf-c815adce2906 rack1 UN 10.255.170.209 2.89 TiB 256 6.8% 97139797-20e1-4322-a0a4-c7b2a3562b23 rack1 [...]
Once a node has left the cluster, it will no longer appear in this list. You can find the following message in the Cassandra logs once the node left the cluster.
INFO ... 2020-02-22 22:18:52,632 StorageService.java:4094 - Announcing that I have left the ring for 30000ms
To check progress you can use nodetool netstats -H. This can help to verify that the leaving process did not get stuck along the way.
> nodetool netstats -H | grep -v 100% Mode: LEAVING Unbootstrap db3f84d0-4a9f-11ea-9a05-77939f28ad95 /10.255.179.78 Sending 2852 files, 70.37 GiB total. Already sent 2444 files, 45.3 GiB total /mnt/data/cassandra/data/<keyspace>/<table>-d38c23809c7a11e89218356b8d10d187/md-93739-big-Data.db 2515599/18301568 bytes(13%) sent to idx:0/10.255.179.78 /10.255.185.33 Sending 2915 files, 79.49 GiB total. Already sent 2308 files, 36.43 GiB total ...
Create Alert for change event
With our current setup it takes about 24h to 36h for a Cassandra node to leave the cluster before we can terminate the instance. In order to be notified right away once the node has left the cluster we added an alert that is sending us a message to OpsGenie and notifies us via Slack. This allows us to act right away and save money.
- Specify the name for the Alert so you can easily find it again
- Select Event “Alert on Event Type(s)” and enable “Changes”
- Use a “Dynamic Focus Query” to find all events that contain “DECOMMISSIONED”
- Specify the “Alert Channel” that you want to use. We use OpsGenie and Slack for being notified
Instana’s automatic discovery and monitoring of Cassandra nodes and clusters, as well as alerting on change events, makes our daily works as an SRE a lot easier. We do not have to check the logs for certain messages and can rely on the Instana alerting mechanism to automate as much as possible of our daily workload.