Things break all the time in distributed systems: Part 1 ClickHouse
Some weeks are more challenging for an SRE than others. Some days you have more time working on projects, on other days you need to deal with production problems.
Last week was one of those weeks for me.
We had data node failures, had to do root-cause analysis, fix issues and find ways to prevent the same problem from occurring again. It all started out with failures in different datastore clusters: ClickHouse, Kafka, ElasticSearch and Cassandra. One thing all of these datastores have in common is that they run in large clusters with data replication enabled across availability zones. Therefore a single failing node does not impact our customers. Our system is built to handle these kinds of things. Nevertheless it is critical to resolve these problems so that they do not escalate.
In the following blog posts I want to talk a bit about what happened, how we discovered the problems and how we resolved them.
ClickHouse node failure
For each component in our architecture we have a set of service level objectives defined as events that are automatically updated with each release. One of these events alerted us via OpsGenie, that distributed inserts for a ClickHouse node are piling up.
A little bit of background on ClickHouse. We run several ClickHouse clusters in our regions. Each cluster consists of up to ten shards with two nodes per shard for data replication. We use distributed tables to store the data. Every time a component wants to write data to a distributed table it sends data to an arbitrary ClickHouse node. This node then takes care of forwarding the data to other nodes. ClickHouse stores this data temporarily on disk before forwarding the data to another node.
Here is one of the alerts we received:
By drilling down we could see that files are constantly piling up and that the CH node is not able to send the files to the other nodes in the cluster.
Next we checked the health of the underlying EC2 instance and compared it to the other ClickHouse nodes. Looking at CPU usage, network utilization and overall LOAD it was clear that the ClickHouse process was not healthy anymore.
We stopped the ClickHouse process, did some regular maintenance tasks and restarted the process. After that, ClickHouse was back up and running and started sending the distributed files to other nodes again.
As a result of the post-mortem for this issue, we improved the ClickHouse cluster dashboard with release 185. We added combined node metrics to the cluster dashboard so it is easier to see the overall cluster state and alerts on these metrics.
We were able to quickly fix the broken ClickHouse node and have the cluster back in a stable state in less than 15min. Our custom alerts as well as built-in alerts helped us identify the problem quickly. Another thing I like about Instana is that you get nice immediate visual feedback once the problem is fixed due to the fine grain one-second resolution on the charts.
In part 2 of this mini blog series we will cover the problems we encountered with one of our Cassandra clusters.
If you want to learn more on how we utilize ClickHouse check out this guest webinar Yoann Buch and myself did together with Altinity.