Life of an SRE at Instana – Things break all the time in distributed systems – Part 1: ClickHouse

September 10, 2020

Post

Things break all the time in distributed systems: Part 1 ClickHouse

Some weeks are more challenging for an SRE than others. Some days you have more time working on projects, on other days you need to deal with production problems.

Last week was one of those weeks for me

We had data node failures, had to do root-cause analysis, fix issues and find ways to prevent the same problem from occurring again. It all started out with failures in different datastore clusters: ClickHouse, Kafka, ElasticSearch and Cassandra. One thing all of these datastores have in common is that they run in large clusters with data replication enabled across availability zones. Therefore a single failing node does not impact our customers. Our system is built to handle these kinds of things. Nevertheless it is critical to resolve these problems so that they do not escalate.

In the following blog posts I want to talk a bit about what happened, how we discovered the problems and how we resolved them.

Tweet
Source: https://twitter.com/MarcelBirkner/status/1301399614503911425

ClickHouse node failure

For each component in our architecture we have a set of service level objectives defined as events that are automatically updated with each release. One of these events alerted us via OpsGenie, that distributed inserts for a ClickHouse node are piling up.

A little bit of background on ClickHouse. We run several ClickHouse clusters in our regions. Each cluster consists of up to ten shards with two nodes per shard for data replication. We use distributed tables to store the data. Every time a component wants to write data to a distributed table it sends data to an arbitrary ClickHouse node. This node then takes care of forwarding the data to other nodes. ClickHouse stores this data temporarily on disk before forwarding the data to another node.

Here is one of the alerts we received:

SLO alert

By drilling down we could see that files are constantly piling up and that the CH node is not able to send the files to the other nodes in the cluster.

ClickHouse DistributedFilesToInsert metric
Next we checked the health of the underlying EC2 instance and compared it to the other ClickHouse nodes. Looking at CPU usage, network utilization and overall LOAD it was clear that the ClickHouse process was not healthy anymore.

LOAD ClickHouse cluster

We stopped the ClickHouse process, did some regular maintenance tasks and restarted the process. After that, ClickHouse was back up and running and started sending the distributed files to other nodes again.

ClickHouse metric

Improvement

As a result of the post-mortem for this issue, we improved the ClickHouse cluster dashboard with release 185. We added combined node metrics to the cluster dashboard so it is easier to see the overall cluster state and alerts on these metrics.

ClickHouse cluster dashboard

Summary

We were able to quickly fix the broken ClickHouse node and have the cluster back in a stable state in less than 15min. Our custom alerts as well as built-in alerts helped us identify the problem quickly. Another thing I like about Instana is that you get nice immediate visual feedback once the problem is fixed due to the fine grain one-second resolution on the charts.

Teaser

In part 2 of this mini blog series we will cover the problems we encountered with one of our Cassandra clusters.

 

Links

If you want to learn more on how we utilize ClickHouse check out this guest webinar Yoann Buch and myself did together with Altinity.

 

Play with Instana’s APM Observability Sandbox

Announcement, Featured, Product
IBM MQ (previously Websphere MQ and MQ Series) is an enterprise-oriented messaging middleware. According to the IBM website, “It works with a broad range of computing platforms, applications, web services and communications...
|
Developer, Thought Leadership
When monitoring services, websites or mobile applications, immediate feedback is important. Operations, like restarting services, can cause large issues and costly delays. Once Upon A Time In the early days of monitoring,...
|
Thought Leadership
Every Monitoring Tool is Now Offering Observability You can hardly find a product in the cloud-native ecosystem today that doesn’t claim to solve some observability problem. These claims rely either directly or...
|

Start your FREE TRIAL today!

As the leading provider of Automatic Application Performance Monitoring (APM) solutions for microservices, Instana has developed the automatic monitoring and AI-based analysis DevOps needs to manage the performance of modern applications. Instana is the only APM solution that automatically discovers, maps and visualizes microservice applications without continuous additional engineering. Customers using Instana achieve operational excellence and deliver better software faster. Visit https://www.instana.com to learn more.