Life of an SRE at Instana – Things break all the time in distributed systems – Part 1: ClickHouse

September 10, 2020

ClickHouse Failure

Things break all the time in distributed systems: Part 1 ClickHouse

Some weeks are more challenging for an SRE than others. Some days you have more time working on projects, on other days you need to deal with production problems.

Last week was one of those weeks for me.

We had data node failures, had to do root-cause analysis, fix issues and find ways to prevent the same problem from occurring again. It all started out with failures in different datastore clusters: ClickHouse, Kafka, ElasticSearch and Cassandra. One thing all of these datastores have in common is that they run in large clusters with data replication enabled across availability zones. Therefore a single failing node does not impact our customers. Our system is built to handle these kinds of things. Nevertheless it is critical to resolve these problems so that they do not escalate.

In the following blog posts I want to talk a bit about what happened, how we discovered the problems and how we resolved them.


ClickHouse node failure

For each component in our architecture we have a set of service level objectives defined as events that are automatically updated with each release. One of these events alerted us via OpsGenie, that distributed inserts for a ClickHouse node are piling up.

A little bit of background on ClickHouse. We run several ClickHouse clusters in our regions. Each cluster consists of up to ten shards with two nodes per shard for data replication. We use distributed tables to store the data. Every time a component wants to write data to a distributed table it sends data to an arbitrary ClickHouse node. This node then takes care of forwarding the data to other nodes. ClickHouse stores this data temporarily on disk before forwarding the data to another node.

Here is one of the alerts we received:

SLO alert

By drilling down we could see that files are constantly piling up and that the CH node is not able to send the files to the other nodes in the cluster.

ClickHouse DistributedFilesToInsert metric
Next we checked the health of the underlying EC2 instance and compared it to the other ClickHouse nodes. Looking at CPU usage, network utilization and overall LOAD it was clear that the ClickHouse process was not healthy anymore.

LOAD ClickHouse cluster

We stopped the ClickHouse process, did some regular maintenance tasks and restarted the process. After that, ClickHouse was back up and running and started sending the distributed files to other nodes again.

ClickHouse metric


As a result of the post-mortem for this issue, we improved the ClickHouse cluster dashboard with release 185. We added combined node metrics to the cluster dashboard so it is easier to see the overall cluster state and alerts on these metrics.

ClickHouse cluster dashboard


We were able to quickly fix the broken ClickHouse node and have the cluster back in a stable state in less than 15min. Our custom alerts as well as built-in alerts helped us identify the problem quickly. Another thing I like about Instana is that you get nice immediate visual feedback once the problem is fixed due to the fine grain one-second resolution on the charts.


In part 2 of this mini blog series we will cover the problems we encountered with one of our Cassandra clusters.


If you want to learn more on how we utilize ClickHouse check out this guest webinar Yoann Buch and myself did together with Altinity.


Play with Instana’s APM Observability Sandbox

It was 3:38 am and our smoke alarm blared loudly and proudly somewhere in our house. My husband and I jolted from bed and into emergency mode. I rushed into our son’s...
Conceptual, Customer Stories, Engineering
Halloween is a scary time to be in abandoned buildings, cemeteries, and dark forests… and DevOps teams. Developers, operations engineers, and SREs told us some DevOps horror stories that have haunted them...
Announcement, Product
This blog post was originally published by IBM What is IBM Observability by Instana? IBM Observability by Instana provides businesses with advanced application performance monitoring and observability capabilities, manages the performance of...

Start your FREE TRIAL today!

Instana, an IBM company, provides an Enterprise Observability Platform with automated application monitoring capabilities to businesses operating complex, modern, cloud-native applications no matter where they reside – on-premises or in public and private clouds, including mobile devices or IBM Z.

Control hybrid modern applications with Instana’s AI-powered discovery of deep contextual dependencies inside hybrid applications. Instana also gives visibility into development pipelines to help enable closed-loop DevOps automation.

This provides actionable feedback needed for clients as they to optimize application performance, enable innovation and mitigate risk, helping Dev+Ops add value and efficiency to software delivery pipelines while meeting their service and business level objectives.

For further information, please visit