Built-in ClickHouse Monitoring
ClickHouse has some amazing built-in monitoring capabilities accessible through the system tables (system.trace_log, system.metrics, system.query_log, etc.). They are great when diving deep into issues, but when ClickHouse is just one of the many things you have to watch for, it’s best if you can use a fully managed and automated monitoring solution such as Instana. It is no wonder that Instana provides excellent support for ClickHouse because Instana developers and operations are using ClickHouse to power Instana, and Instana to monitor ClickHouse.
Let’s see how Instana exposes ClickHouse built-in monitoring data visually and through time, puts them into context, provides additional insights across your stack, and does it all with very little effort.
ClickHouse Infrastructure monitoring
To get Instana running, all you need to do is install a single agent per operating system instance. Once the Instana agent is installed on each of the hosts where the ClickHouse servers are running, Instana automatically discovers and provides real time information about the health and performance of the hosts themselves (CPU, memory, and IO) but also of the running ClickHouse servers. It’s therefore a great tool for operations as it helps them keep clusters healthy, upgrade to newer versions smoothly, and get capacity planning right.
Each ClickHouse server gets a dedicated dashboard where you have access to most of the ClickHouse metrics and other information like the number of active parts per table or the running queries. Metrics can also easily be compared across multiple servers.
Application monitoring based on distributed tracing
Relying on logs or infrastructure metrics only does not give you the full picture because ClickHouse servers do not live on their own. Queries to your ClickHouse cluster may come from multiple services, sometimes managed by different teams. Typical problems such as failed queries, slow queries, N+1 queries, and too frequent rate of inserts, can only be fixed when the service causing the issue is identified. This is why Instana captures the SQL statement, latency, potential error, receiving host, and caller information of every single query to ClickHouse.
Instana provides this level of insight because it comes with automatic tracing (no code changes required) across your distributed systems composed of services you built (e.g. Java, Python, NodeJS, etc.), databases, and messaging systems. It can even trace a ClickHouse query all the way back to the user or page who initiated everything from a web or mobile interface thanks to Instana End User Monitoring capabilities.
Not only does Instana give you access to all of the individual transactions that ever touched your ClickHouse cluster, it also aggregates them to form higher level concepts that everyone is familiar with: applications, services, and endpoints. Their corresponding dashboards make it easy to spot trends and outliers at a glance. With ClickHouse you get one service representing your ClickHouse cluster, and as many endpoints as there are tables. On each, you’ll find typical performance indicators such as call count, error rate, and latency, but also top queries and error messages.
From these dashboards and charts it’s easy to jump to the ClickHouse queries of interest, which can then be further filtered or grouped by all kinds of query properties, including information specific to your domain: tenant, timeframe, known query name, etc.
Automatic issue detection and alerting
Dashboards are useful when looking for the root cause of a known problem (e.g. reported by one of your users) or when trying to improve reliability or response times in general. However for the rest of the time, it’s best to let Instana do the work for you, and let it detect issues as they arise: disk is soon running out of space, load is too high, sudden drop in the number of requests, error rate or latency too high, etc.
You do not get alerted on every single issue to prevent alert fatigue. Instead, Instana runs a root cause analysis for you by correlating events (e.g. the sudden high CPU usage observed on a ClickHouse server is correlated with the change to the ClickHouse setting max_thread) together to form incidents which are then reported within the product or sent to third-parties like PagerDuty or Opsgenie.
Instana comes with built-in knowledge and rules for all kinds of technologies including ClickHouse, but they can be extended based on your SLAs and things you’ve learned from you own experience operating ClickHouse. For example, you could create an issue every time some set of queries falls below a certain latency threshold or when insert queries are being throttled.
How to get ClickHouse monitoring
Instana provides great monitoring capabilities for ClickHouse and more. Insights are readily available and actionable whether you are a developer building services on top of ClickHouse or an ops in charge of running a ClickHouse cluster. If you are interested in trying it out, it’s easy, you can start a free trial right away, install the Instana agent on your machines, and watch your cluster magically appear on the map!