Irrespective of how many levels of abstraction are constantly being introduced like containers and serverless, your code is run by the operating system in processes. And those processes can and will encounter issues, causing them to exit with erroneous status codes. In other cases, the operating system or something else running on it, will kill your processes, or send them other signals that will cause them to shut down. The first step to understanding the root cause of these crashes is to understand why a process crashed. To successfully do so, developers and operators need a Crash Detector.
What monitoring tools traditionally do is to regularly check whether the processes that are being monitored are still up and running, a pattern known as watchdog. However, without knowing how and why your process terminated the way it did, you are left picking at straws and digging through all types of logs finding out why your database called it a day and shut down. The task of finding out what is wrong has been made somewhat worse by the layers of orchestration that will restart processes for you: by the time you manage to log into your (production) machine and start digging around, a new process is up, writing out more logs and mudding the waters.
The fact that processes get seamlessly restarted also leads to situations in which you do not know that issues are occurring: you see sporadic, inexplicable connection errors from clients to servers that appear to be running fine while, in reality, those servers crash, and are brought back up. And trying to reproduce those process crashes is often a frustrating exercise, because if you knew how to trigger those crashes, chances are you’d know how to fix them too. As a popular cartoon taught children in the 90’s, knowing is half the battle.
Don’t get me wrong: bringing processes with issues up again is great for resilience and you totally should do it (or let your orchestration do it for you), but even better is finding out why they break and fix it.
Playing Clue with Process Crashes
You are trying to understand why your database node went to meet its (init) maker. To break this case, you need the following pieces of information:
- When did this process terminate? What issues in other components in your infrastructure are caused by or related to this termination? Are there spikes in error rates in upstream components because of it? Or maybe also spikes in latency, as clients try to connect and wait until a timeout occurs?
- Why did the process terminate? Did it exit of its own accord, or did a signal like SIGKILL or something else not properly handled dispose of it?
- Assuming the process exited of its own accord, which status code did it exited with? More often than not, erroneous status codes are something you can look up in the documentation and see if there are known causes.
- Assuming something sent the process a signal that caused the latter’s termination, which signal was that? Which process sent it?
Abnormal Process Termination in Instana
In the screenshot above, you see how the abnormal termination of the MySQL database causes issues on its clients. The detection and explanation of the abnormal termination supercharges the root-cause analysis capabilities of Instana for an entire class of pernicious, hard-to-detect and harder-to-troubleshoot problems. And, let me mention it again, because it required a lot of ingenuity and hard work to achieve, entirely out of the box.
Anatomy of Crash Detector – the Abnormal Process Termination Feature
Instana’s new and novel Abnormal Process Termination feature is built, among other things, on eBPF.
What is eBPF
In other words, eBPF allows you to hook in many different system calls of the Linux kernel, and execute logic that allows you to extract, for example, diagnostic information. It is actually the same principle powering Instana’s AutoTrace™ in the various runtimes it supports: hook into something interesting, and find out what happens. For Instana, considering a hook and wondering what type of useful information we can collect with it, is pretty much a way of life.
How Instana uses eBPF to detect crashes and Abnormal Process Terminations
As soon as the Instana agent bootstraps on a host, it starts learning about its surrounding environment by performing what we call discovery. The discovery consists of running small modules of the Instana agent specialized in, well, discovering particular technologies running on the same host, may it be containers and what is running within, databases, runtimes like Java Virtual Machines or Node.js and the applications they run. When a particular technology is discovered, the agent will bootstrap a sensor, which is a specialized component in charge of collecting telemetry from the discovered technology. And, of course, discovery is performed continuously, so that the host agent is always up-to-date with what is available for monitoring.
Over discovery, the host agent also learns about the underpinning operating system. If it finds out it is a Linux then, among other things, it will detect whether it provides support for eBPF and, if so, it will start using it right away, spinning up a dedicated process, called the ebpf-sensor. The ebpf-sensor process will hook into the kernel and start listening for process terminations. All of them, whether “normal” like an exit with status code 0, or not. The abnormal ones, will be signaled to the host agent, which will screen them against the processes it monitors to avoid noise about processes that are not relevant, and then send the information upstream to the Instana backend. This crash detector process occurs in near real-time, and it normally takes one or two seconds between the crash of an important process, and the accurate diagnostics of its demise to be visible in Instana. (And, by the way, most of those seconds are spent on the network.)
Besides the crash detector functionality in itself, which is a game changer for the troubleshooting of many incidents, there are some other interesting thing for the Abnormal Process Termination feature:
- Written in Rust. The ebpf-sensor is actually the first part of Instana written in Rust, which truly lives up to the awesome reputation it is rightfully accruing.
- It just works. There is no setup work, no manual dependencies like Linux headers or the like that must be available on the host for it to work. You install the host agent, it does its magic, you effortlessly get the insights you need.
The point about no dependencies is something that may be hard to appreciate without knowing in detail the kind of hurdles we had to overcome to bring you the crash detector feature working out-of-the-box. Suffice it to say, we worked for months and, in the process, invented new ways of doing what others do with eBPF by means of manual setups, in order to make it work the Instana way.
Saving time and headaches with Crash Detector
Knowing how and why process terminated abnormally delivers at your fingertips answers that, without the Abnormal Process Termination feature, would cost you hours and days to find out by other means, sifting through enormous and enormously cryptic operating system logs. With Crash Detector, Instana provides the critical pieces of data, the prime reason, for an entire class of issues affecting the performance of your applications. And, in the most pure Instana way, it works out of the box on all non-so-outdated Linux boxes, no manual setup, no dependencies. Enjoy!
To see how Abnormal Process Termination works in action, sign-up for a free trial for Instana today.