Solving the Out of Memory Killer Puzzle

Post

Instana recently introduced Crash Detector, which automatically detects and reports abnormal process terminations on all Linux machines running Linux kernel 4.8 and above. Instana utilizes the Extended Berkeley Packet Filter (eBPF) functionalities of the Linux kernel to hook into the kernel itself and start listening for process terminations. Any abnormal terminations are signaled to the host agent, which screens them against the processes it monitors to avoid noise about processes that are not relevant, and then sends the information upstream to the Instana backend. This functionality has proven to be a game changer for our customers as they work on troubleshooting incidents.

With Crash Detector, Instana provides the critical pieces of data for many of the issues that are affecting the performance of our customers’ applications. We are now enhancing this functionality by adding Out of Memory Killer events to Crash Detector, and it is an incredibly valuable addition due to its relevance for containerized applications.

What is Out of Memory Killer?

The Cloud may make it seem like, if you have enough budget, you have infinite computing power at your disposal. However, that computing power comes in slices: hosts, physical and virtual alike, containers, functions, they all come with limitations on how much memory you can allocate.

On Linux, the Out of Memory Killer (OOM Killer) is a process in charge of preventing other processes from collectively exhausting the host’s memory. When a process tries to allocate more memory than available, the process that has the overall highest badness score (based, for example, on how much memory they allocate above what is allowed) will receive an out-of-memory signal, which fundamentally means: “You are way out of line. Quit or get some of your child processes to quit, or it is lights out.” Notice that the process that triggers the OOM may not be the process that receives the OOM signal: an application that has not recently increased its memory usage may, all of a sudden, be issued an OOM signal because too many other applications have started on the same host!

The mechanics of OOM sound harsh, but they are actually a very effective mechanism to prevent memory exhaustion on hosts, especially in case of applications not sized correctly, or too many applications running in parallel, that is, the hosts are not sized correctly to the workload.

For containerized platforms like Kubernetes, Cloud Foundry and Nomad, the usage of memory, both in terms of sizing applications appropriately and how many applications to run at any one time on a host, is even more important. Generally, you do not plan out in detail which applications are running on any one node. In many setups, containers will be allocated according to some logic by the orchestrator. Enforcing maximum memory consumption is critical for containers and cgroup, the foundation of virtually every container technology on Linux, also uses the Out of Memory Killer system to ensure that processes running in the same group (that is, a container) do not allocate more memory than they are allowed to. When the processes in your containers try to allocate more memory than they are allowed to, some will be terminated, often bringing their containers down with them.

At scale, everything is harder, including sizing. The more containers you run in your environments, the harder it is to understand when and how and why some of them go down. OOM can create unhealthy situations for your applications in which something is always crashing somewhere and then getting restarted, creating a continuous amount of errors for your end users that skews your SLOs and it is really hard to troubleshoot.

Where Monitoring Has Let OOM Slip Through the Cracks

Finding out why any one process has been disposed of by OOM Killer depends a lot on the technology you use. Some software packages will log it in their own logs. Or you may end up running some command like this on your hosts (on each of them!):

     #CentOS
     grep -i "out of memory" /var/log/messages
     #Debian / Ubuntu
     grep -i "out of memory" /var/log/kern.log

Looks tame, but this is definitely not the kind of task you want to run across your production fleet to try to understand at 3AM why MySQL has kicked the bucket again. Especially when it is on a hunch, since nothing else seems to explain why the database process is no longer there.

In other words, OOM is a system of undeniable importance and efficacy for reliability that fails to provide sufficient observability. But Instana is here to fix that for you 😉

How Instana Detects OOM Killed Process with eBPF

Further building upon the eBPF foundation that brought you Crash Detector, Instana now comes with an out-of-the-box OOM Killer detector. When your process monitored by Instana receives an OOM signal, you will find in Instana, in real-time, not only that it happened, but also how the situation resolved, that is, which process got killed.

Word Image 299

^This process decided to fall on its sword, very honorable!

Similar to most Instana features, all you need is to install the Instana host agent, and watch OOM Killer go about its grim business. We also show you in the event how much memory did the killed process allocate at the time, so that you can understand why it was marked by OOM Killer as “bad”. This new functionality is already making a difference for our customers, as Gregory, Technical Director at Altissia, put it, “Don’t know who worked on the new issue detector, but you made my week!”

Problems you can solve with OOM Killer

Determining how and why a process was terminated or why a process was killed with an OOM Killer can take hours if not days to uncover without the proper tools. With Instana’s Crash Detector, users now immediately have the root cause for every abnormal process termination and every OOM Killer killed process.

Need to understand why a container died? No problem, with Crash Detector’s OOM Killer you’ll know, in near real-time, that perhaps your JVM running a very important batch job allocated more resources than it was allowed to. Or maybe you need to determine the cause of why you’re having so many PHP request failures or why your database disappeared. Again, with Instana’s Crash Detector’s OOM Killer you will have immediate access to the root cause of these issues. Regardless of whether it was caused by some workers taking up too much memory causing the PHP request failures or a mis-configuration of resource limits in systemd requiring your database to be disposed of to protect the rest of your operating system.

Save time on troubleshooting application performance issues with OOM Killer

To get started saving yourself and your DevOps teams time troubleshooting OOM Killer events, simply install the Instana Agent on your Linux OS today. If you don’t already have an Instana instance, fear not, you can see how Instana’s Crash Detector with OOM Killer detection works with a free trial of Instana.

Play with Instana’s APM Observability Sandbox

Start your FREE TRIAL today!

As the leading provider of Automatic Application Performance Monitoring (APM) solutions for microservices, Instana has developed the automatic monitoring and AI-based analysis DevOps needs to manage the performance of modern applications. Instana is the only APM solution that automatically discovers, maps and visualizes microservice applications without continuous additional engineering. Customers using Instana achieve operational excellence and deliver better software faster. Visit https://www.instana.com to learn more.