There is no Root Cause in Microservice Applications

When it comes to troubleshooting a problem in an application, there has always been the quest for the root cause as the ultimate goal for optimization.

Root cause is used to describe the depth in the causal chain where an intervention could reasonably be implemented to improve performance or prevent an undesirable outcome.” [Wikipedia]

Especially APM tools tend to claim that they find the root cause for every problem and show nice pictures of traces with a highlighted SQL statements at the end that was the root cause for a slow transaction. It looks obvious for any developer that there is a linear chain of events and that the reason for a problem must be a single event somewhere in this chain.

This is not true anymore – there is no single root cause in complex systems! This blog post will talk through the reasons for this shift and discuss a new approach to managing quality in complex systems.

Complex System

“A complex system is a system that exhibits some (and possibly all) of the following characteristics:

  • feedback loops;
  • some degree of spontaneous order;
  • robustness of the order;
  • emergent organization;
  • numerosity;
  • hierarchical organization.

A complex system can also be viewed as a system composed of many components and their dependencies/interactions with each other. In many cases it is useful to represent such a system as a network where the nodes represent the components and the links their interactions.” [Wikipedia]

Microservices based applications fulfill most of the criteria. They are built to be resilient/robust with dynamically changing infrastructure (containers) that emerge and organize based on demand. The application itself is a hierarchical organization of these µServices. Each service has a role, can be available multiple times, has some telemetry data (metrics) and changes over time.

In his paper How Complex Systems Fail, MIT professor Richard Cook describes complex systems as intrinsically hazardous which is why designers protect complex systems against failure (resilience). To cause these systems to fail, Mr Cook says:

Catastrophe requires multiple failures – single point failures are not enough.

Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure.

This exactly describes that complex systems have no single root cause but a failure is the result of multiple small issues and changes that can result in a catastrophe. This is also known as the Butterfly Effect.

“In chaos theory, the butterfly effect is the sensitive dependence on initial conditions in which a small change in one state of a deterministic nonlinear system can result in large differences in a later state.” [Wikipedia]

[CC BY-SA 3.0, Lorenz Attractor]

It is important to say that failures are the new normal in these types of systems – there will always be some issues in the system and it is even more important to understand the impact of these issues on the overall system. Instana filters most of the issues if they have no relevant impact, to reduce the noise and alarm only on important incidents.

The paper also states two other important things:

Human expertise in complex systems is constantly changing.

Complex systems require substantial human expertise in their operation and management. This expertise changes in character as technology changes but it also changes because of the need to replace experts who leave.

Change introduces new forms of failure.

The low rate of overt accidents in reliable systems may encourage changes, especially the use of new technology, to decrease the number of low consequence but high frequency failures. These changes may actually create opportunities for new, low frequency but high consequence failures.

Understanding failure in Microservice Applications

Classical application performance management (APM) tools were build to find the single root cause of a problem. This worked well with static 3-tier and SOA architectures but as we have learned from complex system theory, this doesn’t work anymore in modern webscale and microservice based environments.

In order to manage something, there must first be a model. Instana models an application by building a Dynamic Graph of all components to understand the hierarchical organization, the state of all nodes and their interactions (tracing). Using this Graph, which is automatically and continuously updated, Instana tracks all issues and changes since we know that any of them in combination can cause an incident. By analyzing the impact of issues and changes in real-time, Instana understands the health of the application and can reconstruct the state of the system at any given time to show how an incident evolved over time.

instana

The screenshot shows our physical map with our Timeline that presents all issues and changes over time. An Incident is created for a problem that includes all issues and changes that lead to the problem for all involved components in the graph.

The changes are detected automatically for:
– Any configuration change of a component,
– Deployment of new code,
– Starting or stopping of processes and containers.

The issues are detected based on:
– Machine learning like anomaly detection or pattern recognition that runs on a real-time stream of events,

– Knowledge base that contains curated semantical knowledge about each component, for example, understanding the proper Garbage Collection patterns in the JVM.

By leveraging the Dynamic Graph, Instana can connect seemingly disparate events into a correlated, meaningful “Incidents”. This reduces the dependency on experts and assists the developers and operations team in their quest to manage the performance dynamics and quality of even the most complex systems.

So in complex systems, like highly distributed microservice applications, there is no single root cause. APM solutions that are built to find these single causes will not work in such environment. To understand the cause for a problem in such environments, tools have to understand the dependencies of the whole system to correlate issues and understand the impact.