Welcome to Part 3 of the “9 New Issues” series. In this post we’ll explore issues #4 and #5:
- Data from mission critical components is not monitored or not correlated causing lack of visibility into impact
- Abstraction techniques such as cloud, container, and orchestration technologies create a lack of context, making performance optimization and determination of root cause a challenge
There is an important difference between correlation and causation. The term correlation, as defined from a statistical perspective, means “the degree to which two or more attributes or measurements on the same group of elements show a tendency to vary together.” Source – Dictionary.com
You can find correlation between all kinds of unrelated statistics. The chart below is a great example of meaningless correlation whose only value is in the entertainment provided by examining the absurdity of the matter.
It’s utterly ridiculous to think that the divorce rate in Maine is a direct result of how much margarine is consumed per capita. Yet there it is, the data shows a high level of correlation. It’s only when we apply context and logic do we realize that even though there is statistical correlation, the change in the consumption of margarine is not the root cause of changing divorce rates in Maine.
The term causation is defined as “the action of causing or producing.” Source – Dictionary.com To arrive at causation we must use a combination of correlation, context, and experience. In the world of cloud and container based applications, context and experience can both be very difficult to come by.
Partial Visibility Creates Blind Spots
Your cloud and container based applications are quite complex. They most likely consist of a mixture of hosts (mostly virtual), hypervisors, operating systems, containers, processes, application environments (JVM, CLR, etc.), application servers (JBOSS, Tomcat, .NET, etc.), databases, caches, web servers, message busses, orchestration tools, etc. When you think about it, there are an incredible number of components working in harmony to deliver modern application functionality, performance, scalability, and stability. The rate of adoption of new technologies is also increasing as DevOps practices take effect.
The ultimate impact of using this large number of new application technologies is either that monitoring visibility is non-existent, monitoring data is contained within disconnected point solutions, visibility is rudimentary (just some metrics), or any combination thereof. Any one of these conditions create blind spots and major problems when you are trying to figure out the root cause of performance or stability issues.
But let’s assume for a moment that you are capable of monitoring every component in your application stack within a single tool. Is there another issue lurking, waiting to bite you during a customer impacting event? Of course there is…
Context and Expertise FTW (For The Win)
Context is a critical element to understanding the root cause of performance and stability issues. To prove the importance of context let’s take a look at the word “tear”. What does this word mean? Is it the liquid secretion that lubricates our eyeballs or is it what happens when you pull on two ends of paper too hard? You can’t know until I provide the context in the following expression… “shed a tear”.
Given that background, what do I really mean when I say “context” in relation to the IT world? Here’s a list of examples…
- What was the end user doing? (Login, search, checkout, pay bill, check balance, etc.)
- What infrastructure components are involved in delivering each service?
- What software versions are we running?
- How are my services connected?
- What configuration changes were made recently?
- What availability zone is this host running in?
- What pod does this container belong to?
Context is derived from the important data that exists outside of time series metrics often referred to as meta-data. Meta-data is available in many places. It’s often overlooked and does not get collected or associated with the relevant components. Here are some example sources of vitally important meta-data:
- AWS Tags
- Azure Tags
- Google Cloud Labels
- Docker Labels
- Kubernetes Labels
- Transaction names
This meta-data, when associated with the proper application and infrastructure components creates a wealth of context.
So now that we understand the need for correlation and context, what is their connection with expertise? Here’s an example provided by Instana’s CTO Pavlo Barron…
“Let’s take a quite raw metric: retrans/s as provided by Linux’s sar tool. It clearly is a number, but it’s not defined as such anywhere except the sar man page which is already an external resource decoupled from the metric itself as well as the software system providing it. It doesn’t describe itself as a number that is collected per second and reset to zero for the next second, instead of being cumulated. And even if a human might interpret it as such, a monitoring tool or any sort of machine observing this metric has absolutely no chance to know semantics of it.
But it doesn’t end there. Looking at the man page, the description of the metric leaves you alone with the interpretation of its operational meaning, for example judging if it’s good or bad, when to ignore it, and when it will have impact and on what parts of the system. Namely it states: “The total number of segments retransmitted per second – that is, the number of TCP segments transmitted containing one or more previously transmitted octets [tcpRetransSegs]“. The description assumes that anybody looking at the metric will have the relevant knowledge of networking and operating systems needed to understand the operational meaning and impact of it.”
Expert knowledge is accumulated over time through experience. Any single team that manages an application or service has a finite amount of expert knowledge and can only develop expertise in new technologies at a limited pace. If your monitoring tool has been trained with expert knowledge, provides context across every component, and has no gaps in correlation, you will have the best chance of quickly identifying and resolving incidents in your cloud and container based applications. AI has an important role to play in managing these applications as you’ll see in a future blog post.
A Final Twist
We can never forget to consider the dynamic nature of cloud and container based applications. There is constant change in these environments and it can wreak havoc on your IT management platforms. Constant change in your application means that context is also constantly changing. What’s the solution? A modern APM monitoring solution that is built upon a dynamic data model.
In my next post I’ll discuss issue #6… Inflexible data models in monitoring tools make it impossible to understand impact and causality when using containers, orchestration, serverless, etc.
Most people underestimate the importance of the underlying data model on the ability to properly assess and diagnose performance and stability issues. We’ll explore the problem and uncover the solution.