Welcome to Part 2 of the “9 New Issues” series. In this post we’ll explore issue #3 - “Data is not granular enough to accurately detect issues in dynamic environments”.
There are many potential benefits to deploying applications using containers. You use fewer resources than deploying VMs, deploy faster with greater consistency, gain portability between cloud platforms, and much more. One interesting side effect is that containers have become the new adult Mayfly. What’s the connection? Adult Mayflies only live for a few hours after they hatch and they show up in massive swarms. They are the shortest lived animal on our planet if you only consider their adult stage.
Containers, along with orchestration tools like Kubernetes, make it incredibly easy to dynamically expand and contract your services to accommodate changing workload. This high level of dynamism is an important part of right sizing your applications so that you are not wasting money on resources that you don’t need.
Many companies now have containers with an adult lifespan (the time after they boot up) of less than 1 minute in swarms of 10’s or 100’s of thousands. This highly ephemeral behavior is great for efficient IT operations but bring a new challenge in the monitoring realm. Most monitoring tools either have poor granularity (1 minute averages) or have higher granularity but cannot scale to monitor the entire estate.
Why High Granularity Is So Important
When your containers only exist for less than a minute it’s obvious why you need high granularity data. If you ever expect to understand the behavior of those ephemeral containers you MUST have rich data with enough granularity to properly troubleshoot any performance bottlenecks. But what is the right level of granularity? Is 5 seconds good enough? And what about 1 minute granularity for my less ephemeral systems?
In the image below you can see a few different metrics plotted first at 5 second granularity and then at 1 second granularity. Notice there is also a resource limitation plotted on the same chart. The 5 second averages hide a resource contention issue that is exposed by the 1 second granularity data.
The 5 second data completely hides the fact that there is a resource issue causing poor performance. For sake of completeness I have included a chart of the same data but this time averaged to 1 minute. The result is as you would expect, little to no troubleshooting value from the 1 minute data.
Granularity, Scale, and AI
There is an important relationship between granularity, scale, and artificial intelligence. Modern AI systems do not see the world simply in terms of true or false like they used to. Probability allows AI systems to draw conclusions based upon data that may or may not be highly granular. The lower the granularity and less complete the data is, the lower the probability that the AI’s conclusion is accurate. Higher granularity with a more complete data set will yield much higher probability of accuracy of AI results.
Consider your AI system of choice working with the data sets charted above. Which data sets do you believe will provide the best probability of accurate results, 1 second, 5 second average, or 1 minute average? I would definitely prefer to feed 1 second data into my AI system and I would be highly suspicious of the results derived using 1 minute average data.
At one point in time, back when we had mostly static 3-tier applications, 1 minute average data was good enough to make due with. Even then, most of us would use the 1 minute data as a pointer to where a problem might exist and then we would explore further with more granular data in real time. As practitioners we don’t have that luxury anymore. We have too many systems that are changing too fast for us to keep up with. We need monitoring tools that collect high granularity data, at massive scale, to deliver accurate results from our AI systems. Anything less leaves us exposed to the risk of not knowing what is really happening in our applications.
In my next post I’ll discuss issues #4 and #5...
- Data from mission critical components is not monitored or not correlated causing lack of visibility into impact
- Abstraction techniques such as cloud, container, and orchestration technologies creates a lack of context making performance optimization and determination of root cause a challenge