We are in a period of unprecedented growth and interest in observability technologies such as metrics. Developers of applications, frameworks, and libraries are making it a point to publish metrics that expose various measures (counters and gauges) for both external and internal operations and outcomes. But it seems that the industry as a whole is not reaping the benefits of this effort, which requires ongoing maintenance effort and cost in managing metric stores as well as the unrestraint proliferation of custom dashboards growing daily. One reason for this is that operations staff don’t fully understand what the meaning of a metric is, how is it measured and related to other collected metrics, and what type of problem can be inferred from it when it changes. This is not only a consumer issue. Many developers tasked with instrumentation are unable to give due consideration to the use of a proposed metric beyond a very narrow need.
Drowning in metrics
In some ways, metrics are the unit tests of monitoring. Everyone agrees they are good in practice, but only a few seem to have enough experience in designing them beyond measuring the obvious. This is made somewhat worse with new cloud offerings targeting customers looking initially only to store and search large volumes of data collected by instrumentation. Insight is not necessarily on the menu here because those choosing where to send their data tend to favor the presentation of data be kept as close as possible to the raw input form.
The absence of abstraction
Models of abstraction, in the backend or UI/UX, that might be more effective in monitoring and managing the system under observation are deemed unnecessary overhead in the early days when one is more concerned with tracking that all the sensors are indeed operating correctly and pushing measures out from their origin. Naturally, this changes very quickly when customers experience incidents that take far too long to identify and then resolve in production with condensed metric storage backends. One typical response to this is to turn to each of the third party developers behind a significant component and purchase a license for a component-specific dashboard solution that understands the meaning of the metrics published and better tailors the presentation. Unfortunately, this leads to many dashboards, different models of abstraction across each, and more data stores though many developers will also offer the ability to push metrics elsewhere too. The more data that is created before this point, the more likely this situation will arise and remain for some time. This is the nature of the un(tr|ch)ained beast.
A model within a model
Today observability seems focused primarily on creating a sufficiently large information model from an ever-growing list of sensors. Instead of hundreds or thousands of metrics, we have companies claiming to have millions of metrics. This number is probably inflated and reflects more of the diversity of the naming of things and not the types of measures recorded for each context. How much this is true is beside the point. If we have a more than a handful of metrics per object of concern within a management model, then human operators still have a problem that is not magically solved by machine learning as there is always the question of what do these metrics mean and how are they impacted by changes in the environment and the actions of operations. At Instana we consider it paramount to the success of our engineering effort to link cause(s) with effect(s), metrics with traces, events with signals, etc. We carefully consider what is in the foreground (trace) and background (metric) depending on the situation.
The stacking of simplicity
So how we identify the management model, a small subset, from the information model? It is easier said than done. There are various ways of extracting and learning including active exploration of the information space but let’s first step back and consider how might such a management model evolve over the lifecycle of some insignificant software component. Why might the information model grow far bigger than the management model?
Once upon a time, there was a software component with three points of configuration, control set points if you wish, and various measures exposed at the surface of the component’s interface tracking execution and efficiency of both internal and external stimulated activity. Unfortunately, users of the software component did not fully understand how best to use the configuration points and how they related to observations available. To prevent future mishaps in configuration and to improve the adaptiveness of the component to actual usage and environment characteristics the developer created a higher level interface with fewer touch points and lots of smart algorithmic magic under the hood. This new layer codified the intelligence expected of the first users of the component. But unfortunately, this new adaptive mechanism needed a few set points to steer the adaptive algorithm, and so a new simplifier surface layer was designed. And so on and so on. It’s turtles on turtles – up and down.
The component developer, with all their operational wisdom, decided that it would be useful to keep all control (managed) points and observable values (measures) at each layer of the stack accessible for inspection and reporting. But lost in this new flat information model was the structure, scope, sense, sets, and simplification. This is a problem of observability when it stands apart from what is application monitoring – delivering meaning and enabling understanding.
A goal for engineering at Instana is not just improving our ability to see more but in creating models and user experiences that help our customers improve learning of, and in greater depth, the dynamics of their systems, services, components, flows, and operational states.