Bringing semantics to monitoring: metrics and health

Software systems need to be built with observability in mind. Observability allows you to look inside running software to understand how it's operating and to help find issues. Modern software systems that are designed and implemented with observability in mind usually provide a comprehensive set of metrics that can be used to monitor the system. Unfortunately, software systems usually don't accompany their metrics with meta data describing their type, relevance, importance or contextual role in a way that helps humans or machines interpret their meaning. Metrics totally lack semantics, and monitoring built on them lacks an adequate semantical model, too.

As an example of the lack of semantics, let's take a quite raw metric: retrans/s as provided by Linux's sar tool. It clearly is a number, but it's not defined as such anywhere except the sar man page which is already an external resource decoupled from the metric itself as well as the software system providing it. It doesn't describe itself as a number that is collected per second and reset to zero for the next second, instead of being cumulated. And even if a human might interpret it as such, a monitoring tool or any sort of machine observing this metric has absolutely no chance to know semantics of it.

But it doesn't end there. Looking at the man page, the description of the metric leaves you alone with the interpretation of its operational meaning, for example judging if it's good or bad, when to ignore it, and when it will have impact and on what parts of the system. Namely it states: "The total number of segments retransmitted per second - that is, the number of TCP segments transmitted containing one or more previously transmitted octets [tcpRetransSegs]". The description assumes that anybody looking at the metric will have the relevant knowledge of networking and operation systems needed to understand the operational meaning and impact of it.

This is a tiny, but typical example of how software systems leave the semantics completely to the observer instead of describing them through some sort of meta data. They leave you alone with the web search, that might help in clarifying the purpose of the metric, but instead often ends up in outdated documentation and pointless philosophic discussions where everybody is right - for their narrow specific case. In the described case, the software system, instead of opting out to otherwise human readable resources, could add a semantic description to the metric, something like:

metric:

name: retrans/s
level: atomic
resource: network
type: number
granularity: 1s
cumulative: no
importance: severe
good: = 0
urgent: > 5

This description is readable for humans as well as for machines, though of course incomplete in this simple example. An intelligent monitoring tool could automatically derive processing logic for this metric and support the human with the judgment of urgency when this particular metric is indicating a misbehaviour.

But this is still not enough. Since this metric is raw and basically atomic, even if it would be semantically annotated, it would describe itself on the level that's wrong most of the time observation on it is taking place. It's too low level. In-depth observability based on metrics is great when you're hunting an issue. It doesn't mean it's useless otherwise, but a regular software system isn't in trouble all the time, there are usually longer periods of operation that appear stable. And the last thing you're interested in during these stability periods are metrics. What you're interested in are coarse statements such as:

  • is everything alright?
  • is there is an issue coming?
  • is there already an issue?

These coarse statements of course are derived from the metric value and its behaviour over time, but they provide information absolutely sufficient to describe the current and historic health of the system according to this obviously important metric, which, what we think, is the level of detail necessary to adequately describe quality of service.

Taking the same example, higher-order semantics indicating health per metric, could be something like:

metric:

name: retrans/s
health: good

As simple as this, when you consider the whole monitored environment as a highlyDynamic Graph (a concept in the context of monitoring that we will explain through more upcoming blog posts), based on how parts of the system are currently related to each other, a whole hierarchy of decisions can be made on health statements like that. Should the overall health be no good, you can drill down one level after another, down to the lowest metric that indicated a severe issue. That's the time where you need the numbers, ideally together with the meta data described above.

But sadly, this is a hypothetical example. Literally none of the systems being monitored provide semantics for their metrics themselves. One should be happy with the fact that most of the modern systems are being built with observability in mind, though the quality of metrics they provide can and should be questioned. But knowing that semantics are not provided by observed systems, all the more surprising and disappointing it is that most of the monitoring tools simply ignore this fact. They either opt out to blind data crunching on numbers, not providing any meaning to outliers and changes they detect, or they just more nicely present metrics to the human observer, confining themselves with alerting based on thresholds. The better ones codify these semantics implicitly in plugins, mixing them up with alerts based on them.

None of these approaches goes deep enough to fill the gap of missing semantics. ButInstana does fill this gap. Our knowledge base is designed from the beginning to describe metrics semantically, to collect and process them based on their type and relevance, to derive hierarchical health statements based on their behaviour and to provide just enough informantion during sunshine periods and in times of trouble to identify and fix the issue.

We believe that in the world of modern monitoring, with new technologies and frameworks appearing every day, various and unpredictable mixes of technologies, continuous delivery and DevOps approach, throw-away infrastructures and infrastructure abstraction through virtual resource management, a clear dynamic semantical model is essential. Adding semantics to metrics and deriving coarse health statements is just one part of how we model this world in our product, but there is a lot more, so stay tuned - in further posts we will introduce more concepts that we consider missing and extremely helpful.

The number of technologies we support based on our model is rapidly growing. Check out our demo application and give us your feedback: