Automatic Incident recognition to rid the world of Alert Storms

It really has become a problem! A plethora of applications, tools and systems generate alarms and alerts at a frequency and amount greater than humans are able to cope with. Thus the ability to group, filter, prioritize and/or otherwise automate the management of the continuous stream, or flood, or storm of alerts hammering a dev/IT-ops team seems to make sense. Or does it?

It does, if it was not for the fact that alert storms are actually an ‘ancillary’ problem, caused by a solution to another problem; the monitoring and management of systems. So if putting an Alert Management tool on top of System/IT/Application/Network Monitoring tools looks like bolting on a solution on top of another, then that’s pretty much exactly what it is, no matter how ‘complementary’ such a solution is positioned.

Alert storms in distributed applications happen because of a simple reason: most sources of alerts are tools designed primarily to collect observational data in the form of a continuous stream of metrics. The ‘processing end’ of these tools watch the metrics for crossing of user-defined thresholds, and trigger alerts. When something goes wrong with a particular component, there is often a chain reaction with other, related components causing additional alerts and typically these reactions are indiscriminately repeated over time until the metric data falls below the triggering thresholds.

The original root cause easily gets lost in the resulting storm, and that really compounds the problem! The whole point of monitoring after all is to assist IT to precisely identify issues so they can be quickly fixed! Identifying and attacking the original problem by avoiding an alert storm in the first place is actually the better avenue.

Let’s remember that the goal is find the root cause and fix it. A modern monitoring tool should be able to cut through the noise and determine the exact root cause: the actual issue, not a storm of alerts and not a symptom. But let’s think about what would be needed to accomplish this.

While the ‘mechanics’ of monitoring - the data collection, the triggering of alerts - are by now “tried and true” (30k+ hits on github when searching for ‘monitoring’!), one crucially important aspect always seems to be missing, and that is ‘understanding context’. A distributed application is a collection of software components that leverage and depend on the services they offer to each other. In other words, every component depends on other components to do its job. As said before, a problem on one component will likely cause a chain reaction causing other components to also trigger alerts. To realize that these alerts are in fact reactions to other alerts we need a ‘contextual understanding of the dependencies’ among components!

With traditional monitoring, there is no implicit, automatic understanding of such a context. It is up to the ops team to approximate this understanding by manually creating layers of rules within a rule engine, that is if the tool offers such a capability.

Let’s look at example: “Trigger an alert when cpu is over 65%, but only if load (expressed in transactions per second) is lower than 100 (if it were higher, a high cpu load would actually be a good thing because actual work gets done), also take into account that the database access rate (expressed in query execution time in milliseconds) is below 20 (meaning data read and write operations are not backed up and thus CPU cycles are not spend uselessly in futile I/O attempts).”

While it may be a rather “contrived” description, the scenario is certainly valid and reflects observations and experience an ops team gathers over time. Its implementation as a rule may not be as simple, the bigger problem however is its maintainability. Whenever anything involving the rule (its inputs, its thresholds, load patterns, the whole architecture of the app!) changes, the rule must be updated, otherwise it will be out-of-sync, affecting the monitoring tools effectiveness. And obviously there is not just one such rule to govern a production system! Continuous Integration practices have only exacerbated the issue by accelerating the rate of change. How could one keep up with maintaining the whole rule-set to monitor the system if the system to monitor constantly changes?

The above scenario is meant to help explain what I mean by ‘contextual understanding of dependencies’. With such an understanding we could correctly attribute multiple, separate alerts (about cpu utilization, load, database access) to the same origin, treat them consequently as to belong to the same ‘Event’ and come to an accurate conclusion about root cause. But let’s not forget what i mentioned earlier, that typically alerts are repeated over time until underlying metric data falls below its thresholds. To deal with this aspect, we need to also develop the ‘contextual understanding’ over time, that is realizing that ‘Events’ may also be related and thus part of the same ‘Incident’. For example the fact that one event is still ongoing while another Event on a related component just happened, may be the reason to join them into one overarching incident. The definition of an Incident therefore is that of a collection of Events over time related by their dependencies.

Understanding context of dependencies over time would let us grasp an issue as a precise root cause with downstream impact The alternative in practice today is receiving separate alerts for the cpu, for the transaction rate, for the database access rate, for the disk access, and all repeating every 60 seconds. That’s the making of the alert storm right there.

And this is where Instana brings a different approach to the table. Instana is built around a core capability that all other monitoring solutions lack, a Dynamic Graph! Instana's Dynamic Graph holds a comprehensive understanding of configuration, dependencies and health of every component over time. This ‘understanding of context’ is automatically discovered and maintained. There is no external, artificial rule-set needed to approximate the system to monitor. Instead knowledge is built from automatically discovering components (servers, processes, frameworks, http requests, databases, queries) and tracking dependencies of how components interact (run on, are used by, are consuming, service, invoke, transport). This knowledge is continuously updated, meaning all changes impacting a distributed, complex software system are recorded and become part of the Dynamic Graph over time.

Using the Dynamic Graph, Instana can immediately and precisely recognize Events and determine their causality, effects and ramifications and, with the latest release, join them into Incidents!

incident analysis

Alert storms solved! No, Alert Storms prevented!