Monitoring needs to be Immediate and Accurate

Former Gartner APM Analyst Jonah Kowall said on requirements for APM tools: "The Second is the New Minute". I could not agree more! There are two dimensions where I propose this is correct: data granularity and delay.

Data

Examining the plethora of current IT monitoring tools for infrastructure and application management, it is clear they all do the same; they collect data. They perform an incredible amount of individual measurements using their sensors. The data collection ranges from sensors built into hardware, sensors for components like databases or operating systems, up to higher level application sensors and even sensors probing us, the end user. In its purest form, this data is a timestamp and a value.

Granularity

When today's monitoring tools collect data, they consume it in processed form. The most common processing performed are variations of averaging. Sensors in today's monitoring tools collect data over a time period and then return an arithmetic mean or median. Humans also prefer processed data, raw data is too large to understand in a reasonable time, so tools perform aggregation to simplify. Humans like to think about values per time, like requests / minute. While tools could compute this every second, the result would be confusing, so it is actually quite natural to wait a minute and sum up all requests during that timeframe. However, simple aggregations can easily destroy the useful information contained in the measurements.

Take a look at the following graph, where I deliberately have cut off the value axis. The data up to about 10:21:30 is the result of a 5 second tumbling window of mean, after which you see the value plotted each second.

To be clear, this is the comparison of the data processed as a 5 second mean vs 1 second raw data. Here is a picture of the same data using the 1 minute mean rollup typical ITinfrastructure monitoring tools use:

While one can easily see why the more granular data is presenting better information, the story is even more interesting when we realize that the grey line represents a capacity limit. The blue spikes touching the capacity can meaningfully show that the system is in danger.

Delay

When managing our application systems, it is surprising how long we are forced to wait for results. I once worked with an operations team fighting an outage. We knew we had to reboot the server, but it hung 20 minutes in the boot sequence, and then we missed the single second where we could get into the BIOS to change something, just to wait another 20 minutes. It's no fun when computers make us wait, especially in times of crisis. Another example, when we roll out a software update, we want to know if it had the proper effect in a couple of seconds, not in a couple of minutes.

In monitoring, delay is actually closely coupled to granularity. Staying with the example of a software rollout into production, roughly speaking, if a sensor needs to wait for a minute until it can transmit its data to processing, because it is calculating means, then the processing unit needs to account for late reporting (usually by waiting another minute) and then needs time to crunch the data (lets say another minute). Finally, the user can see the effect of the new software update about 3 minutes after the rollout. And due to the realities of aggregated data detailed above, we might only see a distorted picture concerning problems caused by the new software, or nothing at all.

Why Instana is Faster and More Accurate

Instana is building a monitoring solution that can predict and determine root cause.  To accomplish this, we must work off of the most granular precise data that is practical. As the above example illustrated, the system reaching capacity was only obvious when visualizing second level metrics. Capturing granular data also allows users to visualize realistic graphs when they need to cross check the details directly.

Because we send data every second, and process it as soon as it arrives, automated reasoning and visualization happen immediately. To make this even more clear: Showing more data on a dashboard may be nice, but is not a revolution. Using machine learning, realtime analytics and a knowledge base approach on this granularity of data is a totally new era of health monitoring systems and applications!

Of course we have the same challenges as all monitoring tools, like accounting for late data etc., but this is seconds of wait, not minutes. This allows for much faster reaction to problems caused by system changes like new software deployments, or scaling down cloud services.

Processing inbound data every second may look to be a case of “just do more faster”, but in fact this improvement needs a completely different approach. To accomplish this, Instana has re-invented three layers.

First, we have built a totally new agent data collection concept with a new compression model to reduce data volume. This allows us to collect and send data every second.

Second, storing this volume of data requires a custom time series database which can store data long term, but also pass along massive quantities of data for processing.

And third, on the analysis side, we have constructed a hugely scalable complex event processing engine tailored to the needs of monitoring system and application health. Raw data is streamed to the knowledge engine, the user interface, and also to compute statistical information for long term storage.

Check out our demo application and see the difference for yourself: