The ROI of Real-time AI Driven Incident Monitoring and Prediction: Nine New Application Issues Part 5

Losing Time

There’s a dirty little secret in the legacy monitoring world. It can take 5-10 minutes for an alert to fire in most monitoring tools. That’s 5-10 minutes before you have any indication that there is a problem. Here’s the good news: there’s a better, faster way.

Welcome to Part 5 of the “9 New Issues Facing Cloud Applications” blog series. In this post, we’ll explore one of the issues that has the most significant impact on lost revenue – Issue #7: “Alerts take too long to trigger making impact to business too costly.”

Let’s take a look at major steps in the incident management process:

  1. Incident begins - nobody in IT knows about it yet :-(
  2. Find out about a problem from an alert (or worse, from an end user)
  3. Incident prioritization - is this a high enough priority to work on right now?
  4. Incident investigation
    • Level 1 runbooks
    • Level 2 escalation/analysis (Application support / DevOps)
    • Level 3 escalation/analysis (Domain experts / All hands on deck)
  5. Problem resolution
  6. Service/application testing
  7. Incident resolution - Declare Victory!

Every step above contributes to the overall time it takes to restore service. The purpose of monitoring tools is to either shorten this entire incident lifecycle or to avoid it altogether. Why is this so important? Because revenue loss due to downtime is significant across every industry.

Time is Money

A 2017 web poll about downtime costs of over 800 companies conducted by Information Technology Intelligence Company (ITIC) showed the following:

  • 98% reported that 1 hour of downtime costs at least $100,000 US ($16,665 / 10 minutes)
  • 81% reported that 1 hour of downtime costs at least $300,000 US ($50,000 / 10 minutes)
  • 33% reported that 1 hour of downtime costs at least $1.5 million US ($250K / 10 minutes)

These are staggering numbers, but not surprising given the ongoing digital transformation occurring in every company across every sector.

Can you afford to put up with an extra 10 minute delay? On EVERY problem?

Save Time, Save Money

If you want to protect revenue, you have 2 main options. Avoid issues altogether or cut as much time as possible between the moment an incident starts and the moment service has been restored.

Here are some tips on minimizing loss of revenue due to incidents:

  • Be proactive. Countless applications make it into production every day with little to no monitoring in place. Consider your monitoring implementation from the beginning of any application project. The first question I always hear when an incident occurs is “What kind of monitoring do we have in place?” If you’re asking that question during an incident, it might be too late. Put a plan in place to add proper monitoring as soon as possible.
  • Be inclusive. Ensure that your monitoring contains both infrastructure AND application data. You’ll want as much correlated data across the full technology stack as possible to detect and isolate issues early.
  • Incorporate tracing in addition to full stack infrastructure monitoring.
  • Automate wherever possible - using automation during initial installation, configuration, and ongoing operations will ensure you have the most accurate monitoring coverage possible. You NEED accurate, granular, correlated data from all of your infrastructure and application components.
  • Data fidelity matters. The granularity of your data will impact your ability to detect and isolate problems. 1 second granularity is the new standard for all time series metric data so that “averaging” does not hide bad behavior.
  • Stream data. The ability to collect and analyze data as fast as possible will have a major impact on reducing the time it takes to detect and isolate problems.
  • AI powered analysis has become a requirement. There are so many layers of virtualization across complex microservices that it’s nearly impossible to manually troubleshoot these systems in a timely manner. Using AI has the following benefits:
    • Predictive analytics - possibly avoid problems
    • Incident correlation and aggregation - avoid alert storms
    • Automated root cause analysis with full causation (not simple correlation)
    • Remediation recommendations using expert advice

All of the tips above, when used together, significantly reduce downtime due to incidents. The ability to perform AI analysis as part of your stream processing will reduce the traditional 5-10 minute alerting delay down a few seconds.

As seen in the table above, most companies can save anywhere from $16,667 to $250,000 per incident just by seeing incidents 10 minutes faster. The savings really add up when you take advantage of advanced capabilities like AI that could help resolve issues 30-60 minutes faster.

The table below assumes that there are 6 major incidents per year. As you can see, these simple calculations show annual revenue savings ranging from $100K to $9M per year.

*Table assumes 6 major incidents per year

Technology is always progressing, and monitoring technologies get to take advantage of the latest breakthroughs to become more effective and efficient. Stream processing and AI have combined to dramatically reduce the time it takes to detect, isolate, and remediate production issues.

In my next post, I’ll discuss issue #8 - “Performance analysis expertise for new technologies is hard to find, making it difficult to troubleshoot problems”

In the meantime, expand your technical skills with Stan’s Robot Shop sample microservice application or try Instana in your environment to see how quickly you can monitor, detect, isolate and resolve problems.