Monitoring Microservices (Part III) – Investigate: Troubleshooting with Machine Intelligence

Performance incident response processes are still manual creation of dashboards on the fly and visual correlation of data. Instana automates the incident response process.

In Part I of this ‘Monitoring Microservices’ series we described how we automatically discover all the components, services and their changes in dynamic and scaled environments like microservice-based applications.

Part II described our approach to component health, our KPI and machine learning based approach to determine the quality of service, and also how the right visualizations enable DevOps to get the correct understanding of their environment.

The ultimate goal of monitoring is to ensure quality of service. When service is compromised, an immediate reaction is needed to minimize the disruption of business services. This article is about the investigation process and how Instana augments this process using machine intelligence. Let’s first look at what it means to respond to an incident and how this is done in practice.

Incident Response Process

Netflix has pioneered the microservices adoption and often setup best practices for DevOps.

In a recent talk Brendan Gregg shared how, at Netflix, they streamlined the “SRE Performance Incident Response” and built a checklist in order to ensure a timely and effective reaction to an incident. This checklist is an internally shared doc with as many as 66 steps.

When an incident occurs, the Netflix team responsible for the service/application must act as fast as possible – even at 3 in the morning. The metric Mean Time To Repair (MTTR) measures the time from the reporting of an incident to the time when the problem is fixed. A fix can be a restart of a service, a reconfiguration of the load balancer, or a rollback of a deployment, but also can be a code fix and the deployment of a hotfix.

The Netflix process when an incident occurs looks like this:

  1. Check incident and details,
  2. Get an overview of the situation and check impact,
  3. Evaluate the health of the services,
  4. Drill into the details of problematic services,
  5. Correlate metrics of different components to find patterns,
  6. Correlate metrics over time to see patterns,
  7. Repeat step 3-6 of other services and new findings,
  8. Fix problem based on investigation.

During the process many custom dashboards are created on the fly. Let’s go through the process in detail.

First step is a Performance and Reliability Engineering (PRE) checklist that aims to quantify the problem (# support calls, regional vs global, record timestamps, etc.) and to narrow it down by building a first dashboard with error rates and change in error rates. The PRE checklist includes also the first service health investigation with another custom dashboard showing CPU usage, memory, load average and more.

Among the several custom-built dashboards the Cloud App Performance Dashboard collects:

  1. Load
  2. Errors
  3. Latency
  4. Saturation
  5. Instances.

Once the incident is narrowed down to a service, it’s time to move to the underlying  infrastructure and build the “Bad Instance Dashboard”.

bad instance dashboard
(Brendan Gregg, Performance Checklists for SREs, slide 28/79)

That is just before digging into the Linux performance analysis, the Linux Disk Checklist and the Linux Network Checklist.

You might see where we are going with this; in Brendan Gregg’s words  “We have countless more, mostly app specific and reliability focused [dashboards]”. Some of them are predefined and pre-built, some others need to be build on the fly while investigating a specific incident, and under stress.

more dashboards
(Brendan Gregg, Performance Checklists for SREs, slide 29/79)

To summarize, it seems that the state-of-the-art incident response process primarily consists of visual correlation of data on dashboards.

Especially in highly dynamic and scaled microservice environments, this job is difficult as the correlation of the right metrics can be a challenge and humans are not really good in correlating large numbers of metrics. Conversely, machines love doing so.

Automating the Investigation Process

As mentioned the state-of-the-art incident response process is still visual correlation of data on dashboards. That happens because current monitoring tools are still limited to filling screens with time charts and metrics.

Not anymore, this is precisely why we started Instana!

In Part 1 we covered why continuous and automatic discovery of your services and infrastructure is key to keeping up with your ever-changing environments.

Part 2 explains how Instana analyzes the data collected and creates the Dynamic Graph. The Graph enables dependency understanding and the visualizing of your environment from both the physical and service perspectives. The KPIs Instana derives for each service are the same tracked by Netflix: Load, Latency, Errors, Saturation and Instances. Instana uses KPIs to trigger “Incidents” when quality of service is impacted, e.g. slow response times or sudden drop of requests. When an Incident occurs, Instana traverses the Dynamic Graph and immediately analyzes the impact upstream and downstream.

As said, failure is normal in such complex environments and every component can have individual issues. A monitoring tool must not only identify specific issues, but also understand how any issue correlates to incidents that violate service quality.
issues

All changes and issues must be detected and analyzed in order to identify and predict incidents, but also to reduce the noise coming from issues that don’t impact the quality of service and should not wake up the DevOps team in the middle of the night.

Instana’s “Incident” is the starting point of the investigation for incidentthe DevOps. An Incident is essentially a mini report indicating the affected elements (services, middleware and hosts) and their changes (also in configuration), timestamps, history and traces that led to the quality of service deterioration.

All the dashboards of components and clusters involved in the Incident are part of the Instana solution and just one click away and don’t need to be manually built at the moment of the investigation.

Mapping this back to the Netflix process, Instana automates steps 2 to 6 of the above list: no manual narrowing down of the incident, no manual building of a lot of dashboards, no visually correlating data.

 

The Instana investigation process starts by reading through the plain english explanation of what happened, where and when.

From the Incident one can immediately jump to each and every trace involved analyze its call patterns, look for errors.
traces

Should you wonder why certain calls are being made, you can directly access the source code which was performing the calls in question.
code

Conclusion

Ensuring quality of service is the ultimate goal of monitoring tools. But in today’s world of high scale, machine assistance has become a requirement to sift through all the data and complexity. Instana’s Incident approach to QoS management means:
stan

  • Less alarms because service KPIs approach reduces noise,
  • No manual dashboard building,
  • No manual correlation hunting,
  • Immediate understanding of all automatically correlated issues,
  • Direct access to most relevant evidence to identify and solve issues.

Let Stan do the dirty work of sifting through all the system data finding problems. Your engineers can focus on their core job, creating new business services!!