Root Cause Analysis

Concept

DevOps practitioners face significant problems in today’s world of dynamic applications that are composed of hundreds or possibly thousands of components. First of all, when things break they need to be able to detect and understand the problem as soon as possible, even before end-users start to feel the service impact. Secondly, after restoring the service as quickly as possible, they need to figure out and fix the exact root cause and to ensure the problem does not occur again. Practitioners trawl through log files, look at metrics, comb through events, consult crystal balls, and do whatever it takes to find the answer. It can take hours or days to identify the root cause of an issue, and often, the reason is left unidentified and lurking in the background waiting to reappear. Thankfully, Instana has made significant strides in managing incidents and accelerating the identification of root cause. Instana automatically detects changes, issues and incidents to help you detect, understand and investigate Quality of Service issues of your applications.

Changes

A Change is an event representing anything from a server start/stop, a deployment, a configuration change on a system, you name it. Further separated into:

  • Changes - Changed configuration of components, e.g. versions, environment variable values, etc.
  • Offline/Online - Tracking presence of components under management.

Change events are important information used together with the Dynamic Graph to automatically detect relation of changes in configuration to incidents.

change details example

Issues

An Issue is an event that gets created if an application, service or any part of it gets unhealthy. Instana comes with several hundreds of out-of-the-box curated health signatures detecting various problems ranging from degradations of service quality, to complex infrastructure issues, to disk saturation. Issues are automatically resolved as soon as the metrics, events or metadata returns to the expected values.

In addition to built-in issues, you can define custom events to detect problems which are specific to your system.

To see all issues (both built-in and custom) detected by Instana go to "Events" view and choose "Issues" tab. You can use Dynamic Focus to filter issues.

Each Instana issue contains following information:

  • Severity - can be CRITICAL or WARNING, where CRITICAL means there is a direct or indirect risk of data loss or service being not available and WARNING means any other performance issue which might impact user experience or lead to a problem in long term
  • Start, end time and duration of the issue
  • Affected entities - one or more entities affected by the problem
  • Details - additional description providing additional context and measures to resolve the problem
  • Metrics - metric charts showing metric values relevant to the problem around the time the problem has happened
  • Where applicable, users can navigate to Unbounded Analytics to investigate traces, calls or page loads affected by the issue.

event details example

In this example, the CPU steal time on one Linux machine is suspicious and therefore marked as an issue. An issue by itself does not trigger an alert, Instana simply notes that it happened. Should the service to where this system is connected behave badly, this issue will be part of the incident. This methodology is one of the major benefits of Instana because it frees you from manually correlating events and performance problems. Just because something is using too much CPU for a while doesn’t mean there is a problem as such. Only when a service is impacted will this be relevant information.

Checkout Manage Built-in Events for more information on managing built-in and custom issues.

Since Instana knows all dependencies between monitored services it will trigger Incidents for all Quality of Service issues when these are impacting end-user. Also some critical infrastructure issues, such as disk saturation and Elasticsearch cluster split brain situations, will trigger incidents because their end result is most likely data loss.

Note: Applications, services or endpoints which receive infrequent traffic (eg. one call every 15 minutes) are not considered to have a sufficient basis for our issue detection. The severity of an issue can change during its lifetime. It represents the highest severity that was every reached by this particular issue.

Incidents

Incidents yield the highest severity level. They are created when edge services accessed by end-users are impacted or there is an imminent risk of impact. Using Dynamic Graph all relevant events are correlated for each incident to provide context and root cause analysis hypotheses.

Below is an example of an incident. A service is suddenly responding slower than usual, we call this a sudden increase in average latency. The incident is automatically marked in yellow as a warning. The colour is presented as long as this incident is still active. Once it is resolved, the colour changes to grey and is still available for the drill-down menu.

incidents expanded

The incident detail view is organized into three parts:

  1. The header contains basic information about the key facts of the incident. 

    • Start time;
    • End time (current if it is still ongoing);
    • The number of the still active events;
    • The number of changes involved;
    • The number of affected entities.

You can see the incident start date, the end date (if available), how many events are still active, how many changes belong to this incident, and the number of affected entities:

incident KPIs

  1. The second section provides a visual representation of the incident development over time. The chart shows the complete time frame, from start to end and all events, sorted by start time. The view is limited to seven events when collapsed. Press the expand button to see the full view if your incidents contains more than seven events at a time. Clicking on either of the bars will open the detail-view for that issue:

incident population

  1. The third section contains the details for the graph view in section 2. A list of all events, sorted by start time, allows the user to see all available information for each event. To do this, just click it to expand it:

expanded incident event

The details help in understanding the event, followed by multiple charts with the corresponding metric plotted for visualization. If an event is still active, the chart will continue rendering new incoming metric values. There are two flags available, emphasizing that this event affects a service and/or that this event has triggered the incident. If available, the flags are placed top of each event in the list.

When focusing on an event, the detail section will provide the same information described in the incidents event list on point 3.

Events view

To see all events detected by Instana go to "Events" view and choose between "Incidents", "Issues", "Changes" or "All" tabs to see corresponding event types. Searching through events discovered by Instana relies on the Dynamic Focus feature. By clicking on one or selecting multiple bars in the events bar chart at the top, events table will list only the events which are included in the selected bars. This allows detailed inspection of events without changing current time interval.

In addition, you can use the search box to find specific items by the data shown in the columns “Title” or “On” (the name of the service on which the incident occurred) in the overview table. In this example, the search query is event.text:"Error rate". The result is a list of all events containing the phrase "Error rate" in the title:

incident view search