To help you manage the Quality of Service of your applications, Instana detects three types of an event.
An incident helps you to understand situations impacting your edge services and critical infrastructure by automatically learning their behavior and health, and then sending alerts when they become unhealthy. Edge services are what customers or other systems outside the monitored application actually access; they are the external deliverables of the application.
Incidents are created as soon as Instana detects either a key performance indication (KPI) is breached on an edge service, or a critical infrastructure issue. For more context, see our Analyze and Derive Application Health blog.
For discovered application services, Instana tracks the following KPIs:
- Load (calls/second).
- Latency (response time in milliseconds).
- Errors (error rate).
Instana automatically measures these KPIs for every service, and applies machine learning on each one to figure out the health of a service. Typical problems that are detected are:
- The error rate is higher than normal.
- The performance of the service is slow.
- There is a sudden drop of load.
KPIs are determined by capturing and analyzing every trace across the services and the application. Traces automatically capture errors like status codes or exceptions to find out if something went wrong. Traces also measure the time spent in each service and underlying components. Based on the Google Dapper architecture, a trace is a tree that consists of spans, where a span is a basic unit of work. In the microservice world, one span normally equals one request to a service or component, like a database. This means that we not only have an end-to-end trace of the application, but also the information about the performance of each individual service and component.
If the health of a service is impacted, Instana creates a new incident and by traversing the Dynamic Graph of issues and events, correlates it with all the other incidents. This gives a comprehensive overview of the situation regarding service and event impact.
An issue is an event that is triggered if something out of the ordinary happens.
Critical infrastructure issues, such as disk saturation and Elasticsearch cluster split brain situations, trigger incidents because their end result is most likely data loss.
In the preceding example, the CPU steal time on one Linux machine is suspicious and therefore marked as an issue. An issue by itself does not trigger an alert, Instana simply notes that it happened. Should the service to where this system is connected behave badly, this issue will be part of the incident. This methodology is one of the major benefits of Instana because it frees you from manually correlating events and performance problems. Just because something is using too much CPU for a while, doesn't mean there is a problem as such. Only when a service is impacted will this be relevant information.
Instana records the time when an issue first occurs and also when the condition ceases to exist (start and end time). In this case, you see that the CPU steal exceeded the 5% limit for only around two and a half minutes (from 07:08:37 to 07:10:54). By clicking on the issue line, you will see the details on the right hand side of the screen. The increase in CPU steal is evident at around 17:10.
The View Built-in Event link brings you directly to the corresponding definition in the Events & Alerts settings for this issue. This helps to understand on which basis a particular issue has been created.
- Applications, services or endpoints which receive infrequent traffic, for example, one call every 15 minutes, are not considered to have a sufficient basis for our issue detection.
- The severity of an issue can change during its lifetime. It represents the highest severity that was ever reached by this particular issue.
A change within an environment, include but are not limited to deployments, configuration changes, server starts or stops, and are categorized by:
- Changes - Changed configuration of components.
- Offline/Online - Tracking presence of components under management.
Instana recognizes changes by tracking relevant configurations specific to each monitored technology, as well as monitoring if something goes online (monitored by Instana) or goes offline (is not monitored by Instana anymore).
Every change is recorded and typically has a duration of only 1 second (start and end time difference). Like issues, change events are also correlated into an incident should they be relevant, thus sparing you an alert just because a system went offline. Maybe it was turned off because the load decreased at the end of the working day and it was no longer needed.