Monitoring Microservices (Part II) – Understand: Analyze and Derive Application Health

In Part 1 of this blog series about monitoring microservices, we looked at the building blocks of microservice-based applications and the necessity of automatically discovering all components and their changes. This article takes the story to the next step. How to understand all the dependencies, the ‘big picture’, the application’s real behaviors, and ultimately, how to understand and interpret the microservices and application’s health.

Microservice applications use many technologies and architecture paradigms to make the whole system robust and resilient. Failure of components in these complex systems is normal. Resiliency often means that if one service fails, there’s a backup identical service ready to take over. In microservice-based environments, one service relies on others in order to deliver the business processes. A common failure scenario occurs when one service makes an outbound call to a dependent service which fails somewhere in the call chain.

Traditional threshold and baseline based alarming create too much noise and false positives and can not analyze the impact of component failures in the context of the overall system/application. Also, the idea of having a trace (or business transaction, PurePath) as the base model of quality management does not take into account the dynamics and scale of these systems and only provides one piece of the puzzle – the correlation to the other pieces has to be made manually by the user, but that is impossible in dynamic and scaled systems.

Modeling is the Key to Understanding

We concluded the first part of this series explaining how the Instana agents collect and stream events to our real-time knowledge engine for further analysis. Simply stated, without a comprehensive model, there can be no good machine analysis of the situation. To enable an analysis of quality of service, metric data, dependency data and change events are processed and modeled into a Dynamic Graph. The Graph is a time continuous representation of all components, their configuration and metrics, their dependencies and a calculated health value. The Graph organized and persists all this information.

Dynamic Graph

By having a holistic understanding of an application: application code / microservices, middleware and underlying infrastructure, Instana can deeply understand the impact of one component on the others. Back to the example above (one service getting under stress for not receiving response from the called service), the Dynamic Graph allows Instana to understand how one microservice depends on the other and point you to the correct real issue.

Instana continuously updates the Dynamic Graph via the data streaming from the Agents and automatically creates 2 views to help DevOps visualize and understand their environment. Cycle time from discovery to visualization is under 3 seconds. No manual configuration is needed.

A physical view showing the physical components and their dependencies:
logical view

A logical view showing the discovered services and their dependencies:
logical view

A Knowledge Base Approach for Component Health

Instana uses a knowledge base approach to determine a “health index” value between 0 and 1 for each component under management, 0 means very healthy and 1 means very unhealthy.

For every discovered component, Instana applies semantical knowledge specific to that type of component. Some examples are:

  • Prediction of host file systems running to a physical limit – this sounds easy but we use a sophisticated statistical approach, as simple linear regression creates false positives.
  • Analysis of Garbage Collection of a JVM to determine if young and tenured generation are configured right and GC overhead is low.
  • Analysis of an Elasticsearch cluster for split brain problems – it is possible in some master based cluster technologies that the cluster elects a second master, which can cause data loss.

Instana comes with a fully working knowledge base for all supported components curated by our team and enhanced based on the data we gather with Instana and by the feedback of our customers.

If the value of a component’s health index increases over a certain value (health deteriorates), an issue is triggered. These issues are fed into an event stream, including the change events for all components that are automatically detected (start/stop of a component, code deploy, configuration change, etc). We apply machine learning on top of this stream to detect patterns and predict problems.

Issues do not trigger an alarm or notification by default, as this would create too much noise in complex system. Issues have a description and suggestion on how to fix the issue based on our curated knowledge. We are using Incidents for triggering alarms which will be described in the next chapter.

A KPI Based Approach for Service Health

Services are not able to be measured by a simple threshold of a metric. Here is why – a threshold is a moving target based on other variables. For example the number of requests to a service can be dependent on the number of users that are visiting the corresponding website. The number of users can be different depending on time, date, commercials shown on TV, Facebook Ads and many more unpredictable factors. We decided to take a KPI approach to measure the health of a service. We took the Four Golden Signals by Google and added a fifth one to account for the dynamism of container and cloud based applications. The five KPIs are:

  1. Load – measures how much demand/traffic is on the service. Normally measures in requests/sec.
  2. Latency – response time of the service requests that have no error. Normally measures in milliseconds.
  3. Errors – number of errors. Can be measured as errors per second or as a percentage of the overall number of requests vs number of requests with error.
  4. Saturation – measures how full the most constrained resources of a service are. Can be the utilization of a thread pool.
  5. Instances – the number of instances of a service. Can be number of containers that are delivering the same service or number of Tomcat application servers that have a service deployed.

Instana automatically measures these KPIs for every detected service and also on each unique connection between services.


Instana applies machine learning on these KPIs to figure out the health of a service. Typical problems that are detected are:

  • The error rate is higher than normal.
  • The performance of the service is slow.
  • There is a sudden drop or increase of load.
  • The saturation of the service is close to reaching a limit.
  • There are too few or too many instances available for the current load.

KPIs are determined by capturing and analyzing every trace across the services and application. Traces automatically capture errors like status codes, exceptions or error logs to find out if something went wrong. Traces also measure the time spent in each service and underlying component. Based on the Google Dapper architecture, a trace is a tree that consists of spans, where a span is a basic unit of work – which in the microservice world normally is one request to a service or component like a database. This way Instana automatically has an end-to-end trace of the application but also the information about the performance of each individual service and component. We are based on the OpenTracing standard, so that customers can integrate their own tracing based on the standard format.


To monitor the health and quality of your microservices you need a clear understanding of all middleware building blocks, how they are connected/constructed and how they behave. Instana does this by continuously and immediately discovering all components, then dynamically modeling a directed graph of their dependencies. Service KPIs are automatically derived from the tracing data. By applying machine learning to these KPIs, Instana automatically monitors the health of the microservices and the application. Different views empower the user; we like to say that Instana understands your application, then helps you understand your application and its behavior.

In Part III of this Monitoring Microservices series, we’ll review how to investigate an Incident and rapidly get to the exact cause of problems.