A Zero Configuration Approach to Service Discovery and Quality of Service Management
At Instana, we believe in a Lean Startup – deliver a Minimal Viable Product (MVP) as soon as possible – get it into the hands of customers, measure the usage and success, learn, and adopt.
When we announced our Alpha version a year ago, just four month after founding Instana, we introduced the physical view. The first release was focussed on our automatic discovery technology and we evaluated if our 3D map UI would work for our users.
As we engaged with customers, they reported frustration with the amount of effort it was taking them to configure and maintain their existing monitoring and APM tools they were using to manage their services and applications. In other words, we saw that customers want maximum understanding of how their services are behaving with truly zero effort (other than the simple act of inserting an agent) to configure the tool.
So we set a goal for ourselves: Completely automatic, zero configuration, total discovery and understanding of service quality even in highly dynamic and scaled environments.
To accomplish this, again using the Lean Startup approach, we started down the path. We first introduced the concept of health – a knowledge based and machine learning approach (“Stan”) to understand the health of components.
A lot of innovation followed, like automatic change detection, impact analysis with Incidents, Timeshift, and cluster support for big data technologies. The biggest new concepts were the introduction of the Dynamic Graph and Tracing. The Dynamic Graph enables us to model all dependencies between physical and logical components that are discovered. Our end-to-end Tracing is build from ground up for microservice applications that are highly scaled and dynamic. Our approach is based on Google Dapper and the trace information is added to our Dynamic Graph so that Instana is able to correlate data in case of errors or performance issues.
All these technologies were the basis for a capability that we have just released: The Logical View.
By analyzing the communication of the application components with our instrumentation based tracing, Instana automatically, with zero configuration, discovers all services and their communication. This discovered service topology is added to the Dynamic Graph and visualized in the new Logical View. Because the Graph contains all dependencies of all monitored components, this means that Instana tracks the services right down to each host running it.
Examples for these services are applications, database tablespaces, keyspaces, search indexes or queues. In the example above you can see a Java web application “shop” talking to a Java service “productsearch” that uses an Elasticsearch index “products” – including all the statistics and a dashboard for each connection. These statistics can also be visualized on the maps using our particles feature – for each call we send a small particle between the two services on the connection line. Calls that have an error are marked as red particles – this way the user can intuitively see which services have high call rates and where the error rates are high.
Here, the UI is visualizing requests to the “productsearch” service with a throughput of 2 calls per second and an average response time of 47.50 ms with an error rate of 0%.
The service “productsearch” itself then calls an Elasticsearch Index called “products” with 2 calls/s, an average response time of 1.50 ms and an error rate of 0%.
With this release Instana automatically discovers three different types of services plus connections:
The Monitored Services are shown with a circle and are services that are directly monitored and instrumented with an Instana Agent.
The Unmonitored/External Services (with the cloud symbol) are called by a monitored service but there is no Instana Agent running on that service or the recognition failed. These will usually be calls to external services like e.g. a 3rd party service, or an internal service not yet being instrumented with an Instana agent.
The icons above the symbols show the different types (Entry, Application, Queue in the above screen) as specifically as possible.
The metrics shown are for the corresponding services and connections. If you hover over them you will be presented with sparkcharts to get a fast understanding of the recent history of the metrics:
Looking at each individual service, we provide health management based on KPIs (which we will talk about in more detail later), a link to all the traces that started at the service and all inbound and outbound connections to other services.
Because we analyse all dependencies of all components, we not only link to the traces but also to the underlying infrastructure components of the physical view.
In this example we have a shop service that is a Java web application running on Tomcat. The physical instance would be the Tomcat that this application is deployed on.
It is important to recall that all this information are automatically discovered with no configuration needed and that all data is available in 1 second granularity, streaming in real time.
Introducing Service KPIs
Service Quality is not able to be measured by a simple threshold of a metric. We decided to take a KPI approach to measure the health of a service. We took the Four Golden Signals by Google and added a fifth one for the dynamics of container and cloud based applications. The five KPIs are:
- Load – Measures how much demand/traffic is on the service in calls/sec.
- Latency – Response time of the service requests in milliseconds.
- Errors – Number of errors. Measured as as percentage of the overall number of requests vs number of requests with error.
- Saturation – Measures how full the most constrained resources of a service are. Can be the utilization of a thread pool. (not yet implemented in the current release).
- Instances – The number of instances of a service. Number of containers that are delivering the same service – e.g. number of Tomcat application servers that have that service deployed.
Instana automatically measures KPIs for every detected service, underlying instances and also on each individual connection between services and visualize them on the map. Again we have implemented a zero configuration approach.
To understand the health of the service we apply machine learning algorithms suited for each KPI type and relevant behaviour. As the algorithms are semantically applied to fit the relevant metric, including seasonality detection, we are able to understand situations way better than just a deviation from usual.
Typical Instana Incidents are be based on outliers, trends or anomalies.
For each Incident raised we analyze the Dynamic Graph for underlying issues and changes (more about Automatic Change Detection) in all dependent components and correlate them into the Incident.
All the other features of Instana are naturally available on the Logical View as well, like a full historic view on all metrics for the services, including their connections and physical presentations with our Timeshift feature and also our advanced search functionality to find services and components quickly based on name or attributes.
This is a great milestone for Instana. A year after our first prototype we have innovated the APM space with powerful new capabilities specifically relevant for highly dynamic modern applications. The Logical View is a game changer in APM in terms of service discovery and service quality measurement. We are excited to get feedback from our users and continue along our agile innovation path!