Modern Microservice and Container based applications have attributes that introduce challenges for operating and monitoring these environments:
Modern applications can be constructed out of hundreds/thousands µServices and containers which are loosely coupled but communicate with each other. These applications can dynamically scale up and down depending on the load of the system and can spread all over the world in cloud datacenters and availability zones.
µServices can be deployed independently from each other and are build for Continuous Delivery processes, so that changes are happening at a high frequency.
µServices are not steady anymore. They can be started and stopped just for a functional call and containers are moved around by tools like Kubernetes or Mesos to utilize the hardware in a better way. This requires that monitoring tools identify these services and understand their role in the overall context - without relying on manual tagging.
Using the right tool for the problem is one of the paradigms of modern applications. This leads to multiple languages, frameworks and even persistence models for an application. In particular, new distributed caching and database models are getting adopted quickly.
Adrian Cockcroft has build a simulation tool for such environments called Spigo (see screenshot above for a small µSerivce environment) - one of his intentions was to provide monitoring tool providers with a test bed needed to test for scalability and rapid change.
The challenges for monitoring
It is kind of easy: If you have 10x more components involved in modern applications, monitoring must be at least 10x more difficult. What you really care about, the reason for monitoring in the first place, is an understanding of the operational health of your application: performance, availability, reliability, etc.. An understanding of an application’s health implies the need to know how everything works together and the entire surrounding context.
The current generation of monitoring tools have no understanding of an environment - they only have the concept of single components with metrics and traces (business transactions) for understanding health and performance. The interpretation and understanding of how all these components work together to find the root cause of problems is left to the user:
- Understanding what components are affected by an issue.
- Understanding what metrics to look at.
- Interpreting and getting the right information out of the metrics/data.
- Correlate metrics and data to find the root cause of the problem.
Some tools even provide “war rooms” for collaboration as normally whole teams of experts are needed to find the root cause of a problem.
As a simple example a trace could show that the performance of a request was bad because of a slow Elasticsearch query. Finding out the root cause means to look at different components - maybe the payload of the query, the number of resulting documents, the thread pool of the app server, the performance of the caching infrastructure and the configuration of the Elasticsearch cluster. This is already an expert task - and most companies have a small team (normally around 1-3 people even in bigger companies) who troubleshoot these problems based on their experience and using current tools to get indicators where to look.
If hundreds of µServices are involved and distributed caching and database technologies are being used and changes are deployed multiple times a day - all this in a highly dynamic fluid cloud and containerized environment, the task of finding the root cause is like looking for a needle in a haystack. Netflix simply stated in a blogpost: “Humans cannot monitor these environments anymore.”
The Dynamic Graph
The core technology powering Instana is what we call the Dynamic Graph. The Graph is a model of your application that understands all physical and logical dependencies of components. Components are the piece parts of your application, like Host, OS, JVM, Cassandra Node, MySQL, etc.. The Graph has more than the physical components - it also includes logical components like traces, applications, services, clusters or tablespaces. Components and their dependencies are discovered automatically by the Instana Agent and Sensors such that the Graph is continuously kept up to date. Every node in the Graph is also continuously updated with state information like metrics, configuration data and a calculated health value based on semantical knowledge and a machine learning approach. This knowledge also analyses the dependencies in the graph to find logical groupings like services and applications to understand impact on that level and derive criticality of issues. The whole graph is persistent and the Instana application can go back and forth in time to leverage the knowledge of the graph for many operational use cases.
Based on the dynamic graph we calculate the impact of changes and issues on the application or service and, if the impact is critical, we combine a set of correlated issues and changes into an Incident. An incident shows how issues and changes evolve over time enabling Instana to point directly to the root cause of the incident. Any change is then automatically discovered and we calculate its impact on surrounding nodes. A change can be a degradation of health (which we call an “Issue”), a configuration change, a deployment or appearance/disappearance of a process, container or server.
To make this a real-world example I will describe how we would model and understand a simple application that uses an Elasticsearch cluster to search for a product using a web interface. In fact this could be just one µService but it shows how we understand clusters and dependencies in Instana.
Understanding a dynamic application
Let’s develop a model of the Dynamic Graph for an Elasticsearch cluster to understand how this works and why this is useful in distributed and fluid environments.
We start with a single Elasticsearch node. An Elasticsearch node technically is a Java application, so the graph looks like this:
The nodes show the automatically discovered components on the host and their relationships. For an Elasticsearch node we would discover a JVM, a Process, a Docker container (if the node runs inside of a container) and the host that it is running on. If it is running in a cloud environment like Amazon AWS we would also discover the availability zone it is running in and add the zone to the graph.
Each node has properties (like JVM_Version=1.7.21) and all the relevant metrics in real time, e.g. I/O and network statistics of the Host, Garbage Collection statistics of the JVM and number of documents indexed by the ES node.
The edges between the nodes describe their relationships. In this case these are “runs on” relationships. So the ES node runs on the JVM.
For an Elasticsearch Cluster we would have multiple nodes that are building the cluster.
In this case what we added to the graph is a cluster node that represents the state and the health of the whole cluster. It has dependencies on all four Elasticsearch nodes that are building the cluster.
The logical unit of Elasticsearch is the index - the index is used by applications to access documents in Elasticsearch. An index is physically structured in shards that are distributed to the ES nodes in the cluster.
We add the index to the graph to understand the statistics and health of the index used by applications.
To get a little bit further we assume that we access the Elasticsearch index with a simple Spring Boot application.
Now the graph includes the Spring Boot application.
As our sensor for the Java application will inject some instrumentation for tracing distributed transactions, Instana will automatically “see” if the Spring Boot application accesses an index of Elasticsearch.
You can see in the screen above the transaction including a waterfall chart showing the calls to different services and below the details of one service call to an Elasticsearch index including performance and payload data.
We inserted this trace and its relationship to the logical components into the graph and track statistics and health on the different traces.
Using this Graph, we can understand different Elasticsearch issues and show how we analyze the impact on the overall service health.
Let’s assume that we have two different problems:
- /O problem on one host causing read/write on index/shard data being slow
- Thread pool in one Elasticsearch node is overloaded so that requests are queued as they cannot be handled until a thread is free.
In this case the Host (1) starts having I/O problems and our health intelligence would set the health of the host to yellow and fire an issue to our issue tracker. A few minutes later the ES (Elasticsearch) Node (2) would be affected by this and our health intelligence would see that the throughput on this node is degregated to a level that we mark this node as yellow - firing an issue again. Our engine would than correlate the two issues and add them to one incident which wouldn’t be marked as problematic as in this case the cluster health is still good so that the service quality is not affected.
Then on another ES node (3) the thread pool for processing queries is filled up and requests are getting pooled. As the performance is badly affected by this, our engine marks the status of the node as red. This effects the ES cluster (4) which turns to yellow, as the throughput is decreasing. The two issues generated are aggregated to the initial incident.
As the cluster affects the performance of the index (5) we mark the index as yellow and add the issue to the incident. Now the performance of the product search transactions is effected and our performance health analytics will mark the transaction as yellow (6) which also affects the health of the application (7).
As the application and the transaction is effected our incident will actually fire with a yellow status saying that the product search performance is decreasing and users are affected - showing the path to the two root causes - the I/O problem and the Threadpool problem. As seen in the screenshot, Instana will show the evolution of the incident and the user can drill into the components at the time the issue was happening - including the exact historic environment and metrics at that point of time.
This shows the unique capabilities of Instana:
- Combining physical, process and trace information using the Graph to understand their dependencies
- Intelligence to understand the health of single components but also the health of clusters, applications and traces
- Intelligent impact analysis to understand if an issue is critical or not
- Show the root cause of a problem and give actionable information and context
- Keeps the history of the graph, its properties, metrics, changes and issues and provide a “timeshift” feature to analyse any given problem with a clear view on the state and dependencies of all components.
Finding root cause in modern environments will only get more challenging in the coming years. The simple example above has shown that finding the root cause is not a trivial task without understanding of the context, dependencies and impact. Now think of “liquid” systems based on µServices that add and remove services all the time with new releases pushed out frequently - Instana keeps track of the state and health in real time and understands any impact of these changes or issues. This is all done without any manual configuration and in real time.
Instana helps keep your application healthy and dramatically reduces the amount of time to find the root causes of problems or optimizations.