“And at our scale, humans cannot continuously monitor the status of all of our systems”. - Netflix
This is especially true for traditional APM tools, which primarily have been used by performance tuning experts to manually analyze and correlate information to identify bottlenecks and errors in production. With higher scale and dynamics, this task is like finding a needle in a haystack. There are just too many moving parts and metrics to correlate.
If we are to apply a machine intelligence approach to system management, the core model and data set must be impeccable. Microservice applications are made of 100s to 1000s of building blocks, all constantly evolving. It is therefore necessary to understand all the blocks and their dependencies, which demands an advanced approach to discovery.
The building blocks that application monitoring needs to cover are:
It’s not uncommon for thousands of service instances in different versions run on hundreds of hosts in different zones on more than one continent to provide an application to its users. This creates a network of dependencies between the components which must work perfectly together so that the service quality of the application is ensured, and the business value delivered. A traditional monitoring tool would alert when a single component crosses a threshold, however, the failure of one or many of these components does not mean that the quality of the application is definitely affected. A modern monitoring tool therefore must understand the whole network of components and their dependencies to monitor, analyze, and predict the quality of service.
As described, the number of services and their dependencies is 10-100x higher than in SOA-based applications, which poses a challenge for monitoring tools. And the situation is getting worse – continuous delivery methodology, automation tools, and container platforms exponentially increase the rate of changes of applications, making it impossible for humans to keep up with the changes or to continuously configure monitoring tools into the newly deployed blocks (e.g. a new container just spun up by an orchestration tool). A modern monitoring solution is therefore required to have automatic and immediate discovery of each and every block, before analyzing and understanding them.
The changes then need to be linked to the previous snapshot so that persistency is kept and a mode can be reconstructed for any given point in time to investigate incidents.
Changes can happen in any of the building blocks at any time. See this graphic for examples of changes in each component:
A key ingredient to the Instana Dynamic APM solution is our agent architecture, and specifically, our use of sensors. Sensors are mini agents – small programs specifically designed to attach and monitor one thing. They are automatically managed by our single agent (one per host), which is deployed either as a stand alone process on the host, or as a container via the container scheduler.
The agent first automatically detects the physical components like zones in AWS, Docker containers running on the host or Kubernetes, processes like HAProxy, Nginx, JVM, Spring Boot, Postgres, Cassandra or Elasticsearch, and clusters of these processes, like a Cassandra cluster. For each component it detects, the agent will collect its configuration data and start monitoring it for changes. It also starts sending important metrics for each component every second. The agent automatically detects and utilizes metrics provided by the services like JMX or Dropwizard.
As a next step, the agent starts to inject trace functionality into the service code. For example, it intercepts HTTP calls, database calls, and queries to Elasticsearch. It captures the context of each call like stack traces or payload.
The intelligence combining this data into traces, discovering dependencies and services, and detecting changes and issues is done on the server. The agent is therefore lightweight and can be injected into thousands of hosts.
Automatic, immediate, and continuous discovery is a requirement for the new generation of monitoring solutions. Instana has been fundamentally designed around this requirement.
Instana uses a single agent with multiple sensors and currently we support over one hundred sensors. These sensors are not extensions, they are updated, loaded, and unloaded entirely by the agent. There is an optional command line interface which provides access to the agent state, individual sensors, and agent logs.
A sensor is designed to automatically discover and monitor a specific technology, and pass its data to the agent. The agent manages all communication to the Instana Service Quality Engine. After discovery, the sensor collects the details and metric data to provide an accurate representation of the component's state. A specific sensor will gather specific types of data about their respective technologies, which will vary depending on the technology. Sensors collect the following:
In addition, discovery is recursive within a sensor. For example, the Java Machine sensor continues up the stack and discovers frameworks running on it (like Tomcat or SpringBoot), then assists the agent to load the appropriate additional sensors.
The intelligence combining this data into traces, discovering dependencies and services, and detecting changes and issues is done on the server. The agent is therefore lightweight, and can be injected into thousands of hosts.
The Instana backend utilizes streaming technology able to process millions of events per second streamed from the agents. Our streaming engine is effectively real-time, taking only 3 seconds to process the situation and display to the user.