Monitoring Microservices (Part I) – Discovery: Putting the Puzzle Together

Microservices have been pioneered by companies like Netflix, Amazon and Google and are the successor of SOA. New applications from startups and enterprises are using this new architecture style, which Martin Fowler defines as:

“In short, the microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API.” (Microservices by Martin Fowler)

Infrastructure automation has also been an essential part of microservice applications. Containers have become a popular way to package and run microservices as encapsulated units of work. Tools like Kubernetes, Mesos and Nomad provide platforms to schedule and run these containers automatically. A whole ecosystem of tools is currently evolving around them, including platforms in the cloud like AWS Container Service and Google Container Engine. In these new platforms, service instances may only live for seconds or minutes, and not for days or weeks like in the “old” and static SOA world.

Each microservice tends to have its own data management technique, but often it uses centralized platform services. It is also quite normal that microservices are implemented in different programming languages where developers choose the best language for the job.

The main difference between classical application architectures and modern microservice architectures can be described from an operations perspective in two words: scale and dynamism. Compared to a SOA architecture, there are many more service instances running and, for each of those services, much more (automatic) changes happening due to continuous deployment of new versions and the starting/stopping of instances to adapt to load. Applications have become a composition of these service instances at runtime. The business processing is accomplished with orders of magnitude more messages between the services, and the communication is much more asynchronous and chaotic.

As Netflix states in one of its blogs: “And at our scale, humans cannot continuously monitor the status of all of our systems”. This is especially true for traditional APM tools which used to be a tool for performance tuning experts to manually analyze and correlate information to identify bottlenecks and errors in production. With higher scale and dynamics, this task is like finding a needle in a haystack. There are just too many moving parts and metrics to correlate.

With all these differences, one thing remains true; devops teams are still responsible to deliver acceptable (high) Quality of Service (QoS). Instana’s response to this new application world is to apply machine intelligence and learning to the art of QoS management. This series of three articles focuses on Instana’s approach for monitoring and management of scaled and dynamic microservice applications empowering devops team to deliver the quality of services needed by their business.

Part I – Discovery: Describes how Instana addresses the data collection challenge and keeps up with ever-changing modern, containerized and microservice-based environments.

Part II – Understand: Describes how Instana models and analyzes services and applications using KPIs, knowledge, and machine learning in order to understand their health.

Part III – Investigate: Describes how Instana gives context so that issues can be quickly resolved, even in very complex, constantly evolving environments.

The Pieces of the Puzzle

There is an old saying in computing, “garbage in, garbage out”. If we are to apply a machine intelligence approach to system managment, then the core model and data set must be impeccable. Microservice applications are made 100s to 1000s of building blocks constantly evolving. It is therefore required to understand all the blocks and their dependencies which demands an advanced approach to discovery.

The building blocks that application monitoring needs to cover are:

Physical Components:

  • Datacenter/Availability Zones – Zones can be in different continents and regions. They can fail or have different performance characteristics.
  • Hosts/Machines – Either physical, virtual or “as a service”. Each host has resources like CPU, memory and IO that can be a bottleneck. Each hosts runs in one zone.
  • Containers – Running on top of a host and can be managed by a scheduler like Kubernetes or Mesos.
  • Processes – Running in the container (usually one per container) or on the host. Can be runtime environments like Java or PHP but also middleware like Tomcat, Oracle or Elasticsearch.
  • Clusters – Many services can act as a group or cluster, so that they appear as one distributed process to the outside world. The number of instances within cluster can change and can have an impact on the cluster performances.

Logical Components:

  • Services – Logical units of work that can have many instances and different versions running on top of the previous mentioned physical building blocks.
  • Traces/Flow – A trace is the sequence of synchronous and asynchronous communications between services. Services talk to each other and deliver a result for a user request. Transforming data in a data flow can involve many services.
  • Applications – A set of services and traces that have a common context and users. Microservices tend to be written in different programming languages (polyglot) – always choosing the best language for the job of the service.

Business Components:

  • Business Services – Can be compositions of services and applications that deliver unique business value and services.
  • Business Process – A combination of technical traces that form a process. As an example it could be the “buying” trace in e-commerce, followed by an order trace in ERP, followed by a trace in logistics to the FedEx delivery to the customers.

It’s not uncommon that thousands of service instances in different versions run on hundreds of hosts in different zones on more than one continent to provide an application to its users. This creates a network of dependencies between the components which must work perfectly together so that the service quality of the application is ensured and the business value delivered. A traditional monitoring tool would alert when a single component crosses a threshold, however, the failure of one or many of these components does not mean that the quality of the application is for sure affected. A modern monitoring tool therefore must understand the whole network of components and their dependencies to monitor, analyze and predict the quality of service.

Identifying and Cataloging Change

As described, the number of services and their dependencies is 10-100x higher than in SOA based applications, which pose a challenge for monitoring tools. And the situation is getting worse – continuous delivery methodology, automation tools and container platforms exponentially increase the rate of changes of applications, making it impossible for humans to keep up with the changes or to continuously configure monitoring tools into the newly deployed blocks (e.g. a new container just spun up by an orchestration tool).
A modern monitoring solution is therefore required to have automatic and immediate discovery of each and every block, before analyzing and understanding them.

The changes then need to be linked to the previous snapshot so that persistency is kept and a mode can be reconstructed for any given point in time to investigate incidents.

Changes can happen in any of the building blocks at any time. See this graphic for examples of changes in each component:

How Instana Discover Each and Every Piece of the Puzzle

Instana uses an agent to automatically discover all the described components and changes. The agent is deployed either as a stand alone process on the host or as a container via the container scheduler.

The agent first automatically detects the physical components like zones in AWS, Docker containers running on the host or on top of Kubernetes, processes like HAProxy, Nginx, JVM, Spring Boot, Postgres, Cassandra or Elasticsearch, and clusters of these processes, like a Cassandra cluster. For each component it detects, the agent will collect its configuration data and starts monitoring it for changes. It also starts sending important metrics for each component every second. The agent automatically detects and makes use of metrics provided by the services like JMX or Dropwizard.


As next step the agent starts to inject trace functionality into the service code. For example it intercepts HTTP calls, database calls, and queries to Elasticsearch. It captures the context of each call like stack traces or payload.

It is important to note that the agent’s discovery process is continuous and automatic, no manual configuration is needed.

The intelligence combining this data into traces, discovering dependencies and services, and detecting changes and issues is done on the server. The agent is therefore lightweight and can be injected into thousands of hosts.

data streaming

The Instana back-end utilizes streaming technology able to process millions of events per second streamed from the agents. Our streaming engine is effectively real-time taking only 3 seconds to process the situation and display on our 3-D GUI.

Automatic, immediate and continuous discovery is a requirement for the new generation of monitoring solutions. Instana has been fundamentally designed around this requirement. Read the coming blogs in this “Monitoring Microservices” series where we complete the story on how we meet the challenges of monitoring modern applications.