Application Performance Management (APM) first came to market around 2000 when the Java language and the “3-tier architecture” was becoming dominant. Around 2006 a new generation of APM tools (APM 2.0) were introduced. Built for service-oriented architectures (SOA), AppDynamics, Dynatrace and New Relic came to market with technology to trace the processing flow through the services to help troubleshoot performance issues in these distributed environments.
Since the APM 2.0 vendors started developing their tools, a lot has changed in IT:
- Cloud computing has become the new normal: Infrastructure as a Service (IAAS), complete application platforms (PAAS), serverless architectures or functions as a service (FAAS), like AWS Lambda.
- With the release of the iPhone, the mobile revolution began with billions of people using the internet on their mobile devices for private and business use.
- IoT is bringing software onto almost every “thing”, streaming massive amounts of data and connecting these devices to the internet.
- Microservices and containers are now being used at “normal” companies led by the experience of web-scale companies like Facebook, Google and Netflix.
- New types of datastores have been developed that can handle massive scale and data running on hundreds or thousands of nodes.
- Many new programming languages have evolved and are being used to solve specific types of problems.
In parallel organizations and processes have also changed adapting to the higher speed of the digital world. Agile methodologies, continuous delivery and DevOps are adopted not only by new startups but also by traditional enterprises. As software is structured in smaller and encapsulated ways, releases are much smaller, more frequent and less impactful.
More Dynamism and More Scale
Looking at this in a simplified way, applications today are more complex, scaled and have a dynamic structure that is constantly changing based on the stimuli it gets - like user requests, service calls or data flowing in. I once asked Dion Almaer, a friend and now Director of Engineering at Google, how he sees the difference between modern applications and the old SOA applications. His answer, “It is like the difference between biology and chemistry”. I found that to be a really good answer. It literally says that new types of applications are “organic” or living objects, instead of static structures. These structures learn to survive failures and outages (resilience) and can change depending on the behavior of their services and users.
Netflix, one of the pioneers of the new cloud and µService world, even implemented a “chaos monkey” that randomly and constantly creates different failures to test the survival of these structures.
Adrian Cockcroft, one of the architects of the Netflix platform even developed a simulator for creating and growing these “organic” structures, so that monitoring tools can test their abilities to work with the high scale and dynamism.
When we started to think about the third revolution of APM around 2 years ago, it was obvious that one needs a totally different approach to manage “biology“ over “chemistry“. It was also clear that the technologies in use by the leading APM 2.0 vendors would be challenged at the scale and dynamism demanded by the new generations of applications.
Learnings from customers
As soon as we got our founding team together, we started with three questions to answer:
- What would customers need for their new applications and how are they monitoring today? How are they using their APM 2.0 tools?
- What are the new capabilities, use cases and technologies needed for an APM 3.0 solution?
- How could we architect an APM 3.0 tool to deliver the needed functionality today, yet last into the foreseeable future?
We found a common pattern by talking to more than 100 companies worldwide: customers are rapidly shifting to agile development practices and DevOps. Almost every customer had already developed new µService based applications or is in the process of doing so. New types of databases like Cassandra or Elasticsearch are found everywhere - as well as container technologies like Docker, Kubernetes, Mesos or HashiCorp. People are automating their environments using Puppet, Chef or Ansible and they use the cloud intensively - including processing of data using AWS Lambda or Google Cloud functions. Data pipelines using Kafka, Spark and streaming technologies are also quite common. Beside Java and PHP we also saw a lot of node.js and Python for coding new services.
Of course we asked how they are monitoring these new stacks and the answer was interesting: Most of the customers were not using their existing monitoring and APM solutions, instead they were building their own monitoring based on logging (ELK stack, Greylog, Splunk,…), some kind of time series databases (Prometheus) or building bigger “performance data lakes” using HBase or Cassandra with Spark or some other technologies for processing and querying those data. For basic visibility people were using tools like Nagios, Zabbix or cloud based solutions like Datadog.
Everybody agreed that building their own monitoring is very time consuming in terms of developing the needed instrumentation, getting the right metrics and building the right queries and dashboards on top of everything. Especially correlation of data and the ability to find outliers and patterns requires skills that almost no customer really has in its DevOps team. Companies were forced to do so mainly because, in their opinion, there are no vendors providing a solution that can monitor quality of service in production.
Architecture for APM 3.0
After analyzing the feedback from customers we concluded that a new architecture for APM 3.0 that would work in highly scaled and dynamics environments would require:
- Real-time data, because 1 minute averages and 5 minute delays common in APM 2.0 tools were not giving enough detail in environments that possibly live only for seconds and are changing frequently.
- Continuous automatic discovery of physical and logical components and their dependencies because new applications are not static and this „semantical“ information is needed to correlate data to analyze health.
- Historical persistence of all components and their dependencies to understand the evolution of containers, servers and services.
- Automatically determine the health of and track all changes to components because the application’s foundation frameworks must be running perfectly to delivery high quality of service.
- Automatically identify all services, define and analyze the key KPIs of those services because managing their quality is the #1 responsibility of the DevOps team.
- Automatically garner Intelligence and Knowledge that assists in understanding the health of components and services (including automatic correlation and localization of root causes of incidents) because at high scale, humans cannot find root cause anymore.
- Automatically trace all the distributed transactions to understand the runtime behavior of the applications and provide developers with details and context to fix code related problems. The performance impact on the application must be close to zero.
- Visualize understanding of operations and quality in an easy to understand way because no one has time to attend training classes. We were quite sure that dashboards couldn’t be the solution for a highly dynamic and complex environment.
- Feed in user defined data and integrate it into our semantics and intelligence because DevOps wants to capture and automate their domain specific knowledge.
So we started blueprinting this new APM 3.0 product and went back to the customers to present our ideas. Everybody understood directly that the Dynamic Graph, our structure for modeling applications, would be of great value and also why understanding all dependencies is needed to manage modern applications. Applying machine learning and our knowledge base on top of the graph was also appreciated as people agreed that humans cannot monitor highly complex and dynamic applications anymore. So we took this challenge and started developing Instana - the next generation APM tool for next generation applications!
Six months ago we released a first version for physical components and now we are releasing the first version of the full APM 3.0 capabilities!
With Instana we want to help DevOps teams manage highly distributed and dynamic applications in production. We support this in three different steps:
The discover process cannot be manual anymore as the systems and applications have become dynamic (recall: biology vs chemistry). That’s why a modern APM tool has to automatically discover all components that build an application and do this continuously so that changes are detected immediately.
Instana automatically discovers the datacenters, hosts, middleware, clusters, the logical services and process flow of the application and all those component dependencies. For each element, we track the configuration data and metrics in real-time to discover changes and find issues.
To understand the data that the sensors are streaming to our server, we build a Dynamic Graph that adds semantics to the metrics and traces. The Dynamic Graph models the whole application and systems with all their dependencies and keeps the configuration data and metrics in real-time. See this blog post on the Dynamic Graph for more details.
To understand the health of a single physical component, Instana uses a knowledge base approach that applies different techniques to determine a health value (between 0 and 1) for each component. This process is continuous - 0 means very healthy and 1 means very unhealthy. For every component, Instana provides semantical knowledge of that specific component.
Instana also automatically discovers the services of an application. Services are not able to be measured by a simple threshold of a metric. We decided to take a KPI approach to measure the health of a service. We took the Four Golden Signals by Google and added a fifth one for the dynamics of container and cloud based applications. The five KPIs are:
- Load - measure how much demand/traffic is on the service in requests/sec.
- Latency - response time of the service requests that have no error in milliseconds.
- Errors - number of errors per second or as a percentage of the overall number of requests.
- Saturation - measures how full the most constrained resources of a service are e.g. utilization of a thread pool.
- Instances - the number of instances of a service e.g. number of containers that are delivering the same service or number of Tomcat application servers that have a service deployed.
Instana automatically measures these KPIs for every detected service and also on each individual connection between services. Instana applies machine learning on these KPIs to figure out the health of a service. Typical problems that are detected are:
- The error rate is higher than normal.
- The performance of the service is slow.
- There is a sudden drop or increase of load.
- The saturation of the service is close to reaching a limit.
- There not enough instances available for the current load.
With the Knowledge based component health approach and the KPI based service health capabilities, Instana is able to precisely determine and predict real service quality issues with minimal false positives in highly dynamic and scaled modern environments.
The goal of our UI is to externalize our understanding. To assist the DevOps team to leverage the same deep understanding, we took a new approach in UX/UI - it is based on ideas of visual engineering and uses maps and flows on the map to visualize dependencies and impact that is intuitive even in complex environments. We provide different perspectives of your discovered application environment.
A Physical View on all the physical components and their dependencies:
When Instana finds issues and incidents, the user can utilize Instana to investigate the details.
For each component and service that was discovered, Instana provides a realtime dashboard out of the box. The dashboards contain all relevant data. For clusters and groups Instana generates aggregated dashboards including capacity information and aggregations of all group members. For services the dashboards contain performance and availability information as well as all data about upstream and downstream connections.
“Incidents” are triggered by service KPI violations and combine all changes and issues related to that issue in historic order. Incident reports help users see how a problem evolved and can investigate the relevant components directly from the Incident.
Most investigation is done on the past, so to assist with this, we keep a full history of the graph and all metrics and changes. Users can use our Timeshift feature to roll back in time and analyze specific situations or evolving patterns in context of the whole monitored system by simply shifting back the timeline.
Our unique distributed tracing technology traces every service request without configuration and without measureable impact on the application. This sets a new standard of data quality and accuracy and at the same time lowers the risk of instrumenting applications to a minimum.
We are proud to release our APM 3.0 capabilities today! Try Instana to see how game changing total understanding of your production applications can be for your continuous delivery cycle.