For DevOps, speed is the new Six Sigma (so to speak) – it is the ultimate measure for successful application delivery teams.
The traditional model of building a Center of Excellence (CoE) around complicated toolsets and a small group of experts is outdated and slows down modern application development. DevOps promotes decentralization and empowerment of smaller teams, and the CoE is in direct conflict with that principle.
APM, Experts, and Application Triage
Application Performance Management (APM) is a critical tool and skill set to manage the performance and stability of applications and also measure the digital experience of the end users. APM tools also provide troubleshooting capabilities to ensure quick resolution of problems and root cause analysis.
For years, APM and troubleshooting have been an experts domain. APM tools required extensive configuration and optimization to effectively monitor applications. Getting value out of these tools required significant training and knowledge on the part of the users. Most companies established some kind of Center of Excellence (CoE) where a handful of APM experts supported teams of operations and developers to troubleshoot critical problems. CoE can have different names – sometimes they are called the APM team or performance engineering, but at the end of the day, they serve the same purpose: Optimization and troubleshooting of applications with a highly skilled team.
In critical situations, the CoE helps triage the problem – normally together with operations and development in some form of a “war room” – there is a good overview and presentation by Brendan Gregg, a system performance expert at Netflix: https://www.slideshare.net/brendangregg/srecon-2016-performance-checklists-for-sres
(Screenshot from Netflix presentation)
The presentation outlines the Netflix process their experts follow to triage a problem – in fact what they do is correlate data to find anomalies. Normally they do this by building custom dashboards and applying their expert knowledge and experience to put the right metrics onto them. They also use distributed tracing systems to understand application flow, bottlenecks, and dependencies of their services. This will give the team a good understanding of which custom dashboard for which service and component to build manually. Centralized log systems add to the tool box to search for errors and log entries that could be related to a problem. Chat tools like Slack can help the war room team to put all the findings together and triage the problem.
This methodology was developed by Netflix to manage performance of their highly distributed (microservice) architecture running on AWS. If you reviewed the slide deck referenced earlier you might be thinking, “wow, that looks complicated” – and you’d be right.
There are several issues with using this approach in most organizations:
- Dependency on experts – It is hard to find people with these skill sets if you are not Netflix or Google.
- Dynamics and Scale – With the introduction of microservices and containers, applications get more dynamic and scaled. Even standard business applications (not Netflix scale) now have 100s of microservices running on 1000s of containers. Manually correlating data is much harder and finding the root cause of a problem can be like finding the needle in a haystack.
- Significant manual effort – Building custom dashboards, configuring the right data sources, configuring instrumentation or even coding instrumentation, adding log messages, etc can be a lot of manual work for your developers. Over time this can add up as every deployment can change the information you need – so you have to review and adapt the dashboard, rules, etc constantly.
- Teams are slowed down by their dependency on experts – Especially in DevOps organizations, teams are working more autonomously and in smaller, faster iterations. Optimizing performance and stability is an ongoing task for each team, so they need to be enabled to do this without external dependencies – otherwise they will slow down.
The rest of the article will focus on this last issue of DevOps teams and discuss how they can maintain speed of delivery while ensuring performance and stability.
The Epiphany of the new APM challenge for DevOps
When we founded Instana, we focussed on the technical challenges that microservices, containers, and the cloud introduce to application development and therefore to monitoring and management. We reinvented monitoring agent technology to be able to auto discover all the moving components, with the ability to collect high resolution metrics and every single trace, without the need for manual configuration. We also reinvented the way the monitoring platform backend works, as we wanted real-time/stream processing for quick response, a semantic data model for traces and metrics to apply machine learning on top of, and a way to efficiently capture and store all the required data. Our incredible engineering team found solutions for all these challenges and we gained many customers that were faced with the challenges of operating dynamic and scaled applications. We were able to help them operate these environments by providing granular visibility, intelligent alerting, and automatic root cause analysis.
But then we observed the following behavior that was surprising for us, based on our extensive experience in the APM space:
(Above chart shows the number of active Instana users over time at one of our customers)
We have many customers with more than a hundred active users on our platform. This is very unusual for an APM tool, as traditionally you will see less than 5 people actively using it – as said in the intro, APM used to be a troubleshooting tool for experts. When we talk to our customers we realize that they have changed the way they develop and deliver applications – mostly in some similar organizational structure as seen in the graphic below:
Instana is used by all of the teams across the whole application delivery organization. The platform engineering team uses it to plan and monitor the capacity of their platform, analyze and troubleshoot problems in Kubernetes or database clusters, and see traffic by availability zone or tenant.
The development teams use it to analyze and optimize their services. Understanding upstream and downstream calls, errors including context and stacktraces, and also seeing timings on calls to databases or other subsystems.
The business and application teams use Instana to understand the end-user experience, optimize frontend code, and understand the dependencies and performance of the application to the underlying services and platform.
So what we figured out was that every team in the new application delivery organization has use cases and require the data and information of an APM tool. This introduces new challenges, which we summarize as Democratization of APM – you could also say personalization of APM.
Basically APM has to evolve from a tool used by a few performance and troubleshooting experts into a tool that provides value to all stakeholders of modern application development organizations.
What does that mean for APM?
- The user experience of the tool has to evolve away from an expert who uses the tool every day (requiring extensive training), to a UX that can empower non APM experts, who use the tool only occasionally (with minimal training), to identify and resolve application problems.
- Low touch On-Boarding – First time users, or occasional users (every few weeks), must be able to effectively operate the tool right away and get the information they need.
- Information instead of Data – Users need information in the proper context to solve problems or understand the situation, especially in today’s high scale dynamic environments. Machine Learning and modern visualization techniques provide the right information and make this information understandable in the right context.
- Highly automated – The tool must adapt and configure itself, so that it is usable at any given time. It is not acceptable that you need to spend 1-2 days to reconfigure the tool so that it works with the current deployed application and infrastructure.
- The tool must provide a language and domain that the user understands, e.g. an “application” can be something very different for a platform operations user than it is for a application developer – static models of applications therefore do not work anymore.
With Instana we already have a lot of the right functionality in place for this new evolutionary step of APM; extensive automation and machine learning to extract the right information out of the monitoring data. Lately, we have been working on a new UX that will allow us to provide APM to a much wider audience and make it easy for non experts to get value out of Instana.
This new UX enables users to explore and analyze applications quickly and easily, with Instana automatically discovering all the needed pieces – traces, services, APIs and the underlying infrastructure in a consistent model and user interface. We worked with many beta customers to deeply understand the needs of the different personas and their work with very modern application architectures and high velocity teams.
We are excited about our upcoming release and new capabilities that we will launch for the new DevOps teams, like better support for A/B testing and canary deployments, better team collaboration, and integrations into the whole CI/CD stack.
Our goal is to make APM an essential tool for all DevOps team members and users and to get APM out of the expert war rooms! Sign up for your free trial of Instana today.