Welcome to the final installment of the “9 New Issues” series. In this article we’ll explore application performance Issue #8, “Performance analysis expertise for new technologies is hard to find making it difficult to troubleshoot problems.”
As companies migrate to cloud, container and microservices architectures they are adding new technologies faster than at any point in the history of IT, and it’s causing problems.
- Delays in adoption of containerized microservices due to added complexity
- Poor monitoring coverage and not knowing which KPIs are important
- Longer MTTR due to lack of experience and expertise
What are these new technologies I’m referring to? Here are a few examples:
- new schedulers like Kubernetes, OpenShift, Swarm, and Marathon
- new caches like Redis and Memcached
- new databases like Couchbase and MongoDB
- new infrastructure components like Docker and Mesosphere
- new distributed stream technologies like Kafka and Kinesis
- new API gateways and service meshes like Kong and Istio
These aren’t all brand new technologies, but they have all become common components of modern applications. Often these technologies are adopted by teams with very little practical experience in implementing, troubleshooting, and maintaining these specific technologies in production. This is a natural consequence of the pace of innovation. The faster new technologies are adopted and consumed the harder it is to find meaningful expertise in those technologies.
Where are the microservices application performance experts?
Malcolm Gladwell infamously proposed that it takes 10,000 hours of practice to become an expert at something. Most people agree that this is an oversimplification of reality and even Gladwell himself clarified his position by stating, “There is a lot of confusion about the 10,000 rule that I talk about in Outliers. It doesn’t apply to sports. And practice isn’t a SUFFICIENT condition for success. I could play chess for 100 years and I’ll never be a grandmaster. The point is simply that natural ability requires a huge investment of time in order to be made manifest. Unfortunately, sometimes complex ideas get oversimplified in translation.”
It’s probably not going to take an SRE (Site Reliability Engineer) 10,000 hours of practice to become an expert in Kafka but, based on my personal experience I have found that it takes months to years to develop an intermediate to expert level competency in many technologies. The reality is that most of us don’t get to troubleshoot production issues caused by new technology every day (if you do then you’re either practicing chaos engineering or your apps are in desperate need of an intervention). As IT practitioners we rely upon software tools to assist with development, integration, deployment, maintenance, and troubleshooting our applications. Many of these tools incorporate some level of expert knowledge to make them more valuable to the practitioner. The best of these tools typically include extensive productized expertise as a foundational component.
Monitoring KPIs and understanding application performance impact
What constitutes meaningful expert knowledge required for troubleshooting purposes?
- Knowing what metrics are most important (KPIs) for each individual technology
- Knowing good or bad states for these KPIs – a simple example of this is “iowait time on Unix systems should not average more than 20% per 5 second interval”
- Being able to identify known problems for each technology – such as identifying a split brain condition in Elasticsearch clusters which causes data integrity issues
- Understanding the impact of KPIs on system state – A simple example would be knowing that a disk filling up is undesirable and leads to performance and stability problems
- The ability to recommend corrective action
Unfortunately, many of the characteristics listed above are missing from most monitoring tools that I’ve used, seen, or heard about. Over the years a few tools have incorporated between 1-3 of these characteristics but never (in my experience) all 5 of them. Never, until Instana did it.
What’s the value of instant application performance expertise?
What’s the value of an employee that never sleeps, sees and understands every KPI, knows what constitutes good or bad, identifies common problems and anti-patterns, understands the impact to your system, and can recommend how to fix most of these issues, for EVERY one of your technologies in the stack? This might be the greatest employee in the history of employees (that will unfortunately burn out within a week). Well that’s exactly what you get when you put Instana to work monitoring your environment.
Putting some numbers to this value equation we can reference the diagram from my previous blog post…
The incident investigation and problem resolution phases contribute about 1 hour of time in this example. Expert knowledge can conservatively reduce these stages by 50%. How much revenue is saved by resolving production incidents 30 minutes faster? It depends on your business but it’s a lot.
A 2017 web poll about downtime costs of over 800 companies conducted by Information Technology Intelligence Company (ITIC) showed the following:
- 98% reported that 1 hour of downtime costs at least $100,000 US ($16,665 / 10 minutes)
- 81% reported that 1 hour of downtime costs at least $300,000 US ($50,000 / 10 minutes)
- 33% reported that 1 hour of downtime costs at least $1.5 million US ($250K / 10 minutes)
These are staggering numbers, but not surprising given the ongoing digital transformation and cloud migrations occurring in every company across every sector.
How long does an incident last in your organization? What if you could cut 30-60 minutes off of each incident on average? The tables above show that revenue impact is significant when you resolve incidents faster. Annual revenue savings of $300K-$9M is possible across every size company. Investing in an application performance monitoring solution that works with your microservice applications helps protect against this revenue loss.
As I mentioned at the beginning of this article, this is the last post in this series. You’re probably wondering about issue #9 since this post only covers issue #8. If you recall, issue #9 is “Monitoring data exists in too many silo’s causing inefficiency and errors as the IT organization troubleshoots and optimizes their applications” and that topic is covered in a recent post by the CEO of Instana, Mirko Novakovic. The solution to issue #9 is something we call the Democratization of APM.
Instana helps companies manage and understand their cloud, container, and microservice applications. Request your free trial of Instana today.