Real-time Observability is a critical monitoring and tracing capability that leads to improved software health. Why? Because it provides the fastest notification of software and infrastructure issues possible. It enables preventative or remedial triage to begin as rapidly in order to avoid downtime or performance degradation and the escalated costs associated with those conditions. This is especially true for hyperscale and rapidly changing cloud applications.
Real-time Observability provides immediate visibility into the state of your applications, systems software and infrastructure. It doesn’t cobble together information from less timely metrics and sampled traces, like other Observability platforms do. Those platforms can leave significant visibility gaps that can result in significant performance degradation and downtime.
Only Real-time Observability provides the precise visibility you need to prevent software issues from turning into incidents. It also provides the fast Mean Time to Notification (MTTN) for issues that require manual repair (MTTR) and can be summarized by a simple axiom: the faster an issue can be detected; the faster issue remediation can be applied.
The age of preventative software remediation, measured by Mean Time to Prevention (MTTP), is upon us, and real time Observability is leading the way to achieving higher application performance, reliability, and availability.
YourSurprise Customer Story – Preventative Remediation
“With Instana, we can find really quick relations between issues, so if one service is failing, and a lot of other services are depending on that service, you can very quickly pinpoint the exact service or application causing the initial problem.”
“Instana really saves us downtime, without a doubt, because we can see things happening, for example, the SQL database gets overloaded sometimes. We now have the insight to see and fix that before it actually goes down.”
Niels van der Linde, Developer, YourSurprise.
What is Real-time Observability?
Real-time observability consists of:
- Real time telemetry that includes one second metrics
- Full end to end traces for all transactions
- AI-driven contextual correlation
This provides immediate indicators when issues arise so that remedial actions can commence rapidly. The telemetry can be fed into Automated Resource Management platforms such as Turbonomic, AIOps platforms such as IBM Watson AIOps, Incident Management systems such as PagerDuty and other platforms to either prevent or speed repair of issues or incidents.
LiquidShare Customer Story – Proactive Incident Remediation
Instana is empowering LiquidShare’s developers, allowing them to save time when it comes to monitoring all resources, including visibility into CPU and memory requests, latency, and tracing.
“With all that time gained from using Instana and all that time that we saved, we just focus on the thing that really matters.” – Cedric Arnoult, Lead DevOps, for LiquidShare.
Why is that important? Because there’s never a situation in which a slow response is good enough. When performance or availability issues occur, users are impacted. Thus, the speed of notification is always important because the notification triggers the remediation process.
Without real-time information, like from observability platforms that only capture metrics at 10+ second or sample transaction traces instead of capturing full end to end transaction traces, issue visibility gaps do occur. That delays the remediation response to issues which prolongs the issue impact and may lead to the issue becoming an incident.
Once incidents occur, problems compound. Downtime from incidents costs as much as $11000 per minute dependent upon the incident severity and triage effort to fix the problem that’s required.
Statistics for application degradation haven’t been published but it’s reasonable to assume they incur some portion of the costs associated with downtime. Degradation and downtime both impact application performance and user experience and can result in lost business, further impacting cost.
Enento Group Story – Real-time Observability
Enento Group Boosts Service Reliability with Real-Time Visibility into Application Performance.
“Our solutions are designed to enable decisions that move money. Many banks are dependent on our services for making credit decisions. If our service is down, consumers may not receive their credit decisions which have a real-life impact. So maintaining service quality is highly critical for us.” -Eero Arvonen, Strategic Architect
To ensure software health, real-time Observability telemetry generates immediate notification that an issue has begun or is about to begin. Traditionally, this would initiate a manual triage response to repair whatever problem is occurring. These mean time to repair (MTTR) procedures are still the predominant methods used when issues and incidents are caused by code-related problems.
With the shift from on-premises monolithic applications to highly distributed, cloud-based microservices applications the nature of what can go wrong with applications has also shifted. Now application performance and availability issues are from a much larger range of problems, especially as application microservices expand and contract in elastic cloud-native configurations. The real-time Observability goal is to facilitate automated issue remediation represented as a Mean Time to Prevent (MTTP).
Now common performance issues have become resource, network latency, or issues rather than code related. Certainly, code-related issues still occur, but with the smaller size of many microservices to enable refactoring, scaling, replaceability and other microservices development practices reduces the amount of complex code triage required. Afterall, it’s much less complicated to find an issue in a 100-line microservice than it is to isolate an issue in a 10000+ service.
Average lifetime also impacts microservice remediation strategy. Frequently, new microservices that have issues are rolled back and replaced with an older version until the new version is repaired. These rollback strategies can either be automated or semi-automated based upon the complexity of the microservice.
A 2019 study by Sysdig found that the number of containers that are alive for 10 seconds or less has doubled, from 11% to 22%. The number of containers that live for five minutes or less more than doubled as well, from 20% in 2018 to 54% in 2019.
An example of a complex microservices-based application interconnectivity map is illustrated in the diagram, provided by an Instana customer, below. The modern complexity of microservices-based architectures shows that many of the issues that lead to performance degradations or downtime can be from resource issues, such as network bandwidth and latency, CPU and memory allocation, and not code-related, for cloud-native applications.
Real-time Observability provides the issue metrics that trigger microservice remediation efforts. The type of remediation that’s triggered likely depends on the complexity of the issue. For resource related issues, such as lack of memory, CPU overload, network bandwidth, lack of storage capacity, and the like, Automated Resource Management, embodied by IBM Turbonomic, can handle those issues based upon runtime policies that trigger automated remediation procedures.
When remediation requires a decision between different microservices or a rollback to the previous version, semi-automated remediation methods using runbooks or other procedures can be used to solve the problem.
When code triage is needed, there are additional tools to supplement real-time Observability such as AI-driven code completion tools as well as partner programs that feed telemetry to a developer’s IDE (Lightrun) or enhance database telemetry (DBMarlin) to pinpoint database problems.
Real-time Observability enables software health by supporting the following remediation options:
Software Health Options for Software Sustainability
- User experience and performance issue detection (EUM)
- Real-time issue and incident detection (MTTD/MTTN)
- Automated issue remediation (MTTP)
- AI-driven semi-automated service rollout and rollback (CI/CD)
- AI-driven code recommendations (AI/CR)
- AI-driven software incident management and repair processes (MTTR)
- Compliance – SLO/SLI, Security, Cost, etc.
All of these remediation options are triggered by the real-time Observability telemetry. The faster the trigger, the sooner remediation can be initiated. That is absolutely critical for automated and semi-automated procedures when 10 second metrics and sampled transactions can dramatically delay and impair remediation.
Dealerware Customer Story – Real-time Observability for Amazon EKS
“With Instana, our day-to-day goal is to be able to guarantee a latency expectation. Our goal for service calls is to complete within less than 250 milliseconds. So, it’s not just for fire drills. In the day-to-day, we’re able to improve performance, and that drives us toward that 250 ms goal. Instana makes this possible.”
The saying, “closing the stable door after the horse has bolted” definitely applies. 10 second metrics and sampled traces CANNOT initiate automated or semi-automated remediation in real-time. With the reduction in code-related issues and an increase in resource-related issues for cloud-native microservices, those overly long and incomplete Observability measurements are problematic, especially for cloud-native applications.
Those delays can lead to incidents and downtime but also to performance degradation which has become the new downtime. It can also lead to increased costs and lost business. In also leads to the question of how much degradation and downtime is your enterprise willing to absorb to use Observability platforms that dramatically delays your triage efforts?
Real-time Observability Importance
The importance of real-time Observability is to have the information you need, at your fingertips, to reduce the prohibitive costs of downtime and to use automation to prevent issues from turning into incidents. From its inception, computing and software’s imprimatur has been to make things work faster, more reliable, and to reduce costs. Real-time Observability and Automated Remediation are modern methods to help you achieve that goal.
Prisa Customer Story – Real-time Observability
For PRISA, performance is key. When they encounter a performance problem, it has an immediate and detrimental impact on the business performance and the consumer’s perception of their brand.
“A one-second time difference in displaying content makes a huge difference to our audience’s experience.”
To learn how Instana and Turbonomic can help you achieve greater Software Health and Sustainability download our Achieving Software Health in the Microservices Age e-book.