This is Real-Time Observability
“With Instana, our day-to-day goal is to be able to guarantee a latency expectation. Our goal for service calls is to complete within less than 250 milliseconds. So, it’s not just for fire drills. In the day-to-day, we’re able to improve performance, and that drives us toward that 250 ms goal. Instana makes this possible.” – Bryce Hendrix (Lead Platform Architect, Dealerware)
No matter what the environment, no matter what the required precision, Real-time Observability delivers.
Observability is a critical capability that enables cloud-native applications to be both reliable and resilient. It’s an extremely hot and fast-growing technology market segment. Observability is designed to provide rigorous transaction traction with, in some cases, the addition of automated code profiling.
Observability also provides robust context for transaction traces. Context means information and correlation between each step of an end-to-end transaction. With the simultaneous evolution of microservices application architecture, cloud native and other modern applications now have a large array of services that require monitoring and correlation.
You might ask, what does this have to do with Real-time Observability and how does it differ from APM?
APM is code-centric application monitoring with some aspects of transaction tracing. Conversely, Observability is designed to provide rigorous transaction tracing with, in some cases, the addition of automated code profiling. APM was designed for larger code stacks with many internal calls between procedures and a smaller set of calls to other services on other systems.
With the rapid emergence of containers and microservices, the monitoring script literally flipped from the APM focus of medium-large code stacks connected to a small-medium number of services. With containers and microservices, applications are now built with many small, containerized services connected to many other services.
The diagram is a representation of a microservices interconnect map for a medium-sized application and is not at all unusual for applications today. An observability platform needs to monitor all of these connections and containers/endpoints which, as illustrated, is a complicated task. APM tools struggled to keep up with capturing metrics and traces for the large number of new services and interconnects of a microservices-based application.
The reason for those struggles is because APM platforms were not designed to handle the large volume of metrics and traces required to accurately observe highly distributed, highly scalable cloud-native applications. In fact, many of those platforms end up sampling metrics and traces because they can’t store the amount of data needed to provide real-time observability.
Seeing the evolution in computing architectures from SOA to more highly distributed microservices, many APM vendors, began to adapt their APM platforms to attempt to adequately observe cloud-native applications. They added more transaction tracing, at least as much as their monitoring platform could handle, and even started using AI to help associate context to those transactions.
But even the adapted APM platforms struggled with the sheer volume of information needed to accurately observe a microservices application. They simply were not designed with an efficient information streaming or other communication method nor with a highly efficient storage architecture that can contain the volume of information required for Real-time Observability.
Instead, they cobble together sampled information to detect trends and anomalies, which takes time even for the most advanced machine learning algorithms. In some cases, dependent upon the nature of the issue, it can take minutes to discover the issue and provide enough context to get to the issue’s root cause.
Real-Time Observability Specified
Real-time Observability is:
- 1 second metrics stored for 24 hours
- Full end-to-end traces for every transaction without sampling
- 3 second context notification for each metric and trace
All of the elements of the Real-time Observability platform are designed specifically to precisely capture all of the information you need and automatically combine it with pertinent context to ensure immediate usefulness.
Real-time Observability also leverages automation for near instantaneous installation to for deliver immediate results. It also automatically scales with microservices-based applications, so you don’t end up with gaps in your end-to-end observability as your application and service topology inevitably changes.
Why Real-Time Observability is Important
The rapid shift to cloud infrastructure and microservices rapidly changed how applications are deployed. The benefit of this shift is the agility to rapidly modify, enhance and deploy new application features. What took months of effort can now be accommodated in a matter of days.
But the more highly distributed, network-centric nature of containers and microservices has complicated application architectures significantly.
Real-time Observability must capture all of these transaction traces, without sampling. It also captures the underlying infrastructure metrics (host and network) every second and correlates with the end-to-end trace information. It detects issues within one second (MTTD) and provides notification (MTTN) within 3 seconds.
Real-time Observability delivers the ability to immediately either prevent an issue from turning into an incident, or to enable a runbook to be applied or manual triage to start much more rapidly than observability platforms that take much longer to identify an issue.
Platforms that capture metrics much less frequently or sample traces instead of capturing every trace leave visibility gaps that negate the benefits of true Real-Time Observability: identifying issues before they’ve become incidents.
The underlying goal of Real-time Observability is to immediately identify issue trends, such as increasing latency or CPU use, and to apply remedial actions to prevent those issue trends from turning into incidents.
Why is that important? Because incidents lead to either 1) downtime or 2) severely impaired application performance. Downtime costs are approximately $11,000 (US) per minute and impaired application performance can lead to a lot of unhappy users, lost sales, and reputation damage.
Real-time Observability is now being combined with AIOps to provide preventative or rapid triage remediation action to reduce the impact of issues. Real-time Observability provides the trigger to initiate the AIOps actions. The faster the issue can be identified, the faster that AIOps remediation can be applied.
If you care about application quality that’s reflected by performance and availability, then settling for “good enough” metrics or “sampled traces” does not provide you with the time critical insight you need to keep your application optimized and your users happy. Spotty metrics and traces with limited context also do not provide adequate input to engage and benefit from AIOps to prevent or rapidly remediate issues.
Instana is the only Real-time Observability that provides 1) continuous one second metrics, 2) FULL end to end traces for all transactions, and 3) full context with upstream and downstream service information for each transaction step, in real-time.
Instana IS Real-time Observability.
It’s the only platform that enables AIOps to be used effectively to prevent issues from escalating into incidents. It also provides a much faster path for developers, DevOps, SREs and others to get to the root cause of an issue. And, its comprehensive automation enables practitioners to receive information as rapidly as possible and to remain Always On as cloud-native applications scale.