The Role of Context and Dependencies in Observability

Post

To be successful in Observability, you must have the ability to observe the behavior of a system and derive its health state from it. But deriving the health state of any given application entity requires the observability system to understand the context of every piece of information given, to automatically correlate all the data, and deliver hard evidence – quickly – on what is going on.

Traditionally, systems monitoring has focused on metrics and logs, with the actual approach being almost the same for more than a decade. With observability we introduced an advanced concept of monitoring, which implies that we observe something. The important bit here is to observe at a more granular level than we monitor. While inclusive of all the numbers and graphs from monitoring tools, observability adds the knowledge of what is meaningful to be monitored to all teams that have a stake in application performance and availability.

The Importance of Observability

When working with distributed systems, we face challenges that are vastly different from the monolithic applications we built in the past. While we learn and love to think in simple terms of our world, the reality is often very different.

Word Image 350
^
The dream vs the reality of modern architectures © ROELBOB

Even though we provide our users and co-workers simple looking APIs, many different systems and services are running independently and concurrently to fulfil their requests. The majority of services still have one or more sub-calls to other services or databases. An understanding of the interplay between those services is one of the most important bits when trying to fix a bug or mitigate a performance problem.

Metrics and charts in dashboards are a great first step to find out about issues. However, when solving an issue, what we are looking for is relevant information that leads us to the cause of a problem as quickly as possible. The system should tell us which metrics and traces are important to the failure domain of each specific service. Focusing our attention on only what matters.

The Pillars of Observability

Observability, as we think of it today, consists of 3 main pillars; Health and Performance Metrics, (Distributed) Traces, and Logs.

Health and Performance Metrics provide the important and easy to grasp information that the system is healthy or, on the other hand, something fishy may be going on. It is easy to alert on known or breached response latency guarantees. It is also easy to visualize to the human eye, when sudden and extreme changes happen.

Distributed Traces are to distributed systems as stack traces are to applications and exceptions or panics. It’s a technique to capture and time service handlers and internal calls while a request makes its way through a systems landscape to generate its response. For that reason, it is also referred to as distributed request tracing.

Distributed Tracing helps developers analyze request flows and pinpoint the root cause of issues or performance bottlenecks. It should be noted that while many solutions have “distributed tracing,” they don’t all capture a distributed trace of every request, which can lead to information gaps, especially when troubleshooting.


Word Image 351
^A typical service request flow, hitting multiple different services and systems on its way to being responded to.

Last but not least, Logs provide us with use case, application and situation specific context information. Logs are often useful when the component in question is already found, and further digging into the issue is necessary. They may provide errors or warnings.

To be really useful though, logs must contain only information useful to pinpoint errors. Tracing and debugging log messages must be deactivated in production, except for specific situations where a bug is being chased down.

Disconnected Information Syndrome

What must be prevented is the so-called “Disconnected Information Syndrome”, where information of the three pillars is collected, but stored in independent, unrelated and disconnected form – also known as “Information Silos”.

When dealing with disconnected information, correlating information is a more manual process, which can be quite cumbersome, sometimes even impossible, due to the massive amount of data to take into account. The main reason people hate having multiple dashboards for different services is that it’s practically impossible to match any time spans and gather the overall context of the issue.

With Mean Time To Resolution (MTTR) being the major metric Ops teams are judged on, finding the root cause of an issue has to be as fast and as easy as possible.

Automation and Observability

Observability is only as good as the level of insight that can be provided. Therefore, it is important that observability systems are always on top of what is going on in your systems. Keeping systems up to date can be quite tedious though, especially if you have a low level of monitoring automation in place.

Advanced monitoring and observability systems provide sophisticated automation:

  • Service Discovery
  • Framework and Library Discovery
  • Automatic Instrumentation of running services
  • (Service) Dependency Tree building
  • Multiple Data Sources
  • Data Correlation and Pattern Recognition

The major benefit of those features is the ability to correlate machine, infrastructure, and application / services metrics and traces. Distributed traces deliver the understanding of a request’s flow, while metrics provide the necessary performance points.

The biggest benefit of automatic correlation though, is the immediate insight. When looking at an issue or incident, the observability system already did all the detective work, to provide the important pieces of information as contextual evidence and lead us right to the interesting spot.


Word Image 352
^Which airplane would you prefer to fly? The old model 747 (left) or the current model (right)?

A good analogy is an airplane. Vendors did an amazing job over the last decades to make flying an airplane as easy and “intuitive” (if you can use that word for anything regarding flying) as possible. The original Boeing 747 had a massive number of nodes and meters, with a cockpit crew of 3. Today’s model though, has a crew of only 2. Possible due to the board computer analyzing all the information of the airplane’s sensors in real-time and providing the pilots with only the necessary information or evidence in case of a situation.

The Role of Context and Dependencies

Without an understanding of how all the elements of an airplane work together, the board computer wouldn’t be able to decide which information is important and necessary to the pilot at any given time.

Just like airplanes, our “on-board computer,” the observability system, takes every bit of data into account and presents only the necessary pieces of information – in context!

The data correlation, the knowledge of how the infrastructure and services are deployed, as well as the dependency tree of applications, services and (eventually) hardware, must be taken into account when providing hard evidence of what is going on in your system.

That said, the context of any piece of information, such as when, how, and where a value was captured, is just as important as the information itself. Same goes for the context of dependent information pieces.

Conclusion

It’s safe to say that information without its context is pretty much as useless as no information at all – especially with the knowledge that the number of data points we’re collecting is constantly growing. Automatic correlation of events and data is the only possible way forward. It’s true for monitoring – and it’s true for observability.

Play with Instana’s APM Observability Sandbox

Announcement, Developer, Product, Thought Leadership
At Instana, we recently improved the installation process for our self-hosted customers. Instana’s self-hosted platform now utilizes a fully Docker based installation process. In a previous blog post, Lessons Learned From Dockerizing...
|
Featured
An ever-increasing number of System architectures and deployment strategies depend on Kubernetes-based environments. Kubernetes (also known as k8s) is an orchestration platform and abstract layer for containerized applications and services. As such,...
|
Featured
In a previous blog, Increasing Agility with Pipeline Feedback Scoping, we discussed the different types of pipeline feedback scope. In this follow up post, we’ll look at how to best apply pipeline...
|

Start your FREE TRIAL today!

As the leading provider of Automatic Application Performance Monitoring (APM) solutions for microservices, Instana has developed the automatic monitoring and AI-based analysis DevOps needs to manage the performance of modern applications. Instana is the only APM solution that automatically discovers, maps and visualizes microservice applications without continuous additional engineering. Customers using Instana achieve operational excellence and deliver better software faster. Visit https://www.instana.com to learn more.