To be successful in Observability, you must have the ability to observe the behavior of a system and derive its health state from it. But deriving the health state of any given application entity requires the observability system to understand the context of every piece of information given, to automatically correlate all the data, and deliver hard evidence – quickly – on what is going on.
Traditionally, systems monitoring has focused on metrics and logs, with the actual approach being almost the same for more than a decade. With observability we introduced an advanced concept of monitoring, which implies that we observe something. The important bit here is to observe at a more granular level than we monitor. While inclusive of all the numbers and graphs from monitoring tools, observability adds the knowledge of what is meaningful to be monitored to all teams that have a stake in application performance and availability.
The Importance of Observability
When working with distributed systems, we face challenges that are vastly different from the monolithic applications we built in the past. While we learn and love to think in simple terms of our world, the reality is often very different.
^The dream vs the reality of modern architectures © ROELBOB
Even though we provide our users and co-workers simple looking APIs, many different systems and services are running independently and concurrently to fulfil their requests. The majority of services still have one or more sub-calls to other services or databases. An understanding of the interplay between those services is one of the most important bits when trying to fix a bug or mitigate a performance problem.
Metrics and charts in dashboards are a great first step to find out about issues. However, when solving an issue, what we are looking for is relevant information that leads us to the cause of a problem as quickly as possible. The system should tell us which metrics and traces are important to the failure domain of each specific service. Focusing our attention on only what matters.
The Pillars of Observability
Observability, as we think of it today, consists of 3 main pillars; Health and Performance Metrics, (Distributed) Traces, and Logs.
Health and Performance Metrics provide the important and easy to grasp information that the system is healthy or, on the other hand, something fishy may be going on. It is easy to alert on known or breached response latency guarantees. It is also easy to visualize to the human eye, when sudden and extreme changes happen.
Distributed Traces are to distributed systems as stack traces are to applications and exceptions or panics. It’s a technique to capture and time service handlers and internal calls while a request makes its way through a systems landscape to generate its response. For that reason, it is also referred to as distributed request tracing.
Distributed Tracing helps developers analyze request flows and pinpoint the root cause of issues or performance bottlenecks. It should be noted that while many solutions have “distributed tracing,” they don’t all capture a distributed trace of every request, which can lead to information gaps, especially when troubleshooting.
Last but not least, Logs provide us with use case, application and situation specific context information. Logs are often useful when the component in question is already found, and further digging into the issue is necessary. They may provide errors or warnings.
To be really useful though, logs must contain only information useful to pinpoint errors. Tracing and debugging log messages must be deactivated in production, except for specific situations where a bug is being chased down.
Disconnected Information Syndrome
What must be prevented is the so-called “Disconnected Information Syndrome”, where information of the three pillars is collected, but stored in independent, unrelated and disconnected form – also known as “Information Silos”.
When dealing with disconnected information, correlating information is a more manual process, which can be quite cumbersome, sometimes even impossible, due to the massive amount of data to take into account. The main reason people hate having multiple dashboards for different services is that it’s practically impossible to match any time spans and gather the overall context of the issue.
With Mean Time To Resolution (MTTR) being the major metric Ops teams are judged on, finding the root cause of an issue has to be as fast and as easy as possible.
Automation and Observability
Observability is only as good as the level of insight that can be provided. Therefore, it is important that observability systems are always on top of what is going on in your systems. Keeping systems up to date can be quite tedious though, especially if you have a low level of monitoring automation in place.
Advanced monitoring and observability systems provide sophisticated automation:
- Service Discovery
- Framework and Library Discovery
- Automatic Instrumentation of running services
- (Service) Dependency Tree building
- Multiple Data Sources
- Data Correlation and Pattern Recognition
The major benefit of those features is the ability to correlate machine, infrastructure, and application / services metrics and traces. Distributed traces deliver the understanding of a request’s flow, while metrics provide the necessary performance points.
The biggest benefit of automatic correlation though, is the immediate insight. When looking at an issue or incident, the observability system already did all the detective work, to provide the important pieces of information as contextual evidence and lead us right to the interesting spot.
A good analogy is an airplane. Vendors did an amazing job over the last decades to make flying an airplane as easy and “intuitive” (if you can use that word for anything regarding flying) as possible. The original Boeing 747 had a massive number of nodes and meters, with a cockpit crew of 3. Today’s model though, has a crew of only 2. Possible due to the board computer analyzing all the information of the airplane’s sensors in real-time and providing the pilots with only the necessary information or evidence in case of a situation.
The Role of Context and Dependencies
Without an understanding of how all the elements of an airplane work together, the board computer wouldn’t be able to decide which information is important and necessary to the pilot at any given time.
Just like airplanes, our “on-board computer,” the observability system, takes every bit of data into account and presents only the necessary pieces of information – in context!
The data correlation, the knowledge of how the infrastructure and services are deployed, as well as the dependency tree of applications, services and (eventually) hardware, must be taken into account when providing hard evidence of what is going on in your system.
That said, the context of any piece of information, such as when, how, and where a value was captured, is just as important as the information itself. Same goes for the context of dependent information pieces.
It’s safe to say that information without its context is pretty much as useless as no information at all – especially with the knowledge that the number of data points we’re collecting is constantly growing. Automatic correlation of events and data is the only possible way forward. It’s true for monitoring – and it’s true for observability.