Each new day we see observability attached to some other aspect of software and system engineering. Observability pipelines. Observability platforms. It is clear others want the love shown to observability to shine a light into their secluded corners. Unfortunately, we are at a point where observability has lost all meaning and connection to what it was initially intended. Let me try to correct this, first with a listing of a few definitions I’ve taken from elsewhere.
“In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals…one can determine the behavior of the entire system from the system’s outputs. If a system is not observable, this means that the current values of some of its state variables cannot be determined through output sensors. This implies that their value is unknown to the controller (although they can be estimated by various means).” – Wikipedia
“Observability is a notion that plays a major role in filtering and reconstruction of states from inputs and outputs. Together with reachability, observability is central to the understanding of feedback control systems.” – Lectures on Dynamics Systems and Control, MIT
The above is how I visually depict observability and controllability as I consider them to be mostly inseparable for effective and efficient performance in either process. Application monitoring and management is that important process space in between – perceiving, attending, and (re)acting.
Now you could argue that much of the above definitions pertain to control theory (cybernetics). I counter that much of what is happening today in the software industry is centered around streamlining and stabilizing a much bigger feedback loop – managing change and complexity.
The critical elements of cybernetics are feedback, flow, control, and communication – pretty much what many would agree as being the essential aspects of successful DevOps as well. The big difference is that we must now somehow bring together the worlds of man and machine in such a way that plays to the strengths of each other without significantly taxing the capabilities and capacities of each. Talk of exploring the infinite space of the “unknown unknown” is for the most part not at all helpful or productive. There are real limits and constraints in practice and, most certainly, in production. We need to refocus on what should be observed and controlled.
Before proceeding further with what behavior and state of a system to observe and in turn, control, let me present another definition of observability I found of interest.
“most scientific observations consist of drawing inferences from what we sense. But we do not count just any inference made from sensations as results of “observation”. The inference must be reasonably credible, or, made by a reliable process. All observation must be based on sensation, but what matters most is what we can infer safely from sensation…I would define observation as a reliable determination from sensation. ” – Inventing Temperature
What matters most is what we can infer from observations and what is the significance of such observations to the system (of systems) state we are attempting to determine and control.
The problem for observability as I see it since its inception is that we, the application monitoring and management industry, have not defined a model consisting of a relatively small set of universal signals and states that reflect the nature of modern application systems of services, flows, streams, etc. Niche observability vendors keep talking up raw data – traces, metrics, events, and logs. Somehow engineering is expected to pull a bunny out of a hat, one that houses much complexity and change. Control is another illusion here. We need to get back to basics.
- What system (health) states are of concern and at what granularity? Cluster? Service? Endpoint? Instance?
- How is the state determination made? By the system, or service, itself or by external dependents? How reliable, durable, and predictive can such determinations be?
- What is the set of qualitative signals that can describe much of the interaction, dialog as opposed to calls, that occurs between systems and parts of the system?
- What is the mapping of signal to state and how sensitive should it be at various exchange points?
- What mechanism and policy can be employed to scale up (aggregation) and down (decompose) to the appropriate level of operational attention?
Starting at the bottom and working your way up loses far too much context, and leads to far too much work and cognitive overload for both man and machine (learning). Most of the current observability technologies don’t fair well as a source of signals or inferred states. They are not designed to reconstruct behavior that would allow the level of inspection we would need to translate from measurement to signal and in turn, the state effectively. They are designed with data collection and reporting in mind of the event and not the signal or state.
Yesteryear observability technologies are not fully aware of the resilience mechanisms that many service-to-service communication libraries now employ. They count errors at the request level and not failures in an exchange within the context of a workflow, which can consist of one or more errors but still be regarded as successful completion. The default should be to only surface signals and states in tooling and on exception drill down into low-level events. When we talk only of signals and states, then and only then can we say there is observability. Anything else is just sensing and storage. Today’s observability lacks scalability, sensibility, and sensitivity.
At Instana we have an ambitious product roadmap and engineering plan to radically change how organizations monitor, manage, and continuously affect change throughout their systems.
Simplify. Signify. Rectify.