Over the last two months, I’ve had the pleasure to spend time with Mark Burgess discussing the lack of rigor in defining what exactly is effective observability and monitoring in the context of distributed systems and how we can design and develop a better and far more compelling future for application monitoring with the advent of microservices and the rise of event-based architectures. Today Mark has made available one of the papers we have collaborated on. It is an impressive piece of research and understanding, which I hope will receive far more consideration than lesser and largely legacy efforts that seem to grab all the attention at tech conferences.
Here is the abstract and link to the paper.
Abstract: To understand and explain process behaviour we need to be able to see it, and decide its significance, i.e. be able to tell a story about its behaviours. This paper describes the modelling challenges that underlie monitoring and observation of processes in IT, by human or by software. The topic of the observability of systems has been elevated recently in connection with computer monitoring and tracing of processes for debugging and forensics. It raises the issue of well-known principles of measurement, in bounded contexts, but these issues have been left implicit in the Computer Science literature. This paper aims to remedy this omission, by laying out a simple promise theoretic model, summarizing a long standing trail of work on the observation of distributed systems, based on elementary distinguishability of observations, and classical causality, with history. Three distinct views of a system are sought, across a number of scales, that described how information is transmitted (and lost) as it moves around the system, aggregated into journals and logs.
Here are some of the critical high-level points and observations I would like to call out that make this such an insightful paper to read on observability in the context of distributed systems along with my own more personal commentary and vision.
“The state of the art in monitoring relies principally on brute force data collection and graphical presentation. There is surprisingly little discussion about the semantics of the process”.
Much of the focus on application monitoring is on collecting vast amounts of data with little regard to the cost and effectiveness of the data in helping users to understand their systems and actively learn how best to effectively manage them. This needs to change and will soon!
In application monitoring, the “learning” in the marketing literature is focused the machine learning. This is understandable when we consider the vast amount of data that is being generated, stored, and queried. But I think this is just one part of the problem space and solution for us. Our primary objective should be to enable our users, as opposed to our customers (the corporate entity tasking a user), to increase their ability to learn what is needed to perform their task. With machines, we feed them data to train them (more so the algorithm and model). With humans, this is not ideal unless you believe in learning by rote. Humans learn best as active participants in a process where meaning is made or derived.
When we look at what we provide to our users we need to take this into consideration. We need to ask ourselves do users learn from this. Is this learning process effective and adaptive as they gain knowledge? Can we help facilitate the learning process within the UX over an hour, day, and week? How can I bring the user, the human, back into the observation and control loop? How can we play to the strengths of the machine and human where each party gains from the other. For example, the machine can use a human to reinforce its learning by asking for verification of an assessment it has made in some situation. The learning algorithms need to learn – they need to adapt to other data feeds such as human dialog. The other way around we need to think about how to reduce the cognitive load on the human by not having everything blinking but having the user be guided into an area worthy of their attention as well as allowing the user to request more data on demand.
“The automation of alarms (usually based on simple-minded absolute thresholds) tells human operators when to pay attention, at which point they have to rely on what has been traced. The promise to maintain awareness is an expensive one, and we rely heavily on our skills of reconstruction after the fact.”
Observability serves two primary purposes. The first being the acquisition of knowledge and human understanding via reconstruction of a past or recent reality in some comparable form. The second is controllability in the active monitoring and management of a system via direct and indirect dynamic interventions around (work)flows and resources assignments.
“Monitoring and measurement serve no actionable purpose unless there is already a policy for behaviour in place. Ashby’s model of requisite complexity or ‘good regulator’ in cybernetics, summarizes how matching information with information on the same level is required when there is no intrinsic stability of a model to compress such fluctuations by.”
The data is not the model. Not all data has value and where there is value it is not necessarily equal. A model helps form understanding and guides appropriate action. We need to question the production of measurement data with regard to how it is consumed and transformed into something of meaning that aids observers and operators, humans and machines. We can’t manage a system if all we have as a model are stored records be them metrics or events.
When we think about learning there are two essential aspects to consider:
- the method of learning
- the model(s) used to facilitate learning
Systems thinking is a useful set of skills and techniques built around a simplified model of the dynamics within a system. Systems thinking aids in the understanding of systems, predicting future behaviors and states, as well as devising ways to intervene in order to steer a system towards a targetted outcome.
The key concepts systems thinking focuses on are:
- agent-to-agent interactions (over time)
- feedback loops
- flows and stocks
- delays and dynamics
The monitoring models of the future need to extract and incorporate these concepts in both observability and controllability.
The purpose of monitoring is to be able to explain behaviour and even predict problems in advance. Without predictability, monitoring is little more than arcane entertainment. One assumes that, by learning about the past or by building a relationship with system behaviours in band, we are able to predict something about the future behaviour. This assumes stability under the repetition of patterns.
Observability and monitoring are far more critical today not just because we have gone from monoliths to microservices but because of the increase in the rate of change and how it is unevenly dispersed across time (staggered rollouts) and space (containers within containers) in an ever-expanding system of interconnected parts. But because monitoring requires some degree of stability (in patterns of behavior) we now must revisit the past models of monitoring and the approaches to management with such models.
Observability must not just offer an efficient method of understanding, learning, and sharing meaning across technical and social boundaries, it must also enable the prediction of inferred states from the present into one or more predicted and near immediate futures. Prevention over restoration. Intervention over post-mortem analysis. Proactive over reactive.
“It’s slightly paradoxical that the time when most users want to monitor systems is when they are least predictable and providing observations of no value.”
Humans and machines must learn through experience “learning through reflection on doing”. Knowing is doing, and doing is knowing. The process involves reflecting on some concrete experience, forming an abstract conceptualization (a model), followed by active experimentation with the model in another concrete experience — a constant loop of action, reflection, reasoning and (re-en)action.
Concerning monitoring, such learning takes place under extreme situations when a problem occurs in production and operations staff (including developers) are tasked with identifying the cause(s) of related incidents. This typically involves operations staff scanning for potential candidate (clue) metrics with spikes or dips (change) just before the awareness of the problem elsewhere. In many cases, this is the first time operations staff have had to actually look at such metrics and fully understand how they are related and impacted by changes before and after a series of failures or degradation in performance. With increased observability, there is a cognitive overload and resulting blindness. It does not help that during this time humans are not generally well suited to think.
“The significance or meaning of a signal is the heuristic inverse of the (incompressible) information within it. The more information we need to characterize a room, the less stands out about it. If there is one part that dominates, the rest is negligible — hence the principle of signalling significance.”
Lean and meaningful is how we must reimagine and rethink monitoring of the future. Less is more and probably the only realistic option moving forward. Collecting hundreds of fields for each and every event is invariably wasteful for machines, networks, and humans. Yesterday’s data is not today’s data. There is no magic for extracting a signal from data when it comes to the changing nature of software and dynamics of systems. Signals are the new data currency that originates at source and propagates outwards.
“The proliferation of logs in IT systems means that they are used for far more than they are capable of representing. What happens to order and distance relationships in logs after aggregation? There are many tools that imply log aggregation is a good way to bring together all logs into one location, but there is little discussion around the significance or usefulness of the result.”
The deep dynamic behavior of execution flow is all but lost in the promotion of records, tags, and labeled indices. The narrative that observability should be capturing is nowhere to be seen no matter how a chart is prettied or pivoted. Instead, users are clouded and confused by hundreds of labels and fields. It is like picking up a book and instead of immersing ourselves in the narrative constructed by an author we are witness to records detailing the act of the creation such as on what date a particular sentence or word was typed. What was the duration between such acts and how many edits made? This is because of the observer, agent or API, is not focused on the narrative but instead the construction of such – the writing of sentences, words, letters, etc. Observability needs to see beyond data — deep into the nature of the code.
“Keeping data and even models around forever is a senseless squandering of resources and an irresponsible and unsustainable use of technology. One wonders how many of the photos now being eagerly accumulated in the cloud will be preserved in ten years’ time. The same is true of monitoring data that were collected last week. If we don’t understand the timescales, context, and relevance of data, then we have no business collecting it, because it cannot tell us anything of value.”
The faster a system moves forward and changes to circumstances within its environment the less value there is to be found in the distant past. The deeper we dive, the more we understand and from there change, leading to the need to forget more and incorporate the present much faster and to some degree the predicted future. This will be a challenge unless we find a set of signals that are stable across time and the change it encompasses – that is where we’re heading.