Whenever I get into a conversation regarding what I consider unnecessary surface complexity of various trace instrumentation interfaces and toolkits, the words “confusion“, “complicated“, and “challenging” invariably pop up.
The state of tracing is troubling on so many levels within the Observability community. There are a few competing standards, OpenTracing versus OpenTelemetry, that don’t seem to be at all standards-driven or offering very much in the way of stability and simplicity. Much of what is there at the surface level is a collage of multiple concerns all collapsed into one disordered ensemble. There is a minimal amount of clean and consistent separation of the interface based on different actors and roles – application developer, protocol propagator, service provider, data pipeline, and extensions. Then there is the arcane-like terminology of spans, span kinds/types, and baggage items.
Today I still question the sanity and sensibilities of those who have pushed so hard for developers to widely increase the manually hardcoding of which particular type of instrument and model directly within their codebase and with such verbosity that it hampers code readability even for experienced engineers. That battle seems lost for now.
Manual trace instrumentation makes logging look painless in comparison when it should not be the case. I would dearly love to see the community reconsider the current approach and go in one or two directions. One direction would be to offer an enterprise workflow interface with hooks to support local and remote tracing needs at various stages and steps in the processing. The second direction would be to go down the minimalist route with a generic Instrument interface that can dynamically at runtime add and mix-in different types of measures, collections, and models. There is so much early overcommitment within code when there are other concerns that are far more complicated like time.
After more than 20 years of building automatic monitoring and observability technologies and toolkits, I have come to question the significance of measuring time and the disaggregating of it into splits, slices, spans, and whatnot. I can’t help wondering whether time should be measured as an interval when a client is, for the most part, unaware or unconcerned with its change. Unless the client is a (waiting) human or there is an explicit deadline, maybe we should consider it an unnecessary complication within an already highly complex and fast-changing environment?
It seems strange that there is a complete abandonment of encapsulation concerning how a service executes. What criteria must we use in determining whether a caller of a (micro|cloud)service can peer deep into the inner workings and nested call paths of a requested execution? It does not seem, in general, a scalable or appropriate best practice in light of the increased adoption of resilience engineering techniques – not always visible to instruments. Request level measures are from a bygone era. It is all about conversations – more varied interactions and outcomes. We’ve made our systems more tolerant of errors, so maybe it is time to include more explicitly time as well?
Timing a service-to-service call seems simple initially, but it is anything but that when it comes to distributed tracing. But first, let’s start synchronous and straightforward. On one side, there is a caller on the other a callee. The caller is a client of the callee, which is a server to the caller. There are two primary joinpoints of interception here – the call at a callsite and the execution at a target. I am using terminology familiar to aspect-oriented programming in breaking down a local method invocation.
In the CORBA specification world, a portable interceptor uses either client or server qualifiers. The Google gRPC team has since copied this from CORBA and used it to support distributed tracing across service endpoints.
With remoting, in general, a client transmits a request, and a server replies with a response. Within a local context, we replace requests within arguments and responses with return values. Both local and remote have routing – in the case of remote, a service proxy or loader balancer is used, and for local a virtual lookup table.
In the code (method level) profiling world, time is separated into total time and self-time. The self-time also referred to as the inherent or exclusive time, is the total time minus the time spent in calling other methods. Relating this to the remote procedure call (RPC) world, the total time would be the call time measured at the callsite within the caller, and the self-time would the total time minus the time spent executing at the remote callee target. In general, being able to break down total time into an exclusive (self or inherent) and non-exclusive components is useful. This helps isolate the cost of remoting – a cost that could be apportioned to either side.
When it comes to local method profiling, there is no significant measurable difference whether you instrument and measure from the caller callsite or from within the target method body as the cost to set up the call frame and arguments is relatively minuscule. However, one could argue that the lazy evaluation and transformation of call arguments can inflate the unit cost model of the callee when the measurement is done from a caller callsite.
In the distributed tracing world, the measurement of a call includes one or more spans. With instrumentation at both ends, whether manual or dynamically injected, a client and a server span will be created — the client span on the caller site and the server span on the callee side. I’m leaving aside the concept of a trace context or flow here, though it should be noted that a span is not fully realized until it is finished and so a server span will be stored or transmitted to a data backend before the actual dispatch and delivery of the completed client span. Neither party in this service interaction can assess timing-related information of the other at runtime – the complete reconciliation is done post span processing within a backend, in general.
With both client and server spans, the remote networking cost can be easily calculated, what is not so easy is which side of the caller-to-callee invocation chain should bear responsibility for the cost and be assigned it. What if the service employs a proxy or load balancer? Should that not be included in the server span timing as execution time? What if the caller passes calls through a sidecar as part of a service mesh infrastructure? Surely that is a cost, outside of the control of the callee that must be assigned to the caller side?
There are various good and bad reasons for deciding for assignment of cost (timing) on either side of the call chain. From this complication and confusion comes the requirement to add more spans to the call trace (or flow) by instrumenting the intermediaries on the network. Instead of componentizing the time, more nodes (spans) are added in the resulting trace tree, making it far more difficult to calculate the self-time as distinct from the server execution time.
To this day, I still routinely question my thinking in this area, especially in an era of rampant microservice adoption. Is the caller the one pushing for process separation, or is it the server? I suppose it is a necessary cognitive and computational cost we all must bear and reflect on in the course of doing business across space and time.
Now imagine having to reason beyond two hops – possible but not very productive, which is why so much engineering effort is spent on simplifying the span model within tools via visualizations and abstractions.
It is not just the intermediaries on the network that complicate the reasoning of trace measurements. With tracing instrumentation now becoming more commonly supported in client libraries of popular services like Kafka and Cassandra, there is not just one client span for single remote service invocation. In essence, a client span can be created and started in the application space, and another child span created and nested within this parent client span from within the client library. This additional measurement can be useful when the service or operation name used by the client library to name a span is far too generic such as “send”, “dispatch”, “call”, etc.
Would it not be better to break up time into components representing the nested scopes of callers? I believe so, and many tools do so, but after storing and post-processing far too many needless span records. The community had a chance to leave behind the past and offer a choice in how to model multistaged timed execution flow in terms of a mix of spans or time components. Instead, we have span multiplicity with a single time measure. Simple?
We’ll have to see how it plays out, but with increasing levels of one-way calls and asynchronous chaining, it might be time to think in terms of queues (length) and throughput all over again at least within and internal to systems.
Time will tell, or maybe not.