If you’ve spent any time at all reading about Instana, then you’ve likely come across the claim that we trace every request. I write “claim,” but it is in fact true – Instana traces every request.
The Instana SaaS backend currently analyses more than 1 million traces per second. That seems like an insane volume of data, but Instana’s engineering team have spent a considerable amount of time optimising the code to achieve this processing capability with minimum resource utilisation. Other observability platforms sample the traces. They argue that the vast majority of requests are processed perfectly, therefore those traces are not of interest. This article will explore both arguments.
First a quick recap on how tracing works. The tracer is a separate thread inside the instrumented language runtime. Instrumentation code, either manually or automatically injected, collects spans and passes them to the tracer for dispatch to a backend. Running the tracer on a separate thread prevents blocking the regular code execution. The tracer thread receives the collected spans, batches them up into traces and sends them to the backend including handling authentication, retries, etc. The backend receives the traces from the various tracers and stitches them together into end to end traces, handles the storage and the user interface.
There are two sampling approaches: head and tail.
With head based sampling the decision on which traces to send to the backend is made by the tracer. This approach increases the resource overhead of the tracer on the language runtime being traced. Additionally separate metrics must be collected for each endpoint for the KPIs of rate, errors and duration (RED), they can not be accurately derived from sampled trace data.
With tail based sampling every trace is sent to the backend, it then decides which traces to store. Running the baseline algorithm in a separate process reduces the overhead on the instrumented language runtime, additionally the traces can be analysed to provide metrics for RED. However, there is the possibility of increased charges for trace data egress.
The Problem with Algorithms
The sampling algorithms attempt to ensure that interesting traces are stored for future analysis and that the vast majority of normal traces are dropped. What is an interesting trace? DevOps engineers and SREs use traces to find out where something is broken or where it can be tuned or optimised. Therefore a trace of interest will be those that are slower than normal or erroneous. Those traces that contain spans tagged with an error flag are the easy ones to spot. To determine the traces which are performance outliers requires using historical data to establish a normal baseline; a moving average (MA) is a typical example. Now traces that fall above the 90th percentile, for example, can be identified and stored.
Moving Average baseline example
As a final safety net, sampling algorithms have a maximum throughput limit. For example, no more than 1000 traces per minute will be captured whether they are interesting or not.
There are a couple of failure modes when sampling: hitting the hard limit stop and baseline training time.
One scenario where the hard limit stop will be hit is with cascading errors. In distributed systems, one small downstream problem can have a tsunami ripple effect across connected services. This results in a large number of spans with the error flag set, and the hard limit is soon reached causing other traces to be dropped, which inevitably leads to blind spots.
When an entirely new service is deployed, there will not be any historical data. Additionally when a new version of an existing service is deployed, its historical performance data is no longer valid. It is an invalid assumption that its performance will be similar to the previous release. The new version may add extra functionality at the expense of an acceptable increase in response time. Using the invalid assumption causes the algorithm to be falsely triggered, capturing unwanted traces, potentially hitting the rate limit resulting in data loss and blind spots. Alternatively, the new release has a significant performance improvement over the previous release. Any outliers will not be captured as they will not be sufficiently above the lagging baseline. The algorithm will eventually retrain with the new normal but the lost data is gone forever.
Any algorithm calculation of a normal baseline will be more accurate with more data, right after a release there is just not enough historical data to be accurate. The training time lag on changes is a major flaw. Trace capture is at its most unreliable when it is needed most, right after a fresh deployment.
To Sample or Not to Sample, That is the Question
Apologies to the great bard. With any form of sampling, Murphy’s Law will always be true – the traces needed are the ones that were dropped. The only way to insure that your team has all the traces they need is to trace every request. Only Instana traces every request end to end without any sampling. No matter what happens, DevOps engineers will always have all the data they need to quickly validate a release and perform a post mortem when something breaks. Tracing every request also provides the high density data for SREs to find optimisation opportunities.