Note: This page is about the general concept of distributed tracing and how it is implemented in Instana AutoTrace™. Refer to the Tracing in Instana page for more information on which technologies and runtimes can be traced with Instana.
Tracing, or according to Gartner User-Defined Transaction Profiling, is at the core of every Application Performance Management tool. At Instana, we have focused on providing a comprehensive view of your application architecture and distributed call patterns, by understanding the transaction flows through all the connected components. This approach is especially relevant when dealing with your highly distributed, microservice environments.
A trace represents a single request and its path through a system of services. This could be a direct result of a request initiated by a customer’s browser, but could also be initiated by a scheduled job or any other internal execution. Each trace is made up of one or more calls.
A call represents communication between two services; a request and a response (asynchronous). Each call is a set of data and time measurements corresponding to a particular RPC or service call. Within our UI, we highlight each type of call, such as HTTP, messaging, database, batch, or internal.
To capture this data, we measure on both the caller and the callee side, which is crucial in distributed systems. In distributed tracing, these individual measurements are called spans.
An internal call is a particular type of call representing work done inside of a service. It can be created from intermediate spans sent through custom tracing. If you prefer to implement custom tracing to write your own custom instrumentation, Instana offers OpenTracing, OpenCensus, the Web Trace SDK or one of the many language based tracing SDKs.
Calls can represent operations that have incurred in errors. For example, a call representing an HTTP operation could result in a 5xx status code, or the invocation of an API through Java RMI may result in an exception being thrown. Such calls are considered erroneous, and we mark them accordingly in the UI.
Note: HTTP calls resulting in a status code 4xx are not considered erroneous, as 4xx are defined as client-side errors.
As shown in the image above, error logs are shown in the call they are associated with. Instana automatically collect logs with level
ERROR (and equivalent, depending on the logging framework). In the image above, we see that a call is erroneous and has one error log associated with it. In general, however, a call may be erroneous without having error logs associated with it, and vice-versa.
The name span is derived from Google’s Dapper paper, and is short for timespan. Spans represent the timing of code executions; in other words, an action with a start and end time. It also contains a set of data that consists of both a timestamp and a duration. Different types of spans can have one or several sets of these, complete with metadata annotations. Every trace model comprises of a block of spans in a hierarchical set, ordered by 64-bit identifiers used for reference between parent (caller) and child (callee) spans. In each trace, the first span serves as root, and its 64-bit identifier is the identifier for the whole trace.
The first span of a particular service indicates that a call entered the service, and is called an entry span (in the Dapper paper, this is named “server span”). Spans of calls leaving a service are called exit span (in the Dapper paper, this is named a “client span”). In addition to entry and exit spans, intermediate spans mark significant sections of code so the trace runtime can be clearly attributed to the correct code.
Each span has an associated type, such as HTTP call or database connection, and depending on the type of span, additional contextual data is also associated. To follow a sequence of spans across services, Instana sends correlation headers automatically with instrumented exits, and those correlation headers are automatically read by Instana’s entries. For more information, see HTTP Tracing Headers.
A callstack is an ordered list of code executions. Whenever code invokes other code, the new code is put onto the top of the stack. Callstacks are used by runtimes of all programming languages and are usually printed out as a stacktrace. When an error occurs, the stacktrace allows you to trace back to the calls that led to the error.
For example, the following error message states Apple is not a number. Combined with the callstack, it's possible to narrow down where in a complex system the error occurred. The message alone is usually insufficient, as the
NumberUtil algorithm might be used in many places.
Thread.run() HttpFramework.service() HttpFramework.dispatch() ShoppingCart.update() ShoppingCart.updateCart() ShoppingCart.parseQuantity() ShoppingCart.convertParams() NumberUtil.convert() <-- Error: "Apple is not a number"
To understand why the error occurred, use the callstack to trace back from the error to the relevant business method, which in this case is
Callstacks themselves are insufficient for monitoring. They are not easy to read and do not provide information to correlate performance and availability of a system to overall health. To see what happens on a code execution and to correlate much more information like process activity, resource usage, queuing, access patterns, load, and throughput, system and application health needs to be taken into account.
With the introduction of service oriented architectures (SOA), the callstack was broken apart. For example, the ShoppingCart logic may now reside on server A, while
NumberUtil resides on server B. An error trace on server B only contains the short callstack of the parse error, while on server A a new error was produced stating that something went wrong on server B, but not stating the problem.
Instead of a single error callstack that is easy to troubleshoot, you ended up with two callstacks with two errors. Along with this, there was no connection between the two, making it impossible to have access to both at the same time.
Thread.run() HttpFramework.service() HttpFramework.dispatch() ShoppingCart.update() ShoppingCart.updateCart() ShoppingCart.parseQuantity() ShoppingCart.convertParams() RestClient.invokeConversion() <-- Error: Unkown
Thread.run() HttpFramework.service() HttpFramework.dispatch() NumberUtil.convert() <-- Error: "Apple is not a number"
The idea behind distributed tracing was to fix this problem by connecting the two error call stacks with each other. Most implementations use a simple mechanism to so; when server A calls server B, the APM tool adds an identifier to the call that serves as a common reference point between the callstacks in the APM system. This mechanism is called correlation and to produce one error, it joins the two callstacks.
Thread.run() HttpFramework.service() HttpFramework.dispatch() ShoppingCart.update() ShoppingCart.updateCart() ShoppingCart.parseQuantity() ShoppingCart.convertParams() RestClient.invokeConversion() Thread.run() HttpFramework.service() HttpFramework.dispatch() NumberUtil.convert() <-- Error: "Apple is not a number"
With added decoration, where the remote call actually takes place and on which servers parts of the callstack were executed, enables you to find out that the
ShoppingCart was the context of the error, and the
NumberUtil caused the shopping cart activity to fail.
The preceding examples have illustrated error tracing, however, APM tools use the same mechanism for taking and presenting performance measurements. The trace is annotated with performance numbers like this:
413 Thread.run() 413 HttpFramework.service() 413 HttpFramework.dispatch() 412 ShoppingCart.update() 411 ShoppingCart.updateCart() 211 ShoppingCart.parseQuantity() 210 ShoppingCart.convertParams() 200 RestClient.invokeConversion() 10 Thread.run() 10 HttpFramework.service() 10 HttpFramework.dispatch() 5 NumberUtil.convert()
The total time for executing the shopping cart update was approximately 413ms. The number conversion (
NumberUtil.convert()) took 5ms. The time spent in between is distributed among many calls, so you are looking for bigger cliffs. In the example, updating the cart (
ShoppingCart.updateCart()) took a total of 411ms, while the parsing (
ShoppingCart.parseQuantity()) only required 211ms, which itself spent most of the time doing the remote call.
In the case of errors or slow performance, we provide a detailed context, so that all the required data for troubleshooting a particular case is available. We don’t collect this data, including the callstack, for every trace because it's an invasive task that can cause overhead and side effects in the code itself.
Referring to the preceding example, this is how Instana displays the transaction:
Service A | ShoppingCart.update - 412ms | Service A | RestClient.invokeConversion - 200ms | Service B | NumberService - 5ms|
This is a greater visual representation of call nesting and length, as it's reduced to the critical parts, showing where time is spent, and where remote calls took place. It also connects to our Dynamic Graph, which knows that the CPU on the Service B server is overloaded, and for root cause analysis it can correlate this to the above transaction. We also capture other relevant information, such as Service URLs or database queries.
To ensure the trace continuity across different services, our tracing feature utilizes three different HTTP headers.
TRACE_IDof the preceding exit span.
SPAN_IDof of the preceding exit span.
X-INSTANA-L: carries information about the trace level;
0means that no trace is recorded. If you have any firewalls, proxies, or similar infrastructure in place that operates on an HTTP header whitelist, please add them to it.
To make this future-proof we recommend that you allow
tracestate, as these are the W3C standardized headers for tracing purposes.
Trace continuity means that calls triggered by one external request are collected into one trace. At Instana, we use protocol-specific means to add metadata, such as JMS headers, Kafka headers, RabbitMQ headers, or gRPC metadata, to name a few. This ensures trace continuity when the communication via HTTP doesn't happen.
Communication protocols without support for any metadata do not support trace continuity. This means that when calling another service over such a protocol, the outgoing call will be a leaf in the trace tree. The work happening in the receiver of the call will not be part of that trace. Instead, receiving the call starts a new trace and all subsequent calls triggered in the receiver will belong to this new trace.
Trace continuity is not supported in the following cases:
kafka-node(That package does not have support for headers. We recommend to use the npm package
kafka-nodewhen working with Kafka in Node.js. With
kafkajs, trace continuity is automatically supported.),