Tracing is a core element of every Application Performance Management tool. It is called “User-Defined Transaction Profiling” by Gartner and has two important values for the users of APM tools:
- Understand transaction flows through all involved components to get a better view of the application architecture and distributed call patterns.
- Troubleshoot performance problems by getting method/code level details including execution times.
The current generation of APM Tools and their tracing capabilities focus mainly on the second point, with what I call the “callstack”. At Instana, we believe this focus is wrong for a couple of reasons and came to the conclusion that the first point is much more important, especially in highly distributed micro services environments. In this blog I am going to explain our reasoning.
Callstacks
To set the stage for understanding tracing, I will give a quick introduction on how callstacks work: a callstack is an ordered list of code executions. Whenever code invokes other code, the new code is put onto the top of said stack. Callstacks are used by runtimes of all programming languages and they can usually be seen printed out as “stacktrace” when errors occur, as they allow “tracing back” to what calls led to the error.
For example an error message might say “Apple is not a number”. Combined with the callstack it is possible to narrow down where in a complex system the error actually occurred. The message alone is usually insufficient, as the NumberUtil algorithm might be used in many places.
Thread.run() HttpFramework.service() HttpFramework.dispatch() ShoppingCart.update() ShoppingCart.updateCart() ShoppingCart.parseQuantity() ShoppingCart.convertParams() NumberUtil.convert() <-- Error: "Apple is not a number"
Developers can use this callstack to trace back from the error to the relevant business method (ShoppingCart.parseQuantity()) to get a better understanding why the error occurred.
Distributed Tracing
What happened when service oriented architectures appeared ten years ago, was that the callstack was broken apart. The “ShoppingCart” logic now might reside on server A, while “NumberUtil” resides on server B.
Suddenly, the error trace on server B was only containing the short callstack of the parse error, and on server A a new error was produced, telling that something went wrong on server B, but not stating what the problem was. Instead of a single error callstack which allowed troubleshooting the problem, one now ended up with two callstacks, with two errors. Even worse there was no connection between the two, sometimes making it impossible to have access to both at the same time.
Server A: Thread.run() HttpFramework.service() HttpFramework.dispatch() ShoppingCart.update() ShoppingCart.updateCart() ShoppingCart.parseQuantity() ShoppingCart.convertParams() RestClient.invokeConversion() <-- Error: Unkown
Server B: Thread.run() HttpFramework.service() HttpFramework.dispatch() NumberUtil.convert() <-- Error: "Apple is not a number"
This is where distributed tracing came into the picture. The idea behind distributed tracing was to fix this problem by connecting the two error call stacks with each other. While many ideas exist how to do that, most implementations use a simple mechanism: When server A calls server B, the APM tool adds an identifier to the call, so this shared identifier can be used to join the callstacks in the APM system. Such a mechanism is usually called “correlation”.
Thread.run() HttpFramework.service() HttpFramework.dispatch() ShoppingCart.update() ShoppingCart.updateCart() ShoppingCart.parseQuantity() ShoppingCart.convertParams() RestClient.invokeConversion() Thread.run() HttpFramework.service() HttpFramework.dispatch() NumberUtil.convert() <-- Error: "Apple is not a number"
Much better: With some added decoration where the remote call actually takes place and on which servers which parts of the callstack were executed, this allows for again finding out that the “ShoppingCart” was the context of the error, and the “NumberUtil” caused the shopping cart activity to fail.
Measuring Performance
While the examples so far illustrated error tracing, APM tools use the same mechanism also for taking and presenting performance measurements. So what they do is, they annotate the trace with performance numbers like this:
413 Thread.run() 413 HttpFramework.service() 413 HttpFramework.dispatch() 412 ShoppingCart.update() 411 ShoppingCart.updateCart() 211 ShoppingCart.parseQuantity() 210 ShoppingCart.convertParams() 200 RestClient.invokeConversion() 10 Thread.run() 10 HttpFramework.service() 10 HttpFramework.dispatch() 5 NumberUtil.convert()
The total time for executing that shopping cart update was apparently 413ms. The actual number conversion took 5ms. The time spend in between is distributed among many calls, so one usually looks for bigger cliffs, in this case updating cart takes in total 411ms, while the parsing itself only requires 211ms, which itself did spend most time doing the remote call.
In most APM tools the trace has become the core element and central domain of the product. We saw that because of that focus, most customers of existing APM solutions actually do not use the tool for operations and monitoring, but only use it for troubleshooting performance issues or errors. Callstacks themselves are insufficient for monitoring. They are not easy to read for users of the APM tool and do not provide information to correlate performance and availability of a system to an overall health. System and application health needs to take into account not only what happens on a code execution, but also correlate much more information like process activity, resource usage, queuing, access patterns, load and throughput.
Tracing at Instana
At Instana we are building the next generation of APM. We want to provide DevOps teams with the right information to manage their environments. Modern IT environments can consist of many different components, ranging from few monolithic applications to thousands of microservices. These components are individually configurable and manageable, they can be rescaled or restarted. These components are connected by application code to an ever changing complex composition performing business relevant activities and adapting to constant technical and business change. Instead of focusing on the callstack, which would be just a small part of this big picture, we focus comprehensively on system and application health. We use tracing to understand the interaction between different components and the messages that are passed between them. We enhance the trace with information about the underlying service, application and system infrastructure using our unique Dynamic Graph technology. This way we understand how a system of services works together and understand the impact to traces and applications if one of the components fails.
In case of errors or slow performance we provide a detailed context, so that all the data needed for troubleshooting this case is available – we don’t collect this data (including callstack) for every trace – as this is a very invasive task that can cause overhead and side effects in the code itself.
For the tracing implemented in Instana, we took inspiration from “Dapper“, a paper describing distributed tracing at Google, and the OpenTracing Initiative started by developers of “Zipkin“, a Dapper like tracing developed at Twitter. Google and Twitter both face highly distributed service architectures and they exactly faced the problems many microservice users have today.
Even more so, we also contribute to OpenTracing advancing the state of the art. There are a couple of key takeaways from Dapper / OpenTracing I want to highlight:
- The precise lines of code running is neither interesting to operations nor to development.
For operations, code changes are rarely possible, which turns most of the APM tools into development focused tools. But even for development departments, most of the data on code contained in callstacks is noise. It is either unchangeable runtime code, framework code or third party code. - Interaction of Services and Components matters.
Most problems can be resolved when the failing component or service has been identified. While there can be internal errors, service availability and interaction is the most important to monitor. Instana uses tracing also to understand the logical dependencies of components to contribute to its the Dynamic Graph. - Remote calls are especially interesting due to their inherent slowness.
While many APM tools focus on in process callstacks, it turns out that most of the time these are really fast. Performance most often comes into play when accessing remote services. - One-off problems are time wasters to troubleshoot, insights can be gained even when looking at a very low percentage of all traces.
Every engineer of course wants to make sure his/her code runs as fast as possible, but it is just not worth to investigate one call out of a thousand even when it is 100ms slower than the average.
Going back to the above example, this is roughly how Instana will show the above transaction
Service A | ShoppingCart.update - 412ms | Service A | RestClient.invokeConversion - 200ms | Service B | NumberService - 5ms|
This is a more visual representation of call nesting and length, it is reduced to the important parts, showing where time was spent and remote calls took place. More importantly, it connects to our Dynamic Graph, which contains the information that, for example, the CPU on the server of Service B was overloaded and making it very easy to deduce a correlation and possible root cause. Should other information, like Service URLs or database queries, be relevant Instana will of course capture them as well.
We are very excited about our tracing capabilities and the value it delivers. By rethinking what is really needed we came up with a solution which cuts away most of the noise and presents application behaviour in a much more readable fashion, without compromising the details needed to improve application code. By connecting it to our graph, we can correlate important relevant information outside of the callstack.