Recently there has been a lot of discussion around OpenTracing and APM. It all started with a Gartner research note about Application Performance Management (APM) in a microservice world called Advance Your Application Performance Monitoring Strategy to Support Microservices. The paper outlines the need for a new approach to APM in ephemeral microservice based applications. It highlights the dynamic nature of these architectures and that it is impossible to statically map service quality to underlying infrastructure. To collect the required data, Gartner also recommends the use of Open Source tools like OpenTracing to get business transaction flows in these environments.
The Reaction of Old-School APM vendors to OpenTracing
This started a chain of reactions with classical APM vendors. It started with Jonah Kowall, a former Gartner APM analyst, and today VP of market development at AppDynamics, writing about Misunderstanding “Open Tracing” for the Enterprise. His criticism is that OpenTracing is not a broadly adopted open standard and explains that Enterprises have highly heterogeneous environments that require automated, agent based tracing. He finished with a discussion that lock-in with vendor agents is not a major issue due to ease of replacement.
Alois Reitbauer, Chief Technology Strategist at dynaTrace, reacts in his blog A CTO’s strategy towards OpenTracing that OpenTracing is not a standard and agrees with Jonah about the need for standardization of TraceContext.
Jonah and Alois support OpenCensus as a project and strategy that aligns with the concept of standardizing the automatic collection of traces and metrics.
Adrian Cole, Zipkin project lead/initiator and also an early contributor to OpenTracing, replied to these discussions with his post My take on: Misunderstanding “Open Tracing” for the Enterprise – a very interesting view on the world of tracing and Open Source projects in this space.
Why has a relatively simple technology like OpenTracing touched off such a response and discussion? Jonah, Alois, Adrian and the Gartner authors all make good points. Our opinion is that there is a bigger shift in play beyond the adoption of new microservice architectures and ephemeral approaches to constructing applications. The reason for the need for OpenTracing is driven by the shift in the relative power within the application delivery organization. We just published a blog describing this organizational change in The New Application Delivery Organization.
Developers: The New Kingmakers
In these new organizations, the developers play an increasingly influential role – as described in The New Kingmakers. They are also an important user of APM. At Instana, we think of this as The Democratization of APM.
Not only does this mean that the APM User Experience must better meet the expectations and use cases of a developer, but also that APM needs a new interface to this end user: Code. In some situations (much more common in microservice applications), Developers will want to instrument their services in the same easy way as they put logs into it for debugging purposes. In fact, tracing is not much different from logging except that it has predefined semantics and an approach for correlating the “log messages” – which is the trace context discussed earlier.
OpenTracing does exactly what is needed for developers to implement their own instrumentation. It provides an API and creates a facade for vendors to plug in their “Tracer”, similar to what SLF4J did for Logging in Java (https://www.slf4j.org/). In addition to that, OpenTracing also described the model of a trace and its semantics. Finally, OpenTracing provides libraries to instrument code in different languages like Go, Java or PHP.
This is basically it and OpenTracing did a good job in defining this contract to the developers. The core concept of OpenTracing so far is not to standardize on the protocol for the APM vendors – nor is this required to be successful in the developer space. As part of the Cloud Native Computing Foundation (CNCF) it also is part of an open organization that drives the new age of cloud native applications. Part of CNCF is Jaeger a distributed tracing system that supports OpenTracing.
This doesn’t mean that standardizing trace context does not make sense – I personally think that it makes a lot of sense and that OpenCensus or the W3C Trace Context standardization approach are great initiatives, just the developer does not really care about it – it is more us vendors who care (and it would provide some value to the users who are using different tools and could utilize the compatibility).
The Instana way – A Polyglot Tracing approach
At Instana, we support OpenTracing as a key data source for our APM because we understand the great value it provides to developers that want to manually instrument their code. From that same perspective, we also support AWS X-Ray for developers that instrument their Lambda functions and/or AWS services. Of course we support OpenCensus too. For Instana, all of these trace technologies are excellent sources of contextualized data from which we can derive and present an automated understanding of performance and service quality understanding to our APM users. Once the tracing data is available, Instana automatically leverages this data.
We also see great value in agent based automated tracing – especially for enterprises. As Jonah describes in his blog, enterprises have hundreds of applications – internally and externally built – and they need an end-to-end view into ALL their applications. In these environments, it makes good business sense to understand performance with minimal effort, or automatically, with no adjustments to code needed to create the traces.
This is why we implemented an open approach to tracing for the Instana APM solution. The user can choose the best tool for the job, and we will make sure that the end-to-end view is automatically created. That means that a developer can code OpenTracing APIs into a new microservice and use Instana automated tracing for the other parts of the stack; or even for parts of the microservice where manual instrumentations adds no additional value to automated tracing (e.g. database statements). Another developer can use X-Ray to trace Lambda functions and Instana will seamlessly combine the data from all different tracing technologies.
This is what we call a Polyglot tracing approach. Instana will automatically provide an end-to-end trace combining OpenTracing, OpenCensus, automated tracing and X-Ray into one single distributed trace. It is up to the user, usually developers, to decide which techniques to use.
Getting back to the Gartner paper, it is clear that tracing is just one element of a broader APM strategy for microservice architectures. Even in ephemeral/dynamic environments, teams still need an understanding of the whole application. APM uses a combination of end user experience data, Metrics (e.g. Docker CPU), traces, and topology to derive an understanding of health. Instana uses an AI based approach to understand the health of infrastructure, endpoints, services and applications. In microservices architectures, understanding dynamic dependencies using topology and graph analysis is more difficult than in traditional architecture. Instana’s AI leverages our unique Dynamic Graph which provides the context to pinpoint to the root cause of problems and ultimately arrive at accurate causation.
We predict that there will be no ultimate reason to drive to a single tracing standard. Developers and operators need flexibility and the easiest way possible to get the data they need according to the architecture at hand. Its up to APM tools to collect, organize and analyze any of these data sources into actionable information. Nevertheless there would be a high value if standardizing the TraceContext between APM vendors will work out as this allows for better interoperability between tools (especially important in Hybrid Cloud environments) – but we shouldn’t mix this up with developer oriented frameworks like OpenTracing.