The Central Challenge in Monitoring Serverless Applications

A modern application is complex. From the perspective of deployment, topology and operation even the simplest modern application typically consists of numerous inhomogeneous parts interacting with each other through various protocols. These parts can be referenced as microservices and most likely will have completely independent lifecycles, pairs of hands that work on them and can even cover different business areas. A modern application built of such microservices rarely fails completely, just some parts of the overall functionality might become unavailable for some time and get back again at any time, be it due to a crash or a deliberate action.

Operational aspects of serverless applications

An architecture like this accepts the fact that any instance of a service can fail, come back up again and run elsewhere. We refer to this new reality as ephemerality. It also allows for an instance to take over a previous instance’s work where it stopped, or to start straight away freshly. What Erlang/OTP pioneered with the let it crash principle, now gets projected onto a higher level, with services implemented in different languages and their instances considering themselves movable and ephemeral. They assume that the underlying platform takes care of their availability to others, of satisfying their resources needs, dependencies and of managing their state if it needs to survive reinstantiation. Foundations like Mesos and Kubernetes provide such a platform.

But the granularity of logic separation doesn’t end with service instances. General work execution and orchestration foundations completely decouple ephemeral tasks from the low-level environment details, including the underlying operating system and host. A task can run anywhere in the particular moment of time and do some work. The scope and the size of such a task is typically so small that the execution time spans just seconds. AWS Lambda and Spark offer a foundation for such task execution.

The serverless architecture, which can be considered the umbrella term for architectures like this, has the following operational aspects:

  • Ephemerality,
  • High granularity of components,
  • Location transparency of components,
  • Ability to replace and move components at any time.

Simply put, code is decoupled from the infrastructure it runs on, but it brings challenges to monitoring.

Re-identification problem

Ephemerality is the common aspect of failure and short-lived work. In both cases a component can disappear and reappear again. Depending on the granularity of components and on the platform they are managed by, they can be identified by some temporary IDs that can map 1:1 to a host name or to a process ID (PID) in the underlying operating system, or not directly map to those at all, introducing own identification mechanisms within the platform that manages them, such as task IDs and alike. In most cases, they don’t need to know neither their own IDs nor IDs of components they depend on, since they rely on discovery mechanisms of the platform and typically resolve their dependencies by name.

From the perspective of program logic, the underlying identification mechanisms might not play any significant role. But from the perspective of operations, debugging, performance tuning and troubleshooting it’s extremely hard to comprehend or reproduce a complex execution path through a topology that keeps changing all the time, to identify the reason for a bottleneck or to derive system tuning parameters from behavior under load. Engineers need to know what exactly executed when and where.

Even the most beautiful serverless application is running on physical and logical infrastructure including operating system, and these are typically tuned for the operation of this application. Issues either propagate from the infrastructure into the application, or the application causes troubles in the infrastructure. In order to understand the behaviour of an application over time, it is therefore necessary to catalogue the history of changes, issues and generally any events that happen with any of the application’s parts and the underlying infrastructure. To be able to maintain this history, however, it’s necessary to precisely identify, or re-identify, any ephemeral component on every appearance over time.

Explicit vs. implicit re-identification

It should be obvious that ephemeral components cannot be re-identified using low-level facts and mechanisms such as host names, PIDs and alike. PIDs will change between crashes, restarts and moves. And hosts clearly play a role of execution vehicles and resource providers in a serverless architecture, while being also considered temporary and non-unique in fleets, auto-scaling groups etc. They are typically grouped by roles and identified as such rather than directly by their names or IP addresses.

Instana has realised that it’s problematic for traditional monitoring tools to re-identify ephemeral components. It’s not only because poor resolution might simply miss a second-long living task, for example. But also because traditional monitoring tools build on low-level identification mechanisms. Whenever a process dies and comes back, it doesn’t have any history, so it’s up to humans to find out the history of a component throughout all its crashes, restarts and moves.

A more sophisticated monitoring solution would allow to explicitly identify an ephemeral component by a name, so it would be able to re-identify it after a crash or a restart. But naming by hand is burdensome in a serverless architecture.

Attacking ephemerality with steadiness

One of the central ideas of Instana is auto-discovery with minimal configuration. We truly believe that it doesn’t make sense to feed a monitoring solution with explicit facts and configurations to allow it to work properly, while the application under monitoring is itself built around ephemeral components with service discovery and expectation based resource allocation. These two approaches don’t fit together and break abstractions, explicitly lifting infrastructure and monitoring details straight into the application.

We see a huge gap between the ephemerality of services and the tasks, the differences of how they are managed by different underlying foundations and how traditional monitoring solutions approach the re-identification problem. So we have from the beginning designed and built a set of adaptive re-identification abstractions and mechanisms that take into account the environment, the foundation for work execution or service management and scheduling, granularity and ephemerality. We collect all these abstractions and mechanisms under the term steadiness.

Every operational or logical element we automatically discover is considered ephemeral and is never directly identified by its own identification means, hard facts or low-level identification mechanisms. Instead, we derive something we call a steady ID from element's attributes, its environment and soft facts that are not likely to semantically change. No matter where or when or why or how it runs, crashes, restarts or gets redeployed.

Example: steadiness of a process

In the simplest of all scenarios, let's assume we have a program that is executed on the JVM with the following parameters:

-Xms500m -Dfoo=1 -Dbar=2 -Dmore=3 -Dmuch=4,

along with the current PID 1234.

We see two different types of arguments - technical ones to tune the JVM itself (-X) and logical ones that are likely to control the application in some way (-D). Instana assumes in this case that everything that is configured with a -X is likely to change and doesn't contribute to any logic.

We consider the PID volatile from the beginning and collect history of metrics, connections, traces, etc. behind a steady ID which, in this case, is derived from the -D arguments.

When the program gets restarted, resulting in a new PID 2345, none of the -D arguments change. Instana sees the process leaving and coming, catalogues corresponding online and offline events and continues collecting metrics and accumulating history for the process behind the same steady ID.

If somebody decides to do some heap tuning, restarting the program with the following arguments:

-Xms1g -Dfoo=1 -Dbar=2 -Dmore=3 -Dmuch=4,

resulting in PID 3456, Instana still re-identifies this as a program it has previously seen, because none of the -D arguments has changed. We simply ignore the -X change for identification purposes, then keep collecting history.

Example: steadiness by similarity

If someone decides to tweak some logic arguments:

-Xms1g -Dfoo=10 -Dbar=2 -Dmore=3 -Dmuch=4,

resulting in PID 4567, we need to apply more thinking. Our model for re-identification after a change like this is based on similarity. As long as we see another Java program in the same environment that still has previous -D arguments, we assume we now see a totally new item. Since we timeout elements nearly as soon as they disappear, the chance of a mistake is minimal.

But if we don't see a competing process that is currently reporting, we go by similarity. That means that in this case one of the -Ds has changed, and we still have a 75% chance that this is the same thing we’ve seen before, slightly modified. So we re-identify it as such, adapt the modified steady ID and keep continuing history for it.

Example: steadiness by slots

In cases where there is no direct mapping between an ephemeral service and, for example, an OS process, thus infrastructure in general, we need to apply knowledge about the underlying work execution foundation and how exactly it schedules tasks and maps them to the low-level mechanisms. Abstractly seen, no matter if a long-lived service or short-lived task is being monitored, their ephemerality makes them look similar. And the way Mesos, Kubernetes and alike work, offers an abstraction that perfectly supports re-identification by steadiness and links to the underlying low-level infrastructure - a slot.

Platforms like this typically are built around pretty dumb workers that, simply put, host work. The underlying foundation platform offers a number of slots that can be occupied by a service or a task. It doesn’t matter if slots are explicitly configured or are derived from available resources, such as the number of CPUs or roles of services and types of tasks that can run on any of them. It's relatively easy to understand the overall topology of services and tasks by simply asking the underlying platform through an API, which is provided by any of them. Once slots are discovered, it’s necessary to keep track of changes after the initial request, of course.

In such a scenario in order to re-identify ephemeral tasks and to keep their history, we apply one additional important thought: task instances are themselves anonymous. An instance of a task simply doesn’t matter, but it’s important to keep history of task behavior on a particular slot and so understand how it behaves on the underlying infrastructure.

Let's say, the underlying platform offers 3 slots, each of which can be occupied by a task of type "foobar". Once all slots are occupied, work waits for a slot to become available, and so on. Let's assume we start with one task being executed. If the platform assigns this task to slot #1, we build our steady ID out of the task type and the slot number. When a new task gets in and occupies slot #2, we independently track both of them with different steady IDs. But from the perspective of the history, this means we know how this task behaves when executed on slot #1 and slot #2.

Whenever the task executed on slot #1 finishes, the history collection for this slot stops. When another task of the same type comes up and occupies slot #1, we continue the history with a gap in between.

In such a scenario, we might also want to monitor if work distribution on the platform is symmetric for this type of task or if the topology is well utilized. Mixing all task instances into just one element with one steady ID wouldn’t make any sense.

And what about host?

No, I didn’t forget to mention hosts. I just wanted to first show the abstractions and mechanisms we have designed and built as solution for the re-identification problem before sending out one simple message: hosts behave the same way as ephemeral services and tasks. We consider hosts ephemeral as well, and also derive steady IDs for them. It is a bit tricky, because we calculate a fingerprint of an operating system instance from the soft facts we automatically discover.

Ephemerality of hosts plays an important role in auto-scaling groups, fleets and alike. But in some scenarios, even for ephemeral hosts it might be reasonable to replace the steady ID with a configured one. Think of cases where an ephemeral host is fired up for a customer, so instead of calculating a steady ID from technical facts, it might be more relevant to have a history tracked by a customer ID. Even in this case our approach is more progressive than what traditional monitoring tools do. We don’t automatically identify the host by its FQDN or IP address, because we consider these facts volatile.

Conclusion

We believe that in order to be able to reliably monitor a serverless application that builds on ephemeral infrastructure, services and tasks, it’s neither reasonable to solve identification by configuration nor by hard facts or low-level details. We believe that being able to accurately re-identify ephemeral elements is a central challenge for the monitoring solution attempting to manage modern applications, and that our approach solves the problem. Of course, we’re still working on improving it for different target platforms and edge cases, but steadiness is our response to the total ephemerality of serverless applications.