Life of an SRE at Instana – Rolling out releases and hotfixes

March 16, 2020

Scaling microservices

This is the third post in a series on the Life of an SRE at Instana. Check out the first post and second post.

Rolling out releases and hotfixes

Our Instana SaaS platform is running approximately 4000 containers containing Java and NodeJS processes across multiple regions. We have a bi-weekly release process where new features are rolled out to all customers. On top of that we are deploying hotfixes several times a day. To keep up with the ever changing environments, we rely heavily on Instana’s auto-discovery mechanism. All newly deployed components appear after a few seconds on the infrastructure map and we can filter and drill down all metrics that we are interested in. 

We use Nomad from HashiCorp as a scheduler for running our containers. Nomad is a highly available, distributed cluster and application scheduler and is working very stable for us. Every time we roll-out a hotfix or release, we create new deployment requests via the Nomad API. Nomad takes care of scheduling the containers. It restarts them in case a container crashes, and packs them nicely on our Nomad cluster. This allows us to utilize resources as efficiently as possible. For more information on Nomad please have a look at

Container Map

Verify pending allocations

After each deployment we monitor pending allocations in the Nomad dashboards. Changing the timeframe to 1sec resolution allows us to follow the “Pending Allocations” metric to ensure the Nomad cluster is handling the new job deployments well. Seeing pending allocations staying close to 0 gives us confidence that the rollout is in good shape and all components are scheduled successfully.

You can also verify the allocation status for components using the command line but doing that for 4000 containers is tedious and tiring.

> nomad status cashier-acceptor

ID            = cashier-acceptor
Name          = cashier-acceptor
Submit Date   = 02/27/20 14:45:24 UTC
Type          = service
Priority      = 50
Datacenters   = eu-west-1a,eu-west-1b,eu-west-1c
Status        = running
Periodic      = false
Parameterized = false

Task Group        Queued Starting Running  Failed Complete Lost
cashier-acceptor  0 0 2        0 0 0

ID        Node ID Task Group        Version Desired Status Created     Modified
09c614a9  a29865b3 cashier-acceptor  3 run running  15h36m ago  15h35m ago
6c18b836  34a35e25 cashier-acceptor  3 run running  15h36m ago  15h35m ago

Create alerts for pending allocations

Additionally we configure alerts for  “Pending Allocations”. If pending allocations stay greater than 1 for a given time interval that is an indicator to check the system. It can happen that we need to expand the Nomad cluster or that a Docker process crashed on a machine and needs to be restarted.

create alert part 1

create alert part 2

Verify running component versions

After each release we make sure that we do not run old versions of components. This happens if they were accidentally forgotten during a release. When running so many docker processes under high load, in our experience containers can get stuck. Filtering components in the infrastructure map by release version helps us find old containers and replace them with a new version. 

In the following example we filter by the semantic version 1.172.* to find all components, that have the release version 172. Each Java drop wizard component we run contains the full version in the JAR file path name. Instana automatically stores this information and you can filter by the version in the infrastructure map and container perspective. 

We mostly use the “Container” perspective to find old containers:

Change to container view

Then we filter all components by the old release tag prefix. If components show up on the map you can hover the entity and Instana will show you the component name.

Filtered infrastructure map


Instana’s automatic discovery of newly deployed containers and 1sec resolution of metrics helps us during releases and hotfixes. We get instant feedback when updating 4000 containers in case processes are stuck. After each release we verify that there are no old component runnings by filtering with by the old release version in the container perspective.

Play with Instana’s APM Observability Sandbox

Conceptual, Customer Stories, Engineering
Halloween is a scary time to be in abandoned buildings, cemeteries, and dark forests… and DevOps teams. Developers, operations engineers, and SREs told us some DevOps horror stories that have haunted them...
Developer, Thought Leadership
Kubernetes (also known as k8s) is an orchestration platform and abstract layer for containerized applications and services. As such, k8s manages and limits container available resources on the physical machine, as well...
Developer, Engineering
Things break all the time in distributed systems: Part 2 Cassandra This post is a continuation of the previous blog "Things break all the time in distributed systems: Part 1 ClickHouse" In...

Start your FREE TRIAL today!

Instana, an IBM company, provides an Enterprise Observability Platform with automated application monitoring capabilities to businesses operating complex, modern, cloud-native applications no matter where they reside – on-premises or in public and private clouds, including mobile devices or IBM Z.

Control hybrid modern applications with Instana’s AI-powered discovery of deep contextual dependencies inside hybrid applications. Instana also gives visibility into development pipelines to help enable closed-loop DevOps automation.

This provides actionable feedback needed for clients as they to optimize application performance, enable innovation and mitigate risk, helping Dev+Ops add value and efficiency to software delivery pipelines while meeting their service and business level objectives.

For further information, please visit