Post

Life of an SRE at Instana – Rolling out releases and hotfixes

March 16, 2020

This is the third post in a series on the Life of an SRE at Instana. Check out the first post and second post.

Rolling out releases and hotfixes

Our Instana SaaS platform is running approximately 4000 containers containing Java and NodeJS processes across multiple regions. We have a bi-weekly release process where new features are rolled out to all customers. On top of that we are deploying hotfixes several times a day. To keep up with the ever changing environments, we rely heavily on Instana’s auto-discovery mechanism. All newly deployed components appear after a few seconds on the infrastructure map and we can filter and drill down all metrics that we are interested in. 

We use Nomad from HashiCorp as a scheduler for running our containers. Nomad is a highly available, distributed cluster and application scheduler and is working very stable for us. Every time we roll-out a hotfix or release, we create new deployment requests via the Nomad API. Nomad takes care of scheduling the containers. It restarts them in case a container crashes, and packs them nicely on our Nomad cluster. This allows us to utilize resources as efficiently as possible. For more information on Nomad please have a look at https://nomadproject.io/.

Container Map

Verify pending allocations

After each deployment we monitor pending allocations in the Nomad dashboards. Changing the timeframe to 1sec resolution allows us to follow the “Pending Allocations” metric to ensure the Nomad cluster is handling the new job deployments well. Seeing pending allocations staying close to 0 gives us confidence that the rollout is in good shape and all components are scheduled successfully.

You can also verify the allocation status for components using the command line but doing that for 4000 containers is tedious and tiring.

> nomad status cashier-acceptor

ID            = cashier-acceptor
Name          = cashier-acceptor
Submit Date   = 02/27/20 14:45:24 UTC
Type          = service
Priority      = 50
Datacenters   = eu-west-1a,eu-west-1b,eu-west-1c
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group        Queued Starting Running  Failed Complete Lost
cashier-acceptor  0 0 2        0 0 0

Allocations
ID        Node ID Task Group        Version Desired Status Created     Modified
09c614a9  a29865b3 cashier-acceptor  3 run running  15h36m ago  15h35m ago
6c18b836  34a35e25 cashier-acceptor  3 run running  15h36m ago  15h35m ago

Create alerts for pending allocations

Additionally we configure alerts for  “Pending Allocations”. If pending allocations stay greater than 1 for a given time interval that is an indicator to check the system. It can happen that we need to expand the Nomad cluster or that a Docker process crashed on a machine and needs to be restarted.

create alert part 1

create alert part 2

Verify running component versions

After each release we make sure that we do not run old versions of components. This happens if they were accidentally forgotten during a release. When running so many docker processes under high load, in our experience containers can get stuck. Filtering components in the infrastructure map by release version helps us find old containers and replace them with a new version. 

In the following example we filter by the semantic version 1.172.* to find all components, that have the release version 172. Each Java drop wizard component we run contains the full version in the JAR file path name. Instana automatically stores this information and you can filter by the version in the infrastructure map and container perspective. 

We mostly use the “Container” perspective to find old containers:

Change to container view

Then we filter all components by the old release tag prefix. If components show up on the map you can hover the entity and Instana will show you the component name.

Filtered infrastructure map

Summary

Instana’s automatic discovery of newly deployed containers and 1sec resolution of metrics helps us during releases and hotfixes. We get instant feedback when updating 4000 containers in case processes are stuck. After each release we verify that there are no old component runnings by filtering with by the old release version in the container perspective.

Play with Instana’s APM Observability Sandbox

Announcement, Featured, Product
“[Vault is a solution to] secure, store and tightly control access to tokens, passwords, certificates, encryption keys for protecting secrets and other sensitive data”, as stated by HashiCorp’s website. In this article...
|
Developer, Engineering
What is a Zero Width Space? A few days ago I learned that the Unicode character for 'ZERO WIDTH SPACE' is U+200B. "The zero-width space is a non-printing character used in computerized...
|
Engineering, Featured, Product
At Instana, we store a lot of customer telemetry data in various databases. A part of our production environment runs in Amazon Web Services (AWS). We use encrypted EBS volumes to securely...
|

Start your FREE TRIAL today!

As the leading provider of Automatic Application Performance Monitoring (APM) solutions for microservices, Instana has developed the automatic monitoring and AI-based analysis DevOps needs to manage the performance of modern applications. Instana is the only APM solution that automatically discovers, maps and visualizes microservice applications without continuous additional engineering. Customers using Instana achieve operational excellence and deliver better software faster. Visit https://www.instana.com to learn more.