This is the third post in a series on the Life of an SRE at Instana. Check out the first post and second post.
Rolling out releases and hotfixes
Our Instana SaaS platform is running approximately 4000 containers containing Java and NodeJS processes across multiple regions. We have a bi-weekly release process where new features are rolled out to all customers. On top of that we are deploying hotfixes several times a day. To keep up with the ever changing environments, we rely heavily on Instana’s auto-discovery mechanism. All newly deployed components appear after a few seconds on the infrastructure map and we can filter and drill down all metrics that we are interested in.
We use Nomad from HashiCorp as a scheduler for running our containers. Nomad is a highly available, distributed cluster and application scheduler and is working very stable for us. Every time we roll-out a hotfix or release, we create new deployment requests via the Nomad API. Nomad takes care of scheduling the containers. It restarts them in case a container crashes, and packs them nicely on our Nomad cluster. This allows us to utilize resources as efficiently as possible. For more information on Nomad please have a look at https://nomadproject.io/.
Verify pending allocations
After each deployment we monitor pending allocations in the Nomad dashboards. Changing the timeframe to 1sec resolution allows us to follow the “Pending Allocations” metric to ensure the Nomad cluster is handling the new job deployments well. Seeing pending allocations staying close to 0 gives us confidence that the rollout is in good shape and all components are scheduled successfully.
You can also verify the allocation status for components using the command line but doing that for 4000 containers is tedious and tiring.
> nomad status cashier-acceptor ID = cashier-acceptor Name = cashier-acceptor Submit Date = 02/27/20 14:45:24 UTC Type = service Priority = 50 Datacenters = eu-west-1a,eu-west-1b,eu-west-1c Status = running Periodic = false Parameterized = false Summary Task Group Queued Starting Running Failed Complete Lost cashier-acceptor 0 0 2 0 0 0 Allocations ID Node ID Task Group Version Desired Status Created Modified 09c614a9 a29865b3 cashier-acceptor 3 run running 15h36m ago 15h35m ago 6c18b836 34a35e25 cashier-acceptor 3 run running 15h36m ago 15h35m ago
Create alerts for pending allocations
Additionally we configure alerts for “Pending Allocations”. If pending allocations stay greater than 1 for a given time interval that is an indicator to check the system. It can happen that we need to expand the Nomad cluster or that a Docker process crashed on a machine and needs to be restarted.
Verify running component versions
After each release we make sure that we do not run old versions of components. This happens if they were accidentally forgotten during a release. When running so many docker processes under high load, in our experience containers can get stuck. Filtering components in the infrastructure map by release version helps us find old containers and replace them with a new version.
In the following example we filter by the semantic version 1.172.* to find all components, that have the release version 172. Each Java drop wizard component we run contains the full version in the JAR file path name. Instana automatically stores this information and you can filter by the version in the infrastructure map and container perspective.
We mostly use the “Container” perspective to find old containers:
Then we filter all components by the old release tag prefix. If components show up on the map you can hover the entity and Instana will show you the component name.
Summary
Instana’s automatic discovery of newly deployed containers and 1sec resolution of metrics helps us during releases and hotfixes. We get instant feedback when updating 4000 containers in case processes are stuck. After each release we verify that there are no old component runnings by filtering with by the old release version in the container perspective.