Life of an SRE at Instana – Handling AWS EC2 Instance Retirements

This is the second post in a series on the Life of an SRE at Instana. Check out the first post here.

Handling AWS EC2 Instana Retirements

Most people running a production system on AWS have received an email from AWS before informing them about an “Instance Retirement”. The emails typically have the following format:

"AWS Amazon EC2 Instance Retirement [AWS Account ID: XXXXXXXXXXXX]"

Hello,

EC2 has detected degradation of the underlying hardware hosting your 
Amazon EC2 instance (instance-ID: i-08c21960e74b7a276) associated 
with your AWS account (AWS Account ID: XXXXXXXXXXXX) in the eu-west-1 region. 
Due to this degradation your instance could already be unreachable. 
We will stop your instance after 2020-03-06.

...

At Instana we run about 1000 EC2 instances which results in 1-2 retirement emails per week across multiple AWS regions. There can be many reason for such incidents. 

  • cloud hardware failures
  • network component failures
  • software upgrades

As an SRE in charge of a SaaS platform we typically have a few questions that we want to have answered right away when receiving these emails:

  1. Show impacted host (it is hard to remember the AWS instance IDs in your head, i.e. i-08c21960e74b7a276).
  2. Was the host already rebooted?
    1. Depending on the answer we need to check various metrics to see the impact or schedule a manual reboot before the deadline specified in the email
  3. What impact does this EC2 instance retirement have on my system?

These are all questions that Instana can help to answer and much more. So here is a typical flow what we would do once we receive an instance retirement email from AWS.

Step 1: Show impacted host

We take the instance ID and copy it to the search bar on the infrastructure map. Within a few seconds the EC2 instance will appear on the map and you can start with a drill down to all relevant information.

Infrastructure Map

Step 2: Was the host already rebooted?

Looking at the host dashboard we can easily see if the instance was already rebooted (uptime) and what the current CPU Load is. Once an instance gets rebooted, the Instana Agent will start monitoring the host again and detect all running processes. This allows us to quickly see if all processes are up and running.

Host Dashboard

By selecting the timeframe you are interested in you can go through the product and check other charts and metrics.

CPU usage

Step 3: What impact does this EC2 instance retirement have on my system?

In our case, the instance retirement was for a Cassandra node. As a first step we check if the Cassandra node is running fine, if latencies and compactions are in good shape and if the Cassandra cluster was negatively impacted. 

Cassandra Node Dashboard

Summary

After a few checks in Instana, we are confident that the reboot has not caused any outages. Therefore, we can archive the email and continue with our daily work. Instana’s 1 second metrics resolution and auto-detection of components makes it easy to get answers for the most pressing questions when hosts get rebooted without prior notice. 

Start your FREE TRIAL today!

Automatic Application Performance Monitoring (APM) solutions for microservices, Instana has developed the automatic monitoring and AI-based analysis DevOps needs to manage the performance of modern applications. Instana is the only APM solution that automatically discovers, maps and visualizes microservice applications without continuous additional engineering. Customers using Instana achieve operational excellence and deliver better software faster. Visit https://www.instana.com to learn more.