This is the second post in a series on the Life of an SRE at Instana. Check out the first post here.
Handling AWS EC2 Instana Retirements
Most people running a production system on AWS have received an email from AWS before informing them about an “Instance Retirement”. The emails typically have the following format:
"AWS Amazon EC2 Instance Retirement [AWS Account ID: XXXXXXXXXXXX]" Hello, EC2 has detected degradation of the underlying hardware hosting your Amazon EC2 instance (instance-ID: i-08c21960e74b7a276) associated with your AWS account (AWS Account ID: XXXXXXXXXXXX) in the eu-west-1 region. Due to this degradation your instance could already be unreachable. We will stop your instance after 2020-03-06. ...
At Instana we run about 1000 EC2 instances which results in 1-2 retirement emails per week across multiple AWS regions. There can be many reason for such incidents.
- cloud hardware failures
- network component failures
- software upgrades
As an SRE in charge of a SaaS platform we typically have a few questions that we want to have answered right away when receiving these emails:
- Show impacted host (it is hard to remember the AWS instance IDs in your head, i.e. i-08c21960e74b7a276).
- Was the host already rebooted?
- Depending on the answer we need to check various metrics to see the impact or schedule a manual reboot before the deadline specified in the email
- What impact does this EC2 instance retirement have on my system?
These are all questions that Instana can help to answer and much more. So here is a typical flow what we would do once we receive an instance retirement email from AWS.
Step 1: Show impacted host
We take the instance ID and copy it to the search bar on the infrastructure map. Within a few seconds the EC2 instance will appear on the map and you can start with a drill down to all relevant information.
Step 2: Was the host already rebooted?
Looking at the host dashboard we can easily see if the instance was already rebooted (uptime) and what the current CPU Load is. Once an instance gets rebooted, the Instana Agent will start monitoring the host again and detect all running processes. This allows us to quickly see if all processes are up and running.
By selecting the timeframe you are interested in you can go through the product and check other charts and metrics.
Step 3: What impact does this EC2 instance retirement have on my system?
In our case, the instance retirement was for a Cassandra node. As a first step we check if the Cassandra node is running fine, if latencies and compactions are in good shape and if the Cassandra cluster was negatively impacted.
After a few checks in Instana, we are confident that the reboot has not caused any outages. Therefore, we can archive the email and continue with our daily work. Instana’s 1 second metrics resolution and auto-detection of components makes it easy to get answers for the most pressing questions when hosts get rebooted without prior notice.