Running Instana’s SaaS service presents the same challenges our customers face running their own applications. It is a highly dynamic environment under constant change at the mercy of external environmental influences.
How does our SRE team ensure that Instana continues to provide a high quality service to our SaaS customers? They don’t achieve consistently excellent service without any help, they have Stan on their team. That’s right, Instana uses Instana to monitor Instana.
The Instana SaaS platform is continually expanding as new customers sign up or existing customers add more agents, it is also perpetually changing as new releases are pushed live every 2 weeks and patches are also applied in between. Does this sound like the microservices application you are running?
When Instana monitors an application, it traces every single request and acquires metric data at a 1 second resolution, finally processing all that data to produce actionable information all within 3 seconds. Let’s look at some numbers to get an idea of the scale of the task. Currently there are:
- 380 hosts
- 3064 containers
- 13TB of data ingress per week
- Expands to 80TB after denormalisation/decompression
- 600,000 Kafka messages per second
- 400TB data storage
- All growing at a rapid pace
The ingress volume may appear low but that is by design. The Instana agent makes very efficient use of bandwidth when communicating with the backend. For time series metrics only differences are transmitted; if a value has not changed between samples, then no data is sent. Trace data is JSON text which compresses very successfully.
The Instana backend, where Stan lives, is itself a microservices application running in 2 regions and 6 availability zones on AWS and utilising these technologies:
- Java Dropwizard
- Apache Kafka
- Clickhouse data store
- Elasticsearch data store
- Cassandra data store
- Nomad scheduler
- Consul key value store
The rapid rate of change in the environment is not a challenge for our SRE team as Stan does all the grunt work here. The agent automatically detects new components as they are deployed or scaled ensuring total coverage of monitoring. With
the new components automatically appear on the appropriate dashboards.
Most of the functionality in Instana is provided by Java Dropwizard microservices, these services have been written to be very robust and self healing, therefore they do not cause many problems. If one does fail the scheduler will automatically restart it. Multiple restarts of a container will trigger an alert so that our SRE team can investigate. The AWS infrastructure is very reliable but it still has its own problems that can have an impact on services. Instana also alerts on these AWS outages.
The technologies that require the keenest monitoring are the Nomad scheduler and the data stores. Should the scheduler have any problems, it jeopardises the whole application, therefore its health is of the utmost importance. The status of the scheduler is closely monitored with any potential problems triggering alerts. On a day to day basis the technologies that need the most care and feeding are the data stores. These can have an occasional performance blip, or a cluster member not rejoining cleanly after interruption. Again these conditions are alerted on so that our SRE team can investigate and remediate. We take great pride in the quality of service we deliver and the
service status is publically available.
Our SRE team members are beginning to suspect that Stan has an evil sense of humour. Alerts are often triggered just before lunch or at the end of a shift. Is this Stan having a bit of fun or is it just our team being paranoid? Most of the alerting is direct from Stan’s built in knowledge and AI processing, with only 10 custom rule definitions being required to cover very specific use cases, primarily around the interaction of services and data stores. Due to the way the
handles entities, those custom rules require virtually zero maintenance.
Triggered alerts contain summary information and a deep link to all the data on the Instana dashboard. The alert pulls together not only the metric data, it also provides cause and effect analysis. Using the relationship and event data from the Dynamic Graph, the alert visualises the issue rippling across the interdependent services. This automatic root cause analysis enables our operators to quickly identify what remedial action they need to take to ensure service continuity.
The saying goes “the proof of the pudding is in the eating” and ours tastes pretty good. Our SRE team consistently meets our SLA of 99.5% uptime, with most issues fixed within 15 – 30 minutes.
What we learn from using Instana to monitor Instana is fed directly back into the product, making it better for our customers. This feedback loop continually makes Stan smarter, and he is already quite a polymath.