This March teams from Instana & Turbonomic travelled to Santa Clara, California to represent IBM at the first USENIX SREcon since 2019. For those unfamiliar, SREcon is described as “a gathering of engineers who care deeply about site reliability, systems engineering, and working with complex distributed systems at scale.”
The conference provided 2-3 speaking tracks each day for the approximate 500 attendees. The sessions focused both on the issues of building reliable systems as well as how to define good SRE practice. Some specific topics included:
- “Logs Told Us It Was DNS, It Looked like DNS, It Had to Be DNS, It Wasn’t DNS”
- “Epic Incidents of History: The 1979 NORAD Nuclear Near Miss”
- “Building an APM with OpenTelemetry and OpenSource”
- “On the Wings of SREs; J.P. Morgan’s Journey into the Cloud”
The complete list of presentations is available, and video recordings will be posted over the coming weeks.
The Current State of SRE
Reliability Engineering is maturing as a practice and this was reflected in the sessions and hallway discussions at the event. The community is focused on defining and evolving best practices to manage both the technological as well as the sociological processes involved in software delivery and maintenance.
We heard different challenges from organizations at different scales. Smaller organizations are able to adopt new tools and practices rapidly, but they lack the resources needed for defining best practices. Vendors that offer strong opinions about best practices provide an easy onboarding ramp for these small teams.
Larger organizations are slower to adopt new tools and practices but are also more likely to have more tools in use. Consolidation onto centralized developer platforms is seen as the path to SRE maturity for these organizations. However, the siloed nature of large organizations means that many unpopular decisions are likely, and we should avoid thinking of large institutions as monolithic in their opinions.
Both small and large companies are looking for ways to reduce costs and tool consolidation is becoming a hot topic. DevSecOps tools have proliferated wildly during cloud migrations. According to one report most companies are using at least 6 (!) observability tools in production. Developers and executives alike see tool consolidation as a path to reduce both costs and toil — even if a multitool may not be as refined as a scalpel.
The Current State of Observability
Observability is so critical that many SRE teams have dedicated Observability leads or sub-teams. (In some cases, the observability team is a sibling team to the SRE team, but based on our conversations this is rare).
But practitioners are beginning to get tired of hearing the same things from every observability vendor in the market. One engineer proclaimed, “Everybody here is selling answers. I don’t want you to sell me answers, I want the ability to ask better questions.”
Of course, the engineer was referring to root cause analysis. The narrative that our observability tools can immediately pinpoint a problem to a specific line of code is — and will continue to be — attractive. But experienced SREs know that this will never be possible for every incident and ultimately, they will need to pull out their analytics tools and dig in deeper.
As one speaker put it, “Your root cause is not my root cause.”
Another hot topic was OpenTelemetry. The real-world best practices for OpenTelemetry are still emerging. eBay has established themselves as an SRE thought leader through their use of OpenTelemetry in production, but other than their example it is hard to find many large-scale production deployments.
This does not diminish the excitement for OpenTelemetry and open standards within the community — which creates an opportunity for more thought leadership in the Observability industry. For many less mature organizations, “monitoring” means application logs and infrastructure metrics. When they eventually look to add distributed tracing or APM to their pipeline, they will need guidance as well as tools.
What’s Next?
We were thrilled to participate in the return of SREcon. The discussions and sessions left us excited for more ahead. You can find us at KubeCon + CloudNativeCon Europe in Amsterdam this April and at GlueCon in Denver this May. And maybe take it for a spin and see if Instana can be your SRE team’s consolidated tool for Observability.