How Automatic Change Detection Improves Root Cause Determination

OK, You’re an ops guy (if you’re not, pretend for a moment you’re an ops guy or gal in charge of maintaining a dynamically scaled, highly distributed software system in the cloud), let me ask you a question.

What is the number one reason causing your system to get in trouble? This includes everything from unpredictable (and unexplainable) spikes and volatility to (more or less catastrophic) crashes.

The answer is: Change!

It’s not the software bug (it happens) or the hardware failure (that also does) that mostly causes your system to fail. It usually is the perfectly normal, tested, QA passed bug fix, the new feature or enhancement to a software component, the version update of a framework or OS or other infrastructure.

In itself, each change is not only harmless it is in fact intentional, it is supposed to bring about an improvement and the faster it can be deployed, the quicker the improvement can be realized. And this is usually the case. The problem is change can also bring about unintended, unanticipated side effects which then cause new problems.

Here is a short list of change vectors common in Devops:

  • New microservice deployed
  • Version update to OS, or JVM or other technical layer
  • Configuration change to software components (OS, JVM, MySQL, Tomcat, Cassandra, etc.)
  • Increasing and decreasing of nodes in clusters
  • Code pushes (bug fixes or new releases)
  • Containers deployment from orchestration tools
  • Re-design of deployment to enable scaling

Any one of these can unleash a stream of possible side effects that in combination can have, unforeseen consequences, in aggregation mostly negative ones. Who would have thought that increasing memory to a caching layer would end up starving access to a message queue or that the new version of the html templating engine would end up flooding the email server’s disk capacity? The butterfly effect is a very concrete reality in complex software systems.

These are the head scratch moments when one goes ‘hmm’ and one, before hitting the ‘rollback’ button, would want the ability to be able to see what change is correlated to the cause of the current situation.

change_detection

Well, now you can! Instana has released change detection to its capabilities!
We capture the startup and environment settings of a process, we recognize a recurring process as such (yes, we actually do!) and we find the deltas (aka changes!) in those settings. These changes become events, just as other events, like notifications fired because of the drop in throughput rate on a message queue or the filling up of a disk volume. Thanks to Instana’s dynamic dependency graph we can now correlate and walk the graph tree finding ramifications and cause (from changes) and effect (of performance impacts) that would otherwise be difficult to uncover.

Stan, Instana’s virtual DevOps assistant, will inform you if any change is about to bring disruptive consequences to your business.
Thank you Stan!