A recurring question that keeps arising in the DevOps community is, “how do you fix the many incidents that happen during or right after a release.” Fast detection of incidents and decrease the time spent on remediation.
Rolling our new software can be overwhelming and has been suggested to be the cause of the majority of incidents at most companies. But while developers are working on new quick release cycles, the SRE focus is to ensure the stability and availability of your application. Why? Because, of course, your user expects new innovations and the reliability of your service.

How do we achieve the second aspect of this equation?
What is automated remediation?
Automated remediation of the incident enables your team to fix that issue faster. It ranges from basic alerting mechanisms and logging to fully automated remediation. It’s important to note to benefit greatly from automation it’s better for organizations to work their way through levels of automation. AI can identify and utilizes cause-and-effect relationships to go beyond correlation-based predictive models and toward AI systems that can prescribe actions more effectively and act more autonomously.
Let’s start by reviewing the benefits of automated remediation:
Save time, Improve security, Consistency, and continuous compliance logging
- Increase efficiency –saves time, you would not have to react and take action manually. The system would take actions based on past remediation, allowing your team to work on higher value-added tasks. Especially at the enterprise level, the time saved would be significant. Faster MTTR
- Increased security –vulnerabilities and problems are addressed immediately upon discovery, preventing issues from escalating into incidents. Deployment rollback will happen automatically
- Consistency –every action runs with the exact same workflow, and organizations can be sure that the prescribed procedures are always being followed correctly.
- Continuous compliance logging –provide proof of the results of real-time corrections to keep cloud environments compliant, rather than periodic audits. Decreased risk for business experience
What are the requirements for rapid problem remediation that prevents downtime?
Recent reports suggest that the most critical issue during remediation is manual toil (lack of automation) including challenges related to communication e.g, using the right run books or reaching the right people.
You should get an observability application with Automated Remediation that detects problems, underlying incident root causes, and SLO impact across your full stack production deployments.
We’ve identified these top five use cases for automated problem remediation:
- Feature flag settings—Observe application and service behaviors, identify error-causing feature flags, and switch them accordingly to guarantee stable environments.
- Process restarts (for example, JVM memory leaks)—Trigger a service restart or related actions for applications with underlying bug fixes that have been deprioritized or delayed.
- Kubernetes resource adoption—Act on external, holistic, and customer-centric behavior observations—rather than on only internal parameters—and automatically roll out Custom Resource Definitions (CRDs) to designated environments.
- Deployment and rollback—Trigger predefined rollback or roll-forward actions when a faulty deployment violates SLOs or decreases your error budget above the target.
- Targeted notifications—Based on the auto-detected details of underlying root causes, keep your business and technical users, SREs, and Operations team updated regarding ongoing remediation actions and escalate if the situation requires higher visibility.
Check out our previous post on how instana wants to help you resolve issues faster. If you would be interested in helping us with our research in this area and potentially trying out some prototype solutions then please contact [email protected]. Curious to test driver Instana fast incident detection capabilities, sign up for an Instana Trial right now, and get the level of visibility and contextual information you need to solve incidents fast.