AIOps tools (artificial intelligence for IT operations) are a hot topic recently. As distributed systems grow bigger, get harder to manage, and, in general, become more complex with microservices architectures, these tools are essential for proper management.
AIOps tools and the Digital Transformation of IT Operations
When thinking of digital transformation, one of the first things that comes to mind is “moving to the paperless office”, and while offices are still full of sheets of papers, another area is becoming more important by the day. The DevOps ethos proclaims how important it is to automate all things operations and deployment, and with systems becoming more and more complex this need is constantly growing. Collecting logs, trace data and health or performance metrics has become commonplace, but correlating that information is tedious and error prone.
As the next logical step, AIOps tools consume data from various services, collecting the application logs, or measuring a systems health or performance; breaking the siloed IT information problem, and bridging between issues of software, hardware, and the cloud.
Image credit: https://www.gartner.com/en/documents/3893494/use-aiops-for-a-data-driven-approach-to-improve-insights
Intelligent algorithms, based on machine learning (ML), analyze the different data sets to find connections and relationships between services and infrastructure. Understanding of the dependencies and relationships is critical for automatic correlation of issues, identification of root cause, and to take remediation actions when appropriate. The Gartner AIOps model in the image above includes DEM as a data input to AIOps. DEM includes data from Real User Monitoring, Synthetic Monitoring, and Mobile Application Monitoring.
AIOps tools and automation
A core capability of most AIOps tools is automation. There are multiple aspects to automation. Starting at the most basic level, AIOps tools automatically send out notifications via Slack or eMail to the appropriate person or team.
Building toward more complex automation, AIOps tools can suggest remediation steps for IT staff to take appropriate actions to mitigate the issue at hand, based upon problem pattern recognition.
The most advanced AIOps tools are capable of automatically identifying known problems using “fingerprinting” techniques, then automatically taking the appropriate remediation steps. The most common scenario right now is identifying and fixing easy issues automatically, such as a JVM running out of memory and automatic restart of the JVM to fix the issue.
A major obstacle to fully automated remediation is trust. It takes time for IT staff to trust that an AIOps tool will accurately identify an issue and run the proper remediation steps without causing an even bigger problem.
Are these tools even needed?
As an evolutionary step of IT Operations Analytics (ITOA), AIOps is designed to keep up with the changes found in today’s IT systems and the growing focus on SLA, SLI/SLO and Time to Resolution. The following problems are a reality in every technology driven enterprise:
- Growing IT Complexity: Systems exceed the scale to be grasped by humans. Manual correlation of information and events from different sources and systems is no longer possible, especially due to the highly dynamic nature of cloud environments, and microservices architectures.
- Information Overload:
More complex systems, more logs, more metrics. The information to make sense of is exponentially growing. Furthermore, systems tend to consist of more different technologies than ever before, coming with the challenge to provide more and more services with unknown failure domains.
- Faster Time to Resolution: With even-increasing requirements on SLA and SLI/SLO, the desired Time to Resolution shrinks. Consumers expect systems to be continuously available, a direct effect of the Consumerization of the recent past. Every service outage directly or indirectly affects the work or life of people.
- More Edge Computing: Moving more computation to the edge offers advantages in uptime, response time, and average latency. Managing systems running “on the edge” comes with issues unseen before. The elastic nature of edge computing keeps increasing the focus on automation.
- DevOps Convergence: The cliff between engineering and operations is constantly shrinking, with developers having more influence and power over operational tasks. With this increasing power comes the responsibility to understand the system and ensure optimal performance and health.
To meet the challenges listed above, IT staff need assistance from AIOps tools. It’s simply not possible for humans to keep up with the ever increasing demands of business and technology.
AIOps tools are still immature
The AI in AIOps can’t possibly replace humans at this time. Instead, AI relieves the tedious work of information and event correlation from many varied data streams. Real-time anomaly detection, correlation of millions of events per second, and identification of patterns and relationships is the specialty of ML algorithms. AIOps tools reduce information overload, provide focused insights, and enable humans to focus on the creative aspects of problem resolution.
AIOps tools are currently not very mature when it comes to identification of complex issues and figuring out the best way to permanently resolve said issues. Simplistic resolutions like system restarts or adding more capacity are typically short term methods for masking problems instead of long term fixes.
With all of that said, AIOps tools are still invaluable to all levels of IT staff.
AIOps tools Vendors
With AIOps requiring the integration of different information sources, the landscape of AIOps tools providers is broad. Generally, tools can be split into two separate types depending on how they collect information. First, there are the Domain Agnostic AIOps tools, which heavily rely on integrations with many different services to collect data. Second, there are the Domain Centric AIOps tools, which tend to collect most, or all, of the required information themselves. These tools also tend to be more specific to a special domain, such as log management, Application Performance Monitoring (APM), or others. The following is an overview of some different AIOps tools and a description where they fit in this broad space.
Domain Agnostic tools
BigPanda is a classic example of collecting data from many disparate services to feed its ML algorithms with data for correlation and problem identification. “BigPanda captures and combines alerts with change and topology data from all your tools, then uses ML to spot problems and patterns that identify the root cause of performance issues or outages in real-time.” BigPanda offers an open approach to their ML capabilities to encourage trust among their users.
Moogsoft’s AIOps platform provides all necessary components to “herds a stampede of data into the appropriate chutes”. Integrating with external services to collect the necessary information, Moogsoft is another classic domain agnostic AIOps tool, providing noise reduction and causality analytics of the systems under observation. Service specific features are the workflow engine to provide an easy solution for creating custom logic to handle events, and the Situation Room which optimized communication and collaboration around a specific issue.
The Now Platform of ServiceNow includes all necessary components to deliver a full-fledged AIOps experience based on data collected from external data sources. As part of a massive platform, AIOps with ServiceNow is fully integrated with all other services and departments supported by ServiceNow. Given that tight integration to other parts of the platform, operational intelligence and AIOps becomes part of the company’s basic habits.
Victorops / Splunk
When looking at Splunk and Victorops, they can either be used independently or together. Splunk itself falls under the Domain Centric world, being designed for large amounts of log data. When used in conjunction with Victorops, however, Splunk is “just” one of many integrations to source data from. Victorops provides the necessary analytical engine to correlate data not only from Splunk but a multitude of other services, too.
Domain Centric tools
Instana is a modern APM (Application Performance Management) platform, and therefore falls into the Domain Centric AIOps tools category. Instana uses its own Agent technology to automatically and continuously discover and monitor infrastructure and services, as well as collect Distributed Traces of all requests flowing through the system. Instana uses a combination of machine learning, expert knowledge, and event correlation to automatically identify problems. Instana’s unique Dynamic Graph is the logical model used as the basis for deterministic root cause analysis.
Dynatrace comes from a more traditional APM background but has rebuilt their monitoring platform to support modern application architectures. Dynatrace collects most data using its own agents making it a domain centric solution. Using its machine learning algorithms, Dynatrace focuses the user’s attention on issues and resources that really matter. Dynatrace provides features such as Root-Cause Analysis, mapping of cloud environments, as well as event correlation. Dynatrace has named their AIOps engine DAVIS, which provides a robust set of AIOps capabilities.
CA Wily (Broadcom)
Broadcom provides CA Wily (formally a product of CA Technologies Inc). As part of Broadcom’s broader Bizops category, it provides all tools necessary to capture the information it needs to provide the insight and automation necessary to AIOps. While CA Wily integrations exist as part of the Introscope APM product, they seem to focus their attention to large enterprise solutions, such as SAP.
AppDynamic is another long-standing APM product which provides its own data collectors. Furthermore, integrations with external systems exist, and data retrieved is integrated into the correlation algorithms. With the Central Nervous System platform, AppDynamics (part of Cisco) provides a specific service platform purpose built to AIOps.
New Relic’s newly introduced New Relic AI capability provides features dedicated to AIOps. The New Relic One platform collects necessary information using its own agents (but recently announced that they will rely on OpenTelemetry agents in the future), stores, correlates and creates incidents. Integrations with external services and tools exist and are vast.
Datadog began as an IT Infrastructure Monitoring (ITIM) tool but over the years have expanded their offerings to include logs, tracing, security, AIOps, and more. While rapid expansion of a portfolio means that they are not best in class at any one thing, breadth of offerings makes them interesting to many buyers interested in tools consolidation. Datadog’s AIOps capabilities center around their Watchdog capability which relies on the loose correlation between disparate products to come up with meaningful insights.
Zenoss Cloud is a full-stack monitoring solution combined with log and incident management, as well as the intelligence and automation for AIOps. Data for Zenoss is mostly collected using “ZenPacks”, which are plugins or integrations into all kinds of different systems. Furthermore, a SDK can be used to develop custom plugins.
StackState is something in-between domain centric and domain agnostic. Collecting a lot of information through external services, StackState also fetches metrics and events using custom integrations. With support for a vast set of products and integration into other monitoring or APM systems, StackState gathers and correlates a broad range of information, providing a full-stack view of the operational landscape.
There are many more Domain Agnostic and Domain Centric AIOps tools in this very crowded landscape. We’ve tried to call out some of the most interesting and relevant ones in this article but we know this list could change daily if we tried to keep up with them all. Domain Centric and Domain Agnostic tools can, and should be used in conjunction with each other to create the best possible solution given the vast amounts of data being collected every day, even by the smallest of enterprises.