Monitoring is a tool. It is one of many tools in the hands of the IT department used to deliver the best service to their customers, be they company-internal or external. “The Art of Monitoring” book nails the term: “Monitoring provides the translation between business value and the metrics generated by your systems and applications. Your monitoring system translates those metrics into a measurable user experience. That measurable user experience provides feedback to the business to help ensure it's delivering what customers want. The user experience also provides feedback to IT to indicate what isn't working and what's delivering insufficient quality of service.”
So, in order to achieve this business-value-metrics translation, every day, billions of metrics and their aggregates flow from various systems into all kinds of monitoring platforms. Thousands of collectors consume metrics through different protocols, on different time scales. At the other end of the line, a person is watching metrics as they come and applies some sort of rules to catch anomalies in metrics behavior when they deviate from the assumed or statistically computed normal.
This approach might work fine in an environment with just a few IT systems and dependencies between them. But it is considerably challenging for modern dynamic application environments. This 4-part blog post will explain why.
Management vs. Monitoring
Management is the act of maintaining a common direction across smaller parts. It is also applied to make larger decisions based on smaller facts. All in all, it is a tool to control complexity. Managing an organization for example typically means breaking it down into a healthy number of units, each by itself managed the best way it needs, and making overall decisions only when needed, based on the work of units and the overall organization direction.
With IT systems, it is by no means different - starting at considerably small complexity, they need to be managed. These systems most of the time might do their work well in their own scope, but sometimes they cannot. And if they cannot solve a problem themselves, they need a management institution outside their own scope that maintains the overall, larger context and can solve overarching problems. The solution might go down the path back to an involved system to change its behavior, but the decision therefore has been made outside of the system itself.
People are always part of IT systems. In an IT organization, teams of people typically are responsible for specific IT systems - for their development, maintenance, health and user satisfaction. In order to follow the direction of the organization, these teams have their own goals within their own scope, and they use monitoring among other things to achieve them.
The problem is that monitoring keeps people busy. They have to continuously watch the data and make observations in order to understand the health of the system. On the other hand, by utilizing machine power and intelligence to filter noise, recognize recurrent patterns, find technology-typical issues and not obvious anomalies people can focus on making decisions based on indicators provided by the machine. Even more: by defining objectives, people can tell the machine when to let them know and even what automated actions to trigger, should an important objective be missed. This clearly frees people’s hands and heads, and turns continuous work keeping them busy all day long into eventual ad-hoc activity when absolutely necessary. This defines the big difference between monitoring and management.
Let’s look for example at charts showing metrics. We see them everywhere in the IT organization. They sure look good on large screens and impress visitors. But they also tend to attract the human eye. People tend to look at charts when they see them, even if they don’t have to. With that in mind, shouldn’t charts rather be an instrument of the analysis of a problem rather than to continuously remind the team of their existence?
Organisations with continuous fine-grained reporting to higher management even when there is nothing to report are suffering from micromanagement, and tend to fail. In the world of IT systems it is the same - they should only tell you when they have problems, and report status with considerably long periodicity. You don’t want to sit there and watch how they behave all day long - you don’t want to micromanage your IT systems. And even if you apply some basic rules when your IT systems should demand your attention, you’re still watching them when they report you nonsense. Because it’s still you who maintains all the explicit rules for them about when to report.
Service vs. System
The difference between a service and a system is that an IT system is turned into a service through explicitly defined and managed expectations of its users. To exaggerate even more: an IT system not being enriched to a service is unusable outside of the IT organization. Tech people tend to be patient about tools and adjust their expectations as they go. But when the business has made promises to customers, supporting IT systems need to answer these expectations. This automatically turns them into services. It’s really far more than just an API that hides the service details - it’s the perception of the services through its users, be it humans or other services, and a contract expressed in terms of quality of service expectations, such as reliability, availability, performance etc.
Quality vs. Health
Knowing that a service is broken is not sufficient. It is absolutely important to know how broken it is in order to judge if anything needs to be done, which makes it part of the management process. The difference between a statement of health and a statement of quality is that quality is a numeric or percent range, while health in the worst case simply is a boolean value.
Taking it with a metaphor again, a binary health statement about a patient is like they are sick, or they are not. On the other hand, qualifying what body part is unhealthy, and by how much makes it possible to decide if medication will help, or if there is need to hospitalize them to observe or if they need an urgent surgery. Of course, there are extreme cases where the doctor needs to reanimate the patient, but honestly, we don’t want this to happen to people if we can prevent that. Neither do we want this to happen to our services if we can prevent that. Qualifying health more precise than with a simple statement give us the ability to make better decisions earlier instead of just reacting to bring it back to life.
Modeling is Crucial
You cannot manage if you don’t properly model what you would like to manage. Model is the abstraction of the whole that hides the underlying complexity behind an ideally small set of generalizations and their connections. There are multiple factors that make for a good model, such as:
The Model is Explicit
Everybody involved needs to share the same model. It is not hiding in single heads, because human brains can at anytime ensure that two people talking as they think about the same thing don’t share the same model of what they talk about. Also, keeping models in the heads is brittle from the perspective of the organization. When your service is in danger, the last thing you want is the discussion about the terms and misunderstandings around the problem. Instead, you want your model to be clear on what’s going on, so you can derive your decisions from situations, which again brings you closer to managing.
The Model is Minimal
If you try to fight complexity through modeling, it should consist of as few generalizations and connections as possible, otherwise it turns into complexity itself and quickly becomes self-purpose.
The Model Defines a Common Language
It is very important that you share the same simple language inside the IT organizations and with your business customers when it’s about quality of service. Conflicting terms and misunderstandings always will lead to stressful discussions in case of trouble. Trouble situations are stressful enough to extend the stress with additional debates around misunderstandings.
The Model is Written
Everybody can look up the meaning of every generalization at any time, and the terms and their meaning can be quickly learned by new team members.
The Model is Accepted
A model that is not accepted by at least one person involved is questionable. When it lacks acceptance from multiple persons, it’s abstract and not applicable. When it’s about quality, model acceptance from all involved parties is essential.
The Model is Testable
In order to repeatedly verify the correctness of the quality measurement, the quality model needs to be testable, so test data, simulation and load tests can be applied at any time, generating predictable results.
The model for quality of service management Instana provides takes these factors into account. It is simple yet powerful, and is derived from the industry best practices.
Instana’s model for quality of service management
Our model allows formalization of expectations as well as measurement of quality and opens the way for quick decisions once something unwanted happens. It consists of the following generalizations:
- Indicator - defines the quality aspects. It will be handled in depth in the part II of this series
- Objective - defines expectations using indicators and some other criteria. It will be handled in depth in the third part of the series
- Trigger - defines what happens when objectives are not reached. We discuss this in the last part of the series
That’s it. We think that using just three generalizations with this simple model, you can evolve from monitoring, with you doing all the work, over to managing with you only doing the necessary work, leaving the rest to Instana. But what looks simple on the surface, typically is complex and well-conceived underneath. The rest of the series will give you an idea of how we implement and provide this model to you.