Gaining Knowledge: How Stan Learns

“Monitoring systems are tempting choices for gathering performance metrics, but they usually end up having to trade off resolution for economy of storage and rarely have a resolution higher than 1 sample/minute. Low-resolution metrics are certainly useful for capacity planning, useless for performance tuning.” writes Al Tobey, Senior Site Reliability Engineer at Netflix, in his comprehensive Cassandra 2.1 Tuning Guide.

Go-to tools for observation?

Actually, even if Cassandra is not the technology of your choice, his tuning guide serves as an excellent example of the state of the art in system performance analytics. Just skim over the first few pages and you see all kinds of low-level Unix tools such as dstat, vmstat,htop and the usual suspects.

These tools have a proven track record for any well-versed performance engineer to track down issues and problems in complex application setups. They do one thing well: they display data. The data that the engineer knows how to interpret and thus infer information. This information can then be used to find the root cause of a current problem and ultimately find a fix for it.

Getting back to the blog post our own COO Pete Abrams has published about a month ago, the issue is that just collecting the data does not solve your problem, you still need a human expert to interpret that data and get to the root cause.

Learning from Data

Enter Stan! The good thing about Stan is, that he is always learning. And with learning we don’t only mean leveraging machine-learning techniques, but also learning explicit knowledge from human experts. Within Instana, we have several environments we like to think of as labs. These labs are used to replicate production setup of components and hosts to study their behavior. As described in the above guide, we can use Instana itself for data gathering because of our 1-second-resolution metrics. Having recorded all historical data, we then ask a human specialist to “teach” Stan the pattern that identifies the problem and integrate this learning into Stan’s knowledge base. Stan never sleeps so the new knowledge becomes instantly available to all our customers. In many cases, our analytics engineers are even able to identify leading indicator patterns or employ machine-learning techniques to robustly predict an upcoming problem. This way, Stan allows your operations to be pro-active instead of re-active when system health is about to deteriorate.

Let me now give you an example why it is important to use the lab environment to simulate problems and refine problem detection.

Learning from Expert Knowledge - An Example

For our Apache Cassandra knowledge, we are using different setups that suffer from various problems. On one occasion we simulated a cluster that was well tuned but already using its maximum capacity, the CPU would not be able to handle much more load. We then introduced a sudden spike in end-user demand, shown on the left. In another run, we introduced an IO device that happened to show erratic behavior at times, shown on the right.

When we look at the data that our specialists recorded, the metrics over time look remarkably similar. First, take a look at the infrastructure level metric, load. For this particular node, a load above 4 indicated the system being over capacity (the red line). Thus, a simple threshold would have successfully identified some problem but left you totally in the dark about what the problem actually was.

A bit more interesting are metrics derived by the technology under observation itself. For operations of database systems, you are usually interested in read and write latencies. For those, you could employ a little smarter approach. For example, to infer trend information and thus detect a rise in latencies or you could have learned from the past what a normal distribution of these metrics is and this way identify a problem. Still, you know that something bad is going on, but not exactly what it is.

Up until now, both issues seem to be indiscernible from each other. Only when you look at metrics that are special to the technology at hand - in this case, Apache Cassandra’s thread pool statistics - you can separate both issues from one another. Specifically, you could look at the pending stage of the thread pools depicted below and correlate their behavior. Also, you have to know their semantics and only then are you able to tell if the problem is more likely to be an IO issue or a capacity problem.

While the above graphs are results of our offline analytics efforts we have made a short video that shows the resulting derived knowledge in action.

To summarize, this post should have given you an idea how Stan learns expert knowledge. Although there is no particularly fancy data science in play here, many traditional tools would fail to discern both of the example problems. Let alone telling you what exactly the problem is and suggesting a suitable and actionable fix. To find the precise root cause and suggest fixes, Instana uses a multivariate model to understand modern system architectures, real-time precise data and semantic insight of that data.

In our opinion, Stan's approach, which combines expert knowledge for different components and their interplay, semantic knowledge about the data that is being collected, and sophisticated algorithms, is not only the best but the only way to provide our customers with what we call The Stan Experience.

Click the "Experience Instana" button to see our current demo, or sign up for our beta program below, to try Instana in your environment (coming soon).