“We live by software, our competitive advantage is driven by software,” explains David Merryweather, Vice President of Infrastructure and Site Reliability at Macmillan Learning. The company is known for its 175-year history as a textbook publisher, but in the modern era Macmillan uses technology to provide not just content, but an always-on learning platform for students and teachers worldwide to collaborate and work together in real time.
The 2.5 million students using Macmillan software aren’t just reading textbooks — they depend on Macmillan to submit their homework, see instructor feedback on record grades.
“When we have an outage, our biggest concern is, ‘are students impacted?’ ”
Merryweather says. Students often submit work at the last minute, so if Macmillan’s platform goes down for 15 minutes, students might miss their deadline. There’s also the risk that work students or instructors have done on the platform will be lost, forcing them to redo assignments or feedback.
Macmillan’s 80 applications — 9 of which are customer-facing — were previously built on Windows or LAMP stack systems. Macmillan was using New Relic to monitor the system, a set-up that was “good enough, mostly because we didn’t know any different,” Merryweather says, though it was supported by extensive homegrown additions. Still, whenever there was downtime, the team relied on four or five different tools and often spent at least 45 minutes just figuring out what was happening.
When the company moved to a container-based architecture, Macmillan realized the monitoring system wasn’t container-native and didn’t provide nearly enough granular visibility into the containerized application.
“We needed something that would give us a high level of granularity, without requiring a great deal of effort.”
Looking for a Container-Native, Off-the-Shelf Solution
As Macmillan started building Achieve, a containerized platform with Docker Swarm, the company started looking for a new monitoring solution that would provide at least as much visibility as it had for the legacy application. Clearly, any solution had to be container-native, but it also needed to be able to self-configure as well as scale up and down automatically.
One of the first things about Instana that attracted Merryweather was the one-second data granularity. New Relic, Macmillan’s previous monitoring solution, updated data every five minutes as an averaged window of the last 15 minutes. This diluted visibility and contributed to longer discovery and resolution times during an outage.
In addition, with Instana there is no need to manually manage configurations, so Macmillan wouldn’t need to invest a lot of engineering resources to get it set up in different environments.
Merryweather was also attracted by the Instana SDK — the developers on his team like to tinker with things. After setting up Instana, Macmillan used the SDK to create a custom feature that injects transaction reference data across multi-service boundaries, and the team is still evaluating how to use the OpenTrace SDK to improve performance and granularity.
“We have over 1,800 instances, and we manage anywhere between 300 and 800 containers at any one time,” Merryweather says. Nonetheless, Macmillan was able to roll out Instana on the Achieve platform in under four days. It just involved deploying Instana into the host, and didn’t require any complex configuration management.
After setting up Instana, the mean time to respond dropped from six minutes to two minutes. Performance of Macmillan’s Gradebook application improved by a factor of ten. “We are using Instana as a tool to meter an application’s behavior,” Merryweather says.
“We’re able to change performance metrics and understand the application performance from the inside-out, in real-time, leading to major performance improvements.”
Relatedly, Instana is a tool that Macmillan’s 200 developers use on a daily basis. They all use it differently — some to track down bugs in the integration environment and others, in the load test environment, to fine-tune performance — but it’s used much more actively than any monitoring system Macmillan has used before.
“I think for a lot of engineers, it was a game-changer in terms of being able to understand how all these services fit together,” Merryweather says.
“Instana works the way we do. The way they work with containers, high-resolution data, zero configuration. That’s what we build our infrastructure and what we expect from vendors.”
Macmillan uses Instana for its Achieve platform, built on Docker Swarm. Over the course of 2020, the team plans to move to Kubernetes, specifically Amazon EKS. In addition, it’s going to move more of the user base to the container-based platform. There are currently around 30,000 users on the new platform; by fall 2020 there will be 250,000 users on the new platform.