From “A Free-Food Frenzy” to “NASA, We Have a Problem:” Cautionary Tales of Disruptive Glitches & System Downtime

Software health and functional application execution are quintessential to optimizing business performance, creating value through cost savings, and enhancing the consumer experience. However, failure to catch small or out-of-sorts glitches can cost a hefty sum and possibly create a PR crisis. In fact, according to Gartner, the average cost of network downtime for a medium-to-large enterprise can range from $5,600 to $9,000 per minute. Below is a simple formula for calculating the average cost of downtime for an enterprise or small business.

Downtime cost = Minutes of Downtime x Cost-Per-Minute. For a small business, use $427 as cost-per-minute. For a large organization, use $9,000.

In more advanced cases, itemized labor, legal, goods, overhead, and lost revenue costs would factor into the formula.

For instance, a glitch in DoorDash’s point-of-sale (POS) system recently initiated a nationwide “free-food frenzy.” Customers ordered hundreds of thousands of dollars in food, alcohol, personal hygiene products, and even electronics without an authorized method of payment. IT professionals at the food delivery app worked feverously to fix the glitch as word spread like wildfire across social media platforms. DoorDash customers celebrated by posting images of their newfound loot with catchy captions on Twitter and Instagram. For DoorDash, the cost of system downtime was exacerbated by its easy-to-use phone app and the rapid pace of social media communication.

It may take weeks to months for DoorDash to sort out the payment details for customers who seemingly received goods for “free,” and maybe even longer for the company to regain the trust of organizations and customers who were negatively impacted by the glitch. The food delivery app company could have avoided system disruption with an observability tool that promptly alerts IT professionals of operation failure.

 

 

A lot can happen in 10 seconds, and in incidents like this every second counts, so it’s important to have context-rich alerts. Instana already provides functionality to correlate, group, and order multiple issues (in different services, and infrastructure elements) into a single incident, when related. This understanding is powered by Instana’s Dynamic Graph, providing an immediate view of the different components that are affected by an overall incident.

To correlate issues, Instana’s Agent automatically collects, analyzes, and contextualizes information around

  • Services and Service Instances
  • Frameworks used to build the services
  • Process and Runtime Environments, such as the JVM, .NET CLR, nodejs, …
  • Service Dependencies, such as HTTP calls between services
  • Distributed Traces across the service landscape
  • Cluster nodes and infrastructure those services run on

Instana recognizes system incidents within a second of the glitch and provides notification of the incidence within three seconds. Instana’s major observability competitors either sample metrics at 10-second intervals or aggregate metrics in one-minute intervals or more, compared to Instana’s ultra- precise one-second metric interval. Instana also delivers notification of an issue within 3 seconds.

A Forbes article, entitled “How IT Service Management Delivers Value To The Digital Enterprise,” notes “Many organizations still spend most of their IT budgets—and a good deal of staff time—keeping the lights on. Thirty-seven percent of survey respondents report that the majority of their IT budgets go to ongoing maintenance and management. Close to half of executives, 47%, indicate they are responding to the challenge of budget and resources going into maintenance and management by turning to cloud-based services.” As more organizations transition to cloud-based or cloud-native services, synthetic monitoring, which aids in indicating when there are network problems between the end user and backend services, has become a critical observability function.

Consumer phone apps aren’t the only systems susceptible to glitches. NASA recently experienced a huge setback with James Webb, a $10 billion space observatory, when it almost failed to send images and new scientific data to mission operations. The space observatory has sun shield guards that roll back and hit a switch, alerting the team on Earth that the unfolding is complete and the ability to take images and collect data is available. The mission control team waited and waited to receive an alert that did not come.

After several days of waiting, the mission control team decided to fire the switch again – to no avail. Thermal engineers from NASA used telemetry data to identify a malfunction in the sun shield’s alerting system. Telemetry metrics and heat measurements indicated the sun shield guards unrolled, but they did not trigger the alert switch. Eventually, mission control engineers fixed the issue and James Webb produced its first images from outer space.

Telemetry helped NASA discover a glitch in James Webb’s alerting system and O-Tel, also known as Open Telemetry, helps developers discover issues with software.  Open Telemetry is an open-source observability framework with a collection of software development kits (SDKs), vendor-neutral or vendor-agnostic APIs, and tools for instrumentation. Instana utilizes this technology to generate, collect, export, and instrument telemetry data to analyze your platform’s behavior and performance.

 OpenTelemetry offers a vendor-neutral data format that can be integrated with any data processing backend. This is possible thanks to a concept called “exporters.” An exporter allows you to configure which backend(s) you want it sent to. The exporter decouples the instrumentation from the backend configuration. This makes it easy to switch backends without the pain of re-instrumenting your code.

Glitches, system failures, and network downtime are problematic, and can be costly, for many reasons, but they don’t have to be. Installing a data-driven observability tool that has 3-second notification, context-rich alerts, and advanced artificial intelligence automation, such as Instana, will save your organization from the headache and public embarrassment that can occur when weird stuff happens. Take a test drive – Play with Instana or sign up for a free, 14-day trial.

Related Content:

Enterprise Observability:

Synthetic Monitoring:

Open Telemetry

Play with Instana’s APM Observability Sandbox

Thought Leadership
Losing Time There’s a dirty little secret in the legacy monitoring world. It can take 5-10 minutes for an alert to fire in most monitoring tools. That’s 5-10 minutes before you have...
|

Start your FREE TRIAL today!

Instana, an IBM company, provides an Enterprise Observability Platform with automated application monitoring capabilities to businesses operating complex, modern, cloud-native applications no matter where they reside – on-premises or in public and private clouds, including mobile devices or IBM Z.

Control hybrid modern applications with Instana’s AI-powered discovery of deep contextual dependencies inside hybrid applications. Instana also gives visibility into development pipelines to help enable closed-loop DevOps automation.

This provides actionable feedback needed for clients as they to optimize application performance, enable innovation and mitigate risk, helping Dev+Ops add value and efficiency to software delivery pipelines while meeting their service and business level objectives.

For further information, please visit instana.com.