Kubernetes: Resource Contention and Change Risk Assessment

November 17, 2020

Post

An ever-increasing number of System architectures and deployment strategies depend on Kubernetes-based environments.

Kubernetes (also known as k8s) is an orchestration platform and abstract layer for containerized applications and services. As such, k8s manages and limits container available resources on the physical machine, as well as takes care of deployment, (re)start, stop, and scaling of service instances.

Kubernetes

Originally developed by Google, Kubernetes was contributed to (and incarnated the) Cloud Native Computing Foundation. It’s design is heavily influenced by Google’s internal cluster manager Borg and combines many of the common “lessons learned”.

In k8s, everything is modelled around building blocks known as Kubernetes Objects. They define the requested CPU and memory resources, network interfaces and forwarding rules, as well as resource limits. Resources are shared by the host operating system.

One of the most important concepts in k8s are pods. A pod is a higher abstraction around containerized components. While many people often think of a pod as a container, a more fitting definition would be “a group of containers and other components”. Additional components, next to the actual service, may be service mesh components, firewall components, or any other type of sidecar, interacting or managing the actual service container.

Additionally, Kubernetes has components, such as:

  • Services, which provide a simple service discovery abstraction (DNS name and load balancing)
  • Replica Sets, to define the requested number of pod instances, as well as automatic deployments
  • Volumes, for ephemeral or persistent storage
  • Config Maps and Secrets, as a backend for configuration data or files, as well as secrets
  • Namespace, enable the user to partition managed resources into separated sets, like teams, projects, staging / production

Additional kubernetes has a few more advanced concepts. A great overview of their functionality and use cases are available from the extensive k8s documentation.

Resource Contention

By design, resources available to the pods are shared resources. They are split by available time, size, or processing power.

While sharing resources is great and introduced to increase resource utilization, the dark side is resource contention. Many situations can lead pods fighting over resources.

In the past, especially in the beginning of virtual machines, a common issue was plain overcommitment of resources. Imagine the physical machine has 10 GB RAM and 20 VMs get started with 1GB RAM allocated each. As long as only a few virtual machines require their full gigabyte at the same time, there is no issue. But when they all request it at once …

As experience grows with resource management, overcommitment seems to be less of an issue. However, resource planning is only as good as the data at hand to plan ahead. It also requires a good understanding of changes between deployed versions and upcoming releases.

Enterprise Observability

Capturing the necessary information can be tedious. The data required come from a wide variety of sources, such as performance metrics, log files, and distributed traces. On the other hand, services and system dependencies, in the best case, are documented, otherwise they need to be reversed engineered from looking at source code, or even worse, traffic captures.

The use of Observability tools is often limited to keeping an eye on the system’s health and performance state. The massive amount of data collected and their data quality, though, can also be used for system planning. This is specifically true for solutions that are Enterprise Observability grade, adding automatic dependency discovery, service mapping, and data like performance profiles, with comparison between environments, deployment types and releases.

Resource Planning

Planning for the future is never an easy task, no matter if it is a job choice, a marriage, or capacity planning for a server landscape.

Thanks to the cloud, the speed to provide additional resources to our services increased dramatically. Careful guessing planning can prevent situations where the phone rings at 2am in the morning since the system is running out of resources.

To understand the resources a service needs, we can look at historical data.

  • What is the common spike under load or high times?
  • Has any new or major feature or release happened during the last few weeks that changed that behavior?
  • Are all new services being deployed?
  • Is there a new technology being phased in?
  • Is a high season, like Christmas, upcoming?

The number of questions to ask is almost indefinite, and heavily depends on the development methodology, the deployment frequency, as well as the system’s architecture.

Change Risk Assessment

Just like with overcommitment, many people keep those metrics in mind already, when planning. Thankfully, as mentioned, the cloud decreased the need for long-ahead planning, since you can “just” add additional compute power or resources to your cluster. However, every change comes with disruption. Not the positive one though. Careful planning can prevent service disruption and short-term latency increases.

Change Risk Assessment adds the idea that resource usage and contention is not the only important jigsaw piece to resource planning. It enhances this line of thought with a risk metric, specific to a service, a cluster node, or any other component of the system.

That said, whenever an upgrade or change for a component is prepared, part of that preparation is to create a number that represents the risk of this update. So far though, no magic formula has emerged yet, as there are many different risk factors to consider.

Risk factors may include, but are not limited to:

  • The number of dependencies (up- and downstream)
  • For high risk dependencies, a weight-factor for the current risk
  • Cost of a (potential) downtime, such as lost business
  • Startup and warmup time
  • User impact
  • Typical time to recover, such as
    • Deployment pipeline with rollback functionality
    • Components without automatic reconnection

With no specific formula in place to match all use cases, a good starting point is what we normally consider to be a recommendation or page-rank algorithm. The higher the number, the higher the risk of a component update. More time should be put into planning the change and preparing backup plans. On the other hand, low risk components require less time intensive planning.

The main reason behind the risk assessment is twofold. Firstly, human resource capacity planning. High risk updates require more human power to prepare, execute, monitor the component change. Secondly the risk of lost business when an update fails.

In any case though, it prevents the company from either wasting or losing money.

The Need of the Full View

To gather the necessary information to execute the Change Risk Assessment we need a full understanding of all components that make up a system. This includes services (like REST APIs), resource managers (like Kubernetes), virtual machines, physical machines, but also all downstream dependencies like external services, databases, and others.

Gathering the context for each component, as mentioned before, can be tedious. In the best case, the necessary information is documented somewhere, and the documentation is current. Experience, though, shows, that this is most commonly not the case. A good Observability solution, such as Instana, can provide all the required data, such as dependency trees, out of the box.

Instana and the Dynamic Graph

Based on the Dynamic Graph, Instana automatically finds up- and downstream dependencies and maps out a dependency graph. Not only between internal services, but also towards external services, and even “down to the metal”.

Powered by the automatic discovery and change recognition of system components, Instana has an always up-to-date understanding of the system’s architecture and presents the latest contextual information for any selected service, providing the user with the data necessary to execute the Change Risk Assessment.

On top of that, Instana understands releases and canary deployments. If a release is recognized, Instana automatically offers comparisons of health, performance, and latencies before and after the change, so that problems can quickly be seen and mitigated. Massively reducing the Time to Restore.

Know your Risk

The concept of Change Risk Assessment itself isn’t new. There are quite a few software solutions available, too. None of them, though, seems to fully support the needs of software or system components. Most of the time, the “risk” factors are too basic, only differentiating between low, medium, and high.

It may be good enough for simple systems. However, for more complex systems, these three categories are not enough, and the numeric factor on every component can help increase the awareness of complexity and risk in their corresponding development teams, too.

Play with Instana’s APM Observability Sandbox

Announcement, Developer, Product, Thought Leadership
At Instana, we recently improved the installation process for our self-hosted customers. Instana’s self-hosted platform now utilizes a fully Docker based installation process. In a previous blog post, Lessons Learned From Dockerizing...
|
Announcement, Developer, Product, Thought Leadership
To be successful in Observability, you must have the ability to observe the behavior of a system and derive its health state from it. But deriving the health state of any given...
|
Featured
In a previous blog, Increasing Agility with Pipeline Feedback Scoping, we discussed the different types of pipeline feedback scope. In this follow up post, we’ll look at how to best apply pipeline...
|

Start your FREE TRIAL today!

As the leading provider of Automatic Application Performance Monitoring (APM) solutions for microservices, Instana has developed the automatic monitoring and AI-based analysis DevOps needs to manage the performance of modern applications. Instana is the only APM solution that automatically discovers, maps and visualizes microservice applications without continuous additional engineering. Customers using Instana achieve operational excellence and deliver better software faster. Visit https://www.instana.com to learn more.