An ever-increasing number of System architectures and deployment strategies depend on Kubernetes-based environments.
Kubernetes (also known as k8s) is an orchestration platform and abstract layer for containerized applications and services. As such, k8s manages and limits container available resources on the physical machine, as well as takes care of deployment, (re)start, stop, and scaling of service instances.
Originally developed by Google, Kubernetes was contributed to (and incarnated the) Cloud Native Computing Foundation. It’s design is heavily influenced by Google’s internal cluster manager Borg and combines many of the common “lessons learned”.
In k8s, everything is modelled around building blocks known as Kubernetes Objects. They define the requested CPU and memory resources, network interfaces and forwarding rules, as well as resource limits. Resources are shared by the host operating system.
One of the most important concepts in k8s are pods. A pod is a higher abstraction around containerized components. While many people often think of a pod as a container, a more fitting definition would be “a group of containers and other components”. Additional components, next to the actual service, may be service mesh components, firewall components, or any other type of sidecar, interacting or managing the actual service container.
Additionally, Kubernetes has components, such as:
- Services, which provide a simple service discovery abstraction (DNS name and load balancing)
- Replica Sets, to define the requested number of pod instances, as well as automatic deployments
- Volumes, for ephemeral or persistent storage
- Config Maps and Secrets, as a backend for configuration data or files, as well as secrets
- Namespace, enable the user to partition managed resources into separated sets, like teams, projects, staging / production
Additional kubernetes has a few more advanced concepts. A great overview of their functionality and use cases are available from the extensive k8s documentation.
By design, resources available to the pods are shared resources. They are split by available time, size, or processing power.
While sharing resources is great and introduced to increase resource utilization, the dark side is resource contention. Many situations can lead pods fighting over resources.
In the past, especially in the beginning of virtual machines, a common issue was plain overcommitment of resources. Imagine the physical machine has 10 GB RAM and 20 VMs get started with 1GB RAM allocated each. As long as only a few virtual machines require their full gigabyte at the same time, there is no issue. But when they all request it at once …
As experience grows with resource management, overcommitment seems to be less of an issue. However, resource planning is only as good as the data at hand to plan ahead. It also requires a good understanding of changes between deployed versions and upcoming releases.
Capturing the necessary information can be tedious. The data required come from a wide variety of sources, such as performance metrics, log files, and distributed traces. On the other hand, services and system dependencies, in the best case, are documented, otherwise they need to be reversed engineered from looking at source code, or even worse, traffic captures.
The use of Observability tools is often limited to keeping an eye on the system’s health and performance state. The massive amount of data collected and their data quality, though, can also be used for system planning. This is specifically true for solutions that are Enterprise Observability grade, adding automatic dependency discovery, service mapping, and data like performance profiles, with comparison between environments, deployment types and releases.
Planning for the future is never an easy task, no matter if it is a job choice, a marriage, or capacity planning for a server landscape.
Thanks to the cloud, the speed to provide additional resources to our services increased dramatically. Careful
guessing planning can prevent situations where the phone rings at 2am in the morning since the system is running out of resources.
To understand the resources a service needs, we can look at historical data.
- What is the common spike under load or high times?
- Has any new or major feature or release happened during the last few weeks that changed that behavior?
- Are all new services being deployed?
- Is there a new technology being phased in?
- Is a high season, like Christmas, upcoming?
The number of questions to ask is almost indefinite, and heavily depends on the development methodology, the deployment frequency, as well as the system’s architecture.
Just like with overcommitment, many people keep those metrics in mind already, when planning. Thankfully, as mentioned, the cloud decreased the need for long-ahead planning, since you can “just” add additional compute power or resources to your cluster. However, every change comes with disruption. Not the positive one though. Careful planning can prevent service disruption and short-term latency increases.
Change Risk Assessment adds the idea that resource usage and contention is not the only important jigsaw piece to resource planning. It enhances this line of thought with a risk metric, specific to a service, a cluster node, or any other component of the system.
That said, whenever an upgrade or change for a component is prepared, part of that preparation is to create a number that represents the risk of this update. So far though, no magic formula has emerged yet, as there are many different risk factors to consider.
Risk factors may include, but are not limited to:
- The number of dependencies (up- and downstream)
- For high risk dependencies, a weight-factor for the current risk
- Cost of a (potential) downtime, such as lost business
- Startup and warmup time
- User impact
- Typical time to recover, such as
- Deployment pipeline with rollback functionality
- Components without automatic reconnection
With no specific formula in place to match all use cases, a good starting point is what we normally consider to be a recommendation or page-rank algorithm. The higher the number, the higher the risk of a component update. More time should be put into planning the change and preparing backup plans. On the other hand, low risk components require less time intensive planning.
The main reason behind the risk assessment is twofold. Firstly, human resource capacity planning. High risk updates require more human power to prepare, execute, monitor the component change. Secondly the risk of lost business when an update fails.
In any case though, it prevents the company from either wasting or losing money.
To gather the necessary information to execute the Change Risk Assessment we need a full understanding of all components that make up a system. This includes services (like REST APIs), resource managers (like Kubernetes), virtual machines, physical machines, but also all downstream dependencies like external services, databases, and others.
Gathering the context for each component, as mentioned before, can be tedious. In the best case, the necessary information is documented somewhere, and the documentation is current. Experience, though, shows, that this is most commonly not the case. A good Observability solution, such as Instana, can provide all the required data, such as dependency trees, out of the box.
Based on the Dynamic Graph, Instana automatically finds up- and downstream dependencies and maps out a dependency graph. Not only between internal services, but also towards external services, and even “down to the metal”.
Powered by the automatic discovery and change recognition of system components, Instana has an always up-to-date understanding of the system’s architecture and presents the latest contextual information for any selected service, providing the user with the data necessary to execute the Change Risk Assessment.
On top of that, Instana understands releases and canary deployments. If a release is recognized, Instana automatically offers comparisons of health, performance, and latencies before and after the change, so that problems can quickly be seen and mitigated. Massively reducing the Time to Restore.
The concept of Change Risk Assessment itself isn’t new. There are quite a few software solutions available, too. None of them, though, seems to fully support the needs of software or system components. Most of the time, the “risk” factors are too basic, only differentiating between low, medium, and high.
It may be good enough for simple systems. However, for more complex systems, these three categories are not enough, and the numeric factor on every component can help increase the awareness of complexity and risk in their corresponding development teams, too.