Lessons Learned Shipping Instana Self-Hosted On Kubernetes

Post

At Instana, we recently improved the installation process for our self-hosted customers. Instana’s self-hosted platform now utilizes a fully Docker based installation process. In a previous blog post, Lessons Learned From Dockerizing Our Application, we discussed how containerizing our product led to an enterprise ready self-hosted platform that is scalable and continuously upgradable, with as little effort as possible.

This initial single host, dockerized approach has proven to be quite useful for the vast majority of our self-hosted customer base. We are now building upon that success to further improve scalability for our biggest customers running extremely large infrastructures. To do so, we are piggy-backing off the learnings our SaaS team made while moving our SaaS infrastructure to Kubernetes (K8s). With those learnings in hand, it is time to deliver an even more scalable, Kubernetes-based option for our customers, fulfilling the scalability demands a single box could not fully handle.

Instana strives to never add complexity unless the benefits greatly outweigh the added complexity. Based on our experience with K8s we’ve learned that it is not a silver bullet to make your life easier. That being said, K8s does solve many problems by providing an abstraction layer, and while it does introduce some complexity, the benefits are more than worth it. By extending our product to be operable on K8s, we are providing the following benefits to our self-hosted customers:

  • K8s makes scalability easier by abstracting away the underlying hardware and providing a standard API
  • There is already a high adoption rate of K8s within the market and, in particular, especially within Instana’s customer base
  • Customers consolidating all their workloads on K8s already understand how K8s works, minimizing the impact of introducing new complexity
  • Allows our customers to manage Instana the same way they operate their own applications
  • This enables Instana to work on more advanced operational use cases like: guided or automated scale out

This rest of this article looks at the reasons for going beyond dockerization and into orchestrating our application with Kubernetes, why we decided to use the K8s operator model, and some key things we learned about moving to K8s along the way.

Why make our self-hosted application run on K8s now?

After the successful adoption of our dockerized single host installation, we projected the challenges we might face in the future and identified two clear things: K8s is getting widespread adoption and our customer base is growing with increasing amounts of data that needs to be processed. We are well aware that making this step will bring in some complexity. Since our product is hosted by the customer, it is by no means natural that we can rely on cloud APIs for storage and inbound traffic and we need to be prepared for a variety of implementations. But Kubernetes brings a standardized API that we can use to abstract away a lot of work while adopting proven architectural patterns like the operator to increase automation.

Why we chose the K8s operator pattern deployment model

There are many different ways to deploy applications to Kubernetes. For smaller applications, DevOps teams typically go with something like handcrafted and templated YAML-files that would then be handled by the CI/CD-infrastructure.

Helm was the next evolutionary step. It was, and for many still is, the de facto way of providing an easy installation path for various applications ranging from OSS-databases to various commercial products. But even Helm doesn’t solve the problems a complex distributed application like Instana is facing when deploying to Kubernetes.
Let’s look at a few numbers. Instana consists of:

  • 6 different databases
  • Around 30 application components, each one scaled individually
  • 3 ingress endpoints
  • Web UI and API
  • A 4 week release cycle

Component updates and database migrations have to be coordinated to minimize downtime.
While configurations have to be updated on the fly, taking into account the different scaling and deployment requirements for each component.

We all know that a well behaved distributed application should just recover from whatever is thrown at it. While Instana can handle all kinds of failures, ranging from dying nodes to network outages and overload scenarios, we also know that the best failure is the one that is mitigated so fast no one ever notices.

Taking this into account with our mission to have a mostly hands-off, self maintaining system, we see that a more active component is required.

Enter: The Operator Pattern. The Operator Pattern provides several benefits to the way Instana provides its self-hosted application. These include:

  • Different ways of deploying and operating workloads on kubernetes
  • Mapping of an organizational structure in a single environment, company divisions to tenant units
  • Minimal downtime during maintenance
  • Platform independence through Kubernetes abstraction
  • Adoption of as many experiences as possible from our Saas infrastructure

Using the operator framework will also allow auto scaling functionality that exactly matches Instana components needs.

Four lessons about moving to K8s

Accessing K8s external databases from inside the cluster across namespaces

As mentioned above: There are a lot of databases in Instana. For these databases there are many different ways of running them:

  • dedicated cluster
  • single box
  • running via an operator in the same cluster

We had to find a way to account for all of these options since customers will run them in whatever way they please. For Instana, we decided to embrace Kubernetes fully and treat databases as just another set of services. So, instead of expecting databases to be fully configured for the services, we wanted to rely on something we could look up via DNS. Now, there are a few things to know before diving into this.

Most databases have support for resolving cluster members via a DNS facility called SRV-records. SRV-records allows you to assign multiple IPs to a single name. A DNS-client can now resolve those and use all of them. In Java this is done with a simple call of InetAddress.getAllByName(<name>) and this is what most java based databases do out of the box (and yes, it’s not the nicest way to do this since it ignores TTLs, but that’s how they do it). So a service in K8s will have its dedicated DNS-name and all associated IP-addresses as its SRV records. Pretty convenient.

Defining a service accessing other pods running in K8s is straight forward and there are plenty of examples out there. The case we wanted to support, databases external to the K8s-cluster, wasn’t really that well explained and it took a while to figure it out.

The trick is to have a service with clusterIP: None and the endpoints being defined explicitly:

apiVersion: v1
kind: Service
metadata:
  name: cassandra-spans
spec:
clusterIP: None
ports:
  - name: "tcp"
  protocol: "TCP"
  port: 9042
  targetPort: 9042
---
apiVersion: v1
kind: Endpoints
metadata:
  name: cassandra-spans
subsets:
  - addresses:
    - ip: 10.12.1.2
    - ip: 10.12.1.3
    - ip: 10.12.1.4
  ports:
    - port: 9042
    name: "tcp"

Calling InetAddress.getAllByName(<name>) will result in the unorder list of 10.12.1.2, 10.12.1.3, 10.12.1.4.

This already implies a problem with this approach: K8s doesn’t follow the SRV-spec completely, which requires entries to be ordered based on the priority of entries. For most databases this doesn’t matter, but in other scenarios you might be interested in that information or you might have to rely on a guaranteed order of these entries. There is work being done on this issue so we expect to see progress there in the near future.

Another downside of this approach is that it will only accept IP-addresses, host names are simply not allowed.

Persistent volume must be readable and writable from many pods

In cloud environments we use cheap object stores to persist details of traces that are ingested into Instana. This approach has proven to be the most efficient but is not always applicable in private data centers. While customers do tend to have some S3 API compliant storage on-premises, the limits are usually so restrictive that it’s not usable for our high data volume use case. For the single-host dockerized installation we use local disks to store the data and have achieved very good results with that approach.

Now in the Kubernetes environment we are facing some challenges: the first is that we cannot simply write to the local disk because we do not want to bind a component to a single node and have a second component to be forced to be collocated just to read the data. The other dimension to keep in mind is that we are preparing to scale out the components and we need to have one place to store the data. Thus we need to offer multiple configurable options including one for NFS storage that we are abstracting as persistent volume and attach to it through a persistent volume claim. For reference, you can check our template, but there are also some challenges that occurred in real customer environments. The Rancher environment where we supported the customer to roll out the new deployment model, the NFS PVC was automatically created through Rancher, with unique paths each time a PVC is created. That of course does not fit our use case as we need to rely on the same path to find the persisted data.

Ingress

The only scenario currently where the ingress configuration for a K8s cluster is simple is in the cloud. The cloud provider takes over the heavy lifting automatically creating a load balancer dynamically assigning IP addresses and wiring all the bits together. With customers running K8s in a private datacenter we face a lot of different implementations and scenarios. In one scenario we exposed nodeports and configured a DNS load balancer to distribute connection attempts from the agent to the backend. This led to multiple challenges: in scaled out environments we have multiple instances of our component called acceptor that holds the connection to the agents and forwards the data to the subsequent processing layers. With the nodeport configuration, we are no longer able to let the components float freely in the K8s cluster, we are forced to pin them to specific nodes so that traffic can alway reach its destination from the outside. In another scenario the customer is using HAProxy HTTP load balancers with TLS termination at the edge, whereas we rely on encrypted traffic and expect the termination to happen by our components. All of these scenarios require us to deeply involve ourselves into individual customer setups showing where K8s fails to provide a sufficient abstraction layer to deliver distributed software products without headache.

Configuration management on K8s is hard

Putting all options and dimensions together we faced a pretty complex system. At the same time we wanted it as easy as possible to install and operate. For the operations, as previously mentioned, we decided to rely on the operator model and to codify the expertise that our SREs acquired running Instana at high scale. This is especially true for the zero day experience, where it is crucial to reduce the time to value.

For the beta launch we created a Github repo https://github.com/instana/self-hosted-k8s/
with predefined kustomize templates, a build in templating feature by K8s to generate configurations and the potential misconfiguration, because configuring the yaml files is bringing a lot of pitfalls. What we learned pretty fast is that the barrier is still too high, the user needs to be familiar with the Instana architecture to get started. Looking at the adoption of our settings from the single host installation where we reduced to the max, we decided to take a similar approach with kubernetes deployment for the future:

  1. HELM chart for deployment of the operator with any other prerequisites
    1. One Liner to quickly get started
    2. Deliver zero-day utility functions
  2. Use a simple settings.hcl to generate
    1. Known to our customers
    2. Better experience than YAML
donwload_key = "dl_key_1234"
sales_key = "sales_key_1234"
base_domain = "instana.rocks"
agent_ingress = "fullapm.instana.rocks"
core_name = "wow"
databases "cassandra_service" {
  database = "cassandra"
  namespace = "cassandra_namespace"
  scheams = ["schema1", "schema2"]
}
email {
  smtp {
  from = "[email protected]"
  host = ""
  port = 0
  user = ""
  password = ""
  use_ssl = false
  start_tls = false
  }
}
profile = "medium"
spans {
  persistent_volume {
    volume_name = "volumina"
    storage_class = "classica"
  }
}
units "unit1" {
    namespace = "unit_namespace"
    tenant_name = "tenant1"
    initial_agent_key = "supersecret_key"
}

Should You move your workloads to K8s?

There should be no illusions, K8s will add complexity to your tech-stack. In cloud environments the barrier is lower as you can rely on cost-efficient managed services like Google GKE or Amazon EKS. But it also provides a standardized abstraction and, as the wide adoption in the market clearly shows, is the way forward

Planning to Virtually Visit Kubecon?

To learn more about the innovative ways Instana works with Docker, drop in our virtual session Monitoring in a Microservices World at Dockercon on May 28th at 1pm EST. Fabian Stäber, Senior Engineering Manager, will be discussing the paradigm shift in software engineering away from static monolithic applications towards dynamic distributed horizontally scalable architectures. And how Docker is one of the key technologies enabling this development.

Play with Instana’s APM Observability Sandbox

Featured
An ever-increasing number of System architectures and deployment strategies depend on Kubernetes-based environments. Kubernetes (also known as k8s) is an orchestration platform and abstract layer for containerized applications and services. As such,...
|
Announcement, Developer, Product, Thought Leadership
To be successful in Observability, you must have the ability to observe the behavior of a system and derive its health state from it. But deriving the health state of any given...
|
Featured
In a previous blog, Increasing Agility with Pipeline Feedback Scoping, we discussed the different types of pipeline feedback scope. In this follow up post, we’ll look at how to best apply pipeline...
|

Start your FREE TRIAL today!

As the leading provider of Automatic Application Performance Monitoring (APM) solutions for microservices, Instana has developed the automatic monitoring and AI-based analysis DevOps needs to manage the performance of modern applications. Instana is the only APM solution that automatically discovers, maps and visualizes microservice applications without continuous additional engineering. Customers using Instana achieve operational excellence and deliver better software faster. Visit https://www.instana.com to learn more.