Introduction to Mesosphere DC/OS

What is Mesosphere DC/OS

Datacenter Operating System (DC/OS) is an open source operating system based on the Apache Mesos distributed systems kernel. Developed by Mesosphere, DC/OS is available as both an open source and a commercial offering. The commercial Enterprise Edition provides a number of enhancements on top of the functionality offered by the open source version. The Enterprise enhancements are focused around security and compliance, multiple tenancy, load balancing and dedicated support.

Mesosphere Glossary

Mesos – Cluster manager that provides resource isolation and sharing across a group of hosts; physical, virtual or cloud. That cluster can be composed of different host types e.g. 2 CPU and 2GB memory, 8 CPU and 64GB memory etc. Mesos offers available resource to a framework and the framework decides which resource to consume. A number of modern frameworks can run on Mesos including: Apache Spark, Hadoop, Kafka and JBoss Data Grid.

Mesos Architecture

Marathon – Is a container orchestration platform for DC/OS and Mesos. This bit can be a tad confusing. When running with Mesos it receives resource offers and schedules tasks, either Docker or Mesos, using those resources. Like other container orchestration platforms it ensures that the required number of instances of the requested tasks are running. If an instance should fail, then a new one is started with the resource offered by Mesos. Typically the first task that Marathon starts is either Chronos or Metronome, these are responsible for running periodic jobs. Additionally service discovery and load balancing will need to be configured, Marathon-LB is a packaged implementation of HAProxy that can be used for this purpose.

DC/OS – Marathon on Mesos provides a container orchestration platform and cluster resource management, why use DC/OS? In short you don’t need to but it does provide some extra functionality that may come in handy when trying to run a microservices application in a large enterprise. DC/OS is a layer on top providing virtual IP routing, your service endpoint is allocated a dedicated virtual address making it reachable anywhere in the cluster, no matter on which node it has been schedule to run. Load balancing and rerouting around any failures are all performed automatically. A services catalogue provides a comprehensive list of popular applications (Elasticsearch, Kafka, Spark and of course the Instana agent) that are available for one-click deployment. To make administration easier a web based dashboard is provided as well as a command line interface. The Enterprise edition adds in: security and compliance, multiple tenancy and a dedicated support offering.

Adding another layer of complexity, and possibly confusion, it is now possible to run Kubernetes on DC/OS along with Marathon. This is currently available as a beta release.

DC OS Stack

Orchestrator Comparison

The combination of Mesos, Marathon and DC/OS is considerably more complex than the other container orchestration platforms, Kubernetes and Docker Swarm. What does all this extra complexity give you in return? The primary reason to choose the Mesos, Marathon and DC/OS stack is its ability to scale out to very large numbers of nodes, simulations have shown the capacity to scale to 50,000 nodes. Whereas 1,000 node clusters is more typical for Kubernetes and Docker Swarm. The additional features available in DC/OS Enterprise, as mentioned earlier, are also useful for large organisations wanting to run a central platform.

Challenges with DC/OS

The complete DC/OS Mesos stack is itself a complex arrangement of software components providing a high degree of abstraction and automation. All this functionality makes managing web scale microservice applications much easier, providing considerable savings in time to market, human resources and computing costs. However, if something is not working as it should, troubleshooting the incident is so much more difficult as there are so many moving parts.

DC OS Enterprise Components

The built in dashboard of DC/OS provides a good overview of resource utilisation and service state but does not provide enough detail for effective troubleshooting. The command line interface provides more detailed information but not in an easily consumable format.

DC OS Dashboard
DC/OS built in dashboard does not provide enough detail for troubleshooting.

Running a DC/OS system efficiently requires a monitoring system that understands and monitors both DC/OS Mesos AND your applications running on top of it. In such large and complex systems, the monitoring has to be more than just collecting events, time series data and request traces; the amount of data is too vast for the human mind to sift and correlate. The monitoring system must be able to both collect the data and make sense of it, producing actionable information as an output. To achieve this task the monitoring system will need to use multiple Artificial Intelligence pieces – curated knowledge and machine learning with high-resolution training data. All this needs to be built on a flexible data model to cope with the myriad of technologies and application topologies both current and in the future.

We’ll write about monitoring containerised applications running in DC/OS orchestration in the future.