Kubecon 2019 Takeaway: We Live in a Multi-Cluster and Multi-Distro World

This blog was co-authored by Rupinder (Robbie) Gill and Haseeb Budhani.

As more enterprise users deploy Kubernetes as their preferred container orchestrator, the following trend has become more widespread:

Development teams that are early in their Kubernetes journey build out larger clusters and use Kubernetes namespaces to implement multi-tenancy.

This seems like a logical choice, given the namespace concept is designed to do exactly this. But in practice, teams that have been at it for some time and have experienced multiple Kubernetes version upgrades tend to spin up many, smaller clusters and choose to group fewer services into the same cluster.

Why the difference in opinion? Experience.

The following are the technical reasons why these experienced teams are choosing to go with the many, smaller clusters approach:

Blast radius: Every time Kubernetes or a supporting component (e.g. service mesh or metrics collection packages) is upgraded, each service may need to be updated to work with the new version. Someone needs to make sure that all services in a given cluster are ready to work with new APIs if older APIs have been deprecated in the new version. This type of upgrade can impact schedules across multiple teams. Best to let teams run smaller clusters where the impact can be broken up into smaller morsels.
Security requirements: A set of services may have unique hardening and data retention requirements, and it may make sense to deploy these services in a hardened cluster with stringent auditing, auth and logging policies. But doing this across the board may lead to unnecessary slowdowns and overhead.
Scaling requirements: If a few services have massive scaling requirements, it may be best to deploy them into dedicated clusters to protect against other services experiencing “pod pending” events due to busier services taking up an inordinate percentage of available resources.
Integration requirements: Some services may need a special admission controller, high-speed storage, and so on, while others may not. Such special integration requirements may also apply to service meshes and key-management services. Services that need such integrations may be best grouped together in clusters that are pre-configured with required packages or the right storage class, while other services can be deployed on clusters running vanilla Kubernetes.
Custom enhancements: Some services may lead the DevOps team to develop enhancements to Kubernetes. To protect against unforeseen side effects (bugs) from such enhancements, services that need these enhancements can be deployed on customized clusters, while other services can be deployed on clusters running vanilla Kubernetes.
Network load requirements: Services that are expected to drive high network load (by way of Kubernetes API calls) are best deployed on separate clusters to protect other services that may get starved otherwise.

There are also valid business reasons for running many, smaller clusters:

Compliance: If end users are distributed globally, it's better to deploy clusters in target geographies to comply with data sovereignty or other regional regulations instead of implementing complex data management strategies centrally.
Hybrid or multi-cloud strategies: Many companies need to manage a mix of environments for a variety of reasons, ranging from pre-existing assets (colo contracts and servers), M&A activity to demanding customers (“I don’t do business with vendors that run their apps on AWS.”).
Performance: If the end user population is spread across geographies and, if the application is designed appropriately, it may make sense to deploy the web and application tiers in multiple regions, i.e. across multiple clusters. And if you’re thinking of running a cross-region cluster, be ready to address etcd sync issues and cross-pod traffic management across the WAN.

The fact that companies such as VMware (see Tanzu Mission Control) and Microsoft (see Azure Arc) recently announced tech previews of products to help companies manage clusters across hybrid environments implies they also realize this trend.

Because your peers are not only running multiple clusters, but are also leveraging more than one Kubernetes distributions, a high percentage of teams are running two or more Kubernetes distributions across their public cloud and on-premise footprints. The common belief is that cloud providers will do their best to optimize their Kubernetes offerings for their infrastructure, so its best to use EKS in AWS, GKE in GCP, AKS in Azure, and OpenShift or PKS on premises.

At Rafay, we follow the same methodology: We use the resident managed Kubernetes service in our cloud provider of choice instead of spinning up, for example, v1.16.1 ourselves on virtual machines. And - of course - we run many, small clusters.

There are many hurdles to cross in keeping modern applications operational, and if the public cloud (or VMware & RedHat) is able to take away the pain of keeping the Kubernetes control plane up and running, why would anyone not leverage that? What’s more, the cost of running Kubernetes in public clouds is fast approaching zero. You pay for the worker node VMs and the master node costs are a rounding error or downright free.

Operating services across multiple clusters and multiple distros simplifies the development process. Ongoing complexity is reduced because developers no longer have to add complex logic in each service to address environmental characteristics such as service meshes, storage classes and admission controllers. With a growing cluster fleet that may span multiple clouds & data centers, and leveraging multiple Kubernetes distributions, SRE/Ops need tooling to manage their cluster fleet. SRE/Ops teams must now solve for:

Complete visibility and governance across the company’s fleet of Kubernetes clusters, regardless of distribution. They must be able to quickly figure out where a given service is running at present, which app is experiencing restarts in a given cluster, which apps have been upgraded across the fleet in the last month, and much more.
On-demand Cluster bringup and customization in any cloud environment or on premises. In addition to simply bringing up EKS, GKE, etc., SRE/Ops are responsible for ensuring that a given cluster is customized appropriately and conforms to the business’ security/compliance requirements.
Continuous deployment capabilities across the entire cluster fleet without requiring multiple deployment tools. SRE/Ops need access to deployment tools that work across any cluster type, in any environment, and do not require scripting/coding investments that traditionally fall on DevOps teams.

We live in a multi-distro, multi-cluster world. At Rafay, we run our SaaS controller as a cloud-native service that leverages a variety of open source and home-grown components - we are living the gospel we preach. We operate many, small clusters. And so should you, if you aren’t doing it already.

But this is an issue that cloud providers don’t have an incentive to solve for you. When engaging with a vendor focused on helping your SRE/Ops teams address the complexity of managing multiple clusters running a variety of distros, be sure to ask for their plan for the above 3 requirements. Rafay delivers these capabilities today and can simplify ongoing operations for Kubernetes environments as a service.

Want to understand how Rafay can help you operate a fleet of clusters across any environment? Get in touch.