We are hardcore Kubernetes users and contributors. We loved the automation it provides. However, as our team grew and added more clusters and microservices, capacity and resources management becomes a massive pain to us. We started suffering from a lot of outages and unexpected behavior as we promote our code from dev to production environments. Luckily we were working on our AI-powered tools to understand different dependencies, predict usage, and calculate the right resources and configurations that should be applied to our infrastructure and microservices. We dogfooded our agent (http://github.com/magalixcorp/magalix-agent) and were able to stabilize as the #autopilot continuously recovered any miscalculations we made or because of unexpected changes in workloads. We are open sourcing our agent in a few days. Check it out and let us know what you think! We run workloads on Microsoft Azure Google Kubernetes Engine and Amazon EC2 and we're all about Go and Python!
Any advice for a k8s late-adopter to avoid the problems you had? Are we talking sheer dollars and cents of nodes, or something deeper?
I'm the co-founder of Magalix and lived that experience as well :)
I'd recommend the following:
- Budget your CPU/memory pretty much as you budget your money. It is important to keep your team accountable for the resources you use. In k8s you enforce this by having a quota for each namespace. Use something as simple as an excel sheet mapping resources to containers.
- Make sure you look at the utilization versus what you reserved CPU and memory of your pods. If one of your team members requested, for example, 2GB memory for a pod, check if the pod is actually utilizing this requested memory. Otherwise, K8s won't be able to allocate this memory to other pods in need for it.
- Set the limits on CPU and memory as much as possible. This will protect your VMs and cluster from being hijacked by any of your containers and impact other well-behaving microservices.