Scaling Kubernetes with Assurance at Pinterest

845
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

By Anson Qian | Software Engineer, Cloud Runtime


Introduction

It has been more than a year since we shared our Kubernetes Journey at Pinterest. Since then, we have delivered many features to facilitate customer adoption, ensure reliability and scalability, and build up operational experience and best practices.

In general, Kubernetes platform users gave positive feedback. Based on our user survey, the top three benefits shared by our users are reducing the burden of managing compute resources, better resource and failure isolation, and more flexible capacity management.

By the end of 2020, we orchestrated 35K+ pods with 2500+ nodes in our Kubernetes clusters — supporting a wide range of Pinterest businesses — and the organic growth is still rocket high.

2020 in a Short Story

As user adoption grows, the variety and number of workloads increases. It requires the Kubernetes platform to be more scalable in order to catch up with the increasing load from workload management, pods scheduling and placement, and node allocation and deallocation. As more business critical workloads onboard the Kubernetes platform, the expectations on platform reliability naturally rise to a new level.

Platform-wide outage did happen. In early 2020, one of our clusters experienced a sudden spike of pods creation (~3x above planned capacity), causing the cluster autocalor to bring up 900 nodes to accommodate the demand. The kube-apiserver started to first experience latency spikes and increased error rate, and then get Out of Memory (OOM) killed due to resource limit. The unbound retry from Kubelets resulted in a 7x jump on kube-apiserver load. The burst of writes caused etcd to reach its total data size limit and start rejecting all write requests, and the platform lost availability in terms of workload management. In order to mitigate the incident, we had to perform etcd operations like compacting old revisions, defragmenting excessive spaces, and disabling alarms to recover it. In addition, we had to temporarily scale up Kubernetes master nodes that host kube-apiserver and etcd to reduce resource constraint.

Figure 1: Kubernetes API Server Latency Spikes

Later in 2020, one of the infra components had a bug in kube-apiserver integration that generated a spike of expensive queries (listing all pods and nodes) to kube-apiserver. This caused the Kubernetes master node resource usage spikes, and kube-apiserver entered OOMKilled status. Luckily the problematic component was discovered and rolled back shortly afterwards. But during the incident, the platform performance suffered from degrationation, including delayed workload execution and stale status serving.

Figure 2: Kubernetes API Server OOMKilled

Getting Ready for Scale

We continue to reflect on our platform governance, resilience, and operability throughout our journey, especially when incidents happen and hit hard on our weakest spots. With a nimble team of limited engineering resources, we had to dig deep to find out root causes, identify low hanging fruits, and prioritize solutions based on return vs. cost. Our strategy for dealing with the complex Kubernetes ecosystem is to try our best to minimize divergence from what’s provided by the community and contribute back to the community, but never rule out the option of writing our own in house components.

Figure 3: Pinterest Kubernetes Platform Architecture (blue is in-house, green is open source)

Governance

Resource Quota Enforcement

Kubernetes already provides resource quotas management to ensure no namespace can request or occupy unbounded resources in most dimensions: pods, cpu, memory, etc. As our previous incident mentioned, a surge of pod creation in a single namespace could overload kube-apiserver and cause cascading failure. It is key to have resource usage bounded in every namespace in order to ensure stability.

One challenge we faced is that enforcing resource quota in every namespace implicitly requires all pods and containers to have resource requests and limits specified. In Pinterest Kubernetes platform, workloads in different namespaces are owned by different teams for different projects, and platform users configure their workload via Pinterest CRD. We achieved that by adding default resource requests and limits for all pods and containers in the CRD transformation layer. In addition, we also rejected any pod specification without resource requests and limits in the CRD validation layer.

Another challenge we overcame was to streamline quota management across teams and organizations. To safely enable resource quota enforcement, we look at historical resource usage, add 20% headroom on top of peak value, and set it as the initial value for resource quota for every project. We created a cron job to monitor quota usage and send business hour alerts to project owning teams if their project usage is approaching a certain limit. This encourages project owners to do a better job of capacity planning and request a resource quota change. The resource quota change gets manually reviewed and automatically deployed after sign-off.

Client Access Enforcement

We enforce all KubeAPI clients to follow the best practices Kubernetes already provides:

Controller Framework

Controller framework provides a shareable cache for optimizing read operations, which leverages informer-reflector-cache architecture. Informers are set up to list and watch objects of interest from the kube-apiserver. Reflector reflects object changes to the underlying Cache and propagates out watched events to event handlers. Multiple components inside the same controller can register event handlers for OnCreate, OnUpdate, and OnDelete events from Informers and fetch objects from Cache instead of Kube-apiserver directly. Therefore, it reduces the chance of making unnecessary and redundant calls.

Figure 4: Kubernetes Controller Framework

Rate Limiting

Kubernetes API clients are usually shared among different controllers, and API calls are made from different threads. Kubernetes ships its API client along with a token bucket rate limiter that supports configurable QPS and bursts. API calls that burst beyond threshold will be throttled so that a single controller will not jam the kube-apiserver bandwidth.

Shared Cache

In addition to the kube-apiserver built-in cache that comes with the controller framework, we added another informer based write through cache layer in the platform API. This is to prevent unnecessary read calls hard hitting the kube-apiserver. The server side cache reuse also avoided thick clients in application code.

For kube-apiserver access from applications, we enforce all requests to go through the platform API to leverage shared care and assign security identity for access control and flow control. For kube-apiserver access from workload controllers, we enforce that all controllers implement based on control framework with rate limiting.

Resilience

Hardening Kubelet

One key reason why Kubernetes’ control plane entered cascading failure is that the legacy reflector implementation had unbounded retry when handling errors. Such imperfections can be exaggerated, especially when the API server is OOMKilled, which can easily cause a synchronization of reflectors across the cluster.

To resolve this issue, we worked very closely with the community by reporting issues, discussing solutions, and finally getting PRs (1, 2) reviewed and merged. The idea is to add exponential backoff with jitter reflector’s ListWatch retry logic, so the kubelet and other controllers will not try to hammer the kube-apiserver upon kube-apiserver overload and request failures. This resilience improvement is useful in general, but we found it critical on the kubelet side as the number of nodes and pods increases in the Kubernetes cluster.

Tuning Concurrent Requests

The more nodes we manage, the faster workloads are created and destroyed, and the larger the API call QPS server needs to handle. We first increased the maximum concurrent API call settings for both mutating and non-mutating operations based on estimated workloads. These two settings will enforce that the amount of API calls processed doesn’t exceed the configured number and therefore keeps CPU and memory consumption of kube-apiserver at a certain threshold.

Inside Kubernetes’s chain of API request handling, every request will pass a group of filters as the very first step. The filter chain is where max inflight API calls are enforced. For API calls burst to more than the configured threshold, a ‘too many requests” (429) response will be returned to clients to trigger proper retries. As future work, we plan to investigate more on EventRateLimit features with more fine-grained admission control and provide better quality of services.

Caching More Histories

Watch cache is a mechanism inside kube-apiserver that caches past events of each type of resource in a ring buffer in order to serve watch calls from a particular version with best effort. The larger the caches are, the more events can be retained in the server and are more likely to seamlessly serve event streams to clients in case of connection broken. Given this fact, we also improved the target RAM size of kube-apiserver, which internally is finally transferred to the watch cache capacity based on heuristics for serving more robust event streams. Kube-apiserver provides more detailed ways to configure fine grained watch cache size, which can be further leveraged for specific caching requirements.

Figure 5: Kubernetes Watch Cache

Operability

Observability

Aiming to reduce incident detection and mitigation time, we devote efforts continuously to improve observability of Kubernetes control planes. The challenge is to balance failure coverage and signal sensitivity. For existing Kubernetes metrics, we triage and pick important ones to monitor and/or alert so we can more proactively identify issues. In addition, we instrument kube-apiserver to cover more detailed areas in order to quickly narrow down the root cause. Finally, we tune alert statistics and thresholds to reduce noise and false alarms.

At a high level, we monitor kube-apiserver load by looking at QPS and concurrent requests, error rate, and request latency. We can breakdown the traffic by resource types, request verbs, and associated service accounts. For expensive traffic like listing, we also measure request payload by object counts and bytes size, since they can easily overload kube-apiserver even with small QPS. Lastly we monitor etcd watch events processing QPS and delayed processing count as important server performance indicators.

Figure 6: Kubernetes API calls by type

Debuggability

In order to better understand the Kubernetes control plane performance and resource consumption, we also built etcd data storage analysis tool using boltdb library and flamegraph to visualize data storage breakdown. The results of data storage analysis provide insights for platform users to optimize usage.

Figure 7: Etcd Data Usage Per Key Space

In addition, we enabled golang profiling pprof and visualized heap memory footprint. We were able to quickly identify the most resource intensive code paths and request patterns, e.g. transforming response objects upon list resource calls. Another big caveat we found as part of kube-apiserver OOM investigation is that page cache used by kube-apiserver is counted towards a cgroup’s memory limit, and anonymous memory usage can steal page cache usage for the same cgroup. So even if kube-apiserver only has 20GB heap memory usage, the entire cgroup can see 200GB memory usage hitting the limit. While the current kernel default setting is not to proactively reclaim assigned pages for efficient re-use, we are currently looking at setup monitoring based on memory.stat file and force cgroup to reclaim as many pages reclaimed as possible if memory usage is approaching limit.

Figure 8: Kubernetes API Server Memory Profiling

Conclusion

With our governance, resilience, and operability efforts, we are able to significantly reduce sudden usage surges of compute resources, control plane bandwidth, and ensure the stability and performance of the whole platform. The kube-apiserver QPS (mostly read) is reduced by 90% after optimization rollout (as graph shown below), which makes kube-apiserver usage more stable, efficient, and robust. The deep knowledge of Kubernetes’ internals and additional insights we gained will enable the team to do a better job of system operation and cluster maintenance.

Figure 9: Kube-apiserver QPS Reduction After Optimization Rollout

Here are some key takeaways that can hopefully help your next journey of solving Kubernetes scalability and reliability problem:

  1. Diagnose problems to get at their root causes. Focus on the “what is” before deciding “what to do about it.” The first step of solving problems is to understand what the bottleneck is and why. If you get to the root cause, you are halfway to the solution.
  2. It is almost always worthwhile to first look into small incremental improvements rather than immediately commit to radical architecture change. This is important, especially when you have a nimble team.
  3. Make data-driven decisions when you plan or prioritize the investigation and fixes. The right telemetry can help make better decisions on what to focus and optimize first.
  4. Critical infrastructure components should be designed with resilience in mind. Distributed systems are subject to failures, and it is best to always prepare for the worst. Correct guardrails can help prevent cascading failures and minimize the blast radius.

Looking Forward

Federation

As our scale grows steadily, single cluster architecture has become insufficient in supporting the increasing amount of workloads that try to onboard. After ensuring an efficient and robust single cluster environment, enabling our compute platform to scale horizontally is our next milestone moving forward. By leveraging a federation framework, we aim at plugging new clusters into the environment with minimum operation overhead while keeping the planform interface steady to end users. Our federated cluster environment is currently under development, and we look forward to the additional possibilities it opens up once productized.

Capacity Planning

Our current approach of resource quota enforcement is a simplified and reactive way of capacity planning. As we onboard user workloads and system components, the platform dynamics change and project level or cluster wide capacity limit could be out of date. We want to explore proactive capacity planning with forecasting based on historical data, growth trajectory, and a sophisticated capacity model that can cover not only resource quota but also API quota. We expect more proactive and accurate capacity planning can prevent the platform from over-committing and under-delivering.

Acknowledgements

Many engineers at Pinterest helped scale the Kubernetes platform to catch up with business growth. Besides the Cloud Runtime team — June Liu, Harry Zhang, Suli Xu, Ming Zong, and Quentin Miao who worked hard to achieve the scalable and stable compute platform as we have for today, Balaji Narayanan, Roberto Alcala and Rodrigo Menezes who lead our Site Reliability Engineering (SRE) effort, have worked together on ensuring the solid foundation of the compute platform. Kalim Moghul and Ryan Albrecht who lead the Capacity Engineering effort, have contributed to the project identity management and system level profiling. Cedric Staub and Jeremy Krach, who lead the Security Engineering effort, have maintained a high standard such that our workloads can run securely in a multi-tenanted platform. Lastly, our platform users Dinghang Yu, Karthik Anantha Padmanabhan, Petro Saviuk, Michael Benedict, Jasmine Qin, and many others, provided a lot of useful feedback, requirements, and worked with us to make the sustainable business growth happen.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Backend Engineer, Measurement User Match
Seattle, WA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Our mission is to help advertisers gain a deep understanding of their ad performance and generate helpful insights so they can make good decisions about their ad campaigns. You’d design and build systems and services to help advertisers learn more about conversions, viewability, brand lift, sales lift, offline conversions, etc. We’re building end-to-end Big Data distributed systems using a board mix of leading open source and Cloud technologies and integrating with 3rd party tools that Advertisers already trust.

What you’ll do:

  • Increase visibility and scale of conversion capture to power our measurement, targeting, and auction products
  • Create cutting edge technical solutions to match conversion events to Pinners
  • Design and build conversion tags, APIs, and data processing algorithms around tracking and reporting against conversions

What we’re looking for:

  • 3+ years of software engineering experience
  • Experiences in developing backend large scale distributed services and data processing workflows in Java and Python

#LI-GK1

Engineering Manager, Shopping Content...
Toronto, ON, CA

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest is aiming to build a world-class shopping experience for our users, and has a unique advantage to succeed due to the high shopping intent of Pinners. The new Shopping Content Mining team being founded in Toronto plays a critical role in this journey. This team is responsible for building a brand new platform for mining and understanding product data, including extracting high quality product attributes from web pages and free texts that come from all major retailers across the world, mining product reviews and product relationships, product classification, etc. The rich product data generated by this platform is the foundation of the unified product catalog, which powers all shopping experiences at Pinterest (e.g., product search & recommendations, product detail page, shop the look, shopping ads).

There are unique technical challenges for this team: building large scale systems that can process billions of products, Machine Learning models that require few training examples to generate wrappers for web pages, NLP models that can extract information from free-texts, easy-to-use human labelling tools that generate high quality labeled data.Your work will have a huge impact on improving the shopping experience of 400M+ Pinners and driving revenue growth for Pinterest.

What you’ll do:

  • As the Engineering Manager, you’ll be responsible for:
    • Growing this team further in Toronto
    • Driving execution and deliver impact
    • Setting long term technical visions for this area
  • Work with tech leads to provide technical guidance on:
    • Large scale systems that can process billions of products
    • ML models for wrapper induction that require few training examples, NLP models for understanding free-texts
  • Drive cross functional collaborations with partner teams working on shopping

What we’re looking for:

  • 7+ years of industry experience, including 2+ years of management experience
  • Experience on large scale machine learning systems (full ML stack from modelling to deployment at scale.)
  • Experience with big data technologies (e.g., Hadoop/Spark) and scalable realtime systems that process stream data

Nice to have:

  • PhD in Machine Learning or related areas, publication on top ML conferences
  • Familiarity with information extraction techniques for web-pages and free-texts.
  • Experience working with shopping data is a plus.
  • Experience building internal tools for labeling / diagnosing.

#LI-EA1

Staff Machine Learning Software Engin...
Toronto, ON, CA

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Shopping is at the core of Pinterest’s mission to help people create a life they love. The shopping discovery team at Pinterest is inventing a brand new, more visual and personalized shopping experience for 350M+ users worldwide. The team is responsible for delivering mid-funnel shopping experience on shopping surfaces like Product Detail Page, Shopping Search, Shopping on Board etc. As an engineer of the team you will be working on the most cutting edge recommendation algorithms to develop diverse types of shopping recommendations that will be displayed across different shopping surfaces on Pinterest. 

You’ll also be responsible for optimizing the whole page layout by appropriately selecting and slotting the UI templates and recommendation modules optimizing towards a shopping metric. As an engineer of the team you’ll be running experiments and directly improving the shopping metrics contributing to the bottom line of the company.

If you are excited about large scale machine learning problems in the area of recommendation, search and whole page optimization then you must consider this role

What you'll do: 

  • Develop large scale shopping recommendation algorithms
  • Build data pipelines to do data analysis and collect training data
  • Train deep learning models to improve quality and engagement of shopping recommenders
  • Work on backend and infrastructure to build, deploy and serve machine learning models
  • Develop algorithms to optimize the whole page layout of the shopping surfaces
  • Drive the roadmap for next generation of shopping recommenders

What we're looking for: 

  • 6+ years working experience in the area of applied Machine Learning
  • Interest or experience working on a large-scale search, recommendation and ranking problems
  • Interest and experience in doing full stack ML, including backend and ML infrastructure
  • Experience is any of the following areas
    • Developing large scale recommender systems
    • Contextual bandit algorithms
    • Reinforcement learning

#LI-JY1

Senior Machine Learning Engineer, Sho...
Toronto, ON, CA

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest is aiming to build a world-class shopping experience for our users, and has a unique advantage to succeed due to the high shopping intent of Pinners. The new Shopping Content Mining team being founded in Toronto plays a critical role in this journey. This team is responsible for building a brand new platform for mining and understanding product data, including extracting high quality product attributes from web pages and free texts that come from all major retailers across the world, mining product reviews and product relationships, product classification, etc. The rich product data generated by this platform is the foundation of the unified product catalog, which powers all shopping experiences at Pinterest (e.g., product search & recommendations, product detail page, shop the look, shopping ads).

There are unique technical challenges for this team: building large scale systems that can process billions of products, Machine Learning models that require few training examples to generate wrappers for web pages, NLP models that can extract information from free-texts, easy-to-use human labelling tools that generate high quality labeled data. Your work will have a huge impact on improving the shopping experience of 400M+ Pinners and driving revenue growth for Pinterest.

What you’ll do:

  • As a ML engineer, you will design and build large scale ML systems that can process billions of products
  • ML models for wrapper induction that require few training examples, NLP models for understanding free-texts
  • Drive cross functional collaborations with partner teams working on shopping

What we’re looking for:

  • 3+ years of industry experience
  • Hands-on experience on large scale machine learning systems (full ML stack from modelling to deployment at scale.)
  • Hands-on experience with big data technologies (e.g., Hadoop/Spark) and scalable realtime systems that process stream data
  • Nice to have: PhD in Machine Learning or related areas, publication on top ML conferences, Familiarity with information extraction techniques for web-pages and free-texts, Experience working with shopping data is a plus

#LI-EA1

Verified by
Security Software Engineer
Tech Lead, Big Data Platform
Software Engineer
Talent Brand Manager
Sourcer
Software Engineer
You may also like