Scaling Kubernetes with Assurance at Pinterest

1,037
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

By Anson Qian | Software Engineer, Cloud Runtime


Introduction

It has been more than a year since we shared our Kubernetes Journey at Pinterest. Since then, we have delivered many features to facilitate customer adoption, ensure reliability and scalability, and build up operational experience and best practices.

In general, Kubernetes platform users gave positive feedback. Based on our user survey, the top three benefits shared by our users are reducing the burden of managing compute resources, better resource and failure isolation, and more flexible capacity management.

By the end of 2020, we orchestrated 35K+ pods with 2500+ nodes in our Kubernetes clusters — supporting a wide range of Pinterest businesses — and the organic growth is still rocket high.

2020 in a Short Story

As user adoption grows, the variety and number of workloads increases. It requires the Kubernetes platform to be more scalable in order to catch up with the increasing load from workload management, pods scheduling and placement, and node allocation and deallocation. As more business critical workloads onboard the Kubernetes platform, the expectations on platform reliability naturally rise to a new level.

Platform-wide outage did happen. In early 2020, one of our clusters experienced a sudden spike of pods creation (~3x above planned capacity), causing the cluster autocalor to bring up 900 nodes to accommodate the demand. The kube-apiserver started to first experience latency spikes and increased error rate, and then get Out of Memory (OOM) killed due to resource limit. The unbound retry from Kubelets resulted in a 7x jump on kube-apiserver load. The burst of writes caused etcd to reach its total data size limit and start rejecting all write requests, and the platform lost availability in terms of workload management. In order to mitigate the incident, we had to perform etcd operations like compacting old revisions, defragmenting excessive spaces, and disabling alarms to recover it. In addition, we had to temporarily scale up Kubernetes master nodes that host kube-apiserver and etcd to reduce resource constraint.

Figure 1: Kubernetes API Server Latency Spikes

Later in 2020, one of the infra components had a bug in kube-apiserver integration that generated a spike of expensive queries (listing all pods and nodes) to kube-apiserver. This caused the Kubernetes master node resource usage spikes, and kube-apiserver entered OOMKilled status. Luckily the problematic component was discovered and rolled back shortly afterwards. But during the incident, the platform performance suffered from degrationation, including delayed workload execution and stale status serving.

Figure 2: Kubernetes API Server OOMKilled

Getting Ready for Scale

We continue to reflect on our platform governance, resilience, and operability throughout our journey, especially when incidents happen and hit hard on our weakest spots. With a nimble team of limited engineering resources, we had to dig deep to find out root causes, identify low hanging fruits, and prioritize solutions based on return vs. cost. Our strategy for dealing with the complex Kubernetes ecosystem is to try our best to minimize divergence from what’s provided by the community and contribute back to the community, but never rule out the option of writing our own in house components.

Figure 3: Pinterest Kubernetes Platform Architecture (blue is in-house, green is open source)

Governance

Resource Quota Enforcement

Kubernetes already provides resource quotas management to ensure no namespace can request or occupy unbounded resources in most dimensions: pods, cpu, memory, etc. As our previous incident mentioned, a surge of pod creation in a single namespace could overload kube-apiserver and cause cascading failure. It is key to have resource usage bounded in every namespace in order to ensure stability.

One challenge we faced is that enforcing resource quota in every namespace implicitly requires all pods and containers to have resource requests and limits specified. In Pinterest Kubernetes platform, workloads in different namespaces are owned by different teams for different projects, and platform users configure their workload via Pinterest CRD. We achieved that by adding default resource requests and limits for all pods and containers in the CRD transformation layer. In addition, we also rejected any pod specification without resource requests and limits in the CRD validation layer.

Another challenge we overcame was to streamline quota management across teams and organizations. To safely enable resource quota enforcement, we look at historical resource usage, add 20% headroom on top of peak value, and set it as the initial value for resource quota for every project. We created a cron job to monitor quota usage and send business hour alerts to project owning teams if their project usage is approaching a certain limit. This encourages project owners to do a better job of capacity planning and request a resource quota change. The resource quota change gets manually reviewed and automatically deployed after sign-off.

Client Access Enforcement

We enforce all KubeAPI clients to follow the best practices Kubernetes already provides:

Controller Framework

Controller framework provides a shareable cache for optimizing read operations, which leverages informer-reflector-cache architecture. Informers are set up to list and watch objects of interest from the kube-apiserver. Reflector reflects object changes to the underlying Cache and propagates out watched events to event handlers. Multiple components inside the same controller can register event handlers for OnCreate, OnUpdate, and OnDelete events from Informers and fetch objects from Cache instead of Kube-apiserver directly. Therefore, it reduces the chance of making unnecessary and redundant calls.

Figure 4: Kubernetes Controller Framework

Rate Limiting

Kubernetes API clients are usually shared among different controllers, and API calls are made from different threads. Kubernetes ships its API client along with a token bucket rate limiter that supports configurable QPS and bursts. API calls that burst beyond threshold will be throttled so that a single controller will not jam the kube-apiserver bandwidth.

Shared Cache

In addition to the kube-apiserver built-in cache that comes with the controller framework, we added another informer based write through cache layer in the platform API. This is to prevent unnecessary read calls hard hitting the kube-apiserver. The server side cache reuse also avoided thick clients in application code.

For kube-apiserver access from applications, we enforce all requests to go through the platform API to leverage shared care and assign security identity for access control and flow control. For kube-apiserver access from workload controllers, we enforce that all controllers implement based on control framework with rate limiting.

Resilience

Hardening Kubelet

One key reason why Kubernetes’ control plane entered cascading failure is that the legacy reflector implementation had unbounded retry when handling errors. Such imperfections can be exaggerated, especially when the API server is OOMKilled, which can easily cause a synchronization of reflectors across the cluster.

To resolve this issue, we worked very closely with the community by reporting issues, discussing solutions, and finally getting PRs (1, 2) reviewed and merged. The idea is to add exponential backoff with jitter reflector’s ListWatch retry logic, so the kubelet and other controllers will not try to hammer the kube-apiserver upon kube-apiserver overload and request failures. This resilience improvement is useful in general, but we found it critical on the kubelet side as the number of nodes and pods increases in the Kubernetes cluster.

Tuning Concurrent Requests

The more nodes we manage, the faster workloads are created and destroyed, and the larger the API call QPS server needs to handle. We first increased the maximum concurrent API call settings for both mutating and non-mutating operations based on estimated workloads. These two settings will enforce that the amount of API calls processed doesn’t exceed the configured number and therefore keeps CPU and memory consumption of kube-apiserver at a certain threshold.

Inside Kubernetes’s chain of API request handling, every request will pass a group of filters as the very first step. The filter chain is where max inflight API calls are enforced. For API calls burst to more than the configured threshold, a ‘too many requests” (429) response will be returned to clients to trigger proper retries. As future work, we plan to investigate more on EventRateLimit features with more fine-grained admission control and provide better quality of services.

Caching More Histories

Watch cache is a mechanism inside kube-apiserver that caches past events of each type of resource in a ring buffer in order to serve watch calls from a particular version with best effort. The larger the caches are, the more events can be retained in the server and are more likely to seamlessly serve event streams to clients in case of connection broken. Given this fact, we also improved the target RAM size of kube-apiserver, which internally is finally transferred to the watch cache capacity based on heuristics for serving more robust event streams. Kube-apiserver provides more detailed ways to configure fine grained watch cache size, which can be further leveraged for specific caching requirements.

Figure 5: Kubernetes Watch Cache

Operability

Observability

Aiming to reduce incident detection and mitigation time, we devote efforts continuously to improve observability of Kubernetes control planes. The challenge is to balance failure coverage and signal sensitivity. For existing Kubernetes metrics, we triage and pick important ones to monitor and/or alert so we can more proactively identify issues. In addition, we instrument kube-apiserver to cover more detailed areas in order to quickly narrow down the root cause. Finally, we tune alert statistics and thresholds to reduce noise and false alarms.

At a high level, we monitor kube-apiserver load by looking at QPS and concurrent requests, error rate, and request latency. We can breakdown the traffic by resource types, request verbs, and associated service accounts. For expensive traffic like listing, we also measure request payload by object counts and bytes size, since they can easily overload kube-apiserver even with small QPS. Lastly we monitor etcd watch events processing QPS and delayed processing count as important server performance indicators.

Figure 6: Kubernetes API calls by type

Debuggability

In order to better understand the Kubernetes control plane performance and resource consumption, we also built etcd data storage analysis tool using boltdb library and flamegraph to visualize data storage breakdown. The results of data storage analysis provide insights for platform users to optimize usage.

Figure 7: Etcd Data Usage Per Key Space

In addition, we enabled golang profiling pprof and visualized heap memory footprint. We were able to quickly identify the most resource intensive code paths and request patterns, e.g. transforming response objects upon list resource calls. Another big caveat we found as part of kube-apiserver OOM investigation is that page cache used by kube-apiserver is counted towards a cgroup’s memory limit, and anonymous memory usage can steal page cache usage for the same cgroup. So even if kube-apiserver only has 20GB heap memory usage, the entire cgroup can see 200GB memory usage hitting the limit. While the current kernel default setting is not to proactively reclaim assigned pages for efficient re-use, we are currently looking at setup monitoring based on memory.stat file and force cgroup to reclaim as many pages reclaimed as possible if memory usage is approaching limit.

Figure 8: Kubernetes API Server Memory Profiling

Conclusion

With our governance, resilience, and operability efforts, we are able to significantly reduce sudden usage surges of compute resources, control plane bandwidth, and ensure the stability and performance of the whole platform. The kube-apiserver QPS (mostly read) is reduced by 90% after optimization rollout (as graph shown below), which makes kube-apiserver usage more stable, efficient, and robust. The deep knowledge of Kubernetes’ internals and additional insights we gained will enable the team to do a better job of system operation and cluster maintenance.

Figure 9: Kube-apiserver QPS Reduction After Optimization Rollout

Here are some key takeaways that can hopefully help your next journey of solving Kubernetes scalability and reliability problem:

  1. Diagnose problems to get at their root causes. Focus on the “what is” before deciding “what to do about it.” The first step of solving problems is to understand what the bottleneck is and why. If you get to the root cause, you are halfway to the solution.
  2. It is almost always worthwhile to first look into small incremental improvements rather than immediately commit to radical architecture change. This is important, especially when you have a nimble team.
  3. Make data-driven decisions when you plan or prioritize the investigation and fixes. The right telemetry can help make better decisions on what to focus and optimize first.
  4. Critical infrastructure components should be designed with resilience in mind. Distributed systems are subject to failures, and it is best to always prepare for the worst. Correct guardrails can help prevent cascading failures and minimize the blast radius.

Looking Forward

Federation

As our scale grows steadily, single cluster architecture has become insufficient in supporting the increasing amount of workloads that try to onboard. After ensuring an efficient and robust single cluster environment, enabling our compute platform to scale horizontally is our next milestone moving forward. By leveraging a federation framework, we aim at plugging new clusters into the environment with minimum operation overhead while keeping the planform interface steady to end users. Our federated cluster environment is currently under development, and we look forward to the additional possibilities it opens up once productized.

Capacity Planning

Our current approach of resource quota enforcement is a simplified and reactive way of capacity planning. As we onboard user workloads and system components, the platform dynamics change and project level or cluster wide capacity limit could be out of date. We want to explore proactive capacity planning with forecasting based on historical data, growth trajectory, and a sophisticated capacity model that can cover not only resource quota but also API quota. We expect more proactive and accurate capacity planning can prevent the platform from over-committing and under-delivering.

Acknowledgements

Many engineers at Pinterest helped scale the Kubernetes platform to catch up with business growth. Besides the Cloud Runtime team — June Liu, Harry Zhang, Suli Xu, Ming Zong, and Quentin Miao who worked hard to achieve the scalable and stable compute platform as we have for today, Balaji Narayanan, Roberto Alcala and Rodrigo Menezes who lead our Site Reliability Engineering (SRE) effort, have worked together on ensuring the solid foundation of the compute platform. Kalim Moghul and Ryan Albrecht who lead the Capacity Engineering effort, have contributed to the project identity management and system level profiling. Cedric Staub and Jeremy Krach, who lead the Security Engineering effort, have maintained a high standard such that our workloads can run securely in a multi-tenanted platform. Lastly, our platform users Dinghang Yu, Karthik Anantha Padmanabhan, Petro Saviuk, Michael Benedict, Jasmine Qin, and many others, provided a lot of useful feedback, requirements, and worked with us to make the sustainable business growth happen.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Design Technologist - Figma Plugin De...
, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest is looking for a Javascript developer to help build out our Figma tooling solutions to better support the company’s design team. You will be creating the tools which help our designers improve Pinterest so it can live up to its mission to bring inspiration and a positive impact to people’s lives.
 
What you’ll do:
  • Talk with product designers to gain a direct understanding of how they work and how our plugins can better support their workflow.
  • Work directly with the product design team to craft and ship tools that will help our team work better and with greater velocity.
  • Develop a strong baseline/framework for all future Figma plugin solutions within Pinterest.
What we’re looking for:
  • 5+ years of experience building on the web platform.
  • Strong background in current web app development practices as well as a strong familiarity with Javascript, Typescript and Webpack.
  • Solid experience with HTML and CSS fundamentals.
  • Background and familiarity with modern design processes and tools is a big plus.
  • Experience with React and using Figma’s plugin/REST APIs a big plus.

More about contract roles at Pinterest: 

  • This is a contract position at Pinterest. As such, the contractor who fills this role will be employed either by our staffing partner (ProUnlimited) or by an agency partner, and not an employee of Pinterest
  • All interviews will be scheduled and/or conducted by the Pinterest assignment manager. When a finalist has been selected, ProUnlimited or the agency partner will extend the offer and provide assignment details including duration, benefits options and onboarding details

#LI-AZ415

#LI-REMOTE

Engineering Manager, Stream Processin...
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

As the manager for the Stream Processing Platform team at Pinterest, you will lead a team of talented engineers to work on the large scale platform that powers real-time stream processing applications that process trillions of messages every day. You will have the opportunity to work with the team on the latest cutting edge real-time stream processing challenges, defining the future strategy and taking the platform to the next level. You will also have the opportunity to work closely with product teams that develop exciting new applications on our platform every day, from machine learning, analytics, trust safety to shopping. 

What you’ll do:

  • Lead the team that owns the entire stack of stream processing platform
  • Drive Pinterest’s stream processing strategy and vision
  • Collaborate with customers to understand requirements and incorporate them in the roadmap and work closely with partners to align on strategic directions
  • Hire and further build up the team to support more use cases

What we’re looking for:

  • 7+ years of experience, including 2+ years of management experience
  • Solid expertise in big data or other types of large scale distributed systems
  • Platform development and operational experience
  • Enjoying working in an agile environment

#LI-MJ1

Security Engineer, Product
Dublin, IE

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest’s Security team is seeking a Product Security Engineer to help keep our million users safe from real-world threats. You will build tooling, product enhancements, and work with teams to improve our overall security posture. We are looking for a candidate with a passion for security and innovation, who will research and develop new solutions to secure our products.

What you’ll do:

  • Build out tools and product enhancements to better secure Pinterest users and data
  • Code using primarily one of these languages: Python, Go, Java
  • Work cross function to architect scalable and secure solutions to a variety of Pinterest’s problems
  • Conduct regular security assessments
  • Interact directly with the security community regarding vulnerabilities and threats

Who we're looking for:

  • 5+ years of experience in product security or security related software engineering role
  • Enthusiasm for the constant fight to ensure security and privacy
  • Deep knowledge developing and debugging in Python, Go or Java
  • Knowledge, familiarity and experience building and interacting with Identity and Access Management (IAM) systems is a plus

 

#LI-SG1

Software Engineer, Continuous Integra...
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

The Continuous Integration & Testing team develops & deploys tools, services, & infrastructure that aim to provide a paved path and practices in order to make developing in Pinterest’s polyglot (Python, Java, C++, Go) environment fun, rewarding & sustainable. Our vision is to build a platform that inspires developers to do their best work, and our mission is to provide a fast and safe path from prototype to production without worrying about the underlying infrastructure. We believe in contributing to open source work and collaboration as much as possible. 

What you’ll do

  • Help migrate our existing infrastructure over to the ARM based AWS Graviton platform
  • Contribute code to build tools and infrastructure and partner with internal customers to identify common solutions for all of engineering
  • Uplevel operational cadence & strategy in helping drive OS upgrades as a platform
  • Provide build & integration pipeline support, artifact packaging/publishing, & testing pipeline support leveraging bazel as the common build toolchain.

What we’re looking for:

  • Experience building CI infrastructure leveraging cloud platforms, experience with AWS a plus 
  • Experience with building & operating large scale Distributed systems, having a good understanding of CI tools & testing frameworks
  • Be customer focused and is expected to partner with an infrastructure product manager & engprod leadership team in defining product strategy and outcomes

#LI-SG1

Verified by
Security Software Engineer
Tech Lead, Big Data Platform
Software Engineer
Talent Brand Manager
Sourcer
Software Engineer
You may also like