Improving Distributed Caching Performance and Efficiency at Pinterest

496
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

Kevin Lin | Software Engineer, Storage and Caching


Introduction

Pinterest’s distributed caching system, built on top of open source technologies memcached and mcrouter, is a critical component of the production infrastructure stack. Pinterest’s cache-as-a-service platform is responsible for driving down application latency across the board, reducing the overall cloud cost footprint, and ensuring adherence to strict sitewide availability targets.

Today, Pinterest’s memcached fleet spans over 5000 EC2 instances across a variety of instance types optimized along compute, memory, and storage dimensions. Collectively, the fleet serves up to ~180 million requests per second and ~220 GB/s of network throughput over a ~460 TB active in-memory and on-disk dataset, partitioned among ~70 distinct clusters.

As a core driver of reduced sitewide latency, the distributed caching tier is subject to stringent performance and latency requirements. Additionally, a key consequence of the sheer size of the fleet is that even small efficiency optimizations have an outsized impact on the total service cost footprint. Several years of operational experience running memcached at scale in production have provided unique insight into practical optimizations for driving improved performance and efficiency across the entire caching stack.

In this article, we will share some context on the observability and performance testing tools that enable optimization exploration work, followed by a deep dive into practical optimizations currently running in our production environment along dimensions of hardware selection strategy, compute efficiency, and networking performance.

High-level description of the available surface area for performance optimization for memcached running on virtual machines in public cloud environments

Monitoring, observability, and evaluation

All performance optimization efforts start with precise quantitative measurement and a structured, reproducible mechanism for generating workloads for isolated evaluation.

Critical monitoring prerequisites for all the performance evaluation conducted over the years include:

  • Server-side metrics for request throughput, network throughput, resource utilization, and hardware-level parameters (NIC statistics like per-queue packet throughput and EC2 allowance exhaustion, disk response times and in-flight I/O requests, etc.)
  • Client-side metrics for cache request percentile latency, timeout and error rates, and per-server availability (SLIs), as well as top-level application performance indicators like service RPC P99 response time

Pinterest leverages both synthetic load generation and production shadow traffic to evaluate the impact of tuning and optimizations. Historically, synthetic benchmarking has been useful for detecting performance regressions or improvements under maximum load, while shadow traffic evaluation has been more reflective of server resource utilization and overall performance under a real workload at scale.

  • Synthetic load generation: memtier_benchmark is an open source tool capable of generating a synthetic load against a memcached cluster with configurable parameters for the number of concurrent clients and threads, read/write ratio, data sizes, and transport mechanism (plaintext or TLS).
  • Production shadow traffic: mcrouter is an open source memcache-protocol routing proxy deployed as a client-side sidecar in the Pinterest fleet. It provides building blocks to design transparent shadow traffic routing policies with configurable traffic percentages and source/target cluster(s), allowing for flexible dark traffic experimentation across a variety of workload classes.

Together, these tools permit high-signal performance evaluation with zero or minimal impact to critical-path production traffic.

Performance and efficiency

Cloud hardware

Distributed caching at Pinterest serves a diverse array of workloads. In general, each class of workload can be categorized along the following high-level dimensions:

  • Throughput (compute)
  • Data volume (memory and/or disk capacity)
  • Data bandwidth (network and compute)
  • Latency requirement (compute)

While memcached can be arbitrarily horizontally scaled in and out to address a particular cluster’s bottleneck, vertically scaling individual hardware dimensions allows for greater cost efficiency for specific workloads. In practice, this entails standardization on a fixed pool of EC2 instance types optimized for each workload class.

Workload profile: Moderate throughput, moderate data volume

EC2 instance family: r5

Rationale: r5 family instances offer a vCPU-DRAM ratio that works well for most vanilla cache use cases at Pinterest. This instance type is considered the “baseline” against which others are evaluated.

Workload profile: High throughput, low data volume

EC2 instance family: c5

Rationale: c5 family instances are more cost efficient for use cases that would otherwise slot into the r5 type but hold significantly less memory. Maintaining the same vCPU count as its r5 counterpart allows it to serve the same throughput volume at a lower overall cost.

Workload profile: High data volume, relaxed latency requirement

EC2 instance family: r5d

Rationale: r5d family instances are functionally equivalent to r5 family instances but with an instance-colocated NVMe SSD used by extstore for secondary storage. r5d is cost efficient for clusters with high data volume such that there are tangible improvements to hit rate as data is written to disk. Due to the slower disk (relative to i3 family instances), higher tail latency is expected.

Workload profile: Massive data volume, relaxed latency requirement

EC2 instance family: i3 and i3en

Rationale: i3 and i3en family instances ship with a fast and sizable instance-colocated disk, which tangibly increases extstore performance for workloads with a very high ratio of working data on disk relative to DRAM. Additionally, they offer comparable memory capacity to r5 series instances, which reduces extstore thrashing by maintaining a reasonable DRAM to disk usage ratio.

EC2 instance type distribution for the Pinterest memcached fleet

In particular, using extstore to expand storage capacity beyond DRAM into a local NVMe flash disk tier increases per-instance storage efficiency by up to several orders of magnitude, and it reduces the associated cluster cost footprint proportionally. EC2’s storage-optimized instance types provide locally attached solid state drives capable of high random IOPS and R/W throughput, allowing onboarding of extstore use cases with massive data volumes and high request throughput without compromising tail latency.

The introduction of different shapes of storage-optimized EC2 instance types to the fleet (in particular, the lower-tier variants of the i3en instance family containing multiple independent disks per instance) further drives down costs while offering improvements in I/O performance and cost efficiency. Pinterest configures these instances with Linux software RAID at level RAID0 to combine multiple hardware block devices into a single, logical disk for userspace consumption. By striping reads and writes fairly across two disks, RAID0 doubles maximum theoretical I/O throughput with a best-case two-fold reduction in effective disk response time at the cost of a doubled MTTF. This increased hardware performance for extstore at the expense of an increased theoretical failure rate is a highly worthwhile tradeoff. Operating workloads on a public cloud necessitates designing infrastructure to be ephemeral cattle, capable of self-healing in the event of instance failures. A topology control plane for mcrouter automatically and gracefully responds to unexpected changes in server capacity; instance loss is a non-issue.

Compute

Approximately half of all caching workloads at Pinterest are compute-bound (i.e. purely request throughput-bound). Successful optimizations in compute efficiency translate into the ability to downsize clusters without compromising serving capacity.

More precisely, compute efficiency for memcached is defined as the additional rate of requests that can be serviced by a single instance for each percentage point increase in instance CPU usage, without increasing request latency. In simpler terms, an optimization that improves compute efficiency is one that allows memcached to serve a higher request rate at lower CPU usage, without changing request latency characteristics.

At Pinterest, most workloads (including the distributed cache fleet) run on dedicated EC2 virtual machines. Many historical efficiency improvements stem from optimizations in the hardware layer itself, like migrating to different instance families or upgrading to newer generations of existing instance types. However, operating workloads on dedicated (virtualized) machines offers unique opportunities for optimizations at the hardware-software boundary.

Memcached is somewhat unique among stateful data systems at Pinterest in that it is the exclusive primary workload, with a static set of long-lived worker threads, on every EC2 instance on which it is deployed. This is in contrast to database workloads which might have, for example, multiple colocated processes for decoupled storage and serving layers. To this end, one simple but highly effective optimization is tuning process scheduling in order to request the kernel prioritize CPU time for memcached at the expense of deliberately withholding CPU time from other processes on the host, like monitoring daemons. This involves running memcached under a real-time scheduling policy, SCHED_FIFO, with a high priority — instructing the kernel to, effectively, allow memcached to monopolize the CPU by preempting (and thus deliberately starving) all non-realtime processes whenever a memcached thread becomes runnable.

$ sudo chrt — — fifo <priority> memcached …

Example invocation of memcached under a SCHED_FIFO real-time scheduling policy

This one-line change, after rollout to all compute-bound clusters, drove client-side P99 latency down by anywhere between 10% and 40%, in addition to eliminating spurious spikes in P99 and P999 latency across the board. Additionally, it afforded the ability to raise the steady-state operation CPU usage ceiling by 20% without introducing latency regressions. Ultimately, this shaved close to 10% off memcached’s total fleet-wide cost footprint.

Week-over-week comparison of client-side P99 cache latency for one service while real-time scheduling was rolled out to its corresponding dedicated memcached cluster

Ratio of time spent by memcached waiting for execution by the kernel versus wall clock time, before and after real-time scheduling was enabled (data is collected from schedstat in the /proc filesystem)

Stabilization of spurious latency spikes after real-time scheduling was enabled on a canary host (red-colored series)

Networking

There are a few key dimensions when considering networking performance:

  • Data bandwidth, packet throughput, and TCP connection limits. EC2 imposes hard limits on per-instance PPS, aggregate bandwidth, and TCP connections (though only when deployed in a security group with TCP ingress rules). Excess usage beyond these limits is reported by the Elastic Network Adapter (ENA) driver and accessible via ethtool. Confusingly, EC2 also expresses total NIC bandwidth capabilities in terms of burst loads rather than steady-state loads, thus requiring some degree of trial-and-error to determine the practical bandwidth ceiling for workloads like memcached with predictable network characteristics.
  • Connection latency and reliability. Is there a way to make initial TCP connections to memcached faster and more reliable, especially under burst scenarios where thousands of clients are simultaneously establishing connections?
  • Overhead associated with transport-layer features like TLS. Is there a way to reduce the encryption/decryption compute overhead of TLS? Additionally, is there a way to reduce the cost of the initial setup cost (i.e. TLS handshake)?

From a cloud consumer’s perspective, EC2-enforced network limits can and should effectively be considered inherent hardware limitations. Unfortunately, there is no mechanism to work around these limits other than horizontally scaling the fleet to reduce per-instance usage.

In Pinterest’s caching architecture, mcrouter is a universal routing proxy and the single application-facing entry point into the distributed caching tier. Each mcrouter instance (effectively, every individual host in a service cluster) creates a statically sized, long-lived TCP connection pool to every individual memcached server in a cluster. Connection pool sizes are deterministically derived from the number of logical cores available on the host system, typically ranging from 8 to 72 for canonical instance types. This results in upwards of tens of thousands of active established TCP connections per server host, and easily over a million total connections per server cluster — necessitating a strategy for maintaining minimal connection latency and connection reliability at scale.

TCP Fast Open (TFO) is a mechanism for reducing the latency overhead of establishing a TCP connection by optimizing away one RTT in an otherwise costly TCP 3WHS (3-way handshake), while also allowing eager transmission of early data during the handshake itself. While originally intended for end users on unreliable home and mobile networks connecting to remote edge servers, TFO has demonstrated tangible improvements in connection reliability in a closed cloud environment as well. Implementing TFO support in memcached reduced average TCP connection durations of successive sessions by ~10%, most prominently in connections established across an Availability Zone boundary.

Packets exchanged between client and server during TFO cookie setup and a subsequent TFO-initiated session with early data

Separately, raising the sysctl parameter value for net.core.somaxconn and associated listen backlog size in the userspace listen(2) callsite in memcached improved burst connection availability for high-throughput clusters. Previously, deploying a new memcached binary would cause spikes in ECONNREFUSED server errors caused by exhausted server-side TCP accept queues driven by thundering herds of simultaneous inbound connections from thousands of client mcrouter instances. A more generous listen backlog threshold reduced per-server downtime and fixed the brief but frequent SLO violations whenever a shared tenancy cluster was deployed.

Lastly, TLS plays an important role for in-transit data encryption between memcached and mcrouter, and it is enabled for 100% of cache traffic within Pinterest in order to comply with sitewide authentication, authorization, and auditing policies. Even with hardware-accelerated cryptography, TLS adds non-trivial initial and steady-state overhead, due to a post-connect TLS handshake and application-layer encryption/decryption during network I/O, respectively. TLS session resumption, after implementation in memcached, reduced fleet-wide client-side connection timeout rates by allowing reuse of previously cached TLS sessions. One avenue for tackling steady-state overhead is kernel TLS (kTLS) — a mechanism to offload the TLS record layer from userspace to the kernel, implemented either in software or offloaded to supported dedicated NIC hardware for completely transparent inline data encryption/decryption. TLS session resumption was upstreamed by Pinterest to memcached and is available in version 1.6.3 onward; kTLS is an ongoing and relatively experimental optimization area.

Future work

Infrastructure optimization is a critical objective for Pinterest that ultimately drives a more delightful experience for Pinners while reducing our own cloud cost footprint. We look forward to continuing to explore avenues for improving cache performance and efficiency at all layers of the stack, from application clients and routing proxies to the servers themselves. In the near term, we intend to continue evaluation of software kernel TLS, explore compatibility of memcached with newer generations of EC2 instance types for improved price-to-performance characteristics, and application/proxy-side software optimizations like in-flight compression for improved storage efficiency. We hope to additionally build an end-to-end automated performance regression testing framework to track the impact of these optimizations over time.

Thanks to the entire Storage and Caching team at Pinterest for supporting this work, especially Ankita Girish Wagh and Lianghong Xu.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Senior Backend Engineer, User Underst...
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

The user understanding team builds cutting edge user understanding models and systems to deeply understand the evolving interests, intents and tastes of our 450m+ users, which is one of the most essential ML components powering personalization of Pinterest products across Discovery (Homefeed, Search, Related Pins), Ads, Shopping and Growth. As a backend engineer, you will have the opportunity to help build the next generation large-scale user signal platform and potentially affect every surface of Pinterest with deeper user understanding.

What you’ll do:

  • As a backend engineer, design and build large scale systems that can process profile, activities and feedback from hundreds of millions of pinterest users.
  • Design and build systems / tools, e.g.,
  • Drive cross functional collaborations with partner teams adopting user signals and models, e.g. ads, Homefeed, shopping, search and etc.
  • Work with ML Engineers to manage large-scale ML models in production

What we’re looking for:

  • 4+ years of industry experience.
  • Expert in Python and Java or other static languages like Go or C++
  • Hands-on experience building complex backend systems leveraged by multiple clients.
  • Hands-on experience with big data technologies (e.g., Hadoop/Spark/Kafka) and scalable realtime systems that process stream data.
  • Nice to have:
    • Experience working with privacy sensitive data or GDPR compliance.
    • Basic knowledge of machine learning (or willing to learn!): feature extraction, training, model serving etc.

#LI-TG1

Engineer Manager, Content Knowledge S...
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest helps people Discover and Do the things they love. We have more than 450M monthly active users who actively curate an ecosystem of more than 100B Pins on more than 1B boards, creating a rich human curated graph of immense value. 

Technically, we are building out an internet scale personalized recommendation engine in 22+ languages, which requires a deep understanding of the users and content on our platform. As engineer manager on the Content Knowledge Signal team, you’ll work on building 20+ content understanding signals based on Pinterest Knowledge Graph, which will make measurably positive impact on hundreds of millions of users with improved recommendation and featurization breakthroughs on almost all Pinterest product surfaces (Discovery, Shopping, Growth, Ads, etc). 

What you'll do:

  • Manage a horizontal team of talented and dedicated ML engineers to build the foundational content understanding and engagement features of our contents to be used across all Pinterest ecosystems
  • Utilize state of the art algorithms/industry best practice to build and improve content understanding signals 
  • Partner with other engineering teams and sales & marketing team to discover future opportunities to improve content recommendation on Pinterest
  • Hire new engineers to grow the team
  • Build ML models using text and visual information of a pin, identify the most relevant set of text annotations for that pin. These sets of highly relevant annotations are among the most important features used in more than 30 use cases within Pinterest, including key ranking models of Homefeed, Search and Ads.
  • Build ML models using text and images of the products, to understand their product categories (bags, shoes, shirts, etc) and their attributes (brand, color, style, etc). They are used to greatly improve relevance for product recommendation on major shopping surfaces. 
  • Build ML models to understand search queries, then use them, together with Pin level signals, to boost search relevance. 
  • Build graph based embedding as well as explicit annotation to represent the specialties of our native content creators, to improve creator and native content recommendation.
  • Build highly efficient and expandable data pipelines to understand engagement data at various entity levels. Such engagement signals are the major feature of the ranking models for our three main Discovery surfaces. 
  •  

What we're looking for:

  • 2+ years of industrial experience in ML team’s EM or TL for one or multiple of the following use cases with large scale: ads targeting, search and discovery, growth, content/user understanding
  • Hands-on experience working with ML algorithm development and productization.  
  • Experience working with PMs and XFN partners on E2E systems and moving business metrics

#TG1

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Software Engineer, Machine Learning P...
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

We are seeking a senior software engineer to build and boost Pinterest’s machine learning training and serving platforms and infrastructure. The candidate will work with different teams to design, build and improve our ML systems, including the model training computation platform, serving systems and model deployment systems.

What you'll do:

  • Design and build solutions to make the model training, serving and deployment process more efficient, more reliable, and less error-prone by human mistakes.
  • Design and build long term solutions to boost the model iteration velocity for machine learning engineers and data scientists.
  • Work extensively with ML engineers across Pinterest to understand their requirements, pain points, and build generalized solutions. Also work with partner teams to drive projects requiring cross-team coordination. 
  • Provide technical guidance and coaching to other junior engineers in the team.

What we're looking for:

  • Hands-on experience developing large-scale machine learning models in production, or experience working on the systems supporting onboarding large-scale machine learning models.
  • Ability to drive cross-team projects; Ability to understand our internal customers (ML practitioners), their common usage patterns and pain points.
  • Flexibility to work across different areas: tool building, model optimization, infrastructure optimization, large scale data processing pipelines, etc.
  • 5+ years of professional experience in software engineering.
  • Fluency in Python and either Java or Scala (Fluency in C++ for the MLS role).
  • Past tech lead experience is preferred, but not required. (Not necessary for the MLS role).

#LI-GB2

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Engineering Manager, Ads Engagement M...
San Francisco, CA, US; Palo Alto, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest is one of the fastest growing online ad platforms, and our success depends on mining rich user interest data that helps us connect users with highly relevant advertisers/products. We’re looking for an Engineering Manager with experience in machine learning, data mining, and information retrieval to lead a team that develops new data-driven techniques to show the most engaging and relevant promoted content to the users. You’ll be leading a world-class ML team that is growing quickly and laying the foundation for Pinterest’s business success.

What you’ll do:

  • Manage and grow the engineering team, providing technical vision and long-term roadmap
  • Design features and build large-scale machine learning models to improve ads engagement prediction
  • Effectively collaborate and partner with several cross functional teams to build the next generation of ads engagement models
  • Mentor and grow ML engineers to allow them to become experts in modeling/engagement prediction 

What we’re looking for:

  • Degree in Computer Science, Statistics or related field
  • Industry experience building production machine learning systems at scale, data mining, search, recommendations, and/or natural language processing
  • 1+ years of experience leading projects/ teams either as TL/ TLM/ EM
  • Cross-functional collaborator and strong communicator
  • Experience with ads domain is a big plus

#LI-SM4

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Verified by
Security Software Engineer
Tech Lead, Big Data Platform
Software Engineer
Talent Brand Manager
Sourcer
Software Engineer
You may also like