Kevin Lin | Software Engineer, Storage and Caching

Introduction

Pinterest’s distributed caching system, built on top of open source technologies memcached and mcrouter, is a critical component of the production infrastructure stack. Pinterest’s cache-as-a-service platform is responsible for driving down application latency across the board, reducing the overall cloud cost footprint, and ensuring adherence to strict sitewide availability targets.

Today, Pinterest’s memcached fleet spans over 5000 EC2 instances across a variety of instance types optimized along compute, memory, and storage dimensions. Collectively, the fleet serves up to ~180 million requests per second and ~220 GB/s of network throughput over a ~460 TB active in-memory and on-disk dataset, partitioned among ~70 distinct clusters.

As a core driver of reduced sitewide latency, the distributed caching tier is subject to stringent performance and latency requirements. Additionally, a key consequence of the sheer size of the fleet is that even small efficiency optimizations have an outsized impact on the total service cost footprint. Several years of operational experience running memcached at scale in production have provided unique insight into practical optimizations for driving improved performance and efficiency across the entire caching stack.

In this article, we will share some context on the observability and performance testing tools that enable optimization exploration work, followed by a deep dive into practical optimizations currently running in our production environment along dimensions of hardware selection strategy, compute efficiency, and networking performance.

High-level description of the available surface area for performance optimization for memcached running on virtual machines in public cloud environments

Monitoring, observability, and evaluation

All performance optimization efforts start with precise quantitative measurement and a structured, reproducible mechanism for generating workloads for isolated evaluation.

Critical monitoring prerequisites for all the performance evaluation conducted over the years include:

Server-side metrics for request throughput, network throughput, resource utilization, and hardware-level parameters (NIC statistics like per-queue packet throughput and EC2 allowance exhaustion, disk response times and in-flight I/O requests, etc.)
Client-side metrics for cache request percentile latency, timeout and error rates, and per-server availability (SLIs), as well as top-level application performance indicators like service RPC P99 response time

Pinterest leverages both synthetic load generation and production shadow traffic to evaluate the impact of tuning and optimizations. Historically, synthetic benchmarking has been useful for detecting performance regressions or improvements under maximum load, while shadow traffic evaluation has been more reflective of server resource utilization and overall performance under a real workload at scale.

Synthetic load generation: memtier_benchmark is an open source tool capable of generating a synthetic load against a memcached cluster with configurable parameters for the number of concurrent clients and threads, read/write ratio, data sizes, and transport mechanism (plaintext or TLS).
Production shadow traffic: mcrouter is an open source memcache-protocol routing proxy deployed as a client-side sidecar in the Pinterest fleet. It provides building blocks to design transparent shadow traffic routing policies with configurable traffic percentages and source/target cluster(s), allowing for flexible dark traffic experimentation across a variety of workload classes.

Together, these tools permit high-signal performance evaluation with zero or minimal impact to critical-path production traffic.

Performance and efficiency

Cloud hardware

Distributed caching at Pinterest serves a diverse array of workloads. In general, each class of workload can be categorized along the following high-level dimensions:

Throughput (compute)
Data volume (memory and/or disk capacity)
Data bandwidth (network and compute)
Latency requirement (compute)

While memcached can be arbitrarily horizontally scaled in and out to address a particular cluster’s bottleneck, vertically scaling individual hardware dimensions allows for greater cost efficiency for specific workloads. In practice, this entails standardization on a fixed pool of EC2 instance types optimized for each workload class.

Workload profile: Moderate throughput, moderate data volume

EC2 instance family: r5

Rationale: r5 family instances offer a vCPU-DRAM ratio that works well for most vanilla cache use cases at Pinterest. This instance type is considered the “baseline” against which others are evaluated.

Workload profile: High throughput, low data volume

EC2 instance family: c5

Rationale: c5 family instances are more cost efficient for use cases that would otherwise slot into the r5 type but hold significantly less memory. Maintaining the same vCPU count as its r5 counterpart allows it to serve the same throughput volume at a lower overall cost.

Workload profile: High data volume, relaxed latency requirement

EC2 instance family: r5d

Rationale: r5d family instances are functionally equivalent to r5 family instances but with an instance-colocated NVMe SSD used by extstore for secondary storage. r5d is cost efficient for clusters with high data volume such that there are tangible improvements to hit rate as data is written to disk. Due to the slower disk (relative to i3 family instances), higher tail latency is expected.

Workload profile: Massive data volume, relaxed latency requirement

EC2 instance family: i3 and i3en

Rationale: i3 and i3en family instances ship with a fast and sizable instance-colocated disk, which tangibly increases extstore performance for workloads with a very high ratio of working data on disk relative to DRAM. Additionally, they offer comparable memory capacity to r5 series instances, which reduces extstore thrashing by maintaining a reasonable DRAM to disk usage ratio.

EC2 instance type distribution for the Pinterest memcached fleet

In particular, using extstore to expand storage capacity beyond DRAM into a local NVMe flash disk tier increases per-instance storage efficiency by up to several orders of magnitude, and it reduces the associated cluster cost footprint proportionally. EC2’s storage-optimized instance types provide locally attached solid state drives capable of high random IOPS and R/W throughput, allowing onboarding of extstore use cases with massive data volumes and high request throughput without compromising tail latency.

The introduction of different shapes of storage-optimized EC2 instance types to the fleet (in particular, the lower-tier variants of the i3en instance family containing multiple independent disks per instance) further drives down costs while offering improvements in I/O performance and cost efficiency. Pinterest configures these instances with Linux software RAID at level RAID0 to combine multiple hardware block devices into a single, logical disk for userspace consumption. By striping reads and writes fairly across two disks, RAID0 doubles maximum theoretical I/O throughput with a best-case two-fold reduction in effective disk response time at the cost of a doubled MTTF. This increased hardware performance for extstore at the expense of an increased theoretical failure rate is a highly worthwhile tradeoff. Operating workloads on a public cloud necessitates designing infrastructure to be ephemeral cattle, capable of self-healing in the event of instance failures. A topology control plane for mcrouter automatically and gracefully responds to unexpected changes in server capacity; instance loss is a non-issue.

Compute

Approximately half of all caching workloads at Pinterest are compute-bound (i.e. purely request throughput-bound). Successful optimizations in compute efficiency translate into the ability to downsize clusters without compromising serving capacity.

More precisely, compute efficiency for memcached is defined as the additional rate of requests that can be serviced by a single instance for each percentage point increase in instance CPU usage, without increasing request latency. In simpler terms, an optimization that improves compute efficiency is one that allows memcached to serve a higher request rate at lower CPU usage, without changing request latency characteristics.

At Pinterest, most workloads (including the distributed cache fleet) run on dedicated EC2 virtual machines. Many historical efficiency improvements stem from optimizations in the hardware layer itself, like migrating to different instance families or upgrading to newer generations of existing instance types. However, operating workloads on dedicated (virtualized) machines offers unique opportunities for optimizations at the hardware-software boundary.

Memcached is somewhat unique among stateful data systems at Pinterest in that it is the exclusive primary workload, with a static set of long-lived worker threads, on every EC2 instance on which it is deployed. This is in contrast to database workloads which might have, for example, multiple colocated processes for decoupled storage and serving layers. To this end, one simple but highly effective optimization is tuning process scheduling in order to request the kernel prioritize CPU time for memcached at the expense of deliberately withholding CPU time from other processes on the host, like monitoring daemons. This involves running memcached under a real-time scheduling policy, SCHED_FIFO, with a high priority — instructing the kernel to, effectively, allow memcached to monopolize the CPU by preempting (and thus deliberately starving) all non-realtime processes whenever a memcached thread becomes runnable.

$ sudo chrt — — fifo <priority> memcached …

Example invocation of memcached under a SCHED_FIFO real-time scheduling policy

This one-line change, after rollout to all compute-bound clusters, drove client-side P99 latency down by anywhere between 10% and 40%, in addition to eliminating spurious spikes in P99 and P999 latency across the board. Additionally, it afforded the ability to raise the steady-state operation CPU usage ceiling by 20% without introducing latency regressions. Ultimately, this shaved close to 10% off memcached’s total fleet-wide cost footprint.

Week-over-week comparison of client-side P99 cache latency for one service while real-time scheduling was rolled out to its corresponding dedicated memcached cluster

Ratio of time spent by memcached waiting for execution by the kernel versus wall clock time, before and after real-time scheduling was enabled (data is collected from schedstat in the /proc filesystem)

Stabilization of spurious latency spikes after real-time scheduling was enabled on a canary host (red-colored series)

Networking

There are a few key dimensions when considering networking performance:

Data bandwidth, packet throughput, and TCP connection limits. EC2 imposes hard limits on per-instance PPS, aggregate bandwidth, and TCP connections (though only when deployed in a security group with TCP ingress rules). Excess usage beyond these limits is reported by the Elastic Network Adapter (ENA) driver and accessible via ethtool. Confusingly, EC2 also expresses total NIC bandwidth capabilities in terms of burst loads rather than steady-state loads, thus requiring some degree of trial-and-error to determine the practical bandwidth ceiling for workloads like memcached with predictable network characteristics.
Connection latency and reliability. Is there a way to make initial TCP connections to memcached faster and more reliable, especially under burst scenarios where thousands of clients are simultaneously establishing connections?
Overhead associated with transport-layer features like TLS. Is there a way to reduce the encryption/decryption compute overhead of TLS? Additionally, is there a way to reduce the cost of the initial setup cost (i.e. TLS handshake)?

From a cloud consumer’s perspective, EC2-enforced network limits can and should effectively be considered inherent hardware limitations. Unfortunately, there is no mechanism to work around these limits other than horizontally scaling the fleet to reduce per-instance usage.

In Pinterest’s caching architecture, mcrouter is a universal routing proxy and the single application-facing entry point into the distributed caching tier. Each mcrouter instance (effectively, every individual host in a service cluster) creates a statically sized, long-lived TCP connection pool to every individual memcached server in a cluster. Connection pool sizes are deterministically derived from the number of logical cores available on the host system, typically ranging from 8 to 72 for canonical instance types. This results in upwards of tens of thousands of active established TCP connections per server host, and easily over a million total connections per server cluster — necessitating a strategy for maintaining minimal connection latency and connection reliability at scale.

TCP Fast Open (TFO) is a mechanism for reducing the latency overhead of establishing a TCP connection by optimizing away one RTT in an otherwise costly TCP 3WHS (3-way handshake), while also allowing eager transmission of early data during the handshake itself. While originally intended for end users on unreliable home and mobile networks connecting to remote edge servers, TFO has demonstrated tangible improvements in connection reliability in a closed cloud environment as well. Implementing TFO support in memcached reduced average TCP connection durations of successive sessions by ~10%, most prominently in connections established across an Availability Zone boundary.

Packets exchanged between client and server during TFO cookie setup and a subsequent TFO-initiated session with early data

Separately, raising the sysctl parameter value for net.core.somaxconn and associated listen backlog size in the userspace listen(2) callsite in memcached improved burst connection availability for high-throughput clusters. Previously, deploying a new memcached binary would cause spikes in ECONNREFUSED server errors caused by exhausted server-side TCP accept queues driven by thundering herds of simultaneous inbound connections from thousands of client mcrouter instances. A more generous listen backlog threshold reduced per-server downtime and fixed the brief but frequent SLO violations whenever a shared tenancy cluster was deployed.

Lastly, TLS plays an important role for in-transit data encryption between memcached and mcrouter, and it is enabled for 100% of cache traffic within Pinterest in order to comply with sitewide authentication, authorization, and auditing policies. Even with hardware-accelerated cryptography, TLS adds non-trivial initial and steady-state overhead, due to a post-connect TLS handshake and application-layer encryption/decryption during network I/O, respectively. TLS session resumption, after implementation in memcached, reduced fleet-wide client-side connection timeout rates by allowing reuse of previously cached TLS sessions. One avenue for tackling steady-state overhead is kernel TLS (kTLS) — a mechanism to offload the TLS record layer from userspace to the kernel, implemented either in software or offloaded to supported dedicated NIC hardware for completely transparent inline data encryption/decryption. TLS session resumption was upstreamed by Pinterest to memcached and is available in version 1.6.3 onward; kTLS is an ongoing and relatively experimental optimization area.

Future work

Infrastructure optimization is a critical objective for Pinterest that ultimately drives a more delightful experience for Pinners while reducing our own cloud cost footprint. We look forward to continuing to explore avenues for improving cache performance and efficiency at all layers of the stack, from application clients and routing proxies to the servers themselves. In the near term, we intend to continue evaluation of software kernel TLS, explore compatibility of memcached with newer generations of EC2 instance types for improved price-to-performance characteristics, and application/proxy-side software optimizations like in-flight compression for improved storage efficiency. We hope to additionally build an end-to-end automated performance regression testing framework to track the impact of these optimizations over time.

Thanks to the entire Storage and Caching team at Pinterest for supporting this work, especially Ankita Girish Wagh and Lianghong Xu.

Application and Data