3 Innovations While Unifying Pinterest’s Key-Value Storage

818
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

By Jessica Chan | Engineering Manager, MySQL & Key-Value Storage


Engineers hate migrations. What do engineers hate more than migrations? Data migrations. Especially critical, terabyte-scale, online serving migrations which, if done badly, could bring down the site, enrage customers, or cripple hundreds of critical internal services.

So why did the Key-Value Systems Team at Pinterest embark on a two-year realtime migration of all our online key-value serving data to a single unified storage system? Because the cost of not migrating was too high. In 2019, Pinterest had four separate key-value systems owned by different teams with different APIs and featuresets. This resulted in duplicated development effort, high operational overhead and incident counts, and confusion among engineering customers.

In unifying all of Pinterest’s 500+ key-value use cases (over 4PB of unique data serving 100Ms of QPS) onto one single interface, not only did we make huge gains in reducing system complexity and lowering operational overhead, we achieved a 40–90% performance improvement by moving to the most efficient storage engine, and we saved the company a significant amount in costs per year by moving to the most optimal replication and versioning architecture.

In this blog post, we selected three (out of many more) innovations to dive into that helped us notch all these wins.

But first, some background

Before this effort, Pinterest used to have four key-value storage systems:

  • Terrapin: a read-only, batch-load, key-value storage built at Pinterest and featured in Designing Data-Intensive Applications based on HDFS
  • Rockstore: a multi-mode (readonly, read-write, streaming-write) key-value storage also built at Pinterest, based on the open-source Rocksplicator framework, written in C++, and using RocksDB as a storage engine
  • UserMetaStore: a read-write key-value storage with a simplified thrift API on top of HBase
  • Rocksandra: a read-write, key-value storage based on a version of Cassandra, which used RocksDB under the hood

One of the biggest challenges when consolidating to a single system is assessing the feasibility of both achieving feature parity across all systems and integrating those features well into a single platform. Another challenge is to determine which system to consolidate to, and whether to go with an existing system or to consider something that doesn’t already exist at Pinterest. And a final, nontrivial challenge is to convince leadership and hundreds of engineers that migrating in the first place is a good idea.

Before embarking on such a large undertaking, we had to step back. A working group dedicated a few months to deep-dive on requirements and technologies, analyze tradeoffs and benefits, and come up with a final proposal that was ultimately approved. Rockstore, which was the most cost-efficient and performant, simplest to operate and extend, and provided the lowest migration cost, was chosen as the one storage system to rule them all.

We won’t describe the entire migration project in this post, but we’ll highlight some of the best parts.

Innovation 1: API abstractions allow us to seamlessly migrate customer data

We know that in code, strong abstractions lead to cleaner interfaces and more flexibility to make changes “under the hood” without disruption. This is especially true of organizations as well. While each of the four storage systems had their own thrift API abstractions, the fact that there were four interfaces, and some of them, like Terrapin, still required customers to know internal details about the architecture in order to use it (leaky abstraction), made life difficult for both customers and platform owners.

A diagram might be helpful to illustrate the complexity of maintaining four separate, key-value storage systems. If you were a customer, which would you choose?

Figure 1: Four separate Key Value Systems at Pinterest, each with their own APIs, set of unique features and underlying architectures, and varying degrees of performance and cost.

We introduced a new API, aptly called the KVStore API, to be the new unified thrift interface that would absorb the rest. Once everyone is on a single unified API that is built with the intention to be general, the platform team can have the flexibility to make changes, even change storage engines, under the hood without involving customers. This is the ideal state:

Figure 2: The ideal state is a single unified Key-Value interface, reducing the complexity both for customers and for platform owners. When we can consolidate our resources as a company and invest in a single platform, we can move faster and build better.

The migration to get from four systems to the ideal one above was split into two phases: the first, targeting read-only data, and the second, targeting read-write data. Each phase required its own unique migration strategy to be the least disruptive to customers.

Phase 1: Read-only data migration (totally seamless)

The read-only phase was first because it was simpler (immutable data is easier to migrate than mutable data receiving live writes) and because it targeted the majority of customers (about 70% were using Terrapin). Because Terrapin was so prolific and established in our code base, having everyone migrate their APIs to access KVStore would have taken a ton of time and effort with very little incremental value.

We decided to instead migrate most Terrapin customers seamlessly: no changes were required of users calling Terrapin APIs, but unbeknownst to callers, the Terrapin API service was augmented with an embedded KVStore API library to retrieve data from Rockstore. And because Terrapin is a batch-loaded system, we also found a central base class and rerouted workflows to double-load data into Rockstore instead of Terrapin (and then eventually we cut Terrapin off).

Figure 3: By introducing a routing layer between the Terrapin APIs and the Terrapin leaf storage, we can achieve a data migration and eliminate the costly and less stable Terrapin storage system for immediate business impact, all without asking customers to take any action. The tradeoff here is the tech debt and layer of indirection: we are now asking customers to clean up their usage of the Terrapin API in order to directly call KVStore API.

Because Rockstore was more performant and cost-efficient than Terrapin, users saw a 30–90% decrease in latency. When we decommissioned the storage infrastructure of Terrapin, the company also saw $7M of annualized savings, all without users needing to lift a finger (with just a few exceptions). The tradeoff is that we now have some tech debt of ensuring that users clean up their code by moving off of deprecated Terrapin APIs and onto KVStore API so that we no longer have a layer of indirection.

Phase 2: Read-write data migration (partially seamless)

The read-write side presented a different picture: there were fewer than 200 use cases to tackle, and the number of call sites was less extreme, but building feature parity for a read-write system as opposed to read-only involved some serious development. In order to be on par with UserMetaStore (essentially HBase), Rockstore needed a brand new wide-column format, increased consistency modes, offline snapshot support, and higher durability guarantees.

While the team took the time to develop these features, we decided to “bite the bullet” and ask all users to migrate from UserMetaStore’s API to KVStore API from the get-go. The benefit of doing this is it’s a low-risk, low-effort move. Thanks again to the power of abstraction, we implemented a reverse proxy so that customers moving to KVStore API were actually still calling UserMetaStore under the hood. By making this small change now, customers were buying a lasting contract that wouldn’t require such changes again for the foreseeable future.

Figure 4: Instead of taking the same approach as we did with Terrapin in Figure 3, we decided asking customers to migrate their APIs up front made more sense for unifying the read-write storage systems. Once customers moved to our KVStore API abstraction layer, we were free to move their data from UserMetaStore to Rockstore under the hood.

Some of the biggest challenges were actually not technical. Finding owners of the data was an archeological exercise, and holding hundreds of owners accountable for completing their part was difficult due to competing priorities. But when it was done, and when the Rockstore platform was ready, the team was completely unblocked to backfill the data from UserMetaStore to Rockstore without any customer involvement. We also vowed to make sure all data was attributed to owners going forward.

Innovation 2: A wide-column format eliminated both CPU and network load for large payloads

Some of the most popular Terrapin workloads had an interesting property: use cases would store values consisting of large blobs of thrift structures but only need to retrieve a very small piece of that data when read.

At first, these callers would download the huge values that they stored, deserialize them on the client side, and read the property they needed. This very quickly revealed itself to be inefficient in terms of unnecessary network load, throughput degradation, and wasteful client CPU utilization.

The Terrapin solution to this was to introduce an API feature called “trimmer,” where you could specify a Thrift struct and the fields you wanted from it in the request itself. Terrapin would not only retrieve the object, it would also deserialize it and return only the fields requested. This was better in that the network bandwidth was reduced, important especially for reducing cross-AZ traffic costs, but it was worse in terms of both platform cost and leaky abstractions. More CPU utilization meant more machines were needed, and business logic in the platform meant that Terrapin needed to know about required thrift structures. Performance also takes a hit since clients are waiting for this increased processing time.

To solve this in Rockstore and unblock the migration, the team decided against simply re-implementing the trimmer. Instead, we introduced a new file format that accommodated a wide-column access pattern. This means that instead of storing a binary blob of data that can be deserialized into a thrift structure, you can actually store and encode your data structure in a native format that can be retrieved like a key-value pair using a combination of primary keys and local keys. For example, if you have a struct UserData that is a mapping of 30 fields keyed to a user id, instead of storing a key-value pair of (key: user id, value: UserData), you can instead store (key: user id, (local key: UserData field 1, local value: Userdata value 1), (local key: UserData field 2, local value: Userdata value 2)), etc.

The API is then designed to allow you to either access the entire row (all columns associated with user id) or only certain properties (UserData field 3 and 12 of user id). Under the hood, Rockstore is performing a blazing fast range scan or single-point key-value lookup. This accounted for some of the more extreme performance improvements that we ultimately observed. Goodbye network and CPU costs!

Innovation 3: A versioning system for batch-loaded, read-only data unblocked instant data migrations between clusters

One of the biggest pain points of the read-only mode of Rockstore was the inability to move data once it was loaded onto a cluster. If customer data grew beyond what was provisioned for it, or if a certain cluster became unstable, it took two weeks and two or three teams to coordinate changes to workflows, reconfigure thrift call sites, and budget time to double-upload, monitor, and read data to and from the new location.

Another pain point of the read-only mode Rockstore was that it only supported exactly two versions due to how it implements versioning. This was incompatible with Terrapin requirements, which supported fewer than two for cost savings and more than two for critical datasets which require on-disk instant rollback.

The solution to this is what we call “timestamp-based versioning.” Rockstore read-only used to have “round-robin versioning,” where each new version uploaded into the system would either be version One or version Two. Once all the partitions of an uploaded version were online, the version map would simply flip. This created the exactly-two version constraint. Another constraint that bound customers to a specific cluster was the fact that customers needed to specify a serverset address that corresponded to the cluster on which their data lived. Another leaky abstraction! When the data moved, customers needed to make changes to follow it.

In timestamp-based versioning, every upload is attributed a timestamp and registered to a central metastore called Key-Value Store Manager (KVSM), which was used to coordinate cluster map configurations. Once more, the power of abstraction comes in: by calling KVStore APIs, as a customer you no longer need to know on which cluster your data lives. KVStore figures that out for you using the cluster map configuration.

Not only does this abstraction allow for as few as one version or as many as 10 to be stored on disk or in S3 (to trade off cost savings and rollback safety), but moving a dataset from one cluster to another is as simple as a single API call to change the cluster metadata in KVSM and kicking off a new upload. Once the metadata is updated, the new upload will automatically be loaded to the new cluster. And once online, all serving maps will point requests to that location. Thanks to timestamp-based versioning, two weeks of effort has been reduced to a single API call.

Thank you for reading about our journey to a single, abstracted, key-value storage at Pinterest. I’d like to acknowledge all the people that contributed to this critical and technically challenging project: Rajath Prasad, Kangnan Li, Indy Prentice, Harold Cabalic, Madeline Nguyen, Jia Zhan, Neil Enriquez, Ramesh Kalluri, Tim Jones, Gopal Rajpurohit, Guodong Han, Prem Thangamani, Lianghong Xu, Alberto Ordonez Pereira, Kevin Lin, all our partners in SRE, security, and Eng Productivity, and all of our engineering customers at Pinterest which span teams from ads to homefeed, machine learning to signal platform. None of this would be possible without the teamwork and collaboration from everyone here.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Android Engineer, Client Excellence
Mexico City, MEX

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

On the Client Excellence team you ensure Pinners have a high quality experience on Pinterest. You do this by improving our critical client metrics like crash-free users and by upgrading our supported libraries and operating systems. You also partner with other engineering teams to improve the developer experience and champion operational excellence.

What you’ll do:

  • Improve the quality of our apps by monitoring and improving core client metrics e.g. crash-free user rate, app size, memory management and cpu usage
  • Drive library and OS upgrades with minimal disruption across Pinterest
  • Partner with other engineering teams to improve client developer experience
  • Champion operational excellence across all client engineering teams

What we’re looking for:

  • Deep understanding of Android development and best practices in Java or Kotlin
  • Knowledge on multi-threading, logging, memory management, caching and builds on Android
  • Expertise in developing and debugging across a diverse service stack including storage and data solutions
  • Demonstrated track record of improving software quality with stable releases
  • Experience on platform teams/initiatives, driving technology adoption across feature teams
  • Keeps up to date with new technologies to understand what should be incorporated 
  • Strong collaboration and communication skills
Backend Engineer, Discovery Measurements
Mexico City, MEX

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest personalizes millions of experiences by using machine learning algorithms to sift through our catalog of one hundred billion Pins to find the best content for each Pinner. It is critical to measure the users experience across Pinterest and identify opportunities for improvement. The Discovery Measurements team’s charter is to establish human-powered ground truth for major Pinterest products, e.g. Search and Ads, and develop company critical measurements about relevance, domain quality, session experience, retention, etc. As we look to scale these platforms both vertically and horizontally, we’re looking for strong software engineers to join the team to drive technical excellence and curiosity. We need someone who has experience as a backend developer as well as drive to dive into challenging data processing and data mining problems.

What you’ll do:

  • Build a platform that enables teams to evaluate and train their ML models
  • Design and scale company-wide online & offline measurement platforms for organic and ad content
  • Design and develop company critical measurements, including relevance, domain quality, session experience, retention, user satisfaction
  • Establish technical foundation to generate insightful signals about Pin and Pinners that could power other ML models in the Pinterest ecosystem
  • Partner with cross-functional stakeholders to align engineering efforts for high impact technical initiatives

What we’re looking for:

  • Fluent in any of the following languages: C/C++, Java, JavaScript, Python
  • Exposure to architectural patterns of a large, high-scale web application (e.g., well-designed APIs, high volume data pipelines, efficient algorithms)
  • Model of software engineering best practices, including agile development, unit testing, code reviews, design documentation, debugging, and problem solving
  • Familiar with large data processing and measurement
  • Curiosity for leveraging data and metrics to identify challenging opportunities and build impactful solutions
Engineering Manager, Client Excellence
Mexico City, MEX

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

We’re looking for an Engineering Manager to build out the Client Excellence team. This team of Android, iOS, Web and API engineers is responsible for ensuring Pinners have a high quality experience on Pinterest. They do this by creating tools to monitor and improve our critical client metrics like crash-free sessions, keeping our critical libraries up to date and partnering with other engineering teams to champion operational excellence.

What you’ll do:

  • Build out an experienced team of Android/iOS/Web/API engineers and help them develop new skills and advance in their careers
  • Provide a vision to the team, drive technical excellence and partner with key stakeholders to prioritize and deliver on the team's roadmap
  • Improve the quality of our apps by monitoring and improving core client metrics e.g. crash-free user rate, app size, memory management and cpu usage
  • Create an operational strategy to drive library and OS upgrades with minimal disruption across Pinterest
  • Partner with other engineering teams to discover future opportunities to improve client developer experience
  • Champion operational excellence across all client engineering teams

What we’re looking for:

  • Strong communication, people development and software project management skills
  • Ability to deliver on immediate goals and form long-term strategies around technology, processes, and people
  • Demonstrated track record of improving software quality with stable releases
  • Ability to dive deeply into platform metrics (e.g. crash rates, logging) to identify opportunities for focus
  • Experience leading platform teams/initiatives, driving technology adoption across feature teams
Fullstack Engineer, Discovery Measure...
Mexico City, MEX

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest personalizes millions of experiences by using machine learning algorithms to sift through our catalog of one hundred billion Pins to find the best content for each Pinner. It is critical to measure the users experience across Pinterest and identify opportunities for improvement. The Discovery Measurements team’s charter is to establish human-powered ground truth for major Pinterest products, e.g. Search and Ads, and develop company critical measurements about relevance, domain quality, session experience, retention, and more. As we look to scale these platforms both vertically and horizontally, we’re looking for strong software engineers to join the team to drive technical excellence and curiosity. We need someone who has experience as a full-stack engineer to dive into challenging human-in-the-loop AI problems.

What you’ll do:

  • You will start by building human-in-the-loop AI platforms to power ML models on production
  • Design and implement the UI layer by closely working with Data Scientist, Product Managers, and Machine Learning engineers
  • Contribute to the new unified human computation backend service
  • Build the scalable backend API infrastructure which can be used to measure and evaluate all various deep learning and machine learning models on production

What we’re looking for:

  • Mastery in frontend stack (Javascript/HTML/CSS), familiarity with modern frontend frameworks (e.g. React/Redux)
  • Knowledge of backend stack (Java, Python, Go) and how they interact with MySQL, Redis, Kafka, etc.
  • Good judgment about shipping improvement quickly while ensuring the sustainability of platforms
  • Ability to measure and improve large scale platforms
Verified by
Security Software Engineer
Tech Lead, Big Data Platform
Software Engineer
Talent Brand Manager
Sourcer
Software Engineer
You may also like