Pinterest Visual Signals Infrastructure: Evolution from Lambda to Kappa Architecture

1,900
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

By Ankit Patel | Software engineer, Content Acquisition and Media Platform


With the growing need for machine learning signals from Pinterest’s huge visual dataset, we decided to take a closer look at our infrastructure that produces and serves these signals. A few parameters we were particularly interested in were signal availability, infra complexity and cost optimization, tech integration, developer velocity, and monitoring. In this post, we will describe our journey from a Lambda architecture to the new real-time signals infrastructure inspired by Kappa architecture.

Background

In order to understand the existing visual signals infrastructure, we need to understand some of the basic content processing systems at Pinterest. Pinterest’s Content Acquisition and Media Platform, formerly known as Video and Image Platform (VIP), is responsible for ingesting, processing, and serving all of Pinterest’s content on every surface of the application. We ingest media at a massive scale every single day. This post will not go into details about the ingestion and serving part, and it will mostly focus on the processing part, as that is where most of the magic happens.

Media is ingested through 50 different pipelines. Pipelines are namespaces in VIP systems, e.g. Pinner uploaded content, crawled images, shopping images, video keyframes, user profile images, etc. Each pipeline maps to custom media processing configurations tailored for the use case it serves.

When we started building our visual signals processing infrastructure, we utilized this existing namespace philosophy and also partitioned our visual signals around pipelines. Pinterest’s homegrown signal processing and serving tech stack is called Galaxy. Namespaces keyed by different entity IDs in Galaxy are called Joins or Galaxies. Each VIP pipeline is mapped to their equivalent Galaxy. Sometimes we combine multiple similar pipelines into a single galaxy, as they are closely related.

Lambda Architecture

Until recently, we used the Lambda architecture illustrated below to compute visual signals from our media content.

As you can see in the diagram above, there are 2 modes to this architecture: online and offline. Let’s discuss the offline architecture first.

  1. A nightly workflow (Hadoop based map-reduce job) is scheduled to run. We calculate the delta unique image signatures from that particular day and create a new partition.
  2. We then create batches of image signatures and enqueue PinLater jobs for processing these batches.
  3. We spin up a GPU cluster with the ML models needed for the signal computation and then start processing the PinLater Jobs. This is an expensive cluster.
  4. We download the image from Amazon S3, run the model inference on them, and then store the output in S3. We would then spin down the GPU cluster to optimize EC2 spend.
  5. The signal is generated in a delimited bytes format with Protobuf encoded values.
  6. We kickoff a Workflow which transforms this output into equivalent Apache Thrift encoding because Pinterest heavily uses Thrift as its wire format and storage for data.
  7. Thrift output is stored in a Parquet columnar Hive table. Downstream batch clients consume this output.
  8. In order to support real time RPC clients, we have a final workflow that uploads the Hive partition to a Rockstore key value pair database.
  9. Galaxy Signal Service provides an RPC API to look up these signals from Rockstore for our online clients.

As you can see, there are a lot of moving pieces in the above design. Even though these processes matured over time and became more stable, VIP team engineers faced frequent issues while on call. Some of the biggest concerns were:

  • Workflows depending on other workflow
  • Consumers have to wait 24 hours for the signals to be available
  • Application logic issues are extremely hard to debug
  • Granular retries don’t exist, it’s all or nothing
  • Delays due to the nature of these systems

As we continued to build additional features that are powered by these machine learning signals, the need for producing these signals faster became a priority across the company. That is where the online mode comes into picture:

  1. We create a PinLater job for each image signature in real time.
  2. In this Job, we call a GPU cluster running the machine learning model inference service to calculate a signal and directly store its result into the key value pair based Rockstore database. GSS (Galaxy Signal Service) would then serve this signal.
  3. We then publish an event through Kafka to notify the downstream consumers about the signal availability.
  4. Downstream clients consume this event and fetch the signal from GSS in real time.

VIP wasn’t the only team in Pinterest that was getting attracted to this new paradigm of signal processing. Other teams became excited about the idea of having their signals calculated in real time. It provided more visibility into the signal generation compared to the black box that is hadoop based workflows. This approach came with multiple benefits like:

  • Better developer experience
  • Easier to debug, test, and deploy changes
  • Granular retries
  • Low latency signals

However, it came with its share of cons as well. The main ones being:

  • Complex operations, like group bys and signal joins, are not possible with a simple event-driven processing framework like Pinlater. We needed a robust stream processing framework like Apache Flink.
  • The extra cost of running a duplicate GPU cluster that processes the same pins as the batch pipeline on a continuous basis.
  • There was no shared infrastructure to address common needs.

Kappa Architecture

Finally, the Signal Platform team at Pinterest saw an opportunity to address this concern for all signal developers and build the next version of signal development infrastructure called “Near-real-time Galaxy,” or simply NRTG. The technologies of choice were Apache Kafka and Apache Flink. Flink seemed like it was specifically designed to address the concerns mentioned above. Given that the Signal Platform team already had a framework in place to build signals on a batch technology using Galaxy Dataflow APIs, extending it to also work on a stream technology was arguably the best way forward without causing massive amounts of refactoring, rewriting, or just reinventing.

The VIP team decided to be among the early adopters of this initiative as our media signals are some of the most upstream in the whole tree of signals at Pinterest, so it would naturally be the easiest to onboard a platform while it is still being built. We scoped out the signals we wanted to experiment with and got started on this mission.

The gist of it is very simple: you would write a simple flink job that computes the signal in streaming (on Apache Flink). NRTG would make this process easy and quick by leveraging standard design patterns and annotations.

Based on the annotations, the NRTG framework mentioned above does most of the heavy lifting away hidden from the signal developers. The configs and the mapping to underlying native Flink is managed by this middleware layer. This makes signal development extremely fast because the developers do not need to learn Flink in detail as they are already familiar with these annotations. Xenon (Flink) platform team at Pinterest provides the infrastructure capabilities to deploy and maintain Flink applications.

Once we onboarded to NRTG, we wanted to turn down our existing batch workflows setup. There are a number of consumers who consume these signals in batch. We had to provide a solution that would work for them from our streaming pipelines. In order to simulate a daily Hive table for our signals, we wrote a simplified workflow that takes a daily dump of our KVStore and transforms it into existing Hive output. No GPU computations were required to recompute the signal values — the data was simply computed in streaming, moved to S3 via a daily dump, and filtered to the correct format. This not only allowed us to save on the GPU cost but also trim down chained complex workflows design into a simple data transformation job. With the FlinkSQL being in active development, we will be able to completely migrate the offline portion from Spark/Hadoop to Flink.

Conclusion

Migration to this new fast-signals infrastructure is the beginning of a great future for Pinterest in signal generation. It allows the signal developers to quickly build signals with a lot less learning curve. Underlying Flink capabilities also support advanced signals design. Even though batch backfill support is a work in progress in NRTG and the signal producers need to adapt outputs to avoid disruption to their consumers, the benefits still outweigh the costs of duplication in the existing lambda infrastructure. NRTG team already has this in the roadmap to offer end to end support by providing Hive integration as part of the framework. Bringing the end to end lifecycle of a signal under one platform would massively benefit the innovation and productizing ideas across different teams at Pinterest. It has reduced the infra complexity, and we are able to leverage cost optimization on GPU and other compute resources. We expect other teams at Pinterest to follow in the same footsteps and boost their developer velocity by moving to a more simple and robust architecture as outlined in this blog.

This project is a joint effort across multiple teams at Pinterest: Video & Image Platform (VIP), Near-real time Galaxy (NRTG), Xenon, Hermes, Rockstore, and Visual Search.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Android Engineer, Client Excellence
Mexico City, MEX

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

On the Client Excellence team you ensure Pinners have a high quality experience on Pinterest. You do this by improving our critical client metrics like crash-free users and by upgrading our supported libraries and operating systems. You also partner with other engineering teams to improve the developer experience and champion operational excellence.

What you’ll do:

  • Improve the quality of our apps by monitoring and improving core client metrics e.g. crash-free user rate, app size, memory management and cpu usage
  • Drive library and OS upgrades with minimal disruption across Pinterest
  • Partner with other engineering teams to improve client developer experience
  • Champion operational excellence across all client engineering teams

What we’re looking for:

  • Deep understanding of Android development and best practices in Java or Kotlin
  • Knowledge on multi-threading, logging, memory management, caching and builds on Android
  • Expertise in developing and debugging across a diverse service stack including storage and data solutions
  • Demonstrated track record of improving software quality with stable releases
  • Experience on platform teams/initiatives, driving technology adoption across feature teams
  • Keeps up to date with new technologies to understand what should be incorporated 
  • Strong collaboration and communication skills
Backend Engineer, Discovery Measurements
Mexico City, MEX

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest personalizes millions of experiences by using machine learning algorithms to sift through our catalog of one hundred billion Pins to find the best content for each Pinner. It is critical to measure the users experience across Pinterest and identify opportunities for improvement. The Discovery Measurements team’s charter is to establish human-powered ground truth for major Pinterest products, e.g. Search and Ads, and develop company critical measurements about relevance, domain quality, session experience, retention, etc. As we look to scale these platforms both vertically and horizontally, we’re looking for strong software engineers to join the team to drive technical excellence and curiosity. We need someone who has experience as a backend developer as well as drive to dive into challenging data processing and data mining problems.

What you’ll do:

  • Build a platform that enables teams to evaluate and train their ML models
  • Design and scale company-wide online & offline measurement platforms for organic and ad content
  • Design and develop company critical measurements, including relevance, domain quality, session experience, retention, user satisfaction
  • Establish technical foundation to generate insightful signals about Pin and Pinners that could power other ML models in the Pinterest ecosystem
  • Partner with cross-functional stakeholders to align engineering efforts for high impact technical initiatives

What we’re looking for:

  • Fluent in any of the following languages: C/C++, Java, JavaScript, Python
  • Exposure to architectural patterns of a large, high-scale web application (e.g., well-designed APIs, high volume data pipelines, efficient algorithms)
  • Model of software engineering best practices, including agile development, unit testing, code reviews, design documentation, debugging, and problem solving
  • Familiar with large data processing and measurement
  • Curiosity for leveraging data and metrics to identify challenging opportunities and build impactful solutions
Engineering Manager, Client Excellence
Mexico City, MEX

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

We’re looking for an Engineering Manager to build out the Client Excellence team. This team of Android, iOS, Web and API engineers is responsible for ensuring Pinners have a high quality experience on Pinterest. They do this by creating tools to monitor and improve our critical client metrics like crash-free sessions, keeping our critical libraries up to date and partnering with other engineering teams to champion operational excellence.

What you’ll do:

  • Build out an experienced team of Android/iOS/Web/API engineers and help them develop new skills and advance in their careers
  • Provide a vision to the team, drive technical excellence and partner with key stakeholders to prioritize and deliver on the team's roadmap
  • Improve the quality of our apps by monitoring and improving core client metrics e.g. crash-free user rate, app size, memory management and cpu usage
  • Create an operational strategy to drive library and OS upgrades with minimal disruption across Pinterest
  • Partner with other engineering teams to discover future opportunities to improve client developer experience
  • Champion operational excellence across all client engineering teams

What we’re looking for:

  • Strong communication, people development and software project management skills
  • Ability to deliver on immediate goals and form long-term strategies around technology, processes, and people
  • Demonstrated track record of improving software quality with stable releases
  • Ability to dive deeply into platform metrics (e.g. crash rates, logging) to identify opportunities for focus
  • Experience leading platform teams/initiatives, driving technology adoption across feature teams
Fullstack Engineer, Discovery Measure...
Mexico City, MEX

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest personalizes millions of experiences by using machine learning algorithms to sift through our catalog of one hundred billion Pins to find the best content for each Pinner. It is critical to measure the users experience across Pinterest and identify opportunities for improvement. The Discovery Measurements team’s charter is to establish human-powered ground truth for major Pinterest products, e.g. Search and Ads, and develop company critical measurements about relevance, domain quality, session experience, retention, and more. As we look to scale these platforms both vertically and horizontally, we’re looking for strong software engineers to join the team to drive technical excellence and curiosity. We need someone who has experience as a full-stack engineer to dive into challenging human-in-the-loop AI problems.

What you’ll do:

  • You will start by building human-in-the-loop AI platforms to power ML models on production
  • Design and implement the UI layer by closely working with Data Scientist, Product Managers, and Machine Learning engineers
  • Contribute to the new unified human computation backend service
  • Build the scalable backend API infrastructure which can be used to measure and evaluate all various deep learning and machine learning models on production

What we’re looking for:

  • Mastery in frontend stack (Javascript/HTML/CSS), familiarity with modern frontend frameworks (e.g. React/Redux)
  • Knowledge of backend stack (Java, Python, Go) and how they interact with MySQL, Redis, Kafka, etc.
  • Good judgment about shipping improvement quickly while ensuring the sustainability of platforms
  • Ability to measure and improve large scale platforms
Verified by
Security Software Engineer
Tech Lead, Big Data Platform
Software Engineer
Talent Brand Manager
Sourcer
Software Engineer
You may also like