By Ankit Patel | Software engineer, Content Acquisition and Media Platform
With the growing need for machine learning signals from Pinterest’s huge visual dataset, we decided to take a closer look at our infrastructure that produces and serves these signals. A few parameters we were particularly interested in were signal availability, infra complexity and cost optimization, tech integration, developer velocity, and monitoring. In this post, we will describe our journey from a Lambda architecture to the new real-time signals infrastructure inspired by Kappa architecture.
In order to understand the existing visual signals infrastructure, we need to understand some of the basic content processing systems at Pinterest. Pinterest’s Content Acquisition and Media Platform, formerly known as Video and Image Platform (VIP), is responsible for ingesting, processing, and serving all of Pinterest’s content on every surface of the application. We ingest media at a massive scale every single day. This post will not go into details about the ingestion and serving part, and it will mostly focus on the processing part, as that is where most of the magic happens.
Media is ingested through 50 different pipelines. Pipelines are namespaces in VIP systems, e.g. Pinner uploaded content, crawled images, shopping images, video keyframes, user profile images, etc. Each pipeline maps to custom media processing configurations tailored for the use case it serves.
When we started building our visual signals processing infrastructure, we utilized this existing namespace philosophy and also partitioned our visual signals around pipelines. Pinterest’s homegrown signal processing and serving tech stack is called Galaxy. Namespaces keyed by different entity IDs in Galaxy are called Joins or Galaxies. Each VIP pipeline is mapped to their equivalent Galaxy. Sometimes we combine multiple similar pipelines into a single galaxy, as they are closely related.
Until recently, we used the Lambda architecture illustrated below to compute visual signals from our media content.
As you can see in the diagram above, there are 2 modes to this architecture: online and offline. Let’s discuss the offline architecture first.
- A nightly workflow (Hadoop based map-reduce job) is scheduled to run. We calculate the delta unique image signatures from that particular day and create a new partition.
- We then create batches of image signatures and enqueue PinLater jobs for processing these batches.
- We spin up a GPU cluster with the ML models needed for the signal computation and then start processing the PinLater Jobs. This is an expensive cluster.
- We download the image from Amazon S3, run the model inference on them, and then store the output in S3. We would then spin down the GPU cluster to optimize EC2 spend.
- The signal is generated in a delimited bytes format with Protobuf encoded values.
- We kickoff a Workflow which transforms this output into equivalent Apache Thrift encoding because Pinterest heavily uses Thrift as its wire format and storage for data.
- Thrift output is stored in a Parquet columnar Hive table. Downstream batch clients consume this output.
- In order to support real time RPC clients, we have a final workflow that uploads the Hive partition to a Rockstore key value pair database.
- Galaxy Signal Service provides an RPC API to look up these signals from Rockstore for our online clients.
As you can see, there are a lot of moving pieces in the above design. Even though these processes matured over time and became more stable, VIP team engineers faced frequent issues while on call. Some of the biggest concerns were:
- Workflows depending on other workflow
- Consumers have to wait 24 hours for the signals to be available
- Application logic issues are extremely hard to debug
- Granular retries don’t exist, it’s all or nothing
- Delays due to the nature of these systems
As we continued to build additional features that are powered by these machine learning signals, the need for producing these signals faster became a priority across the company. That is where the online mode comes into picture:
- We create a PinLater job for each image signature in real time.
- In this Job, we call a GPU cluster running the machine learning model inference service to calculate a signal and directly store its result into the key value pair based Rockstore database. GSS (Galaxy Signal Service) would then serve this signal.
- We then publish an event through Kafka to notify the downstream consumers about the signal availability.
- Downstream clients consume this event and fetch the signal from GSS in real time.
VIP wasn’t the only team in Pinterest that was getting attracted to this new paradigm of signal processing. Other teams became excited about the idea of having their signals calculated in real time. It provided more visibility into the signal generation compared to the black box that is hadoop based workflows. This approach came with multiple benefits like:
- Better developer experience
- Easier to debug, test, and deploy changes
- Granular retries
- Low latency signals
However, it came with its share of cons as well. The main ones being:
- Complex operations, like group bys and signal joins, are not possible with a simple event-driven processing framework like Pinlater. We needed a robust stream processing framework like Apache Flink.
- The extra cost of running a duplicate GPU cluster that processes the same pins as the batch pipeline on a continuous basis.
- There was no shared infrastructure to address common needs.
Finally, the Signal Platform team at Pinterest saw an opportunity to address this concern for all signal developers and build the next version of signal development infrastructure called “Near-real-time Galaxy,” or simply NRTG. The technologies of choice were Apache Kafka and Apache Flink. Flink seemed like it was specifically designed to address the concerns mentioned above. Given that the Signal Platform team already had a framework in place to build signals on a batch technology using Galaxy Dataflow APIs, extending it to also work on a stream technology was arguably the best way forward without causing massive amounts of refactoring, rewriting, or just reinventing.
The VIP team decided to be among the early adopters of this initiative as our media signals are some of the most upstream in the whole tree of signals at Pinterest, so it would naturally be the easiest to onboard a platform while it is still being built. We scoped out the signals we wanted to experiment with and got started on this mission.
The gist of it is very simple: you would write a simple flink job that computes the signal in streaming (on Apache Flink). NRTG would make this process easy and quick by leveraging standard design patterns and annotations.
Based on the annotations, the NRTG framework mentioned above does most of the heavy lifting away hidden from the signal developers. The configs and the mapping to underlying native Flink is managed by this middleware layer. This makes signal development extremely fast because the developers do not need to learn Flink in detail as they are already familiar with these annotations. Xenon (Flink) platform team at Pinterest provides the infrastructure capabilities to deploy and maintain Flink applications.
Once we onboarded to NRTG, we wanted to turn down our existing batch workflows setup. There are a number of consumers who consume these signals in batch. We had to provide a solution that would work for them from our streaming pipelines. In order to simulate a daily Hive table for our signals, we wrote a simplified workflow that takes a daily dump of our KVStore and transforms it into existing Hive output. No GPU computations were required to recompute the signal values — the data was simply computed in streaming, moved to S3 via a daily dump, and filtered to the correct format. This not only allowed us to save on the GPU cost but also trim down chained complex workflows design into a simple data transformation job. With the FlinkSQL being in active development, we will be able to completely migrate the offline portion from Spark/Hadoop to Flink.
Migration to this new fast-signals infrastructure is the beginning of a great future for Pinterest in signal generation. It allows the signal developers to quickly build signals with a lot less learning curve. Underlying Flink capabilities also support advanced signals design. Even though batch backfill support is a work in progress in NRTG and the signal producers need to adapt outputs to avoid disruption to their consumers, the benefits still outweigh the costs of duplication in the existing lambda infrastructure. NRTG team already has this in the roadmap to offer end to end support by providing Hive integration as part of the framework. Bringing the end to end lifecycle of a signal under one platform would massively benefit the innovation and productizing ideas across different teams at Pinterest. It has reduced the infra complexity, and we are able to leverage cost optimization on GPU and other compute resources. We expect other teams at Pinterest to follow in the same footsteps and boost their developer velocity by moving to a more simple and robust architecture as outlined in this blog.
This project is a joint effort across multiple teams at Pinterest: Video & Image Platform (VIP), Near-real time Galaxy (NRTG), Xenon, Hermes, Rockstore, and Visual Search.