Pinterest Visual Signals Infrastructure: Evolution from Lambda to Kappa Architecture

1,702
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

By Ankit Patel | Software engineer, Content Acquisition and Media Platform


With the growing need for machine learning signals from Pinterest’s huge visual dataset, we decided to take a closer look at our infrastructure that produces and serves these signals. A few parameters we were particularly interested in were signal availability, infra complexity and cost optimization, tech integration, developer velocity, and monitoring. In this post, we will describe our journey from a Lambda architecture to the new real-time signals infrastructure inspired by Kappa architecture.

Background

In order to understand the existing visual signals infrastructure, we need to understand some of the basic content processing systems at Pinterest. Pinterest’s Content Acquisition and Media Platform, formerly known as Video and Image Platform (VIP), is responsible for ingesting, processing, and serving all of Pinterest’s content on every surface of the application. We ingest media at a massive scale every single day. This post will not go into details about the ingestion and serving part, and it will mostly focus on the processing part, as that is where most of the magic happens.

Media is ingested through 50 different pipelines. Pipelines are namespaces in VIP systems, e.g. Pinner uploaded content, crawled images, shopping images, video keyframes, user profile images, etc. Each pipeline maps to custom media processing configurations tailored for the use case it serves.

When we started building our visual signals processing infrastructure, we utilized this existing namespace philosophy and also partitioned our visual signals around pipelines. Pinterest’s homegrown signal processing and serving tech stack is called Galaxy. Namespaces keyed by different entity IDs in Galaxy are called Joins or Galaxies. Each VIP pipeline is mapped to their equivalent Galaxy. Sometimes we combine multiple similar pipelines into a single galaxy, as they are closely related.

Lambda Architecture

Until recently, we used the Lambda architecture illustrated below to compute visual signals from our media content.

As you can see in the diagram above, there are 2 modes to this architecture: online and offline. Let’s discuss the offline architecture first.

  1. A nightly workflow (Hadoop based map-reduce job) is scheduled to run. We calculate the delta unique image signatures from that particular day and create a new partition.
  2. We then create batches of image signatures and enqueue PinLater jobs for processing these batches.
  3. We spin up a GPU cluster with the ML models needed for the signal computation and then start processing the PinLater Jobs. This is an expensive cluster.
  4. We download the image from Amazon S3, run the model inference on them, and then store the output in S3. We would then spin down the GPU cluster to optimize EC2 spend.
  5. The signal is generated in a delimited bytes format with Protobuf encoded values.
  6. We kickoff a Workflow which transforms this output into equivalent Apache Thrift encoding because Pinterest heavily uses Thrift as its wire format and storage for data.
  7. Thrift output is stored in a Parquet columnar Hive table. Downstream batch clients consume this output.
  8. In order to support real time RPC clients, we have a final workflow that uploads the Hive partition to a Rockstore key value pair database.
  9. Galaxy Signal Service provides an RPC API to look up these signals from Rockstore for our online clients.

As you can see, there are a lot of moving pieces in the above design. Even though these processes matured over time and became more stable, VIP team engineers faced frequent issues while on call. Some of the biggest concerns were:

  • Workflows depending on other workflow
  • Consumers have to wait 24 hours for the signals to be available
  • Application logic issues are extremely hard to debug
  • Granular retries don’t exist, it’s all or nothing
  • Delays due to the nature of these systems

As we continued to build additional features that are powered by these machine learning signals, the need for producing these signals faster became a priority across the company. That is where the online mode comes into picture:

  1. We create a PinLater job for each image signature in real time.
  2. In this Job, we call a GPU cluster running the machine learning model inference service to calculate a signal and directly store its result into the key value pair based Rockstore database. GSS (Galaxy Signal Service) would then serve this signal.
  3. We then publish an event through Kafka to notify the downstream consumers about the signal availability.
  4. Downstream clients consume this event and fetch the signal from GSS in real time.

VIP wasn’t the only team in Pinterest that was getting attracted to this new paradigm of signal processing. Other teams became excited about the idea of having their signals calculated in real time. It provided more visibility into the signal generation compared to the black box that is hadoop based workflows. This approach came with multiple benefits like:

  • Better developer experience
  • Easier to debug, test, and deploy changes
  • Granular retries
  • Low latency signals

However, it came with its share of cons as well. The main ones being:

  • Complex operations, like group bys and signal joins, are not possible with a simple event-driven processing framework like Pinlater. We needed a robust stream processing framework like Apache Flink.
  • The extra cost of running a duplicate GPU cluster that processes the same pins as the batch pipeline on a continuous basis.
  • There was no shared infrastructure to address common needs.

Kappa Architecture

Finally, the Signal Platform team at Pinterest saw an opportunity to address this concern for all signal developers and build the next version of signal development infrastructure called “Near-real-time Galaxy,” or simply NRTG. The technologies of choice were Apache Kafka and Apache Flink. Flink seemed like it was specifically designed to address the concerns mentioned above. Given that the Signal Platform team already had a framework in place to build signals on a batch technology using Galaxy Dataflow APIs, extending it to also work on a stream technology was arguably the best way forward without causing massive amounts of refactoring, rewriting, or just reinventing.

The VIP team decided to be among the early adopters of this initiative as our media signals are some of the most upstream in the whole tree of signals at Pinterest, so it would naturally be the easiest to onboard a platform while it is still being built. We scoped out the signals we wanted to experiment with and got started on this mission.

The gist of it is very simple: you would write a simple flink job that computes the signal in streaming (on Apache Flink). NRTG would make this process easy and quick by leveraging standard design patterns and annotations.

Based on the annotations, the NRTG framework mentioned above does most of the heavy lifting away hidden from the signal developers. The configs and the mapping to underlying native Flink is managed by this middleware layer. This makes signal development extremely fast because the developers do not need to learn Flink in detail as they are already familiar with these annotations. Xenon (Flink) platform team at Pinterest provides the infrastructure capabilities to deploy and maintain Flink applications.

Once we onboarded to NRTG, we wanted to turn down our existing batch workflows setup. There are a number of consumers who consume these signals in batch. We had to provide a solution that would work for them from our streaming pipelines. In order to simulate a daily Hive table for our signals, we wrote a simplified workflow that takes a daily dump of our KVStore and transforms it into existing Hive output. No GPU computations were required to recompute the signal values — the data was simply computed in streaming, moved to S3 via a daily dump, and filtered to the correct format. This not only allowed us to save on the GPU cost but also trim down chained complex workflows design into a simple data transformation job. With the FlinkSQL being in active development, we will be able to completely migrate the offline portion from Spark/Hadoop to Flink.

Conclusion

Migration to this new fast-signals infrastructure is the beginning of a great future for Pinterest in signal generation. It allows the signal developers to quickly build signals with a lot less learning curve. Underlying Flink capabilities also support advanced signals design. Even though batch backfill support is a work in progress in NRTG and the signal producers need to adapt outputs to avoid disruption to their consumers, the benefits still outweigh the costs of duplication in the existing lambda infrastructure. NRTG team already has this in the roadmap to offer end to end support by providing Hive integration as part of the framework. Bringing the end to end lifecycle of a signal under one platform would massively benefit the innovation and productizing ideas across different teams at Pinterest. It has reduced the infra complexity, and we are able to leverage cost optimization on GPU and other compute resources. We expect other teams at Pinterest to follow in the same footsteps and boost their developer velocity by moving to a more simple and robust architecture as outlined in this blog.

This project is a joint effort across multiple teams at Pinterest: Video & Image Platform (VIP), Near-real time Galaxy (NRTG), Xenon, Hermes, Rockstore, and Visual Search.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Video Platform Engineer
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Video is becoming the most important content format on Pinterest ecosystem. This role will act as an architect for Pinterest video platform, which responsible for the whole lifecycle of a video from uploading, transcoding, delivery and playback. The video architect will oversee Pinterest video platform strategy, owns the direction of what will be our next strategic investment to strengthen our video platform, and land the strategy into major initiatives towards the directions.

What you'll do: 

  • Lead the optimization and improvement in video codec efficiency, encoder rate control, transcode speed, video pre/post-processing and error resilience.
  • Improve end-to-end video experiences on lossy networks in various user scenarios.
  • Identify various opportunities to optimize in video codec, pipeline, error resilience.
  • Define the video optimization roadmap for both low-end and high-end network and devices.
  • Lead the definition and implementation of media processing pipeline.

What we're looking for: 

  • Experience with AWS Elemental
  • Solid knowledge in modern video codecs such as H.264, H.265, VP8/VP9 and AV1. 
  • Deep understanding of adaptive streaming technology especially HLS and MPEG-DASH.
  • Experience in architecting end to end video streaming infrastructure.
  • Experience in building media upload and transcoding pipelines.
  • Proficient in FFmpeg command line tools and libraries.
  • Familiar with popular client side media frameworks such as AVFoundation, Exoplayer, HLS.js, and etc.
  • Experience with streaming quality optimization on mobile devices.
  • Experience collaborating cross-functionally between groups with different video technologies and pipelines.

#LI-EA1

Senior Software Engineer, Data Privacy
Dublin, IE

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

The Data Privacy Engineering team builds platforms and works with engineers across Pinterest to help ensure our handling of customer and partner data meets or exceeds their expectations of privacy and security.  We’re a small, and growing, team based in Dublin.  We own three major engineering projects with company-wide impact: expanding and onboarding teams doing big data processing to a new fine-grained data access platform, tracking how data moves and evolves through our systems, and ensuring data is always handled appropriately.  As a Senior Engineer, you’ll take a driving role on one of these projects and responsibility for working with internal teams to understand their needs, designing solutions, and collaborating with teams in Dublin and the US to successfully execute on your plans.  Your work will help ensure the safety of our users’ and partners’ data and help Pinterest be a source of inspiration for millions of users.

What you’ll do:

  • Consult with engineers, product designers, and security experts to design data-handling solutions
  • Review code and designs from across the company to guide teams to secure and private solutions
  • Onboard customers onto platforms and refine our tools to streamline these processes
  • Mentor and coach engineers and grow your technical leadership skills, with engineers in Dublin and other offices.
  • Grow your engineering skills as you work with a range of open-source technologies and engineers across the company, and code across Pinterest’s stack in a variety of languages

What we’re looking for:

  • 5+ years of experience building enterprise-scale backend services in an object-oriented programing language (Java preferred)
  • Experience mentoring junior engineers and driving an engineering culture
  • The ability to drive ambiguous projects to successful outcomes independently
  • Understanding of big-data processing concepts
  • Experience with data querying and analytics techniques
  • Strong advocacy for the customer and their privacy

#LI-KL1

Software Engineer, Key Value Systems
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest brings millions of Pinners the inspiration to create a life they love for everything; whether that be tonight’s dinner, next summer’s vacation, or a dream house down the road. Our Key Value Systems team is responsible for building and owning the systems that store and serve data that powers Pinterest's business-critical applications. These applications range from user-facing features all the way to being integral components of our machine learning processing systems. The mission of the team is to provide storage and serving systems that are not only highly scalable, performant, and reliable, but also a delight to use. Our systems enable our product engineers to move fast and build awesome features rapidly on top of them.

What you’ll do

  • Build, own, and improve Pinterest's next generation key-value platform that will store petabytes of data, handle tens of millions of QPS, and serve hundreds of use cases powering almost all of Pinterest's business-critical applications
  • Contribute to open-source databases like RocksDB and Rocksplicator
  • Own, improve, and contribute to the main key-value storage platform, streaming write architectures using Kafka, and additional derivative
  • RocksDB-based distributed systems
  • Continually improve operability, scalability, efficiency, performance, and reliability of our storage solutions

What we’re looking for:

  • Deep expertise on online distributed storage and key-value stores at consumer Internet scale
  • Strong ability to work cross-functionally with product teams and with the storage SRE/DBA team
  • Fluent in C/C++ and Java
  • Good communication skills and an excellent team player

#LI-KL1

Head of Ads Delivery Engineering
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest is on a mission to help millions of people across the globe to find the inspiration to create a life they love. Within the Ads Quality team, we try to connect the dots between the aspirations of pinners and the products offered by our partners. 

You will lead an ML centric organization that is responsible for the optimization of the ads delivery funnel and Ads marketplace at Pinterest. Using your strong analytical skill sets, thorough understanding of machine learning, online auctions and experience in managing an engineering team you’ll advance the state of the art in ML and auction theory while at the same time unlock Pinterest’s monetization potential.  In short, this is a unique position, where you’ll get the freedom to work across the organization to bring together pinners and partners in this unique marketplace.

What you’ll do: 

  • Manage the ads delivery engineering organization, consisting of managers and engineers with a background in ML, backend development, economics and data science
  • Develop and execute a vision for ads marketplace and ads delivery funnel
  • Build strong XFN relationships with peers in Ads Quality, Monetization and the larger engineering organization, as well as with XFN partners in Product, Data Science, Finance and Sales

What we’re looking for:

  • MSc. or Ph.D. degree in Economics, Statistics, Computer Science or related field
  • 10+ years of relevant industry experience
  • 5+ years of management experience
  • XFN collaborator and a strong communicator
  • Hands-on experience building large-scale ML systems and/or Ads domain knowledge
  • Strong mathematical skills with knowledge of statistical models (RL, DNN)

#LI-TG1

Verified by
Security Software Engineer
Tech Lead, Big Data Platform
Software Engineer
Talent Brand Manager
Sourcer
Software Engineer
You may also like