Pinterest Visual Signals Infrastructure: Evolution from Lambda to Kappa Architecture

2,050
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

By Ankit Patel | Software engineer, Content Acquisition and Media Platform


With the growing need for machine learning signals from Pinterest’s huge visual dataset, we decided to take a closer look at our infrastructure that produces and serves these signals. A few parameters we were particularly interested in were signal availability, infra complexity and cost optimization, tech integration, developer velocity, and monitoring. In this post, we will describe our journey from a Lambda architecture to the new real-time signals infrastructure inspired by Kappa architecture.

Background

In order to understand the existing visual signals infrastructure, we need to understand some of the basic content processing systems at Pinterest. Pinterest’s Content Acquisition and Media Platform, formerly known as Video and Image Platform (VIP), is responsible for ingesting, processing, and serving all of Pinterest’s content on every surface of the application. We ingest media at a massive scale every single day. This post will not go into details about the ingestion and serving part, and it will mostly focus on the processing part, as that is where most of the magic happens.

Media is ingested through 50 different pipelines. Pipelines are namespaces in VIP systems, e.g. Pinner uploaded content, crawled images, shopping images, video keyframes, user profile images, etc. Each pipeline maps to custom media processing configurations tailored for the use case it serves.

When we started building our visual signals processing infrastructure, we utilized this existing namespace philosophy and also partitioned our visual signals around pipelines. Pinterest’s homegrown signal processing and serving tech stack is called Galaxy. Namespaces keyed by different entity IDs in Galaxy are called Joins or Galaxies. Each VIP pipeline is mapped to their equivalent Galaxy. Sometimes we combine multiple similar pipelines into a single galaxy, as they are closely related.

Lambda Architecture

Until recently, we used the Lambda architecture illustrated below to compute visual signals from our media content.

As you can see in the diagram above, there are 2 modes to this architecture: online and offline. Let’s discuss the offline architecture first.

  1. A nightly workflow (Hadoop based map-reduce job) is scheduled to run. We calculate the delta unique image signatures from that particular day and create a new partition.
  2. We then create batches of image signatures and enqueue PinLater jobs for processing these batches.
  3. We spin up a GPU cluster with the ML models needed for the signal computation and then start processing the PinLater Jobs. This is an expensive cluster.
  4. We download the image from Amazon S3, run the model inference on them, and then store the output in S3. We would then spin down the GPU cluster to optimize EC2 spend.
  5. The signal is generated in a delimited bytes format with Protobuf encoded values.
  6. We kickoff a Workflow which transforms this output into equivalent Apache Thrift encoding because Pinterest heavily uses Thrift as its wire format and storage for data.
  7. Thrift output is stored in a Parquet columnar Hive table. Downstream batch clients consume this output.
  8. In order to support real time RPC clients, we have a final workflow that uploads the Hive partition to a Rockstore key value pair database.
  9. Galaxy Signal Service provides an RPC API to look up these signals from Rockstore for our online clients.

As you can see, there are a lot of moving pieces in the above design. Even though these processes matured over time and became more stable, VIP team engineers faced frequent issues while on call. Some of the biggest concerns were:

  • Workflows depending on other workflow
  • Consumers have to wait 24 hours for the signals to be available
  • Application logic issues are extremely hard to debug
  • Granular retries don’t exist, it’s all or nothing
  • Delays due to the nature of these systems

As we continued to build additional features that are powered by these machine learning signals, the need for producing these signals faster became a priority across the company. That is where the online mode comes into picture:

  1. We create a PinLater job for each image signature in real time.
  2. In this Job, we call a GPU cluster running the machine learning model inference service to calculate a signal and directly store its result into the key value pair based Rockstore database. GSS (Galaxy Signal Service) would then serve this signal.
  3. We then publish an event through Kafka to notify the downstream consumers about the signal availability.
  4. Downstream clients consume this event and fetch the signal from GSS in real time.

VIP wasn’t the only team in Pinterest that was getting attracted to this new paradigm of signal processing. Other teams became excited about the idea of having their signals calculated in real time. It provided more visibility into the signal generation compared to the black box that is hadoop based workflows. This approach came with multiple benefits like:

  • Better developer experience
  • Easier to debug, test, and deploy changes
  • Granular retries
  • Low latency signals

However, it came with its share of cons as well. The main ones being:

  • Complex operations, like group bys and signal joins, are not possible with a simple event-driven processing framework like Pinlater. We needed a robust stream processing framework like Apache Flink.
  • The extra cost of running a duplicate GPU cluster that processes the same pins as the batch pipeline on a continuous basis.
  • There was no shared infrastructure to address common needs.

Kappa Architecture

Finally, the Signal Platform team at Pinterest saw an opportunity to address this concern for all signal developers and build the next version of signal development infrastructure called “Near-real-time Galaxy,” or simply NRTG. The technologies of choice were Apache Kafka and Apache Flink. Flink seemed like it was specifically designed to address the concerns mentioned above. Given that the Signal Platform team already had a framework in place to build signals on a batch technology using Galaxy Dataflow APIs, extending it to also work on a stream technology was arguably the best way forward without causing massive amounts of refactoring, rewriting, or just reinventing.

The VIP team decided to be among the early adopters of this initiative as our media signals are some of the most upstream in the whole tree of signals at Pinterest, so it would naturally be the easiest to onboard a platform while it is still being built. We scoped out the signals we wanted to experiment with and got started on this mission.

The gist of it is very simple: you would write a simple flink job that computes the signal in streaming (on Apache Flink). NRTG would make this process easy and quick by leveraging standard design patterns and annotations.

Based on the annotations, the NRTG framework mentioned above does most of the heavy lifting away hidden from the signal developers. The configs and the mapping to underlying native Flink is managed by this middleware layer. This makes signal development extremely fast because the developers do not need to learn Flink in detail as they are already familiar with these annotations. Xenon (Flink) platform team at Pinterest provides the infrastructure capabilities to deploy and maintain Flink applications.

Once we onboarded to NRTG, we wanted to turn down our existing batch workflows setup. There are a number of consumers who consume these signals in batch. We had to provide a solution that would work for them from our streaming pipelines. In order to simulate a daily Hive table for our signals, we wrote a simplified workflow that takes a daily dump of our KVStore and transforms it into existing Hive output. No GPU computations were required to recompute the signal values — the data was simply computed in streaming, moved to S3 via a daily dump, and filtered to the correct format. This not only allowed us to save on the GPU cost but also trim down chained complex workflows design into a simple data transformation job. With the FlinkSQL being in active development, we will be able to completely migrate the offline portion from Spark/Hadoop to Flink.

Conclusion

Migration to this new fast-signals infrastructure is the beginning of a great future for Pinterest in signal generation. It allows the signal developers to quickly build signals with a lot less learning curve. Underlying Flink capabilities also support advanced signals design. Even though batch backfill support is a work in progress in NRTG and the signal producers need to adapt outputs to avoid disruption to their consumers, the benefits still outweigh the costs of duplication in the existing lambda infrastructure. NRTG team already has this in the roadmap to offer end to end support by providing Hive integration as part of the framework. Bringing the end to end lifecycle of a signal under one platform would massively benefit the innovation and productizing ideas across different teams at Pinterest. It has reduced the infra complexity, and we are able to leverage cost optimization on GPU and other compute resources. We expect other teams at Pinterest to follow in the same footsteps and boost their developer velocity by moving to a more simple and robust architecture as outlined in this blog.

This project is a joint effort across multiple teams at Pinterest: Video & Image Platform (VIP), Near-real time Galaxy (NRTG), Xenon, Hermes, Rockstore, and Visual Search.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Machine Learning Engineer
San Francisco, CA, US; Palo Alto, CA, US; Seattle, WA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our PinFlex landing page to learn more. 

With more than 400 million users around the world and 300 billion ideas saved, Pinterest Machine Learning engineers build personalized experiences to help Pinners create a life they love. With just over 3,000 global employees, our teams are small, mighty, and still growing. At Pinterest, you’ll experience hands-on access to an incredible vault of data and contribute large-scale recommendation systems in ways you won’t find anywhere else.

What you’ll do:

  • Build cutting edge technology using the latest advances in deep learning and machine learning to personalize Pinterest
  • Partner closely with teams across Pinterest to experiment and improve ML models for various product surfaces (Homefeed, Ads, Growth, Shopping, and Search), while gaining knowledge of how ML works in different areas
  • Use data driven methods and leverage the unique properties of our data to improve candidates retrieval
  • Work in a high-impact environment with quick experimentation and product launches
  • Keeping up with industry trends in recommendation systems 

 

What we’re looking for:

  • 2+ years of industry experience applying machine learning methods (e.g., user modeling, personalization, recommender systems, search, ranking, natural language processing, reinforcement learning, and graph representation learning)
  • End-to-end hands-on experience with building data processing pipelines, large scale machine learning systems, and big data technologies (e.g., Hadoop/Spark)
  • Nice to have:
    • M.S. or PhD in Machine Learning or related areas
    • Publications at top ML conferences
    • Expertise in scalable realtime systems that process stream data
    • Passion for applied ML and the Pinterest product

 

#LI-HYBRID
#LI-LA1

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

iOS Engineer, Product
San Francisco, CA, US; New York City, NY, US; Portland, OR, US; Seattle, WA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our PinFlex landing page to learn more. 

We are looking for inquisitive, well-rounded iOS engineers to join our Product engineering teams. Working closely with product managers, designers, and backend engineers, you’ll play an important role in enabling the newest technologies and experiences. You will build robust frameworks & features. You will empower both developers and Pinners alike. You’ll have the opportunity to find creative solutions to thought-provoking problems. Even better, because we covet the kind of courageous thinking that’s required in order for big bets and smart risks to pay off, you’ll be invited to create and drive new initiatives, seeing them from inception through to technical design, implementation, and release.

What you’ll do:

  • Build out Pinner-facing frontend features in iOS to power the future of inspiration on Pinterest
  • Contribute to and lead each step of the product development process, from ideation to implementation to release; from rapidly prototyping, running A/B tests, to architecting and building solutions that can scale to support millions of users
  • Partner with design, product, and backend teams to build end to end functionality
  • Put on your Pinner hat to suggest new product ideas and features
  • Employ automated testing to build features with a high degree of technical quality, taking responsibility for the components and features you develop
  • Grow as an engineer by working with world-class peers on varied and high impact projects

What we’re looking for:

  • Deep understanding of iOS development and best practices in Objective C and/or Swift, e.g. xCode, app states, memory management, etc
  • 2+ years of industry iOS application development experience, building consumer or business facing products
  • Experience in following best practices in writing reliable and maintainable code that may be used by many other engineers
  • Ability to keep up-to-date with new technologies to understand what should be incorporated
  • Strong collaboration and communication skills

Product iOS Engineering teams: 

Creator Incentives 

Home Product

Native Publishing

Search Product

Social Growth

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Machine Learning Engineer, Core Engi...
San Francisco, CA, US; Palo Alto, CA, US; Seattle, WA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our PinFlex landing page to learn more. 

With more than 400 million users around the world and 300 billion ideas saved, Pinterest Machine Learning engineers build personalized experiences to help Pinners create a life they love. With just over 3,000 global employees, our teams are small, mighty, and still growing. At Pinterest, you’ll experience hands-on access to an incredible vault of data and contribute large-scale recommendation systems in ways you won’t find anywhere else.

What you’ll do:

  • Build cutting edge technology using the latest advances in deep learning and machine learning to personalize Pinterest
  • Partner closely with teams across Pinterest to experiment and improve ML models for various product surfaces (Homefeed, Ads, Growth, Shopping, and Search), while gaining knowledge of how ML works in different areas
  • Use data driven methods and leverage the unique properties of our data to improve candidates retrieval
  • Work in a high-impact environment with quick experimentation and product launches
  • Keeping up with industry trends in recommendation systems 

 

What we’re looking for:

  • 2+ years of industry experience applying machine learning methods (e.g., user modeling, personalization, recommender systems, search, ranking, natural language processing, reinforcement learning, and graph representation learning)
  • End-to-end hands-on experience with building data processing pipelines, large scale machine learning systems, and big data technologies (e.g., Hadoop/Spark)
  • Nice to have:
    • M.S. or PhD in Machine Learning or related areas
    • Publications at top ML conferences
    • Expertise in scalable realtime systems that process stream data
    • Passion for applied ML and the Pinterest product

 

#LI-HYBRID
#LI-LA1

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Software Engineer, Infrastructure
San Francisco, CA, US; Palo Alto, CA, US; Seattle, WA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our PinFlex landing page to learn more. 

The Pinterest Infrastructure Engineering organization builds, scales, and evolves the systems which the rest of Pinterest Engineering uses to deliver inspiration to the world.  This includes source code management, continuous integration, artifact packaging, continuous deployment, service traffic management, service registration and discovery, as well as holistic observability and the underlying compute runtime and container orchestration.  A collection of platforms and capabilities which accelerate development velocity while protecting Pinterest’s production availability for one of the world’s largest public cloud workloads. 

What you’ll do:

  • Design, develop, and operate large scale, distributed systems and networks
  • Work with Engineering customers to understand new requirements and address them in a scalable and efficient manner
  • Actively work to improve the developer process and experience in all phases from coding to operation

What we’re looking for:

  • 2+ years of industry software engineering experience
  • Experience building & operating large scale distributed systems and/or networks
  • Experience in Python, Java, C++, or Go or another language and a willingness to learn
  • Bonus: Experience deploying and operating large scale workloads on a public cloud footprint

Available Hiring Teams: Cloud Delivery Platform (Infra Eng), Code & Language Runtime (Infra Eng), Traffic (Infra Eng), Cloud Systems (Infra Eng), Online Systems (Data Eng), Key Value Systems (Data Eng), Real Time Analytics (Data Eng), Storage & Caching (Data Eng), ML Serving Platform (Data Eng)

 

#LI-SG1

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Verified by
Software Engineer
Sourcer
Software Engineer
Talent Brand Manager
Tech Lead, Big Data Platform
Security Software Engineer
You may also like