Unified Flink Source at Pinterest: Streaming Data Processing

831
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

By Lu Niu & Chen Qin | Software Engineers, Stream Processing Platform Team


To best serve Pinners, creators, and advertisers, Pinterest leverages Flink as its stream processing engine. Flink is a data processing engine for stateful computation over data streams. It provides rich streaming APIs, exact-once support, and state checkpointing, which are essential to build stable and scalable streaming applications. Nowadays Flink is widely used in companies like Alibaba, Netflix, and Uber in mission critical use cases.

Xenon is the Flink-based stream processing platform at Pinterest. The mission is to build a reliable, scalable, easy-to-use, and efficient stream platform to enable data-driven products and timely decision making. This system includes:

  • Reliable and up-to-date Flink compute engine
  • Improved Pinterest dev velocity on stream application development
  • Platform reliability and efficiency
  • Security & Compliance
  • Documentation, User Education, and Community

Xenon has enabled several critical use cases, including:

  • Ads real-time ingestion, spending, and reporting — Calculate spending against budget limits in real time to quickly adjust budget pacing and update advertisers with more timely reporting results.
  • Fast signals — Make content signals available quickly after content creation and use these signals in ML pipelines for a customized user experience.
  • Realtime Trust & Safety — Reduce levels of unsafe content as close to content creation time as possible.
  • Content activation — Distribute fresh creator content and surface engagement metrics to creators so they can refine their content with minimal feedback delay.
  • Shopping — Deliver a trustworthy shopping product experience by updating product metadata in near real time.
  • Experimentation — Accurately deliver metrics to engineers for faster experiment setup, verification, and evaluation.

Streaming Data Architecture

Figure 1: the setup of streaming data infrastructure at Pinterest

  1. Singer, a highly performant logging agent, uploads log messages from serving hosts to Kafka. Singer supports multiple log formats and guarantees at-least-once delivery for log messages.
  2. Merced, a variance of Secor, moves data from Kafka to S3. Merced guarantees exactly-once message persistence from Kafka to S3.
  3. Most of our Flink applications consume from Kafka and output to Kafka, Druid, or RocksStore based on different use cases.

Although all our use cases consume data from Kafka, accessing Kafka alone cannot satisfy all user requirements:

  1. Some use cases need to access historical data; however, data in Kafka has a short retention from three days to less than eight hours.
  2. We require all use cases to go through load testing before going into production. Simulating load by rewinding Kafka doesn’t scale.
  3. Replaying historical data via Kafka is one option, but that comes with added operational costs and a 20x increase in infrastructure costs.

Thanks to Merced, we have historical data in hand. Hence, we are able to provide a single API that concatenates historical data with real time data — the concept of an unlimited log. Users are able to seek any offset or timestamp without worrying about which storage system holds the data.

This design brings several benefits:

  1. The encoding and schema of a topic are the same in Merced (aka bounded stream) and Kafka (aka unbounded stream). No extra effort is needed to convert them into a consistent view.
  2. Merced stores the data on S3 and keeps the original event ordering and partitions with a small fraction of infra cost. Replaying data through Merced acts just like reading original data from Kafka.

Features in Unified Source

One API for Both Realtime and Historical Data

Here is a brief summary of the implementation:

Figure 2: Unified Source Implementation

  1. UnifiedSource extends RichParallelSourceFunction. Each SubTask runs one instance of UnifiedSource, and each instance starts both FlinkKafkaConsumer and MercedFileSource. FlinkKafkaConsumer is provided by Flink out-of-the box; MercedFileSource is able to transform Merced output to stream format.
  2. During partition scheduling, the FlinkKafkaConsumer and MercedFileSource in single SubTask are guaranteed to get data belonging to the same partition, which supports seamless transition from reading the file to Kafka.
  3. In order to get all files belonging to the same partition, MercedSource listened to a Kafka topic that was published by Merced Notification system. This way, we are able to handle late arriving events.
  4. Merced outputs are hourly partitioned, and the file naming encodes the Kafka creation time and partition. Therefore, MercedFileSource is able to read files and recreate the same event ordering and partitioning.
  5. UnifiedSource follows Flink best practices to generate watermarks from Kafka. The API forces users to provide a timestamp extractor, which is used by KafkaSource to generate watermarks from source. In this way, we standardize watermark generation across Flink applications. Similarly, MercedFileSource combines watermark updates of multiple partitions into one single watermark stream.
  6. Merced observes late arriving events in Kafka and writes back to previous data partitions that have already been consumed by UnifiedSource.
  7. Sometimes event time skew across multiple Merced partitions leads to an ever-growing watermark gap, thus leading to downstream checkpointing pressure. We implemented watermark synchronization across subtasks using GlobalAggregateManager to keep watermarks within a global range.

Traffic Control & Sources Synchronization

Unlike processing unbounded streams, processing and reprocessing historical data require extra features like traffic control and source synchronization to achieve stable operation.

Here is a simplified use case to explain this problem:

A product team wants to gain insights in near real time into how audiences on platforms view Pins posted within the last seven days. So the product team built a Flink job with XenonUnifiedSource to pull both published picture Kafka logs as well as views of Kafka logs. In order for the Flink job to post meaningful information, we use a unified source to pull the last seven days of published Pins in sequence.

There are two issues using filesource or Kafka source here:

Views topic is much larger than published Pins so it consumes a lot of bandwidth and slows down backfilling of published Pins. Because the pipelines were written in event time fashion, this led to a slowed down progression of low watermark in the inner join operator. Over time, more and more events were buffered in the inner join waiting to get cleared up. This causes backpressure and checkpointing failures.

In response to traffic control issues, we developed rate limiting per topic level. Views of Pins were throttled at a lower ratio and made room for published picture topics to progress quickly.

Views topics should “wait” until the entire seven days of published Pins is complete. Otherwise, a low match ratio is almost certain. In Flink terminology, watermark from published Pins should progress to “current” (the time job is launched) before views of picture topic watermark progress beyond “current.”

In response to synchronization issues, we developed allreduce-based watermark synchronization implementation and periodically update the ratelimiter threshold of each topic. The view of the picture source ratelimiter is only granted a quota to pull the Kafka topic when watermarks from all subtasks reach “current.”

Building Reliable Flink Applications Using UnifiedSource

The benefits of UnifiedSource are not limited to easy access of realtime and historical data. It provides other benefits as well:

  1. UnifiedSource provides out-of-the-box features such as deserializer, watermark generation, traffic control, and corrupted message metrics. This allows Flink developers to focus on business logic implementation.
  2. Users are now able to generate deterministic results by loading fixed datasets and verify correctness against code changes or framework upgrades.
  3. At Pinterest, critical Flink jobs are required to go through load testing before moving to production. Simulating traffic load through UnifiedSource is more scalable and cost effective than rewinding Kafka offsets.

(simulating large load through UnifiedSource)

watermark change when transferring from file to Kafka

We’re hiring

We are always evolving our streaming processing platform, and there are interesting challenges and opportunities ahead. If you’re interested in tackling big data challenges with us, join us!

Acknowledgments

Thanks to Hannah Chen and Ang Zhang for reviewing. Thanks to Divye Kapoor and Karthik Anantha Padmanabhan for integration in the Near Real-Time Galaxy framework. Thanks for our client teams, Ads Measurement, Core Product Indexing, Content Quality for early adoption and feedback. Thanks to the whole Stream Processing Platform Team for supporting.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Video Platform Engineer
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Video is becoming the most important content format on Pinterest ecosystem. This role will act as an architect for Pinterest video platform, which responsible for the whole lifecycle of a video from uploading, transcoding, delivery and playback. The video architect will oversee Pinterest video platform strategy, owns the direction of what will be our next strategic investment to strengthen our video platform, and land the strategy into major initiatives towards the directions.

What you'll do: 

  • Lead the optimization and improvement in video codec efficiency, encoder rate control, transcode speed, video pre/post-processing and error resilience.
  • Improve end-to-end video experiences on lossy networks in various user scenarios.
  • Identify various opportunities to optimize in video codec, pipeline, error resilience.
  • Define the video optimization roadmap for both low-end and high-end network and devices.
  • Lead the definition and implementation of media processing pipeline.

What we're looking for: 

  • Experience with AWS Elemental
  • Solid knowledge in modern video codecs such as H.264, H.265, VP8/VP9 and AV1. 
  • Deep understanding of adaptive streaming technology especially HLS and MPEG-DASH.
  • Experience in architecting end to end video streaming infrastructure.
  • Experience in building media upload and transcoding pipelines.
  • Proficient in FFmpeg command line tools and libraries.
  • Familiar with popular client side media frameworks such as AVFoundation, Exoplayer, HLS.js, and etc.
  • Experience with streaming quality optimization on mobile devices.
  • Experience collaborating cross-functionally between groups with different video technologies and pipelines.

#LI-EA1

Senior Software Engineer, Data Privacy
Dublin, IE

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

The Data Privacy Engineering team builds platforms and works with engineers across Pinterest to help ensure our handling of customer and partner data meets or exceeds their expectations of privacy and security.  We’re a small, and growing, team based in Dublin.  We own three major engineering projects with company-wide impact: expanding and onboarding teams doing big data processing to a new fine-grained data access platform, tracking how data moves and evolves through our systems, and ensuring data is always handled appropriately.  As a Senior Engineer, you’ll take a driving role on one of these projects and responsibility for working with internal teams to understand their needs, designing solutions, and collaborating with teams in Dublin and the US to successfully execute on your plans.  Your work will help ensure the safety of our users’ and partners’ data and help Pinterest be a source of inspiration for millions of users.

What you’ll do:

  • Consult with engineers, product designers, and security experts to design data-handling solutions
  • Review code and designs from across the company to guide teams to secure and private solutions
  • Onboard customers onto platforms and refine our tools to streamline these processes
  • Mentor and coach engineers and grow your technical leadership skills, with engineers in Dublin and other offices.
  • Grow your engineering skills as you work with a range of open-source technologies and engineers across the company, and code across Pinterest’s stack in a variety of languages

What we’re looking for:

  • 5+ years of experience building enterprise-scale backend services in an object-oriented programing language (Java preferred)
  • Experience mentoring junior engineers and driving an engineering culture
  • The ability to drive ambiguous projects to successful outcomes independently
  • Understanding of big-data processing concepts
  • Experience with data querying and analytics techniques
  • Strong advocacy for the customer and their privacy

#LI-KL1

Software Engineer, Key Value Systems
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest brings millions of Pinners the inspiration to create a life they love for everything; whether that be tonight’s dinner, next summer’s vacation, or a dream house down the road. Our Key Value Systems team is responsible for building and owning the systems that store and serve data that powers Pinterest's business-critical applications. These applications range from user-facing features all the way to being integral components of our machine learning processing systems. The mission of the team is to provide storage and serving systems that are not only highly scalable, performant, and reliable, but also a delight to use. Our systems enable our product engineers to move fast and build awesome features rapidly on top of them.

What you’ll do

  • Build, own, and improve Pinterest's next generation key-value platform that will store petabytes of data, handle tens of millions of QPS, and serve hundreds of use cases powering almost all of Pinterest's business-critical applications
  • Contribute to open-source databases like RocksDB and Rocksplicator
  • Own, improve, and contribute to the main key-value storage platform, streaming write architectures using Kafka, and additional derivative
  • RocksDB-based distributed systems
  • Continually improve operability, scalability, efficiency, performance, and reliability of our storage solutions

What we’re looking for:

  • Deep expertise on online distributed storage and key-value stores at consumer Internet scale
  • Strong ability to work cross-functionally with product teams and with the storage SRE/DBA team
  • Fluent in C/C++ and Java
  • Good communication skills and an excellent team player

#LI-KL1

Head of Ads Delivery Engineering
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest is on a mission to help millions of people across the globe to find the inspiration to create a life they love. Within the Ads Quality team, we try to connect the dots between the aspirations of pinners and the products offered by our partners. 

You will lead an ML centric organization that is responsible for the optimization of the ads delivery funnel and Ads marketplace at Pinterest. Using your strong analytical skill sets, thorough understanding of machine learning, online auctions and experience in managing an engineering team you’ll advance the state of the art in ML and auction theory while at the same time unlock Pinterest’s monetization potential.  In short, this is a unique position, where you’ll get the freedom to work across the organization to bring together pinners and partners in this unique marketplace.

What you’ll do: 

  • Manage the ads delivery engineering organization, consisting of managers and engineers with a background in ML, backend development, economics and data science
  • Develop and execute a vision for ads marketplace and ads delivery funnel
  • Build strong XFN relationships with peers in Ads Quality, Monetization and the larger engineering organization, as well as with XFN partners in Product, Data Science, Finance and Sales

What we’re looking for:

  • MSc. or Ph.D. degree in Economics, Statistics, Computer Science or related field
  • 10+ years of relevant industry experience
  • 5+ years of management experience
  • XFN collaborator and a strong communicator
  • Hands-on experience building large-scale ML systems and/or Ads domain knowledge
  • Strong mathematical skills with knowledge of statistical models (RL, DNN)

#LI-TG1

Verified by
Security Software Engineer
Tech Lead, Big Data Platform
Software Engineer
Talent Brand Manager
Sourcer
Software Engineer
You may also like