Unified Flink Source at Pinterest: Streaming Data Processing

1,091
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

By Lu Niu & Chen Qin | Software Engineers, Stream Processing Platform Team


To best serve Pinners, creators, and advertisers, Pinterest leverages Flink as its stream processing engine. Flink is a data processing engine for stateful computation over data streams. It provides rich streaming APIs, exact-once support, and state checkpointing, which are essential to build stable and scalable streaming applications. Nowadays Flink is widely used in companies like Alibaba, Netflix, and Uber in mission critical use cases.

Xenon is the Flink-based stream processing platform at Pinterest. The mission is to build a reliable, scalable, easy-to-use, and efficient stream platform to enable data-driven products and timely decision making. This system includes:

  • Reliable and up-to-date Flink compute engine
  • Improved Pinterest dev velocity on stream application development
  • Platform reliability and efficiency
  • Security & Compliance
  • Documentation, User Education, and Community

Xenon has enabled several critical use cases, including:

  • Ads real-time ingestion, spending, and reporting — Calculate spending against budget limits in real time to quickly adjust budget pacing and update advertisers with more timely reporting results.
  • Fast signals — Make content signals available quickly after content creation and use these signals in ML pipelines for a customized user experience.
  • Realtime Trust & Safety — Reduce levels of unsafe content as close to content creation time as possible.
  • Content activation — Distribute fresh creator content and surface engagement metrics to creators so they can refine their content with minimal feedback delay.
  • Shopping — Deliver a trustworthy shopping product experience by updating product metadata in near real time.
  • Experimentation — Accurately deliver metrics to engineers for faster experiment setup, verification, and evaluation.

Streaming Data Architecture

Figure 1: the setup of streaming data infrastructure at Pinterest

  1. Singer, a highly performant logging agent, uploads log messages from serving hosts to Kafka. Singer supports multiple log formats and guarantees at-least-once delivery for log messages.
  2. Merced, a variance of Secor, moves data from Kafka to S3. Merced guarantees exactly-once message persistence from Kafka to S3.
  3. Most of our Flink applications consume from Kafka and output to Kafka, Druid, or RocksStore based on different use cases.

Although all our use cases consume data from Kafka, accessing Kafka alone cannot satisfy all user requirements:

  1. Some use cases need to access historical data; however, data in Kafka has a short retention from three days to less than eight hours.
  2. We require all use cases to go through load testing before going into production. Simulating load by rewinding Kafka doesn’t scale.
  3. Replaying historical data via Kafka is one option, but that comes with added operational costs and a 20x increase in infrastructure costs.

Thanks to Merced, we have historical data in hand. Hence, we are able to provide a single API that concatenates historical data with real time data — the concept of an unlimited log. Users are able to seek any offset or timestamp without worrying about which storage system holds the data.

This design brings several benefits:

  1. The encoding and schema of a topic are the same in Merced (aka bounded stream) and Kafka (aka unbounded stream). No extra effort is needed to convert them into a consistent view.
  2. Merced stores the data on S3 and keeps the original event ordering and partitions with a small fraction of infra cost. Replaying data through Merced acts just like reading original data from Kafka.

Features in Unified Source

One API for Both Realtime and Historical Data

Here is a brief summary of the implementation:

Figure 2: Unified Source Implementation

  1. UnifiedSource extends RichParallelSourceFunction. Each SubTask runs one instance of UnifiedSource, and each instance starts both FlinkKafkaConsumer and MercedFileSource. FlinkKafkaConsumer is provided by Flink out-of-the box; MercedFileSource is able to transform Merced output to stream format.
  2. During partition scheduling, the FlinkKafkaConsumer and MercedFileSource in single SubTask are guaranteed to get data belonging to the same partition, which supports seamless transition from reading the file to Kafka.
  3. In order to get all files belonging to the same partition, MercedSource listened to a Kafka topic that was published by Merced Notification system. This way, we are able to handle late arriving events.
  4. Merced outputs are hourly partitioned, and the file naming encodes the Kafka creation time and partition. Therefore, MercedFileSource is able to read files and recreate the same event ordering and partitioning.
  5. UnifiedSource follows Flink best practices to generate watermarks from Kafka. The API forces users to provide a timestamp extractor, which is used by KafkaSource to generate watermarks from source. In this way, we standardize watermark generation across Flink applications. Similarly, MercedFileSource combines watermark updates of multiple partitions into one single watermark stream.
  6. Merced observes late arriving events in Kafka and writes back to previous data partitions that have already been consumed by UnifiedSource.
  7. Sometimes event time skew across multiple Merced partitions leads to an ever-growing watermark gap, thus leading to downstream checkpointing pressure. We implemented watermark synchronization across subtasks using GlobalAggregateManager to keep watermarks within a global range.

Traffic Control & Sources Synchronization

Unlike processing unbounded streams, processing and reprocessing historical data require extra features like traffic control and source synchronization to achieve stable operation.

Here is a simplified use case to explain this problem:

A product team wants to gain insights in near real time into how audiences on platforms view Pins posted within the last seven days. So the product team built a Flink job with XenonUnifiedSource to pull both published picture Kafka logs as well as views of Kafka logs. In order for the Flink job to post meaningful information, we use a unified source to pull the last seven days of published Pins in sequence.

There are two issues using filesource or Kafka source here:

Views topic is much larger than published Pins so it consumes a lot of bandwidth and slows down backfilling of published Pins. Because the pipelines were written in event time fashion, this led to a slowed down progression of low watermark in the inner join operator. Over time, more and more events were buffered in the inner join waiting to get cleared up. This causes backpressure and checkpointing failures.

In response to traffic control issues, we developed rate limiting per topic level. Views of Pins were throttled at a lower ratio and made room for published picture topics to progress quickly.

Views topics should “wait” until the entire seven days of published Pins is complete. Otherwise, a low match ratio is almost certain. In Flink terminology, watermark from published Pins should progress to “current” (the time job is launched) before views of picture topic watermark progress beyond “current.”

In response to synchronization issues, we developed allreduce-based watermark synchronization implementation and periodically update the ratelimiter threshold of each topic. The view of the picture source ratelimiter is only granted a quota to pull the Kafka topic when watermarks from all subtasks reach “current.”

Building Reliable Flink Applications Using UnifiedSource

The benefits of UnifiedSource are not limited to easy access of realtime and historical data. It provides other benefits as well:

  1. UnifiedSource provides out-of-the-box features such as deserializer, watermark generation, traffic control, and corrupted message metrics. This allows Flink developers to focus on business logic implementation.
  2. Users are now able to generate deterministic results by loading fixed datasets and verify correctness against code changes or framework upgrades.
  3. At Pinterest, critical Flink jobs are required to go through load testing before moving to production. Simulating traffic load through UnifiedSource is more scalable and cost effective than rewinding Kafka offsets.

(simulating large load through UnifiedSource)

watermark change when transferring from file to Kafka

We’re hiring

We are always evolving our streaming processing platform, and there are interesting challenges and opportunities ahead. If you’re interested in tackling big data challenges with us, join us!

Acknowledgments

Thanks to Hannah Chen and Ang Zhang for reviewing. Thanks to Divye Kapoor and Karthik Anantha Padmanabhan for integration in the Near Real-Time Galaxy framework. Thanks for our client teams, Ads Measurement, Core Product Indexing, Content Quality for early adoption and feedback. Thanks to the whole Stream Processing Platform Team for supporting.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Machine Learning Engineer
San Francisco, CA, US; Palo Alto, CA, US; Seattle, WA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our PinFlex landing page to learn more. 

With more than 400 million users around the world and 300 billion ideas saved, Pinterest Machine Learning engineers build personalized experiences to help Pinners create a life they love. With just over 3,000 global employees, our teams are small, mighty, and still growing. At Pinterest, you’ll experience hands-on access to an incredible vault of data and contribute large-scale recommendation systems in ways you won’t find anywhere else.

What you’ll do:

  • Build cutting edge technology using the latest advances in deep learning and machine learning to personalize Pinterest
  • Partner closely with teams across Pinterest to experiment and improve ML models for various product surfaces (Homefeed, Ads, Growth, Shopping, and Search), while gaining knowledge of how ML works in different areas
  • Use data driven methods and leverage the unique properties of our data to improve candidates retrieval
  • Work in a high-impact environment with quick experimentation and product launches
  • Keeping up with industry trends in recommendation systems 

 

What we’re looking for:

  • 2+ years of industry experience applying machine learning methods (e.g., user modeling, personalization, recommender systems, search, ranking, natural language processing, reinforcement learning, and graph representation learning)
  • End-to-end hands-on experience with building data processing pipelines, large scale machine learning systems, and big data technologies (e.g., Hadoop/Spark)
  • Nice to have:
    • M.S. or PhD in Machine Learning or related areas
    • Publications at top ML conferences
    • Expertise in scalable realtime systems that process stream data
    • Passion for applied ML and the Pinterest product

 

#LI-HYBRID
#LI-LA1

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

iOS Engineer, Product
San Francisco, CA, US; New York City, NY, US; Portland, OR, US; Seattle, WA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our PinFlex landing page to learn more. 

We are looking for inquisitive, well-rounded iOS engineers to join our Product engineering teams. Working closely with product managers, designers, and backend engineers, you’ll play an important role in enabling the newest technologies and experiences. You will build robust frameworks & features. You will empower both developers and Pinners alike. You’ll have the opportunity to find creative solutions to thought-provoking problems. Even better, because we covet the kind of courageous thinking that’s required in order for big bets and smart risks to pay off, you’ll be invited to create and drive new initiatives, seeing them from inception through to technical design, implementation, and release.

What you’ll do:

  • Build out Pinner-facing frontend features in iOS to power the future of inspiration on Pinterest
  • Contribute to and lead each step of the product development process, from ideation to implementation to release; from rapidly prototyping, running A/B tests, to architecting and building solutions that can scale to support millions of users
  • Partner with design, product, and backend teams to build end to end functionality
  • Put on your Pinner hat to suggest new product ideas and features
  • Employ automated testing to build features with a high degree of technical quality, taking responsibility for the components and features you develop
  • Grow as an engineer by working with world-class peers on varied and high impact projects

What we’re looking for:

  • Deep understanding of iOS development and best practices in Objective C and/or Swift, e.g. xCode, app states, memory management, etc
  • 2+ years of industry iOS application development experience, building consumer or business facing products
  • Experience in following best practices in writing reliable and maintainable code that may be used by many other engineers
  • Ability to keep up-to-date with new technologies to understand what should be incorporated
  • Strong collaboration and communication skills

Product iOS Engineering teams: 

Creator Incentives 

Home Product

Native Publishing

Search Product

Social Growth

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Machine Learning Engineer, Core Engi...
San Francisco, CA, US; Palo Alto, CA, US; Seattle, WA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our PinFlex landing page to learn more. 

With more than 400 million users around the world and 300 billion ideas saved, Pinterest Machine Learning engineers build personalized experiences to help Pinners create a life they love. With just over 3,000 global employees, our teams are small, mighty, and still growing. At Pinterest, you’ll experience hands-on access to an incredible vault of data and contribute large-scale recommendation systems in ways you won’t find anywhere else.

What you’ll do:

  • Build cutting edge technology using the latest advances in deep learning and machine learning to personalize Pinterest
  • Partner closely with teams across Pinterest to experiment and improve ML models for various product surfaces (Homefeed, Ads, Growth, Shopping, and Search), while gaining knowledge of how ML works in different areas
  • Use data driven methods and leverage the unique properties of our data to improve candidates retrieval
  • Work in a high-impact environment with quick experimentation and product launches
  • Keeping up with industry trends in recommendation systems 

 

What we’re looking for:

  • 2+ years of industry experience applying machine learning methods (e.g., user modeling, personalization, recommender systems, search, ranking, natural language processing, reinforcement learning, and graph representation learning)
  • End-to-end hands-on experience with building data processing pipelines, large scale machine learning systems, and big data technologies (e.g., Hadoop/Spark)
  • Nice to have:
    • M.S. or PhD in Machine Learning or related areas
    • Publications at top ML conferences
    • Expertise in scalable realtime systems that process stream data
    • Passion for applied ML and the Pinterest product

 

#LI-HYBRID
#LI-LA1

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Software Engineer, Infrastructure
San Francisco, CA, US; Palo Alto, CA, US; Seattle, WA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our PinFlex landing page to learn more. 

The Pinterest Infrastructure Engineering organization builds, scales, and evolves the systems which the rest of Pinterest Engineering uses to deliver inspiration to the world.  This includes source code management, continuous integration, artifact packaging, continuous deployment, service traffic management, service registration and discovery, as well as holistic observability and the underlying compute runtime and container orchestration.  A collection of platforms and capabilities which accelerate development velocity while protecting Pinterest’s production availability for one of the world’s largest public cloud workloads. 

What you’ll do:

  • Design, develop, and operate large scale, distributed systems and networks
  • Work with Engineering customers to understand new requirements and address them in a scalable and efficient manner
  • Actively work to improve the developer process and experience in all phases from coding to operation

What we’re looking for:

  • 2+ years of industry software engineering experience
  • Experience building & operating large scale distributed systems and/or networks
  • Experience in Python, Java, C++, or Go or another language and a willingness to learn
  • Bonus: Experience deploying and operating large scale workloads on a public cloud footprint

Available Hiring Teams: Cloud Delivery Platform (Infra Eng), Code & Language Runtime (Infra Eng), Traffic (Infra Eng), Cloud Systems (Infra Eng), Online Systems (Data Eng), Key Value Systems (Data Eng), Real Time Analytics (Data Eng), Storage & Caching (Data Eng), ML Serving Platform (Data Eng)

 

#LI-SG1

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Verified by
Software Engineer
Sourcer
Software Engineer
Talent Brand Manager
Tech Lead, Big Data Platform
Security Software Engineer
You may also like