Pinterest Flink Deployment Framework

1,763
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

By Rainie Li | Software Engineer, Stream Processing Platform Team


Background

At Pinterest, stream processing allows us to unlock value from real time data for pinners and partners. The Stream Processing Platform team is working on building a reliable and scalable platform to support many critical streaming applications including real-time experiment analytics and real time machine learning signals.

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. It provides features including exactly-once guarantees, low latency, high throughput, and powerful computation model. At Pinterest, we adopt Flink as the unified streaming processing engine.

Requirements

Standardize Flink Build

At Pinterest, we use Bazel as a build system. We need a standardized Bazel rule to build all Flink jobs without changing Makefiles. Once build is done, instead of asking users to copy Flink jars to YARN clusters, jars should be automatically uploaded to remote storage.

Deployment and Operations History

Users used to copy Flink jars to YARN clusters and manually run commands. It was hard to track previous execution histories if we needed to recover failed jobs. We need to provide standard Flink operations such as launching, killing, triggering savepoint, and resuming jobs from the most recent savepoint.

Job Deduplication

Flink applications are deployed as services, therefore one instance should be running at a time for each Flink application. We need to prevent cases when users accidentally deploy twice for the same job, meaning both instances might write to the same Kafka topic. This would mean double writes to Kafka and could affect downstream jobs.

Deployment Framework

We built our Flink deployment framework on top of Bazel, Hermez (internal continuous deployment platform), Job Submission Service (internal service), and YARN clusters.

Figure 1. Deployment high level architecture

Create Bazel BUILD file

The BUILD file needs to contain load(“flink_release”). Users also need to insert a Bazel rule like this:

Define Hermez Deployment File

Hermez is the Pinterest Continuous Deployment System. In order to launch a Flink job with Hermez, users need to create a Hermez.yml file. This file contains information including which YARN cluster Flink jobs to run in, what YARN parameters to use, what resources to use, etc. For each instance of Flink job, users should set up a separate YAML file. For example, if users run their jobs in dev, staging, and prod environments, they will need to have three different YAML files (one for each environment).

Here’s an example of yml file:

Automatically Flink Job Building

The following numbers are referring to steps in Figure 1: Deployment high level architecture

Whenever a user lands a change to Git repo, Jenkins job will be triggered to build Flink job JARs (1). Jenkins job will follow flink_relase rules that are described in the BUILD file to build Flink JAR and upload it to the S3 bucket (3). Meanwhile, it will upload deployment related Hermez YAML files to Artifactory (2). Hermez monitors Artifactory; when it sees a new yml file, it will display it on UI to allow users to launch a job using that yml (5).

Flink Job Launching

When users launch a Flink job, Hermez converts the yml file into a JSON and submits it to Job Submission Service (JSS) (6). JSS is a service maintained by Pinterest that has the ability to schedule and launch Flink jobs to YARN clusters.

JSS examines the request and ensures that Flink JARs and Flink job state exist in S3 (7). If everything is alright, JSS will first launch a shell-runner job which will execute a command on a YARN cluster cluster (8). The shell-runner job downloads the Flink job’s JAR from S3 and then kicks off the actual Flink job using the configuration provided by JSS (9). The reason we add a shell-runner job is to keep JSS as a thin layer without dealing with different compute engine clients (Flink, Spark, MapReduce, etc.) and different configurations for each cluster.

JSS Deduplication

When resuming a Flink job, we provide several options including resume from most recent savepoint or checkpoint, fresh state, and specify a savepoint or checkpoint path. Job deduplication features ensure that there is only one instance of your Flink job running at a time.

The way job deduplication works is that each job has a unique name when a job is submitted. If there is already an instance of the job running, JSS will trigger a safepoint and stop it first, then submit the new job. If the stop request fails because savepoint fails, then the submitted request will fail and the running instance remains running. If there is one deployment in progress, the new job submission would be rejected

Flink Job Configuration Hotfix

Due to Flink configuration being packaged together with Flink job binary, users used to check in config changes to Repo and rebuild the package. This whole process could take more than 10 minutes. This can be a problem if we would like to quickly adjust parameters during incidents. For example, when Flink jobs failed in production due to lack of resources, we used to go through the entire build process to rollout resource config changes. After the incidents got resolved, we needed to check in another change to roll back these configs. To speed up this process, we provide a hotfix feature on Hermez to overwrite Flink job configuration without code change. Users can adjust Flink configuration values during deployment. Behind the scenes, Hermez will directly overwrite these values in ymls which Hermez read from Artifactory.

What’s Next

Reducing Deployment Latency

The current approach launches shell-runner first. Then, shell-runner launches Flink jobs to YARN clusters which could increase latency. We plan to improve this process to reduce end-to-end Flink job launch time.

Automatically Job Failover

To further improve platform and Flink application availability, we built YARN clusters in multiple AWS Availability Zones (AZ) to provide backup when one cluster or one AZ become unavailable. We are also building a service that could automatically detect any cluster failure and failover failed jobs to backup clusters in different AZs or detect application failures and restart the application automatically.

Stay tuned!

Acknowledgments

Thanks to Steven Bairos-Novak and Yu Yang for their countless contributions. Thanks Ang Zhang for updating this blog. This project is a joint effort across multiple teams at Pinterest. Thanks to the Engineering Productivity Team for Hermez support.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Android Engineer, Client Excellence
Mexico City, MEX

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

On the Client Excellence team you ensure Pinners have a high quality experience on Pinterest. You do this by improving our critical client metrics like crash-free users and by upgrading our supported libraries and operating systems. You also partner with other engineering teams to improve the developer experience and champion operational excellence.

What you’ll do:

  • Improve the quality of our apps by monitoring and improving core client metrics e.g. crash-free user rate, app size, memory management and cpu usage
  • Drive library and OS upgrades with minimal disruption across Pinterest
  • Partner with other engineering teams to improve client developer experience
  • Champion operational excellence across all client engineering teams

What we’re looking for:

  • Deep understanding of Android development and best practices in Java or Kotlin
  • Knowledge on multi-threading, logging, memory management, caching and builds on Android
  • Expertise in developing and debugging across a diverse service stack including storage and data solutions
  • Demonstrated track record of improving software quality with stable releases
  • Experience on platform teams/initiatives, driving technology adoption across feature teams
  • Keeps up to date with new technologies to understand what should be incorporated 
  • Strong collaboration and communication skills
Backend Engineer, Discovery Measurements
Mexico City, MEX

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest personalizes millions of experiences by using machine learning algorithms to sift through our catalog of one hundred billion Pins to find the best content for each Pinner. It is critical to measure the users experience across Pinterest and identify opportunities for improvement. The Discovery Measurements team’s charter is to establish human-powered ground truth for major Pinterest products, e.g. Search and Ads, and develop company critical measurements about relevance, domain quality, session experience, retention, etc. As we look to scale these platforms both vertically and horizontally, we’re looking for strong software engineers to join the team to drive technical excellence and curiosity. We need someone who has experience as a backend developer as well as drive to dive into challenging data processing and data mining problems.

What you’ll do:

  • Build a platform that enables teams to evaluate and train their ML models
  • Design and scale company-wide online & offline measurement platforms for organic and ad content
  • Design and develop company critical measurements, including relevance, domain quality, session experience, retention, user satisfaction
  • Establish technical foundation to generate insightful signals about Pin and Pinners that could power other ML models in the Pinterest ecosystem
  • Partner with cross-functional stakeholders to align engineering efforts for high impact technical initiatives

What we’re looking for:

  • Fluent in any of the following languages: C/C++, Java, JavaScript, Python
  • Exposure to architectural patterns of a large, high-scale web application (e.g., well-designed APIs, high volume data pipelines, efficient algorithms)
  • Model of software engineering best practices, including agile development, unit testing, code reviews, design documentation, debugging, and problem solving
  • Familiar with large data processing and measurement
  • Curiosity for leveraging data and metrics to identify challenging opportunities and build impactful solutions
Engineering Manager, Client Excellence
Mexico City, MEX

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

We’re looking for an Engineering Manager to build out the Client Excellence team. This team of Android, iOS, Web and API engineers is responsible for ensuring Pinners have a high quality experience on Pinterest. They do this by creating tools to monitor and improve our critical client metrics like crash-free sessions, keeping our critical libraries up to date and partnering with other engineering teams to champion operational excellence.

What you’ll do:

  • Build out an experienced team of Android/iOS/Web/API engineers and help them develop new skills and advance in their careers
  • Provide a vision to the team, drive technical excellence and partner with key stakeholders to prioritize and deliver on the team's roadmap
  • Improve the quality of our apps by monitoring and improving core client metrics e.g. crash-free user rate, app size, memory management and cpu usage
  • Create an operational strategy to drive library and OS upgrades with minimal disruption across Pinterest
  • Partner with other engineering teams to discover future opportunities to improve client developer experience
  • Champion operational excellence across all client engineering teams

What we’re looking for:

  • Strong communication, people development and software project management skills
  • Ability to deliver on immediate goals and form long-term strategies around technology, processes, and people
  • Demonstrated track record of improving software quality with stable releases
  • Ability to dive deeply into platform metrics (e.g. crash rates, logging) to identify opportunities for focus
  • Experience leading platform teams/initiatives, driving technology adoption across feature teams
Fullstack Engineer, Discovery Measure...
Mexico City, MEX

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest personalizes millions of experiences by using machine learning algorithms to sift through our catalog of one hundred billion Pins to find the best content for each Pinner. It is critical to measure the users experience across Pinterest and identify opportunities for improvement. The Discovery Measurements team’s charter is to establish human-powered ground truth for major Pinterest products, e.g. Search and Ads, and develop company critical measurements about relevance, domain quality, session experience, retention, and more. As we look to scale these platforms both vertically and horizontally, we’re looking for strong software engineers to join the team to drive technical excellence and curiosity. We need someone who has experience as a full-stack engineer to dive into challenging human-in-the-loop AI problems.

What you’ll do:

  • You will start by building human-in-the-loop AI platforms to power ML models on production
  • Design and implement the UI layer by closely working with Data Scientist, Product Managers, and Machine Learning engineers
  • Contribute to the new unified human computation backend service
  • Build the scalable backend API infrastructure which can be used to measure and evaluate all various deep learning and machine learning models on production

What we’re looking for:

  • Mastery in frontend stack (Javascript/HTML/CSS), familiarity with modern frontend frameworks (e.g. React/Redux)
  • Knowledge of backend stack (Java, Python, Go) and how they interact with MySQL, Redis, Kafka, etc.
  • Good judgment about shipping improvement quickly while ensuring the sustainability of platforms
  • Ability to measure and improve large scale platforms
Verified by
Security Software Engineer
Tech Lead, Big Data Platform
Software Engineer
Talent Brand Manager
Sourcer
Software Engineer
You may also like