Empowering Pinterest Data Scientists and Machine Learning Engineers with PySpark

6,224
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

Data scientists and machine learning engineers at Pinterest found themselves hitting major challenges with existing tools. Hive and Presto were readily accessible tools for large scale data transformations, but complex logic is difficult to write in SQL. Some engineers wrote complex logics in Cascading or Scala Spark jobs, but these have a steep learning curve and take significantly more time to learn and build jobs. Furthermore, data scientists and machine learning engineers often trained models in a small-scale notebook environment, but they lacked the tools to perform large-scale inference.

To combat these challenges, we, (machine learning and data processing platform engineers), built and productionized PySpark infrastructure. The PySpark infrastructure gives our users the following capabilities:

  • Writing logic using the familiar Python language and libraries, in isolated environments that allow experimenting with new packages.
  • Rapid prototyping from our JupyterHub deployment, enabling users to interactively try out feature transformations, model ideas, and data processing jobs.
  • Integration with our internal workflow system, so that users can easily productionize their PySpark applications as scheduled workflows.

PySpark on Kubernetes as a minimum viable product (MVP)

We first built an MVP PySpark infrastructure on Pinterest Kubernetes infrastructure with Spark Standalone Mode and tested with users for feedback.

Figure 1. An overview of the MVP architecture

The infrastructure consists of Kubernetes pods carrying out different tasks:

  • Spark Master managing cluster resources
  • Workers — where Spark executors are spawned
  • Jupyter servers assigned to each user

When users launch PySpark applications from those Jupyter servers, Spark drivers are created in the same pod as Jupyter and the requested executors in worker pods.

This architecture enabled our users to experience the power of PySpark for the first time. Data scientists were able to quickly grasp Python UDFs, transform features, and perform batch inference of TensorFlow models with terabytes of data.

This architecture, however, had some limitations:

  • Jupyter notebook and PySpark driver share resources since they are in the same pod.
  • Driver’s port and address are hard-coded in the config.
  • Users can launch only one PySpark application per assigned Jupyter server.
  • Python dependency per user/team is difficult.
  • Resource management is limited to FIFO approach across all the users (no queue defined).

As the demand for PySpark grew, we worked on a production-grade PySpark infrastructure based on Yarn, Livy, and Sparkmagic.

Production-grade PySpark infrastructure

Figure 2: An overview of the production architecture

In this architecture, each Spark application runs on the YARN cluster. We use Apache Livy to proxy between our internal JupyterHub, the Spark application and the YARN cluster. On Jupyter, Sparkmagic provides a PySpark kernel that forwards the PySpark code to a running Spark application. Conda provides isolated Python environments for each application.

With this architecture, we offer two development approaches.

Interactive development:

  1. A user creates a conda environment zip containing Python packages they need, if any.
  2. From JupyterHub, they create a notebook with PySpark kernel from Sparkmagic.
  3. In the notebook, they declare resources required, conda environment, and other configuration. Livy launches a Spark application on the YARN cluster.
  4. Sparkmagic ships the user’s Jupyter cells (via Livy) to the PySpark application. Livy proxies results back to the Jupyter notebook.

See the attached picture (see Appendix) for a full annotated example of a Jupyter notebook.

Non-interactive development (ad-hoc and production workflow runs):

  1. A Pinterest-internal Job Submission Service acts as the gateway to the YARN cluster.
  2. In development, the user’s local Python code base is packaged into an archive and submitted to launch a PySpark application in YARN.
  3. In scheduled production runs, the production build’s archive is submitted instead.

Benefits

This infrastructure offers us the following benefits:

  1. No resources sharing between Jupyter notebook and PySpark drivers
  2. No hard-coded drivers’ ports and addresses
  3. Users can launch many PySpark applications
  4. Efficient resource allocation and isolation with aggressive dynamic allocation for high resource utilization
  5. Python dependency per user is supported
  6. Resource accountable
  7. Dr. Elephant for PySpark Job analyses

Technical details

Pinterest JupyterHub Integration: (benefits #1,2,3)

We made the Sparkmagic kernel available in Jupyter. When the kernel is selected, a config managed by ZooKeeper is loaded with all necessary dependencies.

We set up Apache Livy, which provides a REST API proxy from Jupyter to the YARN cluster and PySpark applications.

A YARN cluster: (benefit #4)

  • Efficient resource allocation and isolation. We define a queue structure with Fair Scheduler to ensure dedicated resources and preemptable under certain conditions (e.g. after waiting for at least 10 minutes) but a portion of non-preemptable resources will be held for queues with minResource being set. Scheduler and resource manager logs are to manage cluster resources.
  • Aggressive Dynamic allocation policy for high resource utilization. We set the policy where a PySpark application holds at most a certain amount of executors and automatically releases resources once they don’t need. This policy makes sure resources are recycled faster, leading to a better resource utilization.

Python Dependency Management: (benefit #5)

Users can try various Python libraries (e.g. different ML frameworks) without asking platform engineers to install them. To that end, we created a Jenkins job to package a conda environment based on a requirement file, and archive it as a zip file on S3. PySpark applications launched with “ — archives” to broadcast zip file to driver along with all executors, and reset both “PYSPARKPYTHON” (for driver) as well as “spark.yarn.appMasterEnv.PYSPARKPYTHON” (for executors). That way, each application runs under in an isolated Python environment with all libraries needed.

Integrating with Pinterest-internal Job Submission Service (JSS): (benefit #6)

To productionize PySpark applications, users leverage the internal workflow system to schedule. We provided a workflow template to integrate with job submission interfaces to specify code location, parameters, and a Python environment artifact to use.

Self-service job performance analysis: (benefit #7)

We forked the open-sourced Dr. Elephant, and added new heuristics to analyze application’s configuration with various kinds of runtime metrics (executor, job, stage, …). This service provides tuning suggestions and offers guidelines on how to write a spark job properly. The service alleviates users’ debugging-and-troubleshooting pain, boosting the velocity. Moreover, it avoids resource waste and improves cluster stability. Below is an example of the performance analysis.

Figure 3: An overview of Dr. Elephant

Impacts

PySpark is now being used throughout our Product Analytics and Data Science, and Ads teams for a wide range of use cases.

  • Training: users can train models with mllib or any Python machine learning frameworks (e.g. TensorFlow) iteratively with any size of data.
  • Inference: users can test and productionize their Python codes for inferences without depending on platform engineers.
  • Ad-hoc analyses: users can perform various ad-hoc analyses as needed.

Moreover, our users now have the freedom to explore various Python dependencies and use Python UDF for large scale data.

Acknowledgement

We thank David Liu (EM, Machine Learning Platform team), Ang Zhang (EM, Data Processing Platform team), Tais (our TPM), Pinterest Product Analytics and Data Science organization (Sarthak Shah, Grace Huang, Minli Zhang, Dan Lee, Ladi Ositelu), Compute-Platform team (Harry Zhang, June Liu), Data Processing Platform team (Zaheen Aziz), Jupyter team (Prasun Ghosh — Tech Lead) for their support and the collaborations.

Appendix — An example of our use-case (Appendix):

Below is an example of how our users train a model, and run inference logic at scale from their Jupyter notebook with PySpark. We leave explanations in each cell.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
iOS Engineer, Product
San Francisco, CA, US; New York City, NY, US; Portland, OR, US; Seattle, WA, US
<div class="content-intro"><p><strong>About Pinterest</strong><span style="font-weight: 400;">:&nbsp;&nbsp;</span></p><p>Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love.&nbsp;In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping&nbsp;Pinners&nbsp;make their lives better in the positive corner of the internet.</p><p><em>Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our </em><a href="https://www.pinterestcareers.com/pinflex/" target="_blank"><em><u>PinFlex</u></em></a><em> landing page to learn more.&nbsp;</em></p></div><p><span style="font-weight: 400;">We are looking for inquisitive, well-rounded iOS engineers to join our Product engineering teams. Working closely with product managers, designers, and backend engineers, you’ll play an important role in enabling the newest technologies and experiences. You will build robust frameworks &amp; features. You will empower both developers and Pinners alike. You’ll have the opportunity to find creative solutions to thought-provoking problems. Even better, because we covet the kind of courageous thinking that’s required in order for big bets and smart risks to pay off, you’ll be invited to create and drive new initiatives, seeing them from inception through to technical design, implementation, and release.</span></p><p><strong>What you’ll do:</strong></p><ul><li style="font-weight: 400;"><span style="font-weight: 400;">Build out Pinner-facing frontend features in iOS to power the future of inspiration on Pinterest</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Contribute to and lead each step of the product development process, from ideation to implementation to release; from rapidly prototyping, running A/B tests, to architecting and building solutions that can scale to support millions of users</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Partner with design, product, and backend teams to build end to end functionality</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Put on your Pinner hat to suggest new product ideas and features</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Employ automated testing to build features with a high degree of technical quality, taking responsibility for the components and features you develop</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Grow as an engineer by working with world-class peers on varied and high impact projects</span></li></ul><p><strong>What we’re looking for:</strong></p><ul><li style="font-weight: 400;"><span style="font-weight: 400;">Deep understanding of iOS development and best practices in Objective C and/or Swift</span><span style="font-weight: 400;">, e.g. xCode, app states, memory management, etc</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">2+ years of industry iOS application development experience, building consumer or business facing products</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Experience in following best practices in writing reliable and maintainable code that may be used by many other engineers</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Ability to keep up-to-date with new technologies to understand what should be incorporated</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Strong collaboration and communication skills</span></li></ul><p><strong>Product iOS Engineering teams:&nbsp;</strong></p><p><span style="font-weight: 400;">Creator Incentives&nbsp;</span></p><p><span style="font-weight: 400;">Home Product</span></p><p><span style="font-weight: 400;">Native Publishing</span></p><p><span style="font-weight: 400;">Search Product</span></p><p><span style="font-weight: 400;">Social Growth</span></p><div class="content-conclusion"><p><strong>Our Commitment to Diversity:</strong></p><p>At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.</p></div>
iOS Engineer
Warsaw, POL
<div class="content-intro"><p><strong>About Pinterest</strong><span style="font-weight: 400;">:&nbsp;&nbsp;</span></p><p>Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love.&nbsp;In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping&nbsp;Pinners&nbsp;make their lives better in the positive corner of the internet.</p><p><em>Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our </em><a href="https://www.pinterestcareers.com/pinflex/" target="_blank"><em><u>PinFlex</u></em></a><em> landing page to learn more.&nbsp;</em></p></div><p><strong>What you’ll do:</strong></p><ul><li style="font-weight: 400;"><span style="font-weight: 400;">Build product features into existing VOCHI app to enrich it with a lot of video/audio editing tools&nbsp; (effects, filters, canvas, trim/split/merge, audio effects, speed and other)</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Knit across teams by collaborating with Product managers and designers and other functions to build smooth Feed and Video editor experience</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Prototype and create integrative solutions that can be utilized both in VOCHI and Pinterest mobile clients</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Contribute best-in-class programming skills to develop highly innovative consumer-facing mobile products</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Contribute to and lead each step of the product development process, from ideation to implementation to release; from rapidly prototyping, running A/B test, to architecting and building solutions that can scale to support millions of users</span></li></ul><p><strong>What we’re looking for:</strong></p><ul><li style="font-weight: 400;"><span style="font-weight: 400;">6+ years of software engineering experience</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">4+ years of industry experience in developing iOS applications</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Deep understanding of developing on iOS devices in Swift</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Deep understanding of Clean Architecture principles, and different design patterns</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Strong skills and great product sense</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Knowledge on multi-threading, memory management and caching on mobile application</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Strong communication skills</span></li></ul><p>&nbsp;</p><div class="content-conclusion"><p><strong>Our Commitment to Diversity:</strong></p><p>At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.</p></div>
Head of Monetization Sciences and ML ...
San Francisco, CA, US; Palo Alto, CA, US; Seattle, WA, US
<div class="content-intro"><p><strong>About Pinterest</strong><span style="font-weight: 400;">:&nbsp;&nbsp;</span></p><p>Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love.&nbsp;In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping&nbsp;Pinners&nbsp;make their lives better in the positive corner of the internet.</p><p><em>Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our </em><a href="https://www.pinterestcareers.com/pinflex/" target="_blank"><em><u>PinFlex</u></em></a><em> landing page to learn more.&nbsp;</em></p></div><p><span style="font-weight: 400;">You will lead a data science ML engineering organization that is responsible for data driven insights and ML solutions that aim to optimize the Ads marketplace at Pinterest spanning the both advertiser life-cycle and ads delivery funnel. Using your strong analytical skill sets, thorough understanding of machine learning, online auctions and experience in managing large engineering organizations you’ll advance the state of the art in ML and auction theory while at the same time </span><span style="font-weight: 400;">unlocking</span><span style="font-weight: 400;">&nbsp;Pinterest’s monetization potential.&nbsp; In short, this is a unique position, where you’ll get the freedom to work across the organization to bring together pinners, content creators and advertisers in this unique marketplace.</span></p><p><strong>What you’ll do:</strong></p><ul><li style="font-weight: 400;"><span style="font-weight: 400;">Manage a large organization of ML engineers, data scientists and economists responsible for data drive insights and ML solutions that power the monetization efforts at Pinterest</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Provide technical and organizational leadership to grow the organization in alignment with the business needs</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Manage and grow managers of managers and principal ML experts</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Collaborate with engineering and product leadership to define and execute the Monetization vision for Pinterest&nbsp;</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Build strong and effective XFN collaborations across the company with Sales, BizOps, Data eng and Core</span></li></ul><p><strong>What we’re looking for:</strong></p><ul><li style="font-weight: 400;"><span style="font-weight: 400;">MSc. or Ph.D. degree in Economics, Statistics, Computer Science or related field</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">12+ years of relevant industry experience</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">8+ years of management experience</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">XFN collaborator and a strong communicator</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Bridge builder who drives alignment across the engineering and product organization at Pinterest</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Hands-on experience building large-scale ML systems and/or Ads domain knowledge</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Strong mathematical skills with knowledge of statistical models</span></li></ul><p>#LI-TG1</p><div class="content-conclusion"><p><strong>Our Commitment to Diversity:</strong></p><p>At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.</p></div>
Android Engineer, Shopping Product
Toronto, ON, CA
<div class="content-intro"><p><strong>About Pinterest</strong><span style="font-weight: 400;">:&nbsp;&nbsp;</span></p><p>Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love.&nbsp;In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping&nbsp;Pinners&nbsp;make their lives better in the positive corner of the internet.</p><p><em>Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our </em><a href="https://www.pinterestcareers.com/pinflex/" target="_blank"><em><u>PinFlex</u></em></a><em> landing page to learn more.&nbsp;</em></p></div><p><span style="font-weight: 400;">Shopping is at the core of Pinterest’s mission to help people create a life they love. The shopping product team at Pinterest is inventing a brand new, more visual and personalized shopping experience for 350M+ users worldwide. The team is responsible for inspiring Pinners to shop, helping them find the best product and </span><span style="font-weight: 400;">providing </span><span style="font-weight: 400;">seamless checkout </span><span style="font-weight: 400;">experience.&nbsp;</span></p><p><span style="font-weight: 400;">You’ll be responsible for building an Android application that enables Pinners to create the life they love with product discovery and decision experiences that guide from inspiration to purchase. </span><span style="font-weight: 400;">Working closely with the design team, you’ll build beautiful </span><span style="font-weight: 400;">Android</span><span style="font-weight: 400;"> shopping applications for phones and tablets.</span></p><p><strong>What you'll do:</strong></p><ul><li style="font-weight: 400;"><span style="font-weight: 400;">Build features to power the future of Shopping on Pinterest</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Contribute to and lead each step of the product development process, from ideation to implementation to release; from rapidly prototyping, running A/B tests, to architecting and building solutions that can scale to support millions of users</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Work with cross functional peers (PM, Design) to define the product roadmap</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Analyze and visualize data to drive product insights and to inform our decisions</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Contribute best-in-class programming skills to develop highly innovative consumer-facing mobile products</span></li></ul><p><strong>What we're looking for:</strong></p><ul><li style="font-weight: 400;"><span style="font-weight: 400;">2+ years of industry </span><span style="font-weight: 400;">Android</span><span style="font-weight: 400;"> application development experience&nbsp;</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Experience in building consumer facing products on </span><span style="font-weight: 400;">Android</span><span style="font-weight: 400;"> platforms for a rapidly iterating product</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Holistic knowledge and passion for the </span><span style="font-weight: 400;">Android</span><span style="font-weight: 400;"> platform</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Strong command of data that could help improve the user experience</span></li><li style="font-weight: 400;"><span style="font-weight: 400;">Strong communication skills and great product intuition</span></li></ul><p><span style="font-weight: 400;"><span data-sheets-value="{&quot;1&quot;:2,&quot;2&quot;:&quot;#LI-NO1&quot;}" data-sheets-userformat="{&quot;2&quot;:14524,&quot;5&quot;:{&quot;1&quot;:[{&quot;1&quot;:2,&quot;2&quot;:0,&quot;5&quot;:{&quot;1&quot;:2,&quot;2&quot;:0}},{&quot;1&quot;:0,&quot;2&quot;:0,&quot;3&quot;:3},{&quot;1&quot;:1,&quot;2&quot;:0,&quot;4&quot;:1}]},&quot;6&quot;:{&quot;1&quot;:[{&quot;1&quot;:2,&quot;2&quot;:0,&quot;5&quot;:{&quot;1&quot;:2,&quot;2&quot;:0}},{&quot;1&quot;:0,&quot;2&quot;:0,&quot;3&quot;:3},{&quot;1&quot;:1,&quot;2&quot;:0,&quot;4&quot;:1}]},&quot;7&quot;:{&quot;1&quot;:[{&quot;1&quot;:2,&quot;2&quot;:0,&quot;5&quot;:{&quot;1&quot;:2,&quot;2&quot;:0}},{&quot;1&quot;:0,&quot;2&quot;:0,&quot;3&quot;:3},{&quot;1&quot;:1,&quot;2&quot;:0,&quot;4&quot;:1}]},&quot;8&quot;:{&quot;1&quot;:[{&quot;1&quot;:2,&quot;2&quot;:0,&quot;5&quot;:{&quot;1&quot;:2,&quot;2&quot;:0}},{&quot;1&quot;:0,&quot;2&quot;:0,&quot;3&quot;:3},{&quot;1&quot;:1,&quot;2&quot;:0,&quot;4&quot;:1}]},&quot;10&quot;:2,&quot;14&quot;:{&quot;1&quot;:2,&quot;2&quot;:0},&quot;15&quot;:&quot;Calibri&quot;,&quot;16&quot;:12}">#LI-NO1</span></span></p><div class="content-conclusion"><p><strong>Our Commitment to Diversity:</strong></p><p>At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.</p></div>
Verified by
Software Engineer
Sourcer
Software Engineer
Talent Brand Manager
Tech Lead, Big Data Platform
Security Software Engineer
You may also like