Empowering Pinterest Data Scientists and Machine Learning Engineers with PySpark

3,762
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

Data scientists and machine learning engineers at Pinterest found themselves hitting major challenges with existing tools. Hive and Presto were readily accessible tools for large scale data transformations, but complex logic is difficult to write in SQL. Some engineers wrote complex logics in Cascading or Scala Spark jobs, but these have a steep learning curve and take significantly more time to learn and build jobs. Furthermore, data scientists and machine learning engineers often trained models in a small-scale notebook environment, but they lacked the tools to perform large-scale inference.

To combat these challenges, we, (machine learning and data processing platform engineers), built and productionized PySpark infrastructure. The PySpark infrastructure gives our users the following capabilities:

  • Writing logic using the familiar Python language and libraries, in isolated environments that allow experimenting with new packages.
  • Rapid prototyping from our JupyterHub deployment, enabling users to interactively try out feature transformations, model ideas, and data processing jobs.
  • Integration with our internal workflow system, so that users can easily productionize their PySpark applications as scheduled workflows.

PySpark on Kubernetes as a minimum viable product (MVP)

We first built an MVP PySpark infrastructure on Pinterest Kubernetes infrastructure with Spark Standalone Mode and tested with users for feedback.

Figure 1. An overview of the MVP architecture

The infrastructure consists of Kubernetes pods carrying out different tasks:

  • Spark Master managing cluster resources
  • Workers — where Spark executors are spawned
  • Jupyter servers assigned to each user

When users launch PySpark applications from those Jupyter servers, Spark drivers are created in the same pod as Jupyter and the requested executors in worker pods.

This architecture enabled our users to experience the power of PySpark for the first time. Data scientists were able to quickly grasp Python UDFs, transform features, and perform batch inference of TensorFlow models with terabytes of data.

This architecture, however, had some limitations:

  • Jupyter notebook and PySpark driver share resources since they are in the same pod.
  • Driver’s port and address are hard-coded in the config.
  • Users can launch only one PySpark application per assigned Jupyter server.
  • Python dependency per user/team is difficult.
  • Resource management is limited to FIFO approach across all the users (no queue defined).

As the demand for PySpark grew, we worked on a production-grade PySpark infrastructure based on Yarn, Livy, and Sparkmagic.

Production-grade PySpark infrastructure

Figure 2: An overview of the production architecture

In this architecture, each Spark application runs on the YARN cluster. We use Apache Livy to proxy between our internal JupyterHub, the Spark application and the YARN cluster. On Jupyter, Sparkmagic provides a PySpark kernel that forwards the PySpark code to a running Spark application. Conda provides isolated Python environments for each application.

With this architecture, we offer two development approaches.

Interactive development:

  1. A user creates a conda environment zip containing Python packages they need, if any.
  2. From JupyterHub, they create a notebook with PySpark kernel from Sparkmagic.
  3. In the notebook, they declare resources required, conda environment, and other configuration. Livy launches a Spark application on the YARN cluster.
  4. Sparkmagic ships the user’s Jupyter cells (via Livy) to the PySpark application. Livy proxies results back to the Jupyter notebook.

See the attached picture (see Appendix) for a full annotated example of a Jupyter notebook.

Non-interactive development (ad-hoc and production workflow runs):

  1. A Pinterest-internal Job Submission Service acts as the gateway to the YARN cluster.
  2. In development, the user’s local Python code base is packaged into an archive and submitted to launch a PySpark application in YARN.
  3. In scheduled production runs, the production build’s archive is submitted instead.

Benefits

This infrastructure offers us the following benefits:

  1. No resources sharing between Jupyter notebook and PySpark drivers
  2. No hard-coded drivers’ ports and addresses
  3. Users can launch many PySpark applications
  4. Efficient resource allocation and isolation with aggressive dynamic allocation for high resource utilization
  5. Python dependency per user is supported
  6. Resource accountable
  7. Dr. Elephant for PySpark Job analyses

Technical details

Pinterest JupyterHub Integration: (benefits #1,2,3)

We made the Sparkmagic kernel available in Jupyter. When the kernel is selected, a config managed by ZooKeeper is loaded with all necessary dependencies.

We set up Apache Livy, which provides a REST API proxy from Jupyter to the YARN cluster and PySpark applications.

A YARN cluster: (benefit #4)

  • Efficient resource allocation and isolation. We define a queue structure with Fair Scheduler to ensure dedicated resources and preemptable under certain conditions (e.g. after waiting for at least 10 minutes) but a portion of non-preemptable resources will be held for queues with minResource being set. Scheduler and resource manager logs are to manage cluster resources.
  • Aggressive Dynamic allocation policy for high resource utilization. We set the policy where a PySpark application holds at most a certain amount of executors and automatically releases resources once they don’t need. This policy makes sure resources are recycled faster, leading to a better resource utilization.

Python Dependency Management: (benefit #5)

Users can try various Python libraries (e.g. different ML frameworks) without asking platform engineers to install them. To that end, we created a Jenkins job to package a conda environment based on a requirement file, and archive it as a zip file on S3. PySpark applications launched with “ — archives” to broadcast zip file to driver along with all executors, and reset both “PYSPARKPYTHON” (for driver) as well as “spark.yarn.appMasterEnv.PYSPARKPYTHON” (for executors). That way, each application runs under in an isolated Python environment with all libraries needed.

Integrating with Pinterest-internal Job Submission Service (JSS): (benefit #6)

To productionize PySpark applications, users leverage the internal workflow system to schedule. We provided a workflow template to integrate with job submission interfaces to specify code location, parameters, and a Python environment artifact to use.

Self-service job performance analysis: (benefit #7)

We forked the open-sourced Dr. Elephant, and added new heuristics to analyze application’s configuration with various kinds of runtime metrics (executor, job, stage, …). This service provides tuning suggestions and offers guidelines on how to write a spark job properly. The service alleviates users’ debugging-and-troubleshooting pain, boosting the velocity. Moreover, it avoids resource waste and improves cluster stability. Below is an example of the performance analysis.

Figure 3: An overview of Dr. Elephant

Impacts

PySpark is now being used throughout our Product Analytics and Data Science, and Ads teams for a wide range of use cases.

  • Training: users can train models with mllib or any Python machine learning frameworks (e.g. TensorFlow) iteratively with any size of data.
  • Inference: users can test and productionize their Python codes for inferences without depending on platform engineers.
  • Ad-hoc analyses: users can perform various ad-hoc analyses as needed.

Moreover, our users now have the freedom to explore various Python dependencies and use Python UDF for large scale data.

Acknowledgement

We thank David Liu (EM, Machine Learning Platform team), Ang Zhang (EM, Data Processing Platform team), Tais (our TPM), Pinterest Product Analytics and Data Science organization (Sarthak Shah, Grace Huang, Minli Zhang, Dan Lee, Ladi Ositelu), Compute-Platform team (Harry Zhang, June Liu), Data Processing Platform team (Zaheen Aziz), Jupyter team (Prasun Ghosh — Tech Lead) for their support and the collaborations.

Appendix — An example of our use-case (Appendix):

Below is an example of how our users train a model, and run inference logic at scale from their Jupyter notebook with PySpark. We leave explanations in each cell.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
EPM Solutions Architect, Adaptive Pla...
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

The EPM technology team at Pinterest is looking for a senior EPM architect who has at least four years of technical experience in Workday Adaptive Planning. You will be the solutions architect who oversees technical design of the complete EPM ecosystem with emphasis on Adaptive Financial and Workforce planning. The right candidate will also need to have hands-on development experience with Adaptive Planning and related technologies. The role is in IT but will work very closely with FP&A and the greater Finance/Accounting teams. Experience with Tableau suite of tools is a plus.

What you'll do: 

  • Together with the EPM Technology team, you will own Adaptive Planning and all related services
  • Oversee architecture of existing Adaptive Planning solution and make suggestions for improvements
  • Solution and lead Adaptive Planning enhancement projects from beginning to end
  • Help EPM Technology team gain deeper understanding of Adaptive Planning and train the team on Adaptive Planning best practices
  • Establish strong relationship with Finance users and leadership to drive EPM roadmap for Adaptive Planning and related technologies
  • Help establish EPM Center of Excellence at Pinterest
  • This is a contract position at Pinterest. As such, the contractor who fills this role will be employed either by our staffing partner (ProUnlimited) or by an agency partner, and not an employee of Pinterest.
  • All interviews will be scheduled and/or conducted by the Pinterest assignment manager. When a finalist has been selected, ProUnlimited or the agency partner will extend the offer and provide assignment details including duration, benefits options and onboarding details.

What we're looking for: 

  • Hands-on design and build experience with all Adaptive Planning technologies: standard sheets, cube sheets, all dimensions, reporting, integration framework, security, dashboarding and OfficeConnect
  • Strong in application design, data integration and application project lifecycle
  • Comfortable working side-by-side with business
  • Ability to translate business requirements to technical requirements
  • Strong understanding in all three financial statements and the different enterprise planning cycles
  • Familiar with Tableau suite of tools.

 

More about contract roles at Pinterest: 

  • This is temporary contract position at Pinterest. Contractors at Pinterest are employed through our agency partners. If hired, your employer of record will be the agency that submitted your resume to the role. 
  • Interviews will be scheduled through the agency you’re working with, and conducted by the Pinterest hiring manager. When the hiring manager is ready to move forward with a candidate the agency will reach out to the candidate to extend the offer and provide assignment details (duration, rate, benefits, etc.).

 

Machine Learning Engineer, Content Si...
Toronto, ON, CA

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

On the Content Signals team, you’ll be responsible for building machine learning signals from NLP and CV components to productionizing the end product in batch and real-time setting at Pinterest scale. Our systems offer rich semantics to the recommendation platform and enable the product engineers to build deeper experiences to further engage Pinners. In understanding structured and unstructured content, we leverage embeddings, supervised and semi-supervised learning, and LSH. To scale our systems we leverage Spark, Flink, and low-latency model serving infrastructure.

What you’ll do:

  • Apply machine learning approaches to build rich signals that enable ranking and product engineers to build deeper experiences to further engage Pinners
  • Own, improve, and scale signals over both structured and unstructured content that bring tens of millions of rich content to Pinterest each day
  • Drive the roadmap for next-generation content signals that improve the content ecosystem at Pinterest.

What we’re looking for:

  • Deep expertise in content modeling at consumer Internet scale
  • Strong ability to work cross-functionally and with partner engineering teams
  • Expert in Java, Scala or Python.

#LI-EA2

IT Client Platform Engineer (Contract)
San Francisco, CA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

We’re looking for a Client Platform Engineer to join the IT organization at Pinterest. You will help maintain our endpoint management, application engineering and administration efforts to optimize and secure our portfolio of end user services. We use best of breed tools to get the job done.

What you’ll do:

  • Provide technical guidance for our client platform engineering efforts: macOS, Windows, iOS and Android. Including employee devices, service devices, on-prem servers, and AWS services
  • Maintain Helpdesk focused tools to help improve efficiency around the most common requests/incidents
  • Maintain automation workflows and integrations to make our applications work better together
  • Work with the IT and Security teams to drive better security practices across our applications and clients
  • Work with the broader IT team to provide training, documentation, and learning opportunities
  • This is a contract position at Pinterest. As such, the contractor who fills this role will be employed either by our staffing partner (ProUnlimited) or by an agency partner, and not an employee of Pinterest.
  • All interviews will be scheduled and/or conducted by the Pinterest assignment manager. When a finalist has been selected, ProUnlimited or the agency partner will extend the offer and provide assignment details including duration, benefits options and onboarding details.

What we’re looking for:

  • IT systems generalist with experience in an enterprise environment with high standards for reliability and security
  • Strong fundamentals in mobile (iOS/Android), macOS, and Windows
  • Ability to work with commercial and open source tools to help extend our world class, hybrid model
  • Empathy for end users and a passion for frictionless user experiences
  • A pulse for the future of client engineering.

More about contract roles at Pinterest: 

  • This is temporary contract position at Pinterest. Contractors at Pinterest are employed through our agency partners. If hired, your employer of record will be the agency that submitted your resume to the role. 
  • Interviews will be scheduled through the agency you’re working with, and conducted by the Pinterest hiring manager. When the hiring manager is ready to move forward with a candidate the agency will reach out to the candidate to extend the offer and provide assignment details (duration, rate, benefits, etc.).
Backend Engineer, Catalogs Infra
San Francisco, CA

About Pinterest:

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. As a Pinterest employee, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping users make their lives better in the positive corner of the internet.

We're looking for a backend engineer to join our Catalogs Infrastructure team. Our mission is to enable all Partners to succeed by helping them upload their best Shopping contents and manage their presence on Pinterest. You will work on a backend engineering team responsible for executing foundational and infrastructural areas of our Shopping Partner Experience products and strategy. The ideal candidate should have experience building mission critical, highly scalable, low latency, and reliable distributed systems, and be driven to become a technical expert in their domain, and have a passion for building, scaling, and operating backend systems.

What you’ll do:

  • Define and execute vision for evolving the infrastructure that forms the foundation for Pinterest Catalogs and Shopping at Pinterest
  • Own and innovate the core functional areas of Catalogs infrastructure including the product ingestion and indexing systems, product retrieval systems, Partner management systems, Shopping campaign management systems, third party integration for Shopping content acquisition, and more
  • Re-architect Catalogs infrastructure services/components to achieve greater scalability, efficiency, performance, reliability, and functionality
  • Design and implement services/components to support new Catalogs products and Shopping business opportunities

What we’re looking for:

  • 3+ years of relevant industry experience
  • Extensive experience in building and owning mission critical infrastructure and backend systems, and solving complex design, scaling, latency, or performance problems in high-volume low-latency distributed systems
  • Expertise in service-oriented architecture and data storage systems (MySQL, HBase, Cassandra)
  • Proficiency in at least one of the systems languages (Java, C++, C#)

#LI-AG1

Verified by
Security Software Engineer
Tech Lead, Big Data Platform
Software Engineer
Talent Brand Manager
Sourcer
Software Engineer
You may also like