Simplifying Web Deploys

3,016
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

In 2019, Pinterest has moved to a CI/CD model for our API and web layers, which has truly improved agility by reducing time between merge and production. Prior to the update, we had been deploying our web code in the same way for years, and it began showing its age. That mechanism is internally called A/B deploys and externally it’s referred to as Blue-Green deploys. In this post we describe how and why we replaced it with rolling deploys.

Our old deployment model (Blue-Green deploys)

Since the early days, the CD approach for the web layer of our main web property was based on the blue-green deployment model, where we kept two instances of the web layer deployed at all times. These instances were called A and B, therefore we commonly referred to this deployment model as A/B (not to be confused with A/B testing).

At any given time, only one of these instances would be active and taking traffic (let’s say A for example), so we’d deploy a new version to the other instance (B in this case) and switch over as soon as it had been verified with some canary traffic. B would then be on the latest version, active and receiving traffic. The cycle would then repeat with the next deploy happening on A and so on.

This model had a few positive aspects:

Instant rollbacks

When a regression somehow managed to make it past integration tests and canary traffic to be later on detected in production, we could instantly remove it by reactivating the previous version.

Only one version of the application runs at a given time

With only one of the instances active at a time and the cut-over happening virtually instantly, we could always rely on the fact that we were serving only one version of the application at a given time, which really simplified dealing with production metrics.

No capacity loss during deploys

Because deploys only targeted inactive instances, we could deploy very fast and then proceed to activate the new version when it was available everywhere. You can’t really do this that fast if you are updating production endpoints in-place.

However, things weren’t perfect. Here are a few things we didn’t like about this setup:

Need to keep two instances running

Since we had two instances of the webapp running at almost all times, our fleet had to be sized accordingly in terms of memory, disk and CPU. We also had to address other aspects of the instance duplicity, e.g. port and naming conflicts, which added complexity to our code.

No ramp up

To turn on a new version, we went from 0% to 100%. There’s a certain family of regressions that did not show up during canary phase and when they did show up, it was too late.

Statefulness

We had to maintain a lot of state in ZooKeeper to keep track of what had previously been served, when the new version became ready, etc. Over the years, the state machine controlling all of this grew wildly complex to the point where it was hard to change something without causing an incident.

Complex routing logic

The logic to ensure that requests were routed to the right version is hard to get right when you have more than a few possible states. We had to account for all the possible combinations of A serving, B serving, canary serving A, canary serving B, etc. This, coupled with logic to signal version upgrades to our Javascript code, made everything hard to maintain and even harder to extend that code base to support new use cases.

Uniqueness

Most other stateless clusters at Pinterest use the well known rolling deploy model based on Teletraan, so there’s a real cognitive tax in having a hard to understand deploy model just for our web cluster.

The new deployment model (Rolling deploys)

Last year we decided it was time to move to a rolling deployment model. A cross-functional team was assembled to plan and execute the project, comprised of engineers across the delivery platform, traffic and web teams.

After exploring multiple approaches — each one essentially differing in how much complexity happened at the client vs the frontend proxy vs backend web clusters — we decided that we could handle the bulk of our routing logic in our Envoy ingress cluster.

Rolling deploys from the web application perspective

From the application perspective, the move to rolling deploys represented a fundamental change in the way we dealt with production metrics and issues: we could no longer simply rely on the fact that only one version was being served at a given time; in fact, mid-deploy we would have two different versions each running on half of the fleet. Therefore one of our action items was to update our systems and metrics to be more version-aware.

The version of our client-side application also became a key point of discussion, since we have long had a requirement for version affinity between the client-side and the server-side portions of our application. That means that 1) XHR requests coming from a client running a certain version of the app should be processed by server-side code from the same version and 2) our client refreshes to a new version when a new server-side version is detected.

Graph showing web client refreshes during the day, each color represents a new version being rolled out to the web clients. There, peaks coincide with the period when a new version is being deployed to our servers. At that moment, we signal to web clients that a new version is available on the server-side and instruct it to refresh. Once the deploy is complete, the number of refreshes rolls off until a new deploy starts.

We decided to maintain this approach since it provides a number of benefits in terms of development and operations as a consequence of the consistency between client-side and server-side code. However, with rolling deploys the cut-off to a new version is no longer a single point in time but instead a longer interval where two or more servers versions can co-exist. We quickly learned that we would need to roll the client updates along with the server-side updates to maintain a healthy ratio of requests per host while keeping the version affinity mechanism.

A day in the life of the Pinterest web app.

The graph above shows active user sessions, with each color representing a different version.

Notice how we “roll” web clients from one version to the other following our deploys throughout the day. The smaller blue peak represents a deployment that was rolled back when an issue was identified before its completion. It shows one of the many advantages of this model: early incident detection.

Rolling deploys and traffic routing

Last year, the Traffic team replaced our ingress tier based on Varnish with the new and powerful Envoy proxy. Envoy is easily extensible via filters which can be written in modern C++. The ability to extend our edge load-balancers with custom functionality and powerful metrics gave us confidence to explore a replacement for the Blue-Green deployment model. We set out with the goal of having an almost identical deployment model to every other cluster, while maintaining version affinity between client and server during deploys so that the Web team can carry on with the existing premise. This is also important, because switching across versions comes with a cost (e.g.: a browser refresh). So this needs to happen at most once for every active Pinner during a deploy.

We first simplified the client-side logic to ensure that the state machine which handles version switching had only one entry point, to make it easier to operate. Because of our unique requirements we couldn’t just use Envoy’s existing routing mechanism. Our requirements were:

  • During deprecation, both deployment types should be supported (Blue-Green and Rolling)
  • We should be able to gracefully shift over a % of traffic across stages
  • Behaviors should be as deterministic as possible. E.g.: when forcing an existing session into a new version, it shouldn’t jump back to the previous one unless there’s a rollback

So, we designed and prototyped a routing filter that would be in charge of distributing requests during a rolling deploy, while honoring the above requirements.

The first requirement is critical, and most successful migrations are so because they provide a good story around gracefully moving from the Old World into the New World. This allowed us to build confidence while we moved along, even though it came with a tax of supporting more complexity.

The Envoy filter’s state machine ended up looking something like this:

  • If a request has no routing id, assign it one
  • For a given routing id, pick a stage. E.g.: hash(routing_id) % len(stages)
  • Within a given stage, if it’s using rolling deploys then pick a version. E.g.: hash(routingid) % len(versionsforthatstage)

To avoid permanently sticking users to a stage, we established that a routing id has a duration of 24 hours. We also came up with the concept of a Route Map, which describes the traffic distribution across stages and versions. Here’s an example map:

This route map will send 99.5% of traffic to prod and 0.5% to canary. Within each stage, it’ll distribute traffic dynamically and consistently across versions. Dynamically means it’ll route based on the available capacity for each version. Consistently means it’ll apply an ordering between a routing id and the available versions to ensure a given routing_id is not jumping across versions during a deploy and that it only jumps once.

The route map is stored in ZooKeeper and distributed via our config pipeline. The capacity per version per stage is calculated from the available endpoints on our published serversets (which also exist in ZooKeeper). That is, endpoints have metadata about their versions which is then used for capacity calculation. This was all very convenient, because we could rely on existing and battle tested systems. However, it also comes with the challenge of eventual consistency. Not all Envoy servers have the same view of the world at the same time.

To work around this, we extended our filter to give it the notion of “deployment direction”. That is, when a route map is changing you can infer which version is being deployed by observing how capacity changes. A version that is increasing in capacity is the new version. Thus, when there’s a mismatch between the version a session wants versus what the filter thinks it should get we use the deployment’s direction to break the ambiguity. This ended up being very useful for quelling the version bouncing happening because of lack of synchronization across Envoys.

Conclusion

Deployment strategies and traffic routing are fun challenges. Getting them right can really smooth out your developer and operational experience. They can also greatly increase your reliability, when the pipeline is easy to reason about and debug. Being able to build this on top of Envoy really made things easier, given the vitality of the project and how easy it is to extend its core logic via filters.

Changing core infrastructure that has been around for years is always challenging because there’s a lot of undocumented behavior. However, our approach of a phased transition across deployment models made it possible to get steady feedback and ensure an incident-free migration.

This project was a joint effort across multiple teams: Delivery Platform, Core Web, Service Framework and Traffic. During the process we also received very valuable feedback from other teams and actors.

Credits for design ideas & code reviews: James Fish, Derek Argueta, Scott Beardsley, Micheal Benedict, Chris Lloyd

We’re building the world’s first visual discovery engine. More than 250 million people around the world use Pinterest to dream about, plan and prepare for things they want to do in life. Come join us!

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Machine Learning Engineer
San Francisco, CA, US; Palo Alto, CA, US; Seattle, WA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our PinFlex landing page to learn more. 

With more than 400 million users around the world and 300 billion ideas saved, Pinterest Machine Learning engineers build personalized experiences to help Pinners create a life they love. With just over 3,000 global employees, our teams are small, mighty, and still growing. At Pinterest, you’ll experience hands-on access to an incredible vault of data and contribute large-scale recommendation systems in ways you won’t find anywhere else.

What you’ll do:

  • Build cutting edge technology using the latest advances in deep learning and machine learning to personalize Pinterest
  • Partner closely with teams across Pinterest to experiment and improve ML models for various product surfaces (Homefeed, Ads, Growth, Shopping, and Search), while gaining knowledge of how ML works in different areas
  • Use data driven methods and leverage the unique properties of our data to improve candidates retrieval
  • Work in a high-impact environment with quick experimentation and product launches
  • Keeping up with industry trends in recommendation systems 

 

What we’re looking for:

  • 2+ years of industry experience applying machine learning methods (e.g., user modeling, personalization, recommender systems, search, ranking, natural language processing, reinforcement learning, and graph representation learning)
  • End-to-end hands-on experience with building data processing pipelines, large scale machine learning systems, and big data technologies (e.g., Hadoop/Spark)
  • Nice to have:
    • M.S. or PhD in Machine Learning or related areas
    • Publications at top ML conferences
    • Expertise in scalable realtime systems that process stream data
    • Passion for applied ML and the Pinterest product

 

#LI-HYBRID
#LI-LA1

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

iOS Engineer, Product
San Francisco, CA, US; New York City, NY, US; Portland, OR, US; Seattle, WA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our PinFlex landing page to learn more. 

We are looking for inquisitive, well-rounded iOS engineers to join our Product engineering teams. Working closely with product managers, designers, and backend engineers, you’ll play an important role in enabling the newest technologies and experiences. You will build robust frameworks & features. You will empower both developers and Pinners alike. You’ll have the opportunity to find creative solutions to thought-provoking problems. Even better, because we covet the kind of courageous thinking that’s required in order for big bets and smart risks to pay off, you’ll be invited to create and drive new initiatives, seeing them from inception through to technical design, implementation, and release.

What you’ll do:

  • Build out Pinner-facing frontend features in iOS to power the future of inspiration on Pinterest
  • Contribute to and lead each step of the product development process, from ideation to implementation to release; from rapidly prototyping, running A/B tests, to architecting and building solutions that can scale to support millions of users
  • Partner with design, product, and backend teams to build end to end functionality
  • Put on your Pinner hat to suggest new product ideas and features
  • Employ automated testing to build features with a high degree of technical quality, taking responsibility for the components and features you develop
  • Grow as an engineer by working with world-class peers on varied and high impact projects

What we’re looking for:

  • Deep understanding of iOS development and best practices in Objective C and/or Swift, e.g. xCode, app states, memory management, etc
  • 2+ years of industry iOS application development experience, building consumer or business facing products
  • Experience in following best practices in writing reliable and maintainable code that may be used by many other engineers
  • Ability to keep up-to-date with new technologies to understand what should be incorporated
  • Strong collaboration and communication skills

Product iOS Engineering teams: 

Creator Incentives 

Home Product

Native Publishing

Search Product

Social Growth

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Machine Learning Engineer, Core Engi...
San Francisco, CA, US; Palo Alto, CA, US; Seattle, WA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our PinFlex landing page to learn more. 

With more than 400 million users around the world and 300 billion ideas saved, Pinterest Machine Learning engineers build personalized experiences to help Pinners create a life they love. With just over 3,000 global employees, our teams are small, mighty, and still growing. At Pinterest, you’ll experience hands-on access to an incredible vault of data and contribute large-scale recommendation systems in ways you won’t find anywhere else.

What you’ll do:

  • Build cutting edge technology using the latest advances in deep learning and machine learning to personalize Pinterest
  • Partner closely with teams across Pinterest to experiment and improve ML models for various product surfaces (Homefeed, Ads, Growth, Shopping, and Search), while gaining knowledge of how ML works in different areas
  • Use data driven methods and leverage the unique properties of our data to improve candidates retrieval
  • Work in a high-impact environment with quick experimentation and product launches
  • Keeping up with industry trends in recommendation systems 

 

What we’re looking for:

  • 2+ years of industry experience applying machine learning methods (e.g., user modeling, personalization, recommender systems, search, ranking, natural language processing, reinforcement learning, and graph representation learning)
  • End-to-end hands-on experience with building data processing pipelines, large scale machine learning systems, and big data technologies (e.g., Hadoop/Spark)
  • Nice to have:
    • M.S. or PhD in Machine Learning or related areas
    • Publications at top ML conferences
    • Expertise in scalable realtime systems that process stream data
    • Passion for applied ML and the Pinterest product

 

#LI-HYBRID
#LI-LA1

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Software Engineer, Infrastructure
San Francisco, CA, US; Palo Alto, CA, US; Seattle, WA, US

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our PinFlex landing page to learn more. 

The Pinterest Infrastructure Engineering organization builds, scales, and evolves the systems which the rest of Pinterest Engineering uses to deliver inspiration to the world.  This includes source code management, continuous integration, artifact packaging, continuous deployment, service traffic management, service registration and discovery, as well as holistic observability and the underlying compute runtime and container orchestration.  A collection of platforms and capabilities which accelerate development velocity while protecting Pinterest’s production availability for one of the world’s largest public cloud workloads. 

What you’ll do:

  • Design, develop, and operate large scale, distributed systems and networks
  • Work with Engineering customers to understand new requirements and address them in a scalable and efficient manner
  • Actively work to improve the developer process and experience in all phases from coding to operation

What we’re looking for:

  • 2+ years of industry software engineering experience
  • Experience building & operating large scale distributed systems and/or networks
  • Experience in Python, Java, C++, or Go or another language and a willingness to learn
  • Bonus: Experience deploying and operating large scale workloads on a public cloud footprint

Available Hiring Teams: Cloud Delivery Platform (Infra Eng), Code & Language Runtime (Infra Eng), Traffic (Infra Eng), Cloud Systems (Infra Eng), Online Systems (Data Eng), Key Value Systems (Data Eng), Real Time Analytics (Data Eng), Storage & Caching (Data Eng), ML Serving Platform (Data Eng)

 

#LI-SG1

Our Commitment to Diversity:

At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.

Verified by
Software Engineer
Sourcer
Software Engineer
Talent Brand Manager
Tech Lead, Big Data Platform
Security Software Engineer
You may also like