Improving Efficiency and Reducing Runtime Using S3 Read Optimization

1,181
Pinterest
Pinterest's profile on StackShare is not actively maintained, so the information here may be out of date.

By Bhalchandra Pandit | Software Engineer


Overview

We describe a novel approach we took to improving S3 read throughput and how we used it to improve the efficiency of our production jobs. The results have been very encouraging. A standalone benchmark showed a 12x improvement in S3 read throughput (from 21 MB/s to 269 MB/s). Increased throughput allowed our production jobs to finish sooner. As a result, we saw 22% reduction in vcore-hours, 23% reduction in memory-hours, and similar reduction in run time of a typical production job. Although we are happy with the results, we are exploring additional enhancements in the future. They are briefly described at the end of this blog.

Motivation

We process petabytes of data stored on Amazon S3 every day. If we inspect the relevant metrics of our MapReduce/Cascading/Scalding jobs, one thing stands out: slower than expected mapper speed. In most cases, the observed mapper speed is around 5–7 MB/sec. That speed is orders of magnitude slower compared to the observed throughput of commands such as aws s3 cp, where speeds of around 200+ MB/sec are common (observed on a c5.4xlarge instance in EC2). If we can increase the speed at which our jobs read data, our jobs will finish sooner and save us considerable time and money in the process. Given that processing is costly, these savings can add up quickly to a substantial amount.

S3 read optimization

The Problem: Throughput bottleneck in S3A

If we inspect implementation of the S3AInputStream, it is easy to notice the following potential areas of improvement:

  1. Single threaded reads: Data is read synchronously on a single thread which results in jobs spending most of the time waiting for data to be read over the network.
  2. Multiple unnecessary reopens: The S3 input stream is not seekable. A split has to be closed and reopened repeatedly each time one performs a seek or encounters a read error. The larger the split, the greater the chance of it happening. Each such reopening further slows down the overall throughput.

The Solution: Improving read throughput

Architecture

Figure 1: Components of a prefetching+caching S3 reader

Our approach to addressing the above-mentioned drawbacks includes the following:

  1. We treat a split to be made up of fixed sized blocks. The size defaults to 8 MB but is configurable.
  2. Each block is read asynchronously into memory before it can be accessed by a caller. The size of the prefetch cache (in terms of number of blocks) is configurable.
  3. A caller can only access a block that has already been prefetched into memory. That delinks a client from network flakiness and allows us to have an additional retry layer to increase the overall resiliency.
  4. Each time we encounter a seek outside of the current block, we cache the prefetched blocks in the local file system.

We further enhanced the implementation to make it a mostly lock-free producer-consumer interaction. This enhancement improves read throughput from 20 MB/sec to 269 MB/sec as measured by a standalone benchmark (see details below in Figure 2).

Sequential reads

Any data consumer that processes data sequentially (for example, a mapper) greatly benefits from this approach. While a mapper is processing currently retrieved data, data next in sequence is being prefetched asynchronously. Most of the time, data has already been pre-fetched by the time the mapper is ready for the next block. That results in a mapper spending more time doing useful work and less time waiting for data, thereby effectively increasing CPU utilization.

More efficient Parquet reads

Parquet files require non-sequential access as dictated by their on-disk format. Our initial implementation did not use a local cache. Each time there was a seek outside of the current block, we had to discard any prefetched data. That resulted in worse performance compared to the stock reader when it came to reading from Parquet files.

We observed significant improvement in the read throughput for Parquet files once we introduced the local caching of prefetched data. Currently, our implementation increases Parquet file reading throughput by 5x compared to the stock reader.

Improvement in production jobs

Improved read throughput leads to a number of efficiency improvements in production jobs.

Reduced job runtime

The overall runtime of a job is reduced because mappers spend less time waiting for data and finish sooner.

Potentially reduced number of mappers

If mappers take sufficiently less time to finish, we are able to reduce the number of mappers by increasing the split size. Such reduction in the number of mappers leads to reduced CPU wastage associated with fixed overhead of each mapper. More importantly, it can be done without increasing the run time of a job.

Improved CPU utilization

The overall CPU utilization increases because the mappers are doing the same work in less time.

Results

For now, our implementation (S3E) is in a separate git repository to allow faster iterations over enhancements. We will eventually contribute it back to the community by merging it back into S3A.

Standalone benchmark

Figure 2: Throughput of S3A vs S3E

In each case, we read a 3.5 GB S3 file sequentially and wrote it locally to a temp file. The latter part is used to simulate IO overlap that takes place during a mapper operation. The benchmark was run on a c5.9xlarge instance in EC2. We measured the total time taken to read the file and compute the effective throughput of each method.

Production run

We tested many large production jobs with the S3E implementation. Those jobs typically use tens of thousands of vcores per run. In Figure 3, we present a summary of comparison between metrics obtained with and without S3E enabled.

Measuring resource savings

We use the following method to compute resource savings resulting from this optimization.

Observed results

Figure 3: Comparison of MapReduce job resource consumption

Given the variation in the workload characteristics across production jobs, we saw vcore reduction anywhere between 6% and 45% across 30 of our most expensive jobs. The average saving was a 16% reduction in vcore days.

One thing that is attractive about our approach is that it can be enabled for a job without requiring any change to a job’s code.

Future direction

At present, we have added the enhanced implementation to a separate git repository. In the future, we would likely update the existing S3A implementation and contribute back to the community.

We are in the process of rolling out this optimization across a number of our clusters. We will publish the results in a future blog.

Given that the core implementation of S3E input stream does not depend on any Hadoop code, we can use it in any other system where large amounts of S3 data is accessed. Currently we are using this optimization to target MapReduce, Cascading, and Scalding jobs. However, we have also seen very encouraging results with Spark and Spark SQL in our preliminary evaluation.

The current implementation can use further tuning to improve its efficiency. It is also worth exploring if we can use past execution data to automatically tune the block size and the prefetch cache size used for each job.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog, and visit our Pinterest Labs site. To view and apply to open opportunities, visit our Careers page.

Pinterest
Pinterest's profile on StackShare is not actively maintained, so the information here may be out of date.
Tools mentioned in article
Open jobs at Pinterest
Backend Engineer, Core & Monetization
San Francisco, CA, US; , CA, US
<div class="content-intro"><p><strong>About Pinterest</strong><span style="font-weight: 400;">:&nbsp;&nbsp;</span></p> <p>Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love.&nbsp;In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping&nbsp;Pinners&nbsp;make their lives better in the positive corner of the internet.</p> <p><em>Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our </em><a href="https://www.pinterestcareers.com/pinflex/" target="_blank"><em><u>PinFlex</u></em></a><em> landing page to learn more.&nbsp;</em></p></div><p><span style="font-weight: 400;">We are looking for inquisitive, well-rounded Backend engineers to join our Core and Monetization engineering teams. Working closely with product managers, designers, and backend engineers, you’ll play an important role in enabling the newest technologies and experiences. You will build robust frameworks &amp; features. You will empower both developers and Pinners alike. You’ll have the opportunity to find creative solutions to thought-provoking problems. Even better, because we covet the kind of courageous thinking that’s required in order for big bets and smart risks to pay off, you’ll be invited to create and drive new initiatives, seeing them from inception through to technical design, implementation, and release.</span></p> <p><strong>What you’ll do:</strong></p> <ul> <li style="font-weight: 400;"><span style="font-weight: 400;">Build out the backend for Pinner-facing features to power the future of inspiration on Pinterest</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Contribute to and lead each step of the product development process, from ideation to implementation to release; from rapidly prototyping, running A/B tests, to architecting and building solutions that can scale to support millions of users</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Partner with design, product, and backend teams to build end-to-end functionality</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Put on your Pinner hat to suggest new product ideas and features</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Employ automated testing to build features with a high degree of technical quality, taking responsibility for the components and features you develop</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Grow as an engineer by working with world-class peers on varied and high impact projects</span></li> </ul> <p><strong>What we’re looking for:</strong></p> <ul> <li style="font-weight: 400;"><span style="font-weight: 400;">2+ years of industry backend development experience, building consumer or business facing products</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Proficiency in common backend tech stacks for RESTful API, storage, caching and data processing</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Experience in following best practices in writing reliable and maintainable code that may be used by many other engineers</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Ability to keep up-to-date with new technologies to understand what should be incorporated</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Strong collaboration and communication skills</span></li> </ul> <p><strong>Backend Core Engineering teams:</strong></p> <ul> <li><span style="font-weight: 400;">Community Engagement</span></li> <li><span style="font-weight: 400;">Content Acquisition &amp; Media Platform</span></li> <li><span style="font-weight: 400;">Core Product Indexing Infrastructure</span></li> <li><span style="font-weight: 400;">Shopping Catalog&nbsp;</span></li> <li><span style="font-weight: 400;">Trust &amp; Safety Platform</span></li> <li><span style="font-weight: 400;">Trust &amp; Safety Signals</span></li> <li><span style="font-weight: 400;">User Understanding</span></li> </ul> <p><strong>Backend Monetization Engineering teams:&nbsp;</strong></p> <ul> <li><span style="font-weight: 400;">Ads API Platform</span></li> <li><span style="font-weight: 400;">Ads Indexing Platform</span></li> <li><span style="font-weight: 400;">Ads Reporting Infrastructure</span></li> <li><span style="font-weight: 400;">Ads Retrieval Infra</span></li> <li><span style="font-weight: 400;">Ads Serving and ML Infra</span></li> <li><span style="font-weight: 400;">Measurement Ingestion</span></li> <li><span style="font-weight: 400;">Merchant Infra&nbsp;</span></li> </ul> <p>&nbsp;</p> <p><span style="font-weight: 400;">At Pinterest we believe the workplace should be equitable, inclusive, and inspiring for every employee. In an effort to provide greater transparency, we are sharing the base salary range for this position. This position will pay a base salary of $145,700 to $258,700. The position is also eligible for equity. Final salary is based on a number of factors including location, travel, relevant prior experience, or particular skills and expertise.</span></p> <p><span style="font-weight: 400;">Information regarding the culture at Pinterest and benefits available for this position can be found at <a href="https://www.pinterestcareers.com/pinterest-life/">https://www.pinterestcareers.com/pinterest-life/</a>.</span></p> <p><span style="font-weight: 400;">This position is not eligible for relocation assistance.</span></p> <p>#LI-CL5&nbsp;</p> <p>#LI-REMOTE</p> <p>&nbsp;</p><div class="content-conclusion"><p><strong>Our Commitment to Diversity:</strong></p> <p>At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.</p></div>
Engineering Manager, Advertiser Autom...
San Francisco, CA, US; , CA, US
<div class="content-intro"><p><strong>About Pinterest</strong><span style="font-weight: 400;">:&nbsp;&nbsp;</span></p> <p>Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love.&nbsp;In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping&nbsp;Pinners&nbsp;make their lives better in the positive corner of the internet.</p> <p><em>Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our </em><a href="https://www.pinterestcareers.com/pinflex/" target="_blank"><em><u>PinFlex</u></em></a><em> landing page to learn more.&nbsp;</em></p></div><p><span style="font-weight: 400;">As the Engineering Manager of the Advertiser Automation team, you’ll be leading a large team that’s responsible for key systems that are instrumental to the performance of ad campaigns, tying machine learning models and other automation techniques to campaign creation and management. The ideal candidate should have experience leading teams that work across the web technology stack, be driven about partnering with Product and other cross-functional leaders to create a compelling vision and roadmap for the team, and be passionate about helping each member of their team grow.</span></p> <p><strong>What you’ll do:</strong></p> <ul> <li style="font-weight: 400;"><span style="font-weight: 400;">Managing a team of full-stack engineers</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Work closely with Product and Design on planning roadmap, setting technical direction and delivering value</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Coordinate closely with XFN partners on multiple partner teams that the team interfaces with</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Lead a team that’s responsible for key systems that utilize machine learning models to help advertisers create more performant campaigns on Pinterest</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Partner with Product Management to provide a compelling vision and roadmap for the team.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Work with PM and tech leads to estimate scope of work, define release schedules, and track progress.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Mentor and develop engineers at various levels of seniority.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Keep the team accountable for hitting business goals and driving meaningful impact</span></li> </ul> <p><strong>What we’re looking for:</strong></p> <ul> <li style="font-weight: 400;"><em><span style="font-weight: 400;">Our PinFlex future of work philosophy requires this role to visit a Pinterest office for collaboration approximately 1x per quarter. For employees not located within a commutable distance from this in-office touchpoint, Pinterest will cover T&amp;E. Learn more about PinFlex <a href="https://www.pinterestcareers.com/pinflex/" target="_blank">here</a>.</span></em></li> <li style="font-weight: 400;"><span style="font-weight: 400;">1+ years of experience as an engineering manager (perf cycles, managing up/out, 10 ppl)</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">5+ years of software engineering experience as a hands on engineer</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Experience leading a team of engineers through a significant feature or product launch in collaboration with Product and Design</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Track record of developing high quality software in an automated build and deployment environment</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Experience working with both frontend and backend technologies</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Well versed in agile development methodologies</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Ability to operate in a fast changing environment / comfortable with ambiguity</span></li> </ul> <p>&nbsp;</p> <p><span style="font-weight: 400;">At Pinterest we believe the workplace should be equitable, inclusive, and inspiring for every employee. In an effort to provide greater transparency, we are sharing the base salary range for this position. This position will pay a base salary of $172,500 to $258,700. The position is also eligible for equity and incentive compensation. Final salary is based on a number of factors including location, travel, relevant prior experience, or particular skills and expertise.</span></p> <p><span style="font-weight: 400;">Information regarding the culture at Pinterest and benefits available for this position can be found at </span><a href="https://www.pinterestcareers.com/pinterest-life/"><span style="font-weight: 400;">https://www.pinterestcareers.com/pinterest-life/</span></a><span style="font-weight: 400;">.</span></p> <p>#LI-REMOTE</p> <p>#LI-NB1</p><div class="content-conclusion"><p><strong>Our Commitment to Diversity:</strong></p> <p>At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.</p></div>
Engineering Manager, Conversion Data
Seattle, WA, US; , WA, US
<div class="content-intro"><p><strong>About Pinterest</strong><span style="font-weight: 400;">:&nbsp;&nbsp;</span></p> <p>Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love.&nbsp;In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping&nbsp;Pinners&nbsp;make their lives better in the positive corner of the internet.</p> <p><em>Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our </em><a href="https://www.pinterestcareers.com/pinflex/" target="_blank"><em><u>PinFlex</u></em></a><em> landing page to learn more.&nbsp;</em></p></div><p><span style="font-weight: 400;">Pinterest is one of the fastest growing online advertising platforms, and our continued success depends on our ability to enable advertisers to understand the value and return on their advertising investments. Conversion Data, a team within the Measurement org, is a Seattle engineering product team. </span><span style="font-weight: 400;">The Conversion Data team is functioning as custodian of conversion data inside Pinterest. We build tools to make conversion data accessible and usable for consumers with valid business justifications. We are aiming to have conversion data consumed in a privacy-safe and secured way. By providing toolings and support, we reduce friction for consumers to stay compliant with upcoming privacy headwinds.&nbsp;</span></p> <p><strong>What you’ll do</strong></p> <ul> <li style="font-weight: 400;"><span style="font-weight: 400;">Manager for the Conversion Data team (5 FTE ICs and 3 contractors) which sits within the Measurement Data Foundations organization in Seattle.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Help to reinvent how conversion data can be utilized for downstream teams in the world while maintaining a high bar for Pinner privacy.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Work closely with cross functional partners in Seattle as measurement is a cross-company cutting initiative.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Drive both short term execution and long term engineering strategy for Pinterest’s conversion data products.</span></li> </ul> <p><strong>What we’re looking for:</strong></p> <ul> <li style="font-weight: 400;"><span style="font-weight: 400;">Experience managing product development teams, including working closely with PM and Product Design to identify, shape and grow successful products</span></li> <li style="font-weight: 400;">The ideal candidate will have experience with processing high volumes of data at a scale.</li> <li style="font-weight: 400;">Grit, desire to work in a team, for the betterment of all - correlates to the Pinterest value of “acts like an owner”</li> <li style="font-weight: 400;">2+ years EM experience</li> </ul> <p><span style="font-weight: 400;">At Pinterest we believe the workplace should be equitable, inclusive, and inspiring for every employee. In an effort to provide greater transparency, we are sharing the base salary range for this position. This position will pay a base salary of $172,500 to $258,700. The position is also eligible for equity and incentive compensation. Final salary is based on a number of factors including location, travel, relevant prior experience, or particular skills and expertise.</span></p> <p><span style="font-weight: 400;">Information regarding the culture at Pinterest and benefits available for this position can be found at </span><a href="https://www.pinterestcareers.com/pinterest-life/"><span style="font-weight: 400;">https://www.pinterestcareers.com/pinterest-life/</span></a><span style="font-weight: 400;">.</span></p> <p>#LI-REMOTE</p> <p>#LI-NB1</p><div class="content-conclusion"><p><strong>Our Commitment to Diversity:</strong></p> <p>At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.</p></div>
UX Engineer
Warsaw, POL
<div class="content-intro"><p><strong>About Pinterest</strong><span style="font-weight: 400;">:&nbsp;&nbsp;</span></p> <p>Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love.&nbsp;In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping&nbsp;Pinners&nbsp;make their lives better in the positive corner of the internet.</p> <p><em>Our new progressive work model is called PinFlex, a term that’s uniquely Pinterest to describe our flexible approach to living and working. Visit our </em><a href="https://www.pinterestcareers.com/pinflex/" target="_blank"><em><u>PinFlex</u></em></a><em> landing page to learn more.&nbsp;</em></p></div><p><strong>What you’ll do:</strong></p> <ul> <li style="font-weight: 400;"><span style="font-weight: 400;">Work directly with the Motion design team in Warsaw to help bring their dynamic work to life.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Partner with the Design system team to align motion guidelines and build out a motion library.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Help build UI components, guidelines and interactions for the open source design system.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Partner with other teams across the Pinterest product to implement motion assets and promo pages within Pinterest.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Scope and prioritize your work; serve as the technical subject matter expert to build an end to end service culture for the motion team; building its independence and raising its visibility.&nbsp;</span></li> </ul> <p><strong>What we’re looking for:</strong></p> <ul> <li style="font-weight: 400;"><span style="font-weight: 400;">3+ years of experience building on the web platform.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Strong background in current web app development practices as well as a strong familiarity with Lottie, Javascript, Typescript and Webpack.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Solid experience with HTML and CSS fundamentals, and CSS Animation.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Experience with React.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Familiarity with accessibility best practices; ideally in the context of motion and animation.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Background and familiarity with modern design processes and tools like Figma and/or Adobe After Effects; working with designers and product managers.</span></li> <li style="font-weight: 400;"><span style="font-weight: 400;">Curiosity, strong communication and collaboration skills, self-awareness, humility, a drive for personal growth, and knowledge sharing.</span></li> </ul> <p><span style="font-weight: 400;">#LI-HYBRID</span></p> <p><span style="font-weight: 400;">#LI-DL2</span></p> <p>&nbsp;</p><div class="content-conclusion"><p><strong>Our Commitment to Diversity:</strong></p> <p>At Pinterest, our mission is to bring everyone the inspiration to create a life they love—and that includes our employees. We’re taking on the most exciting challenges of our working lives, and we succeed with a team that represents an inclusive and diverse set of identities and backgrounds.</p></div>
You may also like