Cost Reduction in Goku

504
Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.

By Monil Mukesh Sanghavi | Software Engineer, Real Time Analytics Team; Rui Zhang | Software Engineer, Real Time Analytics Team; Hao Jiang | Software Engineer, Real Time Analytics Team; Miao Wang | Software Engineer, Real Time Analytics Team;


In 2018, we launched Goku, a scalable and high performant time series database system, which served as the storage and query serving engine for short term metrics (less than one day old). In early 2020, we launched GokuL (Goku long term), which extended Goku’s capability by supporting long term metrics data (i.e. data older than a day and up to a year). Both of these completely replaced OpenTSDB. For GokuL, we used 3 clusters of i3.4xlarge SSD backed EC2 instances which, over time, we realized are very costly. Reducing this cost was one of our primary aims going into 2021. This blog post will cover the approach we took to achieve our ambition.

Background

We use a tiered approach to segregate the long term data and store it in the form of buckets.

Table 1: table of a tiered approach

Tiers 1–5 contain the data stored on the GokuL (long term) clusters. GokuL uses RocksDB to store its long term data, and the data is ingested in the form of SST files.

Query Analysis

We analyzed the queries going to the long term cluster and observed the following:

  1. There are very few metrics (approximately ~6K) out of a total of 10B for which data points older than three months were queried from GokuL.
  2. More than half of the GokuL queries had specified rollup intervals of one day or more.

Tier 5 Data Analysis

We randomly selected a few shards in GokuL and analyzed the data. We observed the memory consumption of tier 5 data was much more than all the other tiers (1–4) combined. This was despite the fact that tier 5 contains only one hour of rolled up data, whereas the other tiers contained a mix of raw and 15 minute rolled up data.

Table 2: SST File size for each bucket in MiB

Solutions

It was inferred from the query and tier 5 analysis that tier 5 data (which holds six buckets of 64 days of data each) was the least queried as well as the most disk consuming. We planned our solutions to target this tier as it would give us the most benefits. Mentioned below are some of the solutions which were discussed.

Namespace

Implementation of a functionality called namespace would store configurations like ttl, rollup interval, and tier configurations for a set of metrics following that namespace. Uber’s M3 also has a similar solution. This would help us set appropriate configurations for the select sete.g. set a lower ttl for metrics that do not require longer retention, etc). The time to production for this project was longer, and hence we decided to make this a separate project in the future. This is a project being actively worked upon.

Rollup Interval Adjust for Tier 5 Data

We experimented with changing the rollup interval of tier 5 data from one hour to one day and observed the change in the final SST file(s) size for the tier 5 bucket.

Table 3

The savings that came out of this solution were not strong enough to support putting this into production.

On Demand Loading of Tier 5 Data

GokuL clusters would only store data from tiers 1–4 on startup and would load the tier 5 buckets as necessary (based on queries). The cons of this solution were:

  • Users would have to wait and retry the query once the corresponding tier 5 bucket from s3 had been ingested by the GokuL host.
  • Once ingested, the bucket would remain in GokuL unless thrown away by an eviction algorithm.

We decided not to go with this solution because it was not user friendly.

Tiered Storage

We decided to move tier 5 data into a separate HDD based cluster. While there was some notable difference observed in the query latency, it could be ignored because the number of queries hitting this tier was much less. We calculated that tier 5 was consuming approximately 1 TB of each of the 650 hosts in the GokuL cluster. We decided to use the d2.2xlarge instance to store and serve the tier 5 data in GokuL.

Table 4

The cost savings that came out of this solution were huge. We replaced around 325 i3.4xlarge instances with 111 d2.2xlarge instances, and the cost reduction was huge. We reduced nearly 30–35% of our costs with this change.

To support this, we had to design and implement tier-based routing in the goku root cluster, which routes the queries to short term and long term leaf clusters. This was one of the solutions that gave us a huge cost savings.

In the future, we can evaluate if we can reduce the number of replicas and compromise on availability in opposition to the low number of queries.

RocksDB Tuning

As mentioned above, GokuL uses RocksDB to store the long term data. We observed that the RocksDB options we were using were not optimal for Goku’s data that has high volume and low QPS.

We experimented with using a stronger compression algorithm (ZSTD with level 5), and this reduced the disk usage by 40%. In addition to this, we enabled the partitioned index filter wherein only the top level index is loaded into memory. On top of this, we enabled caching with higher priority for filter and index blocks so that they use the same cache as the data blocks and also minimize the performance impact.

With both the above changes, we noticed that the latency difference was not large and the reduction in data space usage was approximately 50%. We immediately put this into production and shrunk the size and cost of our GokuL clusters by another half.

What’s Next

Namespace

As mentioned, we are actively working on the implementation of the namespace feature, which will help us reduce the long term cluster costs even further by reducing the ttl for most of the current metrics that do not need the high retention anyways.

Acknowledgments

Huge thanks to Brian Overstreet, Wei Zhu, and the observability team for providing and supporting solutions on the table.

Pinterest
Pinterest is a social bookmarking site where users collect and share photos of their favorite events, interests and hobbies. One of the fastest growing social networks online, Pinterest is the third-largest such network behind only Facebook and Twitter.
Tools mentioned in article
Open jobs at Pinterest
Android Engineer, Client Excellence
Mexico City, MEX

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

On the Client Excellence team you ensure Pinners have a high quality experience on Pinterest. You do this by improving our critical client metrics like crash-free users and by upgrading our supported libraries and operating systems. You also partner with other engineering teams to improve the developer experience and champion operational excellence.

What you’ll do:

  • Improve the quality of our apps by monitoring and improving core client metrics e.g. crash-free user rate, app size, memory management and cpu usage
  • Drive library and OS upgrades with minimal disruption across Pinterest
  • Partner with other engineering teams to improve client developer experience
  • Champion operational excellence across all client engineering teams

What we’re looking for:

  • Deep understanding of Android development and best practices in Java or Kotlin
  • Knowledge on multi-threading, logging, memory management, caching and builds on Android
  • Expertise in developing and debugging across a diverse service stack including storage and data solutions
  • Demonstrated track record of improving software quality with stable releases
  • Experience on platform teams/initiatives, driving technology adoption across feature teams
  • Keeps up to date with new technologies to understand what should be incorporated 
  • Strong collaboration and communication skills
Backend Engineer, Discovery Measurements
Mexico City, MEX

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest personalizes millions of experiences by using machine learning algorithms to sift through our catalog of one hundred billion Pins to find the best content for each Pinner. It is critical to measure the users experience across Pinterest and identify opportunities for improvement. The Discovery Measurements team’s charter is to establish human-powered ground truth for major Pinterest products, e.g. Search and Ads, and develop company critical measurements about relevance, domain quality, session experience, retention, etc. As we look to scale these platforms both vertically and horizontally, we’re looking for strong software engineers to join the team to drive technical excellence and curiosity. We need someone who has experience as a backend developer as well as drive to dive into challenging data processing and data mining problems.

What you’ll do:

  • Build a platform that enables teams to evaluate and train their ML models
  • Design and scale company-wide online & offline measurement platforms for organic and ad content
  • Design and develop company critical measurements, including relevance, domain quality, session experience, retention, user satisfaction
  • Establish technical foundation to generate insightful signals about Pin and Pinners that could power other ML models in the Pinterest ecosystem
  • Partner with cross-functional stakeholders to align engineering efforts for high impact technical initiatives

What we’re looking for:

  • Fluent in any of the following languages: C/C++, Java, JavaScript, Python
  • Exposure to architectural patterns of a large, high-scale web application (e.g., well-designed APIs, high volume data pipelines, efficient algorithms)
  • Model of software engineering best practices, including agile development, unit testing, code reviews, design documentation, debugging, and problem solving
  • Familiar with large data processing and measurement
  • Curiosity for leveraging data and metrics to identify challenging opportunities and build impactful solutions
Engineering Manager, Client Excellence
Mexico City, MEX

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

We’re looking for an Engineering Manager to build out the Client Excellence team. This team of Android, iOS, Web and API engineers is responsible for ensuring Pinners have a high quality experience on Pinterest. They do this by creating tools to monitor and improve our critical client metrics like crash-free sessions, keeping our critical libraries up to date and partnering with other engineering teams to champion operational excellence.

What you’ll do:

  • Build out an experienced team of Android/iOS/Web/API engineers and help them develop new skills and advance in their careers
  • Provide a vision to the team, drive technical excellence and partner with key stakeholders to prioritize and deliver on the team's roadmap
  • Improve the quality of our apps by monitoring and improving core client metrics e.g. crash-free user rate, app size, memory management and cpu usage
  • Create an operational strategy to drive library and OS upgrades with minimal disruption across Pinterest
  • Partner with other engineering teams to discover future opportunities to improve client developer experience
  • Champion operational excellence across all client engineering teams

What we’re looking for:

  • Strong communication, people development and software project management skills
  • Ability to deliver on immediate goals and form long-term strategies around technology, processes, and people
  • Demonstrated track record of improving software quality with stable releases
  • Ability to dive deeply into platform metrics (e.g. crash rates, logging) to identify opportunities for focus
  • Experience leading platform teams/initiatives, driving technology adoption across feature teams
Fullstack Engineer, Discovery Measure...
Mexico City, MEX

About Pinterest:  

Millions of people across the world come to Pinterest to find new ideas every day. It’s where they get inspiration, dream about new possibilities and plan for what matters most. Our mission is to help those people find their inspiration and create a life they love. In your role, you’ll be challenged to take on work that upholds this mission and pushes Pinterest forward. You’ll grow as a person and leader in your field, all the while helping Pinners make their lives better in the positive corner of the internet.

Pinterest personalizes millions of experiences by using machine learning algorithms to sift through our catalog of one hundred billion Pins to find the best content for each Pinner. It is critical to measure the users experience across Pinterest and identify opportunities for improvement. The Discovery Measurements team’s charter is to establish human-powered ground truth for major Pinterest products, e.g. Search and Ads, and develop company critical measurements about relevance, domain quality, session experience, retention, and more. As we look to scale these platforms both vertically and horizontally, we’re looking for strong software engineers to join the team to drive technical excellence and curiosity. We need someone who has experience as a full-stack engineer to dive into challenging human-in-the-loop AI problems.

What you’ll do:

  • You will start by building human-in-the-loop AI platforms to power ML models on production
  • Design and implement the UI layer by closely working with Data Scientist, Product Managers, and Machine Learning engineers
  • Contribute to the new unified human computation backend service
  • Build the scalable backend API infrastructure which can be used to measure and evaluate all various deep learning and machine learning models on production

What we’re looking for:

  • Mastery in frontend stack (Javascript/HTML/CSS), familiarity with modern frontend frameworks (e.g. React/Redux)
  • Knowledge of backend stack (Java, Python, Go) and how they interact with MySQL, Redis, Kafka, etc.
  • Good judgment about shipping improvement quickly while ensuring the sustainability of platforms
  • Ability to measure and improve large scale platforms
Verified by
Security Software Engineer
Tech Lead, Big Data Platform
Software Engineer
Talent Brand Manager
Sourcer
Software Engineer
You may also like