How Algolia Reduces Latency For 21B Searches Per Month

19,817
Algolia
Developer-friendly hosted search service. API clients for all major frameworks and languages. REST, JSON & detailed documentation.

By Josh Dzielak, Developer Advocate at Algolia.


Algolia Paris meeting room


Algolia helps developers build search. At the core of Algolia is a built-from-scratch search engine exposed via a JSON API. In February 2017, we processed 21 billion queries and 27 billion indexing operations for 8,000+ live integrations. Some more numbers:

  • Query volume: 1B/day peak, 750M/day average (13K/s during peak hours)
  • Indexing operations: 10B/day peak, 1B/day average (spikes can be over 1M/s)
  • Number of API servers: 800+
  • Total memory in production: 64TB
  • Total I/O per day: 3.9PB
  • Total SSD storage capacity: 566TB

We’ve written about our stack before and are big fans of StackShare and the community here. In this post we‘ll look at how our stack is designed from the ground up to reduce latency and the tools we use to monitor latency in production.

I’m Josh and I’m a Developer Advocate at Algolia, formerly the VP Engineering at Keen IO. Being a developer advocate is pretty cool. I get to code, write and speak. I also get to converse daily with developers using Algolia.

Frequently, I get asked what Algolia’s API tech stack looks like. Many people are surprised when I tell them:

  1. The Algolia search engine is written in C++ and runs inside of nginx. All searches start and finish inside of our nginx module.

  2. API clients connect directly to the nginx host where the search happens. There are no load balancers or network hops.

  3. Algolia runs on hand-picked bare metal. We use high-frequency CPUs like the 3.9Ghz Intel Xeon E5–1650v4 and load machines with 256GB of RAM.

  4. Algolia uses a hybrid-tenancy model. Some clusters are shared between customers and some are dedicated, so we can use hardware efficiently while providing full isolation to customers who need it.

  5. Algolia doesn’t use AWS or any cloud-based hosting for the API. We have our own servers spanning 47 datacenters in 15 global regions.


Algolia architecture diagram


Why this infrastructure?

The primary design goal for our stack is to aggressively reduce latency. For the kinds of searches that Algolia powers—suited to demanding consumers who are used to Google, Amazon and Facebook—latency is a UX killer. Search-as-you-type experiences, which have become the norm since Google announced instant search in 2011, have demanding requirements. Any more than 100ms from end-to-end can be perceived as sluggish, glitchy and distracting. But at 50ms or less the experience feels magical. We prefer magic.

Monitoring

Our monitoring stack helps us keep an eye on latency across all of our clusters. We use Wavefront to collect metrics from every machine. We like Wavefront because it’s simple to integrate (we have it plugged in to StatsD and collectd), provides good dashboards, and has integrated alerting.

We use PagerDuty to fire alerts for abnormalities like CPU depletion, resource exhaustion and long-running indexing jobs. For non-urgent alerts, like single process crashes, we dump and collect the core for further investigation. If the same non-urgent alert repeats more than a set number of times, we do trigger a PagerDuty alert. We keep only the last 5 core dumps to avoid filling up the disk.

When a query takes more than 1 second we send an alert into Slack. From there, someone on our Core Engineering Squad will investigate. On a typical day, we might see as few as 1 or even 0 of these, so Slack has been a good fit.

Probes

We have probes in 45 locations around the world to measure the latency and the availability of our production clusters. We host the probes with 12 different providers, not necessarily the same as where our API servers are. The results from these probes are publicly visible at status.algolia.com. We use a custom internal API to aggregate the large amount of data that probes fetch from each cluster and turn it into a single value per region.


Algolia probes


Downed Machines

Downed machines are detected within 30 seconds by a custom Ruby application. Once a machine is detected to be down, we push a DNS change to take it out of the cluster. The upper bound of propagation for that change is 2 minutes (DNS TTL). During this time, API clients implement their internal retry strategy to connect to healthy machines in the cluster, so there is no customer impact.

Debugging Slow Queries

When a query takes abnormally long - more than 1 second - we dump everything about it to a file. We keep everything we need to rerun it including the application ID, index name and all query parameters. High-level profiling information is also stored - with it, we can figure out where time is spent in the heaviest 10% of query processing. A syscall called getrusage analyzes resource utilization of the calling process and its children.

For the kernel, we record the number of major page faults (ru_majflt), number of block inputs, number of context switches, elapsed wall clock time (using gettimeofday, so that we don’t skip counting time on a blocking I/O like a major page fault since we’re using memory mapped files) and a variety of other statistics that help us determine the root cause.

With data in hand, the investigation proceeds in this order:

  1. The hardware
  2. The software
  3. Operating system and production environment

Hardware

The easiest problem to detect is a hardware issue. We see burned SSDs, broken memory modules and overheated CPUs. We automate the reporting of the most common failures like SSDs by alerting on S.M.A.R.T. data. For infrequent errors, we might need to run a suite of specific tools to narrow down the root cause, like mbw for uncovering memory bandwidth issues. And of course, there is always syslog which logs most hardware failures.

Individual machine failures will not have a customer impact because each cluster has 3 machines. Where it’s possible in a given geographical region, each machine is located in a different datacenter and attached to a different network provider. This provides further insulation from network or datacenter loss.

Software

We have some close-to-zero cost profiling information obtained from the getrusage syscall. Sometimes that’s enough to diagnose an issue with the engine code. If not, we need to look to profiling. We can’t run a profiler in production for performance reasons, but we can do this after the fact.

An external binary is attached to a profiler, containing exactly the same code as the module running inside of nginx. The profiler uses information obtained by google-perftools, a very accurate stack-sampling profiler, to simulate the exact conditions of the production machine.

OS / Environment

If we can rule out hardware and software failure, the problem might have been with the operating environment at that point in time. That means analyzing system-wide data in the hope of discovering an anomaly.

Once we discovered that defragmentation of huge pages in the kernel could block our process for several hundred milliseconds. This defragmentation isn’t necessary because we keep large memory pools like nginx. Now we make sure it doesn’t happen, to the benefit of more consistent latency for all of our customers.

Deployment

Every Algolia application runs on a cluster of 3 machines for redundancy and increased throughput. Each indexing operation is replicated across the machines using a durable queue.

Clusters can be mirrored to other global regions across Algolia’s Distributed Search Network (DSN). Global coverage is critical for delivering low latency to users coming from different continents. You can think of DSN like a CDN without caching - every query is running against a live, up-to-date copy of the index.

Early Detection

When we release a new version of the code that powers the API, we do it in an incremental, cluster-aware way so we can rollback immediately if something goes wrong.

Automated by a set of custom deployment scripts, the order of the rolling deploy looks like this:

  • Testing machines
  • Staging machines
  • ⅓ of production machines
  • Another ⅓ of production machines
  • The final ⅓ of production machines

First, we test the new code with unit tests and functional tests on a host that with an exact production configuration. During the API deployment process we use a custom set of scripts to run the tests, but in other areas of our stack we’re using Travis CI.

One thing we guard against is a network issue that produces a split-brain partition during a rolling deployment. Our deployment strategy considers every new version as unstable until it has consensus from every server, and it will continue to retry the deploy until the network partition heals.

Before deployment begins, another process has encrypted our binaries and uploaded them to an S3 bucket. The S3 bucket sits behind CloudFlare to make downloading the binaries fast from anywhere.

We use a custom shell script to do deployments. The script launches the new binaries and then checks to make sure that the new process is running. If it’s not, the script assumes that something has gone wrong and automatically rolls back to the previous version. Even if the previous version also can’t come up, we still won’t have a customer impact while we troubleshoot because the other machines in the cluster can still service requests.

Scaling

For a search engine, there are two basic dimensions of scaling:

  • Search capacity - how many searches can be performed?
  • Storage capacity - how many records can the index hold?

To increase your search capacity with Algolia, you can replicate your data to additional clusters using the point-and-click DSN feature. Once a new DSN cluster is provisioned and brought up-to-date with data, it will automatically begin to process queries.

Scaling storage capacity is a bit more complicated.

Multiple Clusters

Today, Algolia customers who cannot fit on one cluster need to provision a separate cluster and create logic at the application layer to balance between them. This is often needed by SaaS companies who have customers growing at different rates, and sometimes one customer can be 10x or 100x compared to the others, so you need to move that customer to somewhere they can fit.

Soon we’ll be releasing a feature that takes this complexity behind the API. Algolia will automatically balance data a customer’s available clusters based on a few key pieces of information. The way it works is similar to sharding but without the limitation of shards being pinned to a specific node. Shards can be moved between clusters dynamically. This avoids a very serious problem encountered by many search engines - if the original shard key guess was wrong, the entire cluster will have to be rebuilt down the road.

Collaboration

Our humans and our bots congregate on Slack. Last year we had some growing pains, but now we have a prefix-based naming convention that works pretty well. Our channels are named #team-engineering, #help-engineering, #notif-github, etc.. The #team- channels are for members of a team, #help- channels are for getting help from a team, and #notif- channels are for collecting automatic notifications.


Algolia Zoom Room


It would be hard to count the number of Zoom meetings we have on a given day. Our two main offices are in Paris and San Francisco, making 7am-10am PST the busiest time of day for video calls. We now have dedicated "Zoom Rooms" with iPads, high-resolution cameras and big TVs that make the experience really smooth. With new offices in New York and Atlanta, Zoom will become an even more important part of our collaboration stack which also includes Github, Trello and Asana.

Team

When you're an API, performance and scalability are customer-facing features. The work that our engineers do directly affects the 15,000+ developers that rely on our API. Being developers ourselves, we’re very passionate about open source and staying active with our community.


Algolia values


We’re hiring! Come help us make building search a rewarding experience. Algolia teammates come from a diverse range of backgrounds and 15 different countries. Our values are Care, Humility, Trust, Candor and Grit. Employees are encouraged to travel to different offices - Paris, San Francisco, or now Atlanta - at least once a year, to build strong personal connections inside of the company.

See our open positions on StackShare.

Questions about our stack? We love to talk tech. Comment below or ask us on our Discourse forum.

Thanks to Julien Lemoine, Adam Surak, Rémy-Christophe Schermesser, Jason Harris and Raphael Terrier for their much-appreciated help on this post.

Algolia
Developer-friendly hosted search service. API clients for all major frameworks and languages. REST, JSON & detailed documentation.
Tools mentioned in article
Open jobs at Algolia
Engineering Manager Front-end - DX ch...
Paris | London |
Algolia is looking for an engineering manager in Paris to take on a leadership position. Your role will be to set up your team for success, helping them to grow in all directions (from development to communication and productivity) while ensuring they achieve their short-term objectives and long-term goals. As a Engineering Manager you know how to prioritize your actions: when to work for your team, when to do individual contributions to help them and when to challenge them without micromanaging. As an experienced Developer, you still love being hands-on to solve problems when needed. You can work with different technologies, languages and frameworks and you’re not afraid to adapt to an unknown technical stack. As a Leader, you enjoy mentoring, people development and hiring. You have done it before and want to keep learning in this field. Your mission will be to generate and encourage a supportive, inclusive and collaborative environment to motivate the team to produce their best work and be happy to come in to work everyday. This is a full-time opportunity in Paris/France and open to partial remote work.
  • Help the team plan, execute and ship releases respecting timelines and high quality development, working closely with the product and leadership teams
  • Coach and mentor engineers to excel at their work and grow
  • Hire and onboard engineers to build a diverse and excellent team
  • Lead by example, coding alongside your team when necessary
  • Be a culture advocate within the company living by our core values
  • Continuously work on our Engineering Brand, producing content, attending events, leading some talks and being creative
  • Implement process as needed like weekly syncs, team's handbooks

  • Deep interest in people and mentoring
  • At least 5 years of software development experience
  • Successful team management or lead experience
  • A passion for shipping quality code
  • Excellent spoken and written English skills
  • Background in Front-end development
  • Knowledge on : React.js, Typescript, JS, Web Pack...
  • First thing in the morning is a one on one with one of your team member. Respecting the maker's schedule of your teammates you booked those meetings early in the morning, right after lunch or at the end of the day. That way people can have long uninterrupted sessions of work. Ideally you prepared this session for it to be efficient
  • Right after that, you check any important mention you got by email, GitHub or Slack, to see if someone needs your immediate attention or advice on their work
  • Your team is growing so you check the recruiting tool to see if there are any good candidate in the pipe, you will ping the recruiting team if so
  • It's lunch time!
  • After lunch you have a screening call with an engineer for your team, again you prepared this by making sure to read the candidate profile and being clear about the current needs in your team
  • Now it's time to get some individual contribution. At Algolia depending on the team you are managing you will spend between 20% and 50% of your time on individual contributions. Today you'll fix the continuous integration build for your team to be efficient
  • After that you have a design meeting with the product manager of your team where you will spend time together with another engineer designing a new feature on your product
  • Now one of your teammate wants you to review a presentation for a meetup they will attend, it's nice to be involved in it and be able to make people grow in different skills
  • Finally, you start wrapping up the day by preparing the next day: You have a weekly team meeting where everyone will update on the status of their projects along with bringing any discussion wort having as a group
  • As an engineering manager you are used to those days sliced in many small events. Context switching is part of your job and one you are good at, while still being able to have focused individual contributions when needed
  • GRIT - Problem-solving and perseverance capability in an ever-changing and growing environment
  • TRUST - Willingness to trust our co-workers and to take ownership
  • CARE - Genuine care about other team members, our clients and the decisions we make in the company.
  • CANDOR - Ability to receive and give constructive feedback.
  • HUMILITY- Aptitude for learning from others, putting ego aside.
  • Private Medical Insurance
  • Life and Disability Insurance 
  • Business Travel Insurance
  • Relocation support
  • Company Canteen (high standard)
  • Flexible work hours and flexible time off
  • Competitive pay and equity
  • Coaching and sponsorship to participate and speak at leading industry conferences
  • Ongoing professional education opportunities through internal & external workshops, including public speaking, language learning (English/French)
  • Fun: we spend time together — team building, socializing and making tools that encourage getting to know teammates across offices and continents. 
  • Charitable contribution matching 
  • Unique referral rewards program: refer a candidate, and we’ll donate to your charity of choice
  • Fully stocked kitchens
  • Team workouts
  • Meals & happy hours
  • Senior Front End Engineer (Industry S...
    New York | Atlanta | East Coast
    Algolia was built to help product teams deliver fast and relevant search in their websites and mobile apps with flexible resources & tools. We provide a hosted search API used by thousands of customers in more than 100 countries. Billions of search queries are answered every month thanks to the code we push every day into production. Our tools allow product teams to focus on building great experiences without having to worry about maintaining their infrastructure.  We're looking for a JavaScript Software Engineer (Front end) to join the Solutions Strategy team at Algolia. We are responsible for developing industry solutions to create the best developer experience for our users through building on Algolia’s open source UI libraries such as InstantSearch.js. We are a cross functional team that works closely with the product, marketing, solutions engineering and architecture team to drive feature adoption and growth in the developer community. We're not using only one framework, but all of them (you don't need to know all of them though, we teach each others). We build open source tools, demos and boilerplates for Angular, Vue.js, React but also plain JavaScript ("Vanilla JS" 🍦). As a member of the Solutions Strategy team, you have the opportunity to build proof of concepts on our latest product releases such as voice and other smart features. Your work will be heavily used in client facing and technical marketing campaigns. You also have the opportunity to live code in many of our community events. We are at the frontline of the developer experience at Algolia, a great position to have a big impact for both developers and end-users. This is an opportunity for someone to make an immense impact at a fast growing company. You should value and practice transparency, have the humility to accept your weaknesses and continuously strive to improve both personally and professionally. Do you like, or would you like to, craft open source code, tools and libraries for developers? Do you love to design a clean API solving real-world use cases? Then apply and let's chat about it. Coming right from an engineer in the team : “As a developer, I use a ton of different libraries to achieve my goals of building applications. Now I can go to the next level and learn how to build tools for developers to build their applications. It's a whole different challenge, I do it in the open source world and engage with my users.”.
  • Develop open source templates for Vue InstantSearch, React InstantSearch, InstantSearch.js, Angular InstantSearch as well as JS API Client and so much more with the help of the team
  • Implement new features, solve issues and analyze user's feedback on our popular open-source projects
  • Coordinate with our product, design and marketing teams on the release of new projects and websites
  • Take ownership, research, explore and deliver novel experiences
  • Teach us what we don't know and how you want to improve the search and discovery experience
  • Participate (new features, bug fixes) to other popular open-source projects related to our project needs
  • Help define engineering best practices and processes
  • Good knowledge of JavaScript (ES5, ES6), TypeScript, the DOM, HTML, CSS
  • Experience working with Figma/Sketch/Photoshop to build Front-end integration
  • You have created at least one reusable module (private or public)Experience with at least one of the following frameworks React, Angular, Vue.js or Ember.js 
  • A passion for shipping quality codeYou have an eye for great user experiences (UX)
  • Willingness to go beyond what you know today
  • Ability to interact with contributors and customers
  • Good oral and written communication in English
  • Overall we care about your current and future skillset. Not where you studied or who do you know. We care about what you like to do and about what you'd like to do in the future, at Algolia.
  • 4+ years of front-end development experience
  • Experience at our current stage and beyond ($50-200M ARR range, high growth, lots of change and building internal infrastructure) 
  • GRIT - Problem-solving and perseverance capability in an ever-changing and growing environment
  • TRUST - Willingness to trust our co-workers and to take ownership
  • CANDOR - Ability to receive and give constructive feedback
  • CARE - Genuine care about other team members, our clients and the decisions we make in the company
  • HUMILITY - Aptitude for learning from others, putting ego aside
  • Covered medical, dental, and vision benefits for you and your family 
  • 401(k) Matching Plan
  • Flexible work hours and unlimited Paid Time Off 
  • Paid Parental Leave
  • Pre-tax commuter benefits 
  • Life insurance and disability benefits 
  • Competitive pay and stock options
  • Charitable contribution matching 
  • Fully stocked kitchens
  • Catered lunches on Tuesdays and Thursdays 
  • Workout Wednesdays w/ personal trainer 
  • Bi-monthly meditation sessions 
  • Senior Front-end Engineer (Industry S...
    Paris | London |
    Algolia was built to help product teams deliver fast and relevant search in their websites and mobile apps with flexible resources & tools. We provide a hosted search API used by thousands of customers in more than 100 countries. Billions of search queries are answered every month thanks to the code we push every day into production. Our tools allow product teams to focus on building great experiences without having to worry about maintaining their infrastructure.  We're looking for a JavaScript Software Engineer (Front end) to join the Solutions Strategy team at Algolia. We are responsible for developing industry solutions to create the best developer experience for our users through building on Algolia’s open source UI libraries such as InstantSearch.js. We are a cross functional team that works closely with the product, marketing, solutions engineering and architecture team to drive feature adoption and growth in the developer community. We're not using only one framework, but all of them (you don't need to know all of them though, we teach each others). We build open source tools, demos and boilerplates for Angular, Vue.js, React but also plain JavaScript ("Vanilla JS" 🍦). As a member of the Solutions Strategy team, you have the opportunity to build proof of concepts on our latest product releases such as voice and other smart features. Your work will be heavily used in client facing and technical marketing campaigns. You also have the opportunity to live code in many of our community events. We are at the frontline of the developer experience at Algolia, a great position to have a big impact for both developers and end-users. This is an opportunity for someone to make an immense impact at a fast growing company. You should value and practice transparency, have the humility to accept your weaknesses and continuously strive to improve both personally and professionally. Do you like, or would you like to, craft open source code, tools and libraries for developers? Do you love to design a clean API solving real-world use cases? Then apply and let's chat about it. Coming right from an engineer in the team : “As a developer, I use a ton of different libraries to achieve my goals of building applications. Now I can go to the next level and learn how to build tools for developers to build their applications. It's a whole different challenge, I do it in the open source world and engage with my users.”.
  • Develop open source templates for Vue InstantSearch, React InstantSearch, InstantSearch.js, Angular InstantSearch as well as JS API Client and so much more with the help of the team
  • Implement new features, solve issues and analyze user's feedback on our popular open-source projects
  • Coordinate with our product, design and marketing teams on the release of new projects and websites
  • Take ownership, research, explore and deliver novel experiences
  • Teach us what we don't know and how you want to improve the search and discovery experience
  • Participate (new features, bug fixes) to other popular open-source projects related to our project needs
  • Help define engineering best practices and processes
  • Good knowledge of JavaScript (ES5, ES6), TypeScript, the DOM, HTML, CSS
  • Experience working with Figma/Sketch/Photoshop to build Front-end integration
  • You have created at least one reusable module (private or public)Experience with at least one of the following frameworks React, Angular, Vue.js or Ember.js 
  • A passion for shipping quality codeYou have an eye for great user experiences (UX)
  • Willingness to go beyond what you know today
  • Ability to interact with contributors and customers
  • Excellent spoken and written English skills required
  • Overall we care about your current and future skillset. Not where you studied or who do you know. We care about what you like to do and about what you'd like to do in the future, at Algolia.
  • GRIT - Problem-solving and perseverance capability in an ever-changing and growing environment
  • TRUST - Willingness to trust our co-workers and to take ownership
  • CANDOR - Ability to receive and give constructive feedback
  • CARE - Genuine care about other team members, our clients and the decisions we make in the company
  • HUMILITY - Aptitude for learning from others, putting ego aside
  • Covered medical, dental, and vision benefits for you and your family 
  • 401(k) Matching Plan
  • Flexible work hours and flexible Paid Time Off 
  • Paid Parental Leave
  • Pre-tax commuter benefits 
  • Life insurance and disability benefits 
  • Competitive pay and stock options
  • Charitable contribution matching 
  • Fully stocked kitchens
  • Catered lunches on Tuesdays and Thursdays 
  • Workout Wednesdays w/ personal trainer 
  • Bi-monthly meditation sessions 
  • Infrastructure Operations - Lead/Manager
    Paris
    Since the early days, Algolia has been running hybrid infrastructure largely based on bare metal servers. This approach gives Algolia complete control over the performance of the system but also brings challenges in automation, maintenance and provisioning. With its rapid growth, Algolia's infrastructure is doubling every year and what used to be a handful servers in 1 datacenter is now almost 3000 bare-metal servers across 17 regions and 70+ datacenters. All that without compromise of security or availability. There will be more servers, more datacenters and more regions. Are up for the challenge? As a member of the Infrastructure Operations team you’ll work on scaling, automating, optimising and maintaining our worldwide bare-metal infrastructure. Your main customer is going to be our Foundation squad of SREs taking care of the production platform powering our Search API. This is a full-time opportunity onsite from the Paris office or in partial/full remote from France.
  • Coach and mentor engineers to excel at their work and grow
  • Hire and onboard engineers to build a diverse and excellent team
  • Be a culture advocate within the company living by our core values
  • Discuss with vendors and partners hardware release plans
  • Grow and maintain Algolia's worldwide infrastructure
  • Test new pre-production hardware
  • Develop and maintain infrastructure automation systems and processes
  • Perform capacity planning based on the growth and needs of the company
  • Provision new infrastructure
  • Troubleshoot issues and outages related to the infrastructure
  • 5+ years of engineering experience
  • Knowledge of server hardware
  • Knowledge of datacenter and colocation environment
  • Knowledge of Shell scripting and at least one scripting language (Python, Ruby, etc.)
  • Understanding of computer networks: TCP/IP, DNS
  • Excellent spoken and written English skills required
  • Ability to make independent decisions and taking ownership for them
  • GRIT - Problem-solving and perseverance capability in an ever-changing and growing environment
  • TRUST - Willingness to trust our co-workers and to take ownership
  • CANDOR - Ability to receive and give constructive feedback
  • CARE - Genuine care about other team members, our clients and the decisions we make in the company
  • HUMILITY- Aptitude for learning from others, putting ego aside
  • Private Medical Insurance
  • Life and Disability Insurance
  • Business Travel Insurance
  • Relocation support
  • Company Canteen (high standard)
  • Flexible work hours and flexible time off
  • Competitive pay and equity
  • Coaching and sponsorship to participate and speak at leading industry conferences
  • Ongoing professional education opportunities through internal & external workshops, including public speaking, language learning (English/French)
  • Fun: we spend time together — team building, socializing and making tools that encourage getting to know teammates across offices and continents
  • Charitable contribution matching
  • Unique referral rewards program: refer a candidate, and we’ll donate to your charity of choice
  • Fully stocked kitchens
  • Team workouts
  • Meals & happy hours
  • Verified by
    Marketing Specialist
    Engineering Lead
    Co-Founder & CEO
    Software Engineer
    Software Engineer
    Front-end developer
    Software engineer
    Frontend Engineer
    Frontend Developer
    Information Technology
    Content & Education
    VP of Engineering
    Software engineer
    Customer Solutions Engineer
    Software Engineer
    Senior JavaScript Engineer
    You may also like