How Sentry Receives 20 Billion Events Per Month While Preparing to Handle Twice That

33,326
Sentry
Sentry’s Application Monitoring platform helps developers see performance issues, fix errors faster, and optimize code health.

By James Cunningham, Operations Engineer, Sentry.


About Sentry

Sentry illustration

Unless your engineering team is staffed by angels who commute down to the office from heaven every morning, we’re pretty confident you run into plenty of problems developing and iterating on your applications in production. Sentry provides all the tools you need to find, triage, reproduce, and fix application-level issues before your users even know there was a problem. With the added bonus that you won’t get any more nasty looks from support engineers at happy hour.

By automating error detection and aggregating and adding important context to stack traces, Sentry helps you proactively correct the errors that are doing the most harm to your business more efficiently and durably and with minimal disruption. Closing the gap between the product team and customers improves productivity, speeds up the entire development process, and helps engineers focus on what they do best: build apps that make users’ lives better.

I was personally a Sentry user way before I was an employee. Early on at my previous company, I was tasked with upgrading the open-source error tracking service that hadn’t really been maintained or used for a while. I reached out for help and heard back from David (Sentry’s co-founder) and Matt (Sentry’s second engineer), meeting two of my future co-workers on IRC years before I ever saw their faces (protip: connect with Matt on LinkedIn).


This is Matt

This is Matt


They were incredibly helpful and, when I went looking for a new job, I thought, “Hey, this is a very nice piece of software, and the people who are running it are really mindful of their community. I’d love to be a part of that.” Today, I spend my waking hours happily keeping Sentry’s hosted service operational, available, and responsive to our exponentially-increasing event volume (editor’s note: when he’s not trolling new hires on Slack for their taste in hip-hop and Fruit Gushers).

A Powerful Side Project

Sentry started as (and remains) an open-source project, growing out of an error logging tool David built in 2008. He displayed a truly shrewd notion of branding even then, giving the project a catchy name that companies the world over remain jealous of to this day: django-db-log. For the longest time, Sentry’s subtitle on GitHub was “A simple Django app, built with love.” A slightly more accurate description probably would have included Starcraft and Soylent alongside love; regardless, this captured what Sentry was all about.

That original build nine years ago was Django and Celery (Python’s asynchronous task codebase), with Postgres as the database and Redis as the power behind Celery.

A Fast-Growing Company

As you might expect, Sentry usage has grown exponentially over the past decade, and the infrastructure has changed and matured to accommodate massive scale. We now host the open-source project as a SaaS product. Sentry has SDKs for just about every framework, platform, and language and integrations with the most popular developer tools, which helps make it incredibly easy to adopt. Today, Sentry is central to the error tracking and resolution workflows of tens of thousands of organizations and more than 100,000 active users around the world, many of whom support implementations for some of the biggest properties on the internet: Dropbox, Uber, Stripe, Airbnb, Xbox Live, HubSpot, and more. That’s 5 billion events per week, just from the hosted service.

When a customer sends events to Sentry, they don’t receive a laundry list of notifications, they get the aggregate issue with counts of how often it’s occurred and which of their users are experiencing the issue. This is all presented very simply and cleanly in Sentry, but if a user wants individual events, we’ll provide those also. We save every single event we accept, which gets very expensive to do in a traditional relational database.

One of the first improvements Sentry made to address scalability was storing all of these events in a distributed key-value store. There are a variety of key-value stores out there, all with their promises and pitfalls, but when evaluating solutions, we ultimately chose Riak. Our Riak cluster does exactly what we want it to: write event data to more than one location, grow or shrink in size upon request, and persist through normal failure scenarios.

The first major infrastructure project that I contributed to when joining Sentry was horizontally scaling our ability to execute offline tasks. As Sentry runs throughout the day, there are about 50 different offline tasks that we execute—anything from “process this event, pretty please” to “send all of these cool people some emails.” There are some that we execute once a day and some that execute thousands per second.

Managing this variety requires a reliably high-throughput message-passing technology. We use Celery’s RabbitMQ implementation, and we stumbled upon a great feature called Federation that allows us to partition our task queue across any number of RabbitMQ servers and gives us the confidence that, if any single server gets backlogged, others will pitch in and distribute some of the backlogged tasks to their consumers.

Another project we’ve undergone is setting up safeguards in front of our application to protect from unpredictable and unwanted traffic. When accepting events, we would be crazy to just expose the Python web process to the public Internet and say, “Alright, give me all you got!” Instead, we use two different proxying services that sit in front of our web machines:

  • NGINX, our product-aware proxy, handles many of the upper bounds that we have deemed reasonable. It is responsible for a variety of bounds, but its most popular one is protecting Sentry from exceedingly large event volumes. Ever so often, a user will run into a problem where they’ve deployed their code out into the abyss, and their event volume clocks in at a few zeroes higher than what they signed up for.
  • - In front of NGINX, we use another proxying service called HAProxy, which acts as a delta of connections without any of that product awareness logic and has a lot higher throughput. All it does is accept connections and send them off to different NGINX servers, allowing us to gracefully add or remove NGINX servers as we see fit.


Everything is fine now


An Evolving Architecture

Sentry began life as a traditional Django application, and has gone through a couple of architecture iterations since. The current Sentry dashboard, which is what customers use to browse and debug their production issues, has evolved into a single-page application written in React and Reflux (an early Flux library). We write ES6 and transpile to JavaScript using Babel and Webpack. For fetching and submitting data, we communicate with the Django backend through a straightforward REST-based HTTP API.

The event processing pipeline, which is responsible for handling all of the ingested event data that makes it through to our offline task processing, is written primarily in Python. For particularly intense code paths, like our source map processing pipeline, we have begun re-writing those bits in Rust. Rust’s lack of garbage collection makes it a particularly convenient language for embedding in Python. It allows us to easily build a Python extension where all memory is managed from the Python side (if the Python wrapper gets collected by the Python GC we clean up the Rust object as well.)


Sentry Releases animation


A Simple Deploy Workflow

For the most part, Sentry is still a classically monolithic app. This is driven, in part, by the fact that Sentry is still open-source, and we want to make it easy for our community to install and run the server themselves. To do this, we provide installation details for a Docker image that contains all of Sentry’s core services in one place. This monolithic nature makes contributing to and deploying Sentry ourselves relatively straightforward.

When someone wants to commit a change to the codebase, it is submitted as a pull request to our public project on GitHub. From there, Travis CI runs a set of parallelized builds, which include not only unit and integration tests, but also visual regression tests that are managed through Percy. Since we’re still an open-source project that supports different relational databases, we run test suites not only for Postgres, but also for MySQL and SQLite, as well.

Once all tests are green, the code has been reviewed, and any detected UI changes have been approved, the code is merged through GitHub. We then use an internal open-source tool named Freight to build and deploy our Docker image to production. Additionally, Freight injects the only closed source piece of Sentry, our billing platform. Once the image is in production, we trigger a rolling restart of every Sentry container to pick up the new image.


Sentry plus Slack integration GIF


An Unpredictable World

One of our biggest challenges is that Sentry’s traffic is inherently unpredictable, and there’s simply no way to foresee when a user’s application is going to melt down and send us a huge influx of events. On bare metal, we handled this by preparing for the worst(ish) and over-provisioning machines in case of an event deluge. Unfortunately, as demand grew, our time window for needing new machines shrunk. We started demanding more from our provider, requesting machines before they were needed, and keeping common machines idle for days on end, waiting to see which component needed it the most.

For that reason, we made the leap to Google Cloud Platform (GCP) in July 2017 to give ourselves greater flexibility. Calling it a “leap” makes it sound impulsive, but the transition actually took months of planning. And no matter how long we spent projecting resource usage within Google Compute Engine, we never would have predicted our increased throughput. Due to GCP’s default microarchitecture, Haswell, we noticed an immediate performance increase across our CPU-intensive workloads, namely source map processing. The operations team spent the next few weeks making conservative reductions in our infrastructure, and still managed to cut our costs by roughly 20%. No fancy cloud technology, no giant infrastructure undertaking -- just new rocks that were better at math.

You can find way more detail about it on the Google Cloud Platform Blog.

Observability and Action

A big reason we can sustain Sentry is that it falls into a category of observability tooling that requires a non-trivial amount of resources to host. We run Sentry ourselves because we’ve gotten pretty good at it. We rely on Sentry to track errors in our production app and help us set priorities for iteration, based on user experience and impact.

But when it comes to the rest of our monitoring stack, we apply the same thinking as the users signing up for Sentry’s hosted service every day: “It’s better to pay for uptime in dollars than in engineering hours.” (If you haven’t used Sentry’s hosted service, it only takes a couple minutes and a few lines of code to set up.)

We use a few toolchains outside of our production environment. I could write an essay detailing each (and I probably will), but let’s just outline how I would get notified that we’ve regressed in our 95th percentile of request latency:

  • Each host running a web server sends the timing of requests to Stripe’s Veneur
  • Veneur creates histograms of request timings and forwards those to Datadog
  • A Datadog threshold alert detects we’ve gone higher than 500ms
  • The threshold alert is configured to notify a Slack channel and a PagerDuty rotation
  • The PagerDuty rotation notifies both operations engineers currently on-call


Sentry welcome gif

We introduce every new employee with their own welcome gif


Fantastic Co-Workers

Our Engineering org is split into four teams in two programs: Product and Infrastructure. Their names do a pretty solid job describing their purposes, but:

  • Product is broken into the Workflow and Growth teams. Workflow focuses specifically on how our users interact with Sentry throughout their own workflows and development processes. Growth looks at the tweaks we can make that will increase the likelihood that a new user will find Sentry relevant, onboard effectively, and stick around to use it more and more.

  • Infrastructure is broken into the Platform and Operations teams. Platform is dedicated to all of the Sentry code that powers our API, including event ingestion. Operations is where I live, and we’re dedicated to building, deploying, maintaining, and monitoring all of the components that keep sentry.io stable.

We also have an unofficial fifth team that plays a large part in Sentry’s development and will always outnumber the others: our open-source contributors. Sentry’s entire codebase is right on GitHub for the whole world to see, and many improvements to our service have been introduced by users and community members who don’t work here.

Other Stacks

Just as Sentry is a part of many software teams’ stacks, we rely on a number of additional commercial and open-source services to help run our business. We use Stripe to handle customer billing, SendGrid for reliable email delivery, Slack for team communication, Google Analytics for basic web analytics, BigQuery for data warehousing, and Jira for project management.

On the open-source side, our growth and BI teams use Redash to derive useful statistics from our data. We use Jekyll to publish sentry.io and other online marketing content, like our blog.

Closing


Sentry team photo


Open source, open company. That’s our credo, and it really captures what we’re all about. As I mentioned earlier, I applied for a job at Sentry because it’s such a nice piece of software, and the people who run the company are mindful about the role of the community. Since everyone who works here is also a member of the open-source community, that mindfulness extends to and flows between employees.

Growth is inevitable here. The hard decision is not what to scale, but when. It’s the Operations team’s responsibility to put engineering hours into the right initiative and balance scale with security, reliability, and productivity. Maybe you want to make some of those hard decisions on my team?

Or maybe operations isn’t your thing, but you want to build something open-source. Want to contribute to Sentry beyond just code? We’re hiring pretty much across the organization and would love to talk to you if you’ve read this entire post and think you still might be as into Sentry as I am.

Sentry
Sentry’s Application Monitoring platform helps developers see performance issues, fix errors faster, and optimize code health.
Tools mentioned in article
Open jobs at Sentry
Senior Software Engineer, Visibility
Toronto, Canada

About Sentry

Bad software is everywhere, and we’re tired of it. Sentry is on a mission to help developers write better software faster, so we can get back to enjoying technology.

With more than $127 million in funding and 70,000 organizations that believe we’re on to something, we're building performance and error monitoring tools that help companies like Disney, Microsoft, and Atlassian spend less time fixing bugs and more time building products. If you like to selfishly build things that make your digital life better, come help us build the next generation of software monitoring tools.

About the Role

The Visibility team’s mission is to make Sentry the place to understand the health of your application, find and prioritize investments and resources, and uncover insights to improve your overall performance.

As a Senior Full Stack Engineer on the Visibility team in Toronto, Canada, you will shape the experience of how we visualize our customers tracing and performance data. You will collaborate with other teams throughout the company to ensure that our customers can get to important, actionable data easier than before. Presenting new ideas to our technical steering committee, establishing ownership of a feature set and growing your technical skills are just a few of the activities you can expect from this role.

In this role you will:

  • Partner with our design team and define the best approach to visualize insights into our customer’s products.
  • Collaborate with your team to define API contracts for new features.
  • Build tests for your code to maintain a high standard of quality for Sentry.
  • Write documentation to support new features with our technical writing staff.
  • Seek out improvements to the platform and drive initiatives to implement those improvements.
  • Review code and mentor less-experienced engineers.

You'll love this job if you:

  • Desire to be the change you seek in the world, improve Sentry and use Sentry to make Sentry better. You are the target customer for the software you build, and you can influence what makes it into the product.
  • Tackle application performance issues that arises when our largest customers push Sentry to its limits: identify bottlenecks and optimize our application to improve user experience.
  • Lead and participate in engineering discussions and initiatives to help take the team to the next level.
  • Enjoy customizing their shell and editor to the point where you cannot use the defaults.
  • Relish the idea of making small improvements to a tool that magnify into days of saved time across our customer base.

Qualifications

  • 5+ years engineering experience.
  • Located in Toronto or willing to relocate, interested in working in an office 4x a week (eventually.) 
  • You are comfortable developing both the frontend and backend services (we use React, Typescript, Python and Django.)
  • Build detail-oriented solutions.
  • Write robust and performant code that adheres to the company standards.
  • Are passionate about improving the debugging tools that developers use.

Benefits

  • Competitive salary and meaningful equity
  • 100% medical, dental, and vision coverage for employees, 75% company-paid for dependents
  • Monthly commuter subsidy.
  • Charitable matching program.
  • Generous parental leave policy.
  • Flexible working schedule and vacation policy, and real work/life balance.
  • Company events (Hack Weeks, All Hands, quarterly social events) and friends and family events.
  • Relocation assistance.

COVID Vaccine Required - Reasonable Accommodations for Medical or Religious Reasons Considered

Sentry values diversity and inclusivity in our company and is an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Software Engineer, Revenue
San Francisco, CA

About Sentry

Bad software is everywhere, and we’re tired of it. Sentry is on a mission to help developers write better software faster, so we can get back to enjoying technology.

With more than $127 million in funding and 70,000 organizations that believe we’re on to something, we're building performance and error monitoring tools that help companies like Disney, Microsoft, and Atlassian spend less time fixing bugs and more time building products. If you like to selfishly build things that make your digital life better, come help us build the next generation of software monitoring tools.

About the Role

The Revenue team is responsible for ensuring that Sentry has the financial system that enables rapid growth of our self serve and enterprise customers. We ensure that our billing system that encompasses subscription plans, invoicing, provisioning and more is rock solid and accurate. We also own the user experience for checkout flow which includes providing easy upgrade paths for our customers to take advantage of more Sentry product features to help write better software faster.

In this role you will:

  • Enhance and harden our subscription plans, invoice generation, and provisioning process.
  • Improve and iterate on our our checkout flow experience and upgrade path.
  • Partner with the Finance, Sales and Data team to create and maintain business workflows and accurate reporting of the revenue.
  • Scale up our internal toolset, including our testing stack and administrative console
  • Participate in engineering discussions and initiatives around deliverables the team is working towards.

You'll love this job if you:

  • Are a business focused engineer who likes understanding how financial aspects of a company works.
  • Like to come up with engineering solutions to big and complex business problems.
  • Understand the difference between fixed and floating point math.

Qualifications

  • 3+ years professional engineering experience
  • You are comfortable developing both the frontend and backend services, (we use React, Typescript, Python and Django.)
  • Experience in building large scale user facing web applications
  • Excellent written and verbal communication skills and ability to articulate technical concepts clearly and succinctly

Benefits

  • Competitive salary and meaningful equity
  • 100% medical, dental, and vision coverage for employees, 75% company-paid for dependents
  • Monthly commuter subsidy
  • 401k program
  • Learning & Development stipend
  • Charitable matching program
  • Generous parental leave policy
  • Flexible working schedule and vacation policy, work from home policy, and real work/life balance
  • Friday catered lunches
  • Company events (Hack Weeks, All Hands, quarterly social events) and friends and family events
  • Relocation assistance

COVID Vaccine Required - Reasonable Accommodations for Medical or Religious Reasons Considered

Sentry values diversity and inclusivity in our company and is an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Staff Site Reliability Engineer
San Francisco, CA

About Sentry

Bad software is everywhere, and we’re tired of it. Sentry is on a mission to help developers write better software faster, so we can get back to enjoying technology.

With more than $127 million in funding and 70,000 organizations that believe we’re on to something, we're building performance and error monitoring tools that help companies like Disney, Microsoft, and Atlassian spend less time fixing bugs and more time building products. If you like to selfishly build things that make your digital life better, come help us build the next generation of software monitoring tools.

About the Role

The Engineering Operations team is responsible for the deployment, configuration, maintenance and monitoring of Sentry's hosted platform. We do this by leveraging automation tools to automatically spin up and scale services to meet the traffic demands of 1,000,000+ developers. Sentry receives over a billion events a day, and process terabytes of data to return complex aggregations with sub-second latency.

As a Staff SRE, you will work with other teams at Sentry to evolve our data storage and process systems to handle 100x our current event volume. You'll do this by identifying bottlenecks, automating the addition of instances to our Kafka and ClickHouse clusters, and researching and implementing new ways of increasing the performance and resilience of those systems. You’ll contribute to our vision of Engineering Operations in a world of cloud providers and you will aid other engineering teams in in their efforts to grow and sustain Sentry and help you deliver on that vision.

If you're looking for a high-impact role where you move a company from processing "big data" to "really big data", this could be the job for you.

In this role you will:

  • Work across Sentry to ensure the uptime and reliability of Sentry's hosted platform.
  • Architect and automate services and systems to meet the demand of scale.
  • Analyze and tune systems to operate at maximum efficiency.
  • Collaborate with other engineering teams to deploy and scale new and existing services.
  • Be a member of the Engineering Operations team's on-call rotation, and be available to respond and resolve critical issues.

You'll love this job if you:

  • You enjoy working with others to improve scalability and performance.
  • You’re not afraid to dig into Linux internals during the troubleshooting process.
  • You're experienced in leading the way to a solution when faced with system limitations or frailty.
  • You've seen networks make and break hosted solutions; and have direct experience with growing and maintaining distributed systems.
  • You’re familiar with the various SaaS ecosystems and have taken ownership of a service you once knew nothing about.
  • You've got a story (or two) of royally goofing it and can tell us why it would never happen again under your watch.

Examples of projects our team has worked on:

Qualifications

  • 10+ years relevant experience
  • Experience with implementing good processes and solutions
  • Strong knowledge of replicated and distributed data storage systems
  • You have experience with some or all of the following tools we leverage:
    • System Administration: Debian, Docker, Kubernetes,
    • Databases: PostgreSQL, ClickHouse, Redis
    • Environment Management: Saltstack, Terraform, Google Cloud Environment
    • TCP/HTTP Routing: HAProxy, NGINX, Envoy
    • Data Platforms: Kafka, RabbitMQ, Memcached
  • Excellent written and oral communication skills and ability to articulate technical concepts clearly and succinctly
  • In the San Francisco Bay Area or willing to relocate

Benefits

  • Competitive salary and meaningful equity
  • 100% medical, dental, and vision coverage for employees, 75% company-paid for dependents
  • Monthly commuter subsidy
  • 401k program
  • Learning & Development stipend
  • Charitable matching program
  • Generous parental leave policy
  • Flexible working schedule and vacation policy, work from home policy, and real work/life balance
  • Friday catered lunches
  • Company events (Hack Weeks, All Hands, quarterly social events) and friends and family events
  • Relocation assistance

COVID Vaccine Required - Reasonable Accommodations for Medical or Religious Reasons Considered

Sentry values diversity and inclusivity in our company and is an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Staff Software Engineer, Search & Sto...
San Francisco, CA

About Sentry

Bad software is everywhere, and we’re tired of it. Sentry is on a mission to help developers write better software faster, so we can get back to enjoying technology.

With more than $127 million in funding and 70,000 organizations that believe we’re on to something, we're building performance and error monitoring tools that help companies like Disney, Microsoft, and Atlassian spend less time fixing bugs and more time building products. If you like to selfishly build things that make your digital life better, come help us build the next generation of software monitoring tools.

About the Role

The Search and Storage team is responsible for the infrastructure that powers all of Sentry's time-series data and searching capabilities across billions of events with sub-second latency. We started this initiative by building Snuba, the primary storage and query service for Sentry's event data powered by ClickHouse, and we're now looking to provide even more visibility and reporting on the terabytes of data that our users send us.

As a Staff Software Engineer, you'll lead efforts to bring Sentry into a new age of data visibility. You’ll do this by working directly with consumers of this data to build out new capabilities in our search infrastructure, developing new solutions based on our state of the art storage, and increasing the performance and integrity of Sentry’s core data services. You’ll also contribute to the vision of Infrastructure at Sentry and collaborate with other stakeholders to turn that vision into a reality.

If you're looking for a high-impact role where you do the thinking necessary to move a company from processing "big data" to "really big data", this could be the job for you.

In this role you will:

  • Work with people across Sentry to expand Search and Storage's impact on delivering world-class data delivery.
  • Architect and automate services and systems to meet the demand of scale.
  • Make architectural decisions to balance the wants and needs of Product and Engineering teams.
  • Maintain and grow the team's code quality initiatives by regularly reviewing code and leading design discussions.
  • Collaborate with other teams on shared designs and deliverables.
  • Mentor other engineers to grow Sentry's capability to deliver.
  • Improve the state of the art of the Search & Storage team's services.

Examples of projects our team has worked on:

Qualifications

  • 10+ years relevant experience
  • Strong knowledge of replicated and/or distributed data storage systems
  • Experience working as a cross-team collaborator
  • You have experience with some or all of the following systems we leverage:
    • Disk-driven Storage Systems: PostgreSQL, ClickHouse
    • Memory-driven Storage Systems: Memcached, Redis
    • Streaming Platforms: Kafka, RabbitMQ
  • Excellent written and oral communication skills and ability to articulate technical concepts clearly and succinctly
  • In the San Francisco Bay Area or willing to relocate

Benefits

  • Competitive salary and meaningful equity
  • 100% medical, dental, and vision coverage for employees, 75% company-paid for dependents
  • Monthly commuter subsidy
  • 401k program
  • Learning & Development stipend
  • Charitable matching program
  • Generous parental leave policy
  • Flexible working schedule and vacation policy, work from home policy, and real work/life balance
  • Friday catered lunches
  • Company events (Hack Weeks, All Hands, quarterly social events) and friends and family events
  • Relocation assistance

COVID Vaccine Required - Reasonable Accommodations for Medical or Religious Reasons Considered

Sentry values diversity and inclusivity in our company and is an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

You may also like