Unless your engineering team is staffed by angels who commute down to the office from heaven every morning, we’re pretty confident you run into plenty of problems developing and iterating on your applications in production. Sentry provides all the tools you need to find, triage, reproduce, and fix application-level issues before your users even know there was a problem. With the added bonus that you won’t get any more nasty looks from support engineers at happy hour.
By automating error detection and aggregating and adding important context to stack traces, Sentry helps you proactively correct the errors that are doing the most harm to your business more efficiently and durably and with minimal disruption. Closing the gap between the product team and customers improves productivity, speeds up the entire development process, and helps engineers focus on what they do best: build apps that make users’ lives better.
I was personally a Sentry user way before I was an employee. Early on at my previous company, I was tasked with upgrading the open-source error tracking service that hadn’t really been maintained or used for a while. I reached out for help and heard back from David (Sentry’s co-founder) and Matt (Sentry’s second engineer), meeting two of my future co-workers on IRC years before I ever saw their faces (protip: connect with Matt on LinkedIn).
This is Matt
They were incredibly helpful and, when I went looking for a new job, I thought, “Hey, this is a very nice piece of software, and the people who are running it are really mindful of their community. I’d love to be a part of that.” Today, I spend my waking hours happily keeping Sentry’s hosted service operational, available, and responsive to our exponentially-increasing event volume (editor’s note: when he’s not trolling new hires on Slack for their taste in hip-hop and Fruit Gushers).
A Powerful Side Project
Sentry started as (and remains) an open-source project, growing out of an error logging tool David built in 2008. He displayed a truly shrewd notion of branding even then, giving the project a catchy name that companies the world over remain jealous of to this day: django-db-log. For the longest time, Sentry’s subtitle on GitHub was “A simple Django app, built with love.” A slightly more accurate description probably would have included Starcraft and Soylent alongside love; regardless, this captured what Sentry was all about.
A Fast-Growing Company
As you might expect, Sentry usage has grown exponentially over the past decade, and the infrastructure has changed and matured to accommodate massive scale. We now host the open-source project as a SaaS product. Sentry has SDKs for just about every framework, platform, and language and integrations with the most popular developer tools, which helps make it incredibly easy to adopt. Today, Sentry is central to the error tracking and resolution workflows of tens of thousands of organizations and more than 100,000 active users around the world, many of whom support implementations for some of the biggest properties on the internet: Dropbox, Uber, Stripe, Airbnb, Xbox Live, HubSpot, and more. That’s 5 billion events per week, just from the hosted service.
When a customer sends events to Sentry, they don’t receive a laundry list of notifications, they get the aggregate issue with counts of how often it’s occurred and which of their users are experiencing the issue. This is all presented very simply and cleanly in Sentry, but if a user wants individual events, we’ll provide those also. We save every single event we accept, which gets very expensive to do in a traditional relational database.
One of the first improvements Sentry made to address scalability was storing all of these events in a distributed key-value store. There are a variety of key-value stores out there, all with their promises and pitfalls, but when evaluating solutions, we ultimately chose Riak. Our Riak cluster does exactly what we want it to: write event data to more than one location, grow or shrink in size upon request, and persist through normal failure scenarios.
The first major infrastructure project that I contributed to when joining Sentry was horizontally scaling our ability to execute offline tasks. As Sentry runs throughout the day, there are about 50 different offline tasks that we execute—anything from “process this event, pretty please” to “send all of these cool people some emails.” There are some that we execute once a day and some that execute thousands per second.
Managing this variety requires a reliably high-throughput message-passing technology. We use Celery’s RabbitMQ implementation, and we stumbled upon a great feature called Federation that allows us to partition our task queue across any number of RabbitMQ servers and gives us the confidence that, if any single server gets backlogged, others will pitch in and distribute some of the backlogged tasks to their consumers.
Another project we’ve undergone is setting up safeguards in front of our application to protect from unpredictable and unwanted traffic. When accepting events, we would be crazy to just expose the Python web process to the public Internet and say, “Alright, give me all you got!” Instead, we use two different proxying services that sit in front of our web machines:
- NGINX, our product-aware proxy, handles many of the upper bounds that we have deemed reasonable. It is responsible for a variety of bounds, but its most popular one is protecting Sentry from exceedingly large event volumes. Ever so often, a user will run into a problem where they’ve deployed their code out into the abyss, and their event volume clocks in at a few zeroes higher than what they signed up for.
- - In front of NGINX, we use another proxying service called HAProxy, which acts as a delta of connections without any of that product awareness logic and has a lot higher throughput. All it does is accept connections and send them off to different NGINX servers, allowing us to gracefully add or remove NGINX servers as we see fit.
An Evolving Architecture
The event processing pipeline, which is responsible for handling all of the ingested event data that makes it through to our offline task processing, is written primarily in Python. For particularly intense code paths, like our source map processing pipeline, we have begun re-writing those bits in Rust. Rust’s lack of garbage collection makes it a particularly convenient language for embedding in Python. It allows us to easily build a Python extension where all memory is managed from the Python side (if the Python wrapper gets collected by the Python GC we clean up the Rust object as well.)
A Simple Deploy Workflow
For the most part, Sentry is still a classically monolithic app. This is driven, in part, by the fact that Sentry is still open-source, and we want to make it easy for our community to install and run the server themselves. To do this, we provide installation details for a Docker image that contains all of Sentry’s core services in one place. This monolithic nature makes contributing to and deploying Sentry ourselves relatively straightforward.
When someone wants to commit a change to the codebase, it is submitted as a pull request to our public project on GitHub. From there, Travis CI runs a set of parallelized builds, which include not only unit and integration tests, but also visual regression tests that are managed through Percy. Since we’re still an open-source project that supports different relational databases, we run test suites not only for Postgres, but also for MySQL and SQLite, as well.
Once all tests are green, the code has been reviewed, and any detected UI changes have been approved, the code is merged through GitHub. We then use an internal open-source tool named Freight to build and deploy our Docker image to production. Additionally, Freight injects the only closed source piece of Sentry, our billing platform. Once the image is in production, we trigger a rolling restart of every Sentry container to pick up the new image.
An Unpredictable World
One of our biggest challenges is that Sentry’s traffic is inherently unpredictable, and there’s simply no way to foresee when a user’s application is going to melt down and send us a huge influx of events. On bare metal, we handled this by preparing for the worst(ish) and over-provisioning machines in case of an event deluge. Unfortunately, as demand grew, our time window for needing new machines shrunk. We started demanding more from our provider, requesting machines before they were needed, and keeping common machines idle for days on end, waiting to see which component needed it the most.
For that reason, we made the leap to Google Cloud Platform (GCP) in July 2017 to give ourselves greater flexibility. Calling it a “leap” makes it sound impulsive, but the transition actually took months of planning. And no matter how long we spent projecting resource usage within Google Compute Engine, we never would have predicted our increased throughput. Due to GCP’s default microarchitecture, Haswell, we noticed an immediate performance increase across our CPU-intensive workloads, namely source map processing. The operations team spent the next few weeks making conservative reductions in our infrastructure, and still managed to cut our costs by roughly 20%. No fancy cloud technology, no giant infrastructure undertaking -- just new rocks that were better at math.
You can find way more detail about it on the Google Cloud Platform Blog.
Observability and Action
A big reason we can sustain Sentry is that it falls into a category of observability tooling that requires a non-trivial amount of resources to host. We run Sentry ourselves because we’ve gotten pretty good at it. We rely on Sentry to track errors in our production app and help us set priorities for iteration, based on user experience and impact.
But when it comes to the rest of our monitoring stack, we apply the same thinking as the users signing up for Sentry’s hosted service every day: “It’s better to pay for uptime in dollars than in engineering hours.” (If you haven’t used Sentry’s hosted service, it only takes a couple minutes and a few lines of code to set up.)
We use a few toolchains outside of our production environment. I could write an essay detailing each (and I probably will), but let’s just outline how I would get notified that we’ve regressed in our 95th percentile of request latency:
- Each host running a web server sends the timing of requests to Stripe’s Veneur
- Veneur creates histograms of request timings and forwards those to Datadog
- A Datadog threshold alert detects we’ve gone higher than 500ms
- The threshold alert is configured to notify a Slack channel and a PagerDuty rotation
- The PagerDuty rotation notifies both operations engineers currently on-call
We introduce every new employee with their own welcome gif
Our Engineering org is split into four teams in two programs: Product and Infrastructure. Their names do a pretty solid job describing their purposes, but:
Product is broken into the Workflow and Growth teams. Workflow focuses specifically on how our users interact with Sentry throughout their own workflows and development processes. Growth looks at the tweaks we can make that will increase the likelihood that a new user will find Sentry relevant, onboard effectively, and stick around to use it more and more.
Infrastructure is broken into the Platform and Operations teams. Platform is dedicated to all of the Sentry code that powers our API, including event ingestion. Operations is where I live, and we’re dedicated to building, deploying, maintaining, and monitoring all of the components that keep sentry.io stable.
We also have an unofficial fifth team that plays a large part in Sentry’s development and will always outnumber the others: our open-source contributors. Sentry’s entire codebase is right on GitHub for the whole world to see, and many improvements to our service have been introduced by users and community members who don’t work here.
Just as Sentry is a part of many software teams’ stacks, we rely on a number of additional commercial and open-source services to help run our business. We use Stripe to handle customer billing, SendGrid for reliable email delivery, Slack for team communication, Google Analytics for basic web analytics, BigQuery for data warehousing, and Jira for project management.
Open source, open company. That’s our credo, and it really captures what we’re all about. As I mentioned earlier, I applied for a job at Sentry because it’s such a nice piece of software, and the people who run the company are mindful about the role of the community. Since everyone who works here is also a member of the open-source community, that mindfulness extends to and flows between employees.
Growth is inevitable here. The hard decision is not what to scale, but when. It’s the Operations team’s responsibility to put engineering hours into the right initiative and balance scale with security, reliability, and productivity. Maybe you want to make some of those hard decisions on my team?
Or maybe operations isn’t your thing, but you want to build something open-source. Want to contribute to Sentry beyond just code? We’re hiring pretty much across the organization and would love to talk to you if you’ve read this entire post and think you still might be as into Sentry as I am.