This post is by Robert Beekman from AppSignal
As an ever growing all-in-one APM, we spend a lot of time on making sure AppSignal can cope with our increase in traffic. Usually, we don’t talk about how we do that; our blog is full of articles about great things under the hood of Ruby or doing crazy things with Elixir, but not about what makes AppSignal tick.
This time however, we’d like to share some of the bigger changes in our stack we’ve made over the past few years, so we can (easily) process the double-digit billions of requests sent our way every month. In real-time. So today we use our scaling experience to discuss our own stack and help you that way.
From a standard Rails setup to more custom parts
AppSignal started out as a pretty standard Rails setup. We used a Rails app that collected data through an API endpoint which created Sidekiq jobs to process in the background.
After a while, we replaced the Rails API with a Rack middleware to gain a bit of speed and later this was replaced with a Go web server that pushed Sidekiq compatible jobs to Redis.
App state and increments/updates
While this setup worked well for a long time, we began to run into issues where the databases couldn’t keep up with the amount of queries run against them. At this point we were processing tens of billions of requests already. The main reason for this bottleneck was that each Sidekiq process needed to get the entire app's state from the database, in order to increment the correct counters and update the right documents.
We could alleviate this somewhat with local caching of data, but because of the round-robin nature of our setup it still meant that each server needed to have a full cache of all data, because we couldn’t be sure on what server the payload would end up. We realized that with the data growth we were experiencing, this setup would become impossible in the future.
In search for a better way to handle the data we settled on using Kafka as the data processing pipeline. Instead of aggregating metrics in the database, we now aggregate the metrics in Kafka processors. Our goal is that our Kafka pipeline never queries the database until the aggregated data has to be flushed. This drives the amount of queries per payload down from up to ten reads and writes to just one write at the end of the pipeline.
We specify a key for each Kafka message and Kafka guarantees that the same keys end up on the same partition, that's consumed by the same server. We use the app's ID as a key for messages, this means that instead of having a cache for all customers on the server, we only have to cache data for the apps a server receives from Kafka, not all apps.
Kafka is a great system and we’ve migrated over in the past two years. Right now almost all processing is done in Rust through Kafka, but there are still things that are easier done in Ruby, such as sending Notifications and other database-heavy tasks. This meant that we needed some way to get data from Kafka to our Rails stack.
Connecting Kafka and Ruby/Rails
When we began this transition there were a couple of Kafka Ruby gems, but none worked with the latest (at the time 0.10.x) release of Kafka and most were unmaintained.
We looked at writing our own gem (which we eventually did). We will write more about that in a different article. But having a nice driver is only part of the requirements. We also needed a system to consume the data and execute the tasks in Ruby and spawn new workers when old ones crash.
Eventually we came up with a different solution. Our Kafka stack is built in Rust and we wrote a small binary that consumes a
sidekiq_out topic and creates Sidekiq compatible jobs in Redis. This way we could deploy this binary on our worker machines and it would feed new jobs into Sidekiq just as you would do within Rails itself.
The binary has a few options such as limiting the amount of data in Redis to stop consuming the Kafka topic until the threshold is cleared. This way all the data from Kafka won’t end up in Redis' memory on the workers if there is a backlog.
From Ruby’s point of view, there is no difference at all between jobs generated in Rails and those that come from Kafka. It allows us to prototype new workers that get data from Kafka and process it in Rails–to send notifications and update the database–without having to know anything about Kafka.
This made the migration to Kafka easier as we could switch over to Kafka and back without having to deploy new Ruby code. It also made testing super easy as you could easily generate jobs in the test suite to be consumed by Ruby without having to setup an entire Kafka stack locally.
We use Protobuf to define all our (internal) messages, this way we can be pretty sure that if the test passes, the worker will correctly process jobs from Kafka.
In the end this solution saved us a lot of time and energy and made life a lot simpler for our Ruby team.
Pros and cons
As with everything there are a few pros and cons for this setup:
- No changes in Ruby required, API compatible
- Easy to deploy and revert
- Easy to switch between Kafka and Ruby
- Redis isn’t overloaded by messages when using the limiter, saves memory on the server, keeping the messages in Kafka instead.
- Horizontal scaling leads to smaller caches on each server, because of the keyed messages.
- Still has the issue that each Sidekiq thread needs access to a cache of all data for the apps from the partitions the server consumes. (e.g. Memcache).
- Separate process running on the server
- The Rust processor commits the message offset when the message is flushed to Redis, this means that it’s guaranteed to be in Redis, but there’s no guarantee the message is processed by Ruby, this means that in case of a server crash, there is a chance some messages that were in Redis, but not processed are not processed.
Sidekiq and Kafka
Using Sidekiq helped us tremendously while migrating our processing pipeline to Kafka. We've now almost completely moved away from Sidekiq and are handling everything via our Kafka driver directly, but that's for another article.
This happy ending wraps up the love story. We hope you enjoyed this perspective on performance and scaling, and our experience scaling AppSignal. We hope this story on the decisions we made around our stack, in turn, help you.
Check out the rest of the blog or follow us to keep an eye out on when the next episode about our Kafka setup is published. And if you end up looking for an all-in-one APM that is truly by developers for developers, come find us.