Rust at OneSignal

4,281
OneSignal
OneSignal is a high volume mobile push, web push, email, and in-app messaging service.

This post is by Joe Wilm of OneSignal

Earlier last year, we announced OnePush, our notification delivery system written in Rust.

In this post, we will cover improvements in our delivery capabilities since then, an interactive tour of OnePush’s subsystems and reflections of our experience shipping production Rust code. We hope you'll find it insightful!

Delivery Stats

OnePush was built to scale deliveries at OneSignal. To know whether this endeavor was a success, we collect metrics such as historical delivery counts and delivery throughput. Here's how OnePush is performing:

  • OneSignal had ~10,000 users at the start of 2016 and now has over 110,000 at the time of publishing this post. (Over 10x growth!)
  • We've increased the number of daily notifications sent by 20x in the same period.
  • OnePush delivers over 2 billion notifications per week.
  • OnePush is fast - We've observed sustained deliveries up to 125,000/second and spikes up to 175,000/second.

The title image on this post is a screenshot from our live delivery monitoring. Each bar represents deliveries occurring in that second, and each vertical division denotes 5,000 deliveries. The colors represent different platforms like iOS, Android, Chrome WebPush, etc. Every single one of them was delivered by OnePush.

OnePush

OnePush is comprised of several subsystems for loading notifications, delivering notifications across HTTP/1.1 and HTTP/2, and for processing events and results.

Choosing Rust

Choosing the programming language for a core system is a big decision. If not careful, one could end up with months of time invested and get stuck writing library code instead of the application itself. This is less of a concern with programming languages that have a mature ecosystem, but that's not exactly Rust just yet. On the other hand, Rust enables one to build robust, complex systems quickly and fearlessly thanks to its powerful type system and ownership rules.

Given that we now have a production system written in Rust, it's obvious which side of this trade we landed on. Our experience has been positive overall and indeed we have had fantastic results. The following sections discuss the specific pros and cons we considered for building OnePush in Rust, what risks we accepted on the outset, the successes we had, and issues we ran into.

Reasons to not use Rust

The Rust ecosystem is young. Even if there exists a library for your purpose, it's not guaranteed to be robust enough for a production deployment. Additionally, many libraries today have a "truck factor" of 1. If the library's developer gets hit by a truck, it's going to be on you to maintain it.

Next, Rust's tooling story is weak. You can use tools like Racer and YCM to get pretty far, but they fail in a lot of cases. Good tooling is a necessity, especially for developers that are getting up-to-speed.

Having team members (who may be unfamiliar with Rust) contribute to the project may take a lot of "ramp-up" time. This risk has turned out to be quite real, but it hasn't stopped other members of our team from contributing patches to the project. Mentoring from team members more proficient with the language and familiar with the code base helped a lot here.

Finally, iteration times can be long. This wasn't something we anticipated up front, but build times have become onerous for us. A build from scratch now falls into the category of "go make coffee and play some ping-pong." Recompiling a couple of changes isn't exactly quick either.

Before settling on Rust, we considered writing OnePush in Go. Go has a lot going for it for this sort of application - its concurrency model is perfectly suited for managing many async TCP connections, and the ecosystem has good libraries for HTTP requests, Redis and PostgreSQL clients, and serialization. Go is also more approachable for someone unfamiliar with the language; this makes the code base more accessible to the rest of your team. Go's developer tools have also had more time to mature than Rust's.

Why choose Rust

Despite the negatives and the presence of a good alternative, Rust has a lot going for it that makes it a good choice for us. As mentioned earlier,

Rust enables one to build robust, complex systems quickly and fearlessly thanks to its powerful type system and ownership rules

This is huge. Being able to encode constraints of your application in the type system makes it possible to refactor, modify, or replace large swaths of code with confidence. The type system is our ultimate "move quickly and don't break things" secret weapon.

Rust's error handling model forces developers to handle every corner case. Even if there is a system with the potential to panic, it can be moved into its own thread for isolation. More recently, it has become possible to catch panics within a thread instead of only at the boundary. Languages like Go make it too easy to ignore errors.

Next, OnePush needed to be fast. Rust makes writing multithreaded programs quite easy. The Send and Sync traits work together to ensure such programs are free from data races.

At the end of the day, our OnePush service is just a program optimized for sending a lot of HTTP requests. The library ecosystem offered everything we needed to build this system: An async HTTP/2 client, an async HTTP/1.1 client, a Redis client library and a PostgreSQL client library. We are fortunate that the Rust community is full of talented and ambitious developers who have already published a great deal of quality libraries that suit our specific needs.

Finally, the developer leading the effort had experience and a strong preference for Rust. There are plenty of technologies that would have met our requirements, but deferring to a personal preference made a lot of sense. Having engineers excited about what they are working on is incredibly valuable. Such intrinsic motivation increases developer happiness and reduces burnout. Imagine going to work every day and getting to work on something you're excited about! Developer happiness is important to us as a company. Being able to provide so much by going with one technology versus another was a no-brainer.

Risks

Aside from risks associated with not choosing Rust, we had a few additional concerns for this particular project.

As a glorified HTTP client, OnePush needed to be able to send lots of HTTP/1.1 requests very quickly. In the beginning, this wasn't quite as true because of our scale and because Android notifications could be batched into single requests. Going forward, we expected a huge increase in HTTP/1.1 outgoing request volume due to growth and the new WebPush specification with encrypted payloads. Hyper (Rust's HTTP library), had an async branch that was just a prototype when we started. We hoped that, by the time we truly needed an async client, it would be ready.

As it turned out, the initial async Rotor-based branch of Hyper never stabilized since tokio and futures were announced in August 2016. By the time we really needed the async branch, we ended up having to spend a week or two debugging, stress-testing and fixing the Rotor-based hyper::Client. This turned out to be ok since it was a chance to give back to the Rust community.

Since we would be on the nightly channel for serde derive and clippy lints, another risk was spending a lot of time doing rustc upgrades. We avoided this situation by pinning to specific versions of the compiler and upgrading infrequently. When we did upgrade, the process required finding a recent rustc that was supported by both libraries. This will become less of an issue very soon with the advent of Macros 1.1.

Finally, Solicit (Rust's HTTP/2 library) uses three threads per connection. Although this is fine in isolation, having 20,000 connections quickly becomes expensive. We've mitigated this issue by using a short keep-alive to limit the number of active connections and by taking advantage of the Apple's HTTP/2 provider API (APNs), which allows 500 requests in-flight per connection.

Unexpected Issues

For the most part, we knew what we were getting into building such a system in Rust. However, one thorn in our side that we didn't anticipate was rust-openssl upgrades. We are stuck on an earlier version of rust-openssl since the Solicit library depends on an API that has been removed since v0.8.0. This means that we are unable to upgrade other dependencies which rely on rust-openssl until we fix the Solicit issue.

Another minor issue at one time was the limited test framework. A common feature for test frameworks is to have some setup and teardown steps that run before and after a test. We say this issue was minor because we were able to work around its absence by generating many tests declaratively with macros (discussed below).

Successes

Writing OnePush in Rust has been hugely successful for us. We've been able to easily meet our performance and scaling goals with the application. OnePush is capable of delivering over 100k notifications per second and efficiently maximizes the use of system resources. Despite being highly multithreaded, race conditions have not been an issue for us. Even better, OnePush needs very little attention. We were able to leave it running without any issues through the holiday break.

Regressions are very infrequent. There's a huge class of bugs in languages like Ruby that just aren't possible in Rust. When combined with good test coverage, it becomes difficult to break things - all thanks to Rust's fantastic type system. This isn't just about regressions either. The compiler and type system make refactoring basically fool-proof. We like to say that Rust enables belligerent refactoring - making dramatic changes and then working with the compiler to bring your project back to a working state.

The macro system has been another big win. Our favorite example of how this saves us engineering time is using macros for writing tests declaratively. For example, a large set of tests we have are for the Terminal. Each test takes some Events as input, and then the state of Redis and Postgres are checked to be correct after processing the event. The macro system enabled us to remove all of the boilerplate for these tests and declaratively say what the event is and what the expected outcome should be. Writing a test for this system today looks like this:

// Invoking terminal test-writing macro
push_test! {
    // The part before the arrow ends up being the test name.
    // The `response` describes an `Event`, and the rest describes the system
    // state after processing it. There are more parameters that can be
    // specified, but the default values are acceptable in this case.
    apns_success => {
        response: apns::Response::Success,
        success: 1,
        sending_done: true
    },
    // .. and so on
}

Writing a lot of similar tests in this fashion enables us to get a lot of coverage without a lot of work. It also helps us work around the lack of features in the Rust test system (such as before/after hooks).

The final thing we want to comment on here is serde. This library enables adding a #[derive(Deserialize)] attribute to a struct and getting a deserialize implementation. Combined with our serde-redis library, this makes it possible to load data out of Redis like so:

/// A person has a name and an ID.
///
/// This is just some data with a derived
/// Deserialize implementation
#[derive(Deserialize)]
struct Person {
    name: String,
    id: u64
}

// Gets a `Person` out of redis
let person: Person = redis.hgetall("person")?;

On the left hand side of the line fetching person, there's a binding name with a type annotation. On the right hand side, there's a call to Redis with HGETALL, and a ?. The ? is a bit of error handling; if the request is successful and deserialization works, person will be a valid Person, and the name and id fields can be used directly with knowledge that they were returned from Redis. If something goes wrong, like Redis is unreachable or there is data missing for the Person (such as a missing id), an error is returned from the current function.

This is really powerful! We can just describe our data, add this derive attribute and then safely load the data out of Redis. To get the same effect in a dynamic language, one would need to load this dictionary out of Redis and write a bunch of boilerplate to validate that the returned fields are correct. This sort of thing makes Rust more expressive than many high-level languages.

Open Source

Early adoption in an ecosystem means there are lots of opportunities for open source contributions. The most notable of our contributions is a project called serde-redis, a Redis deserialization backend for serde. We've also had the opportunity to contribute several patches to Hyper's Rotor-based async client. We use that client in OnePush and have made billions of HTTP requests with it.

What's next

We've come far with OnePush, but there's still more work to do! Here's just a few of our upcoming projects related to OnePush:

  • Upgrade to Hyper's Tokio-based async implementation. We probably won't be super early adopters here since we've got an HTTP client with a lot of production miles on it right now.
  • Rework result processing to use futures. The Terminal's concurrency from threads is limited, whereas something backed by mio could have much higher throughput. This would require futures compatible Redis and Postgres clients.
  • Replace Solicit's thread-based async client with a mio-based one. We've actually got a prototype of something from earlier in 2016.

We also have a new internal application written in Rust which we hope to blog about soon! It's a core piece of our monitoring which is responsible for collecting statistics from our production systems and storing them in InfluxDB.

Conclusion

We've had fantastic results building one of our core systems in Rust. It has delivered many billions of notifications, and it's delivering more and more each day. We hope that sharing our experience as early adopters in the Rust ecosystem will be helpful to others when making similar decisions. We've certainly found Rust to be a secret weapon for quickly building robust systems.

Like what we're doing? We're hiring!

OneSignal
OneSignal is a high volume mobile push, web push, email, and in-app messaging service.
Tools mentioned in article
Open jobs at OneSignal
Senior SDK Engineer (Remote US)
San Mateo, California
OneSignal is a Remote First Collaboration Company, offering Remote work as the default option across the United States. We offer in-office experiences in San Mateo, CA and New York, NY. At OneSignal, Senior Engineers have the option of working remotely or in an office. In the United States, we currently support remote work in CA, NY and TX and we have offices in New York City and San Mateo. OneSignal has a lot of the great tech startup qualities you'd expect, but we don't stop there. Our massive scale and small team, emphasis on kindness in all our interactions, and focus on ownership and personal growth make OneSignal a uniquely great place to work. OneSignal’s vision is to power the world’s messages. Our customer engagement platform enables our users to compose and send messages via mobile push, web push, in-app, SMS, and email. We have raised a total of $34M from investors including SignalFire, Y Combinator, HubSpot, and Rakuten Ventures. OneSignal customers include Volkswagen, Verizon, Burger King, 7 Eleven, Zynga, Virgin Mobile, and many more. Join us in scaling the business! OneSignal has grown rapidly to where we are sending upwards of over 10 billion messages daily, supporting over 750,000 live apps and 3.7% of the Internet.   As an SDK-centric company, we’re constantly evolving our offerings and improving the Developer Experience across the board.  With that objective in mind, we are looking for experienced Software Engineers to join our SDK Team.  The SDK Team is responsible for the components of our software stack that we ship to our customers.  These include client libraries for our REST API, CLI Tools, libraries embedded in customer applications(e.g. the Android SDK), and associated tooling and resources.  Understanding how to better serve our developers is part of the work we do in the SDK Team. <li>Engaging with developers on GitHub, troubleshooting customer issues, and developing or directing software system testing or validation procedures, programming, or documentation.</li><li>Conducting research on ecosystem trends, upstream software changes, and deepening domain knowledge.</li><li>Developing prototypes and analyzing user needs and software requirements to determine feasibility of design within time and cost constraints.</li><li>Crafting APIs that are both robust and easy to use for a wide range of use cases.</li><li>Developing modifications to our software codebases.</li><li>Reviewing and supporting other’s work.</li><li>Participating in departmental, team, and company events as appropriate. &nbsp;Most are optional. &nbsp;Conducting compliance activities and other duties as may be required.</li> <li>At least 6 years experience working as a software engineer.</li><li>Deep experience with developing applications for one or more of the following platforms: Android, iOS, Web, Unity, Godot, Xamarin, React Native, Flutter, Cordova.</li><li>Comfortable working in a distributed team with autonomy.</li><li>Passion for building tools for frontend developers.</li><li>Experience interacting with RESTful and RPC APIs.</li><li>Interest in working with a diverse group of polyglot codebases(Java, Kotlin, Obj-C, Swift, TypeScript, C#, JavaScript, PHP, Java, Dart, Ruby, and more).</li> <li>Experience writing SDKs, Client Libraries, or other Developer Tools.</li><li>Broad knowledge of development environments.</li><li>Active engagement with a developer community.</li><li>Enjoys interacting with a developer ecosystem and is experienced in leveraging empathy for making better developer products.</li> <li>Friendliness&nbsp;</li><li>Modesty</li><li>Ability to collaborate well on a team&nbsp;</li><li>Can deliver solutions independently</li><li>Self Starter</li><li>Love of learning</li>
Staff Software Engineer, Server-Side ...
United States ()
OneSignal is a Remote First Collaboration Company, offering Remote work as the default option across the United States. We offer in-office experiences in San Mateo, CA and New York, NY. OneSignal has a lot of the great tech startup qualities you'd expect, but we don't stop there. Our massive scale and small team, emphasis on kindness in all our interactions, and focus on ownership and personal growth make OneSignal a uniquely great place to work. OneSignal’s vision is to power the world’s messages. Our customer engagement platform enables our users to compose and send messages via mobile push, web push, in-app, SMS, and email. We have raised a total of $35M from investors including SignalFire, Y Combinator, HubSpot, and Rakuten Ventures. OneSignal customers include Volkswagen, Verizon, Burger King, 7 Eleven, Zynga, Virgin Mobile, and many more. Join us in scaling the business! OneSignal has grown rapidly to where we are sending upwards of over 10 billion messages daily, supporting over 750,000 live apps and 3.7% of the Internet.   As an SDK-centric company, we’re constantly evolving our offerings and improving the Developer Experience across the board.  With that objective in mind, we are looking for an experienced Server-Side SDK Tech Lead to join our SDK Team.  The SDK Team is responsible for the components of our software stack that we ship to our customers.  These include client libraries for our REST API, CLI Tools, libraries embedded in customer applications(e.g. the Android SDK), and associated tooling and resources.  Understanding how to better serve our developers is part of the work we do in the SDK Team. <li>Engaging with developers on GitHub, troubleshooting customer issues, and developing or directing software system testing or validation procedures, programming, or documentation</li><li>Conducting research on ecosystem trends, upstream software changes, and deepening domain knowledge</li><li>Developing prototypes and analyzing user needs and software requirements to determine feasibility of design within time and cost constraints</li><li>Crafting APIs that are both robust and easy to use for a wide range of use cases</li><li>Developing modifications to our software codebases</li><li>Reviewing and supporting other’s work</li><li>Participating in departmental, team, and company events as appropriate. &nbsp;Most are optional. &nbsp;Conducting compliance activities and other duties as may be required</li> <li>At least 10 years experience working as a software engineer on APIs or Developer Tooling</li><li>Experience building CLI Tools on Linux, macOS, and/or Windows</li><li>Experience building RESTful API servers or clients</li><li>Experience working with Ruby, Java, and Bash</li><li>Comfortable working in a distributed team with autonomy</li><li>Passion for building tools for developers</li><li>Experience interacting with RESTful and RPC APIs</li> <li>Knowledge of Rust</li><li>Experience writing SDKs, Client Libraries, or other Developer Tools</li><li>Interest in working with a diverse group of polyglot codebases(Java, Kotlin, Obj-C, Swift, TypeScript, C#, JavaScript, PHP, Java, Dart, Ruby, and more)</li><li>Broad knowledge of frontend and backend development environments</li><li>Active engagement with a developer community</li><li>Enjoys interacting with a developer ecosystem and is experienced in leveraging empathy for making better developer products</li><li>Experience with developing applications for one or more of the following platforms: Android, iOS, Unity, Godot, Xamarin, React Native, Flutter, Cordova, Web</li> <li>Friendliness&nbsp;</li><li>Modesty</li><li>Ability to collaborate well on a team&nbsp;</li><li>Can deliver solutions independently</li><li>Self Starter</li><li>Love of learning</li>
Customer Support Engineer
United Kingdom ()
Work with a YCombinator company that supports over 1M developers across our mobile and web push platforms and over 7,500 new developers sign up each week. You will be able to interface with customers that include Volkswagen, Verizon, Burger King, 7 Eleven, Zynga, Virgin Mobile, KFC, and many more. You will be able to help provide best-in-class customer messaging with a SaaS provider for large-scale websites and mobile apps. In collaboration with UK Elements Global Services will help to grow a business exponentially. We have a lot of the great tech startup qualities you'd expect, but we don't stop there. Our massive scale and small team, emphasis on healthy life balance and kindness in all our interactions, and focus on ownership and personal growth make us a uniquely great place to work. Providing a delightful support experience for every customer (either large or small) is a key part of our ongoing success. We believe support is more than simply answering questions -- it's also a way to better understand the needs of clients and to find ways to make services easier to use. The technical nature of the questions received and the opportunities to improve the product dashboard and SDKs make this a role that will give you the opportunity to learn and apply new technologies each day. <li>Answer technical support questions via email messaging tools and chat</li><li>Help customers set up the mobile SDKs including for Android, iOS, React Native, Ionic, Cordova, Flutter, Xamarin, Unity</li><li>Work with customers to troubleshoot and debug general and technical issues</li><li>Test endpoints of the REST API</li><li>Evaluate crash logs and stack traces to help solve customer issues</li><li>Collaborate with SDK stakeholders to patch bugs and ship updates</li><li>Respond to Wordpress and Github issues</li><li>Maintain knowledge base and create and maintain technical documentation and video tutorials for new products and features</li><li>Build and maintain example code and projects</li><li>Help with demos and customer onboarding</li><li>Be the voice of our customers, and work closely with product and engineering teams to share customer feedback and make recommendations to improve the product</li><li>Help to drive positive reviews on G2 and other review platforms</li><li>Train customers on product enablement to improve overall customer retention </li><li>Offer world-class support to both experienced developers and beginners</li> <li>Have experience or demonstrable interest in technology and programming</li><li>​Enjoy working with customers via video calls, e-mail, and chat</li><li>Get excited about the opportunity to join a small but fast growing startup company</li><li>Have patience and integrity working with customers from all over the world (~75% of our customers are international)</li><li>Strong interpersonal and customer support skills</li><li>Strong written and verbal communication skills</li><li>Speaking Chinese fluently is a bonus - Opportunity to liaise with top customers in Asia </li><li>Those with experience from technical bootcamps are welcome to apply</li><li>Mobile experience or interest with SDKs is a plus</li> <li>Friendliness and empathy</li><li>Modesty</li><li>Ability to collaborate well on a team</li><li>Can deliver solutions independently&nbsp;</li><li>Love of learning</li> <li>Flexible work hours</li><li>20 days paid vacation + 8 holidays</li><li>Yummy Foods: Lunch and snacks provided when in the office</li><li>Choice of workstation!&nbsp;</li><li>Sweet Swag: You'll need another closet for all the gear & jackets!</li><li>Equity - as the company grows in value, you benefit </li>
Senior Site Reliability Engineer (US)
United States ()
OneSignal is a Remote First Collaboration Company, offering Remote work as the default option across the United States. We offer in-office experiences in San Mateo, CA and New York, NY. Our blog contains more information about the OneSignal Engineering career ladder, compensation model, remote-first culture, and our diverse team. Our salary bands are available on AngelList. We have grown rapidly to where we are today serving billions of HTTP requests daily and sending upwards of over 10 billion messages daily. We achieved this scale writing scale sensitive components in languages like Rust and Go. This potent combination of high performance with efficient resource utilization has given us an incredible competitive edge. At our rapid growth pace, we are hiring SREs to help us continue to scale by operating and engineering the future of our infrastructure. We are maintaining 99.95% uptime today, and we are investing to ensure we maintain that as then business continues to grow and as the product evolves. Your primary task will be software engineering with a focus on infrastructure, operations, and automation. You'll be building systems to run our product, improving internal services, and advising product teams on architecture as it relates to the operability of the service. The systems you'll be responsible include all of the services which power our product. This ranges from off-the-shelf services like haproxy, nginx, Redis, PostgreSQL, Kafka, Kubernetes, etc. to our in-house services such as the Rails web app, various Rust backend services, and our high-performance API layer written in Go. You'll be working with Kubernetes to automate our data center operations and writing operational services to automate database operations. One of the key challenges in this role is to not only understand systems to the point of being able to manually operate by hand but also to understand in sufficient detail to write software systems to automate such operations. For some additional context on how we think about SRE, please see the introductory chapter of the Google SRE book. <li>Improve our CI/CD pipeline to improve deploy performance</li><li>Develop new tools to enable other developers to better spend their time</li><li>Add new code to the system to enable messaging users on a new platform</li><li>Help evaluate a new storage technology to further scale our stack</li><li>Provision and configure new hardware</li><li>Investigate network issues</li><li>Improve application and infrastructure monitoring</li> <li>At least 3 years experience working as a software engineer</li><li>Experience operating reliable production systems at scale</li><li>Knowledge of Linux systems internals</li><li>Easily bored running tasks by hand and the ability to automate such tasks</li><li>Experience with PostgreSQL</li> <li>Experience working with Cloud Providers(AWS/GCP/Azure)</li><li>Operational experience deploying and managing Kubernetes&nbsp;</li><li>Experience writing Kubernetes controllers and operators</li><li>Recent experience writing Go and/or Rust</li><li>Past experience as an SRE</li><li>Experience working with Layers 1-3 of the OSI networking model</li><li>Experience with any of Redis, Kafka, etcd, ZooKeeper, nginx, haproxy</li> <li>Flexible work hours</li><li>20 days paid vacation + 8 holidays&nbsp;</li><li>Equity - as the company grows in value, you benefits</li><li>Yummy Foods: Lunch and snacks provided when in office</li><li>Choice of workstation!&nbsp;</li><li>Sweet Swag:You'll need another closet for all the OneSignal gear & jackets!</li>
Verified by
Cofounder & CEO, OneSignal
COO
You may also like