Tilt is the easiest way to collect money from a group. At Tilt, we enable our users to make anything a reality, from weekend getaways to helping a bobsled team go to the Olympics to bringing your favorite band back to your city. The possibilities are endless, and we're always striving to make the process as easy and as frictionless as possible.
With a fast growing team across the U.S., Canada, and the UK, more than 300,000 groups have used Tilt. Our user base continues to grow each day, and we are actively expanding to more countries across continents.
I've been with Tilt 3 years now since joining as the first hire. I'm currently the Director of Technology, and I work on constantly improving all areas of our stack, including everything from our on-boarding, to our development environment, testing, architecture, code reviews, the deploy process, and everything in between. Prior to Tilt, I worked at Rackspace helping build the OpenStack open cloud computing platform. I learned a ton through the rapid growth phase at Rackspace leading up to and through it's IPO, and that experience has been invaluable in building Tilt.
Currently the Tilt engineering team is comprised of around 30 engineers split into the following 7 teams:
- Tilts - Focused on all things Tilt Pages (Tilt creation, edit, contributions, etc.)
- Tilters - Focused on all things Tilters (login/signup/social/sharing)
- Payments - Building the best group payments layer around
- Notifications - Owning Email, Push Notifications, Notification Center, Events
- Mobile - Gorgeous Android and iOS apps for everyone
- Internal Tools - Providing everything we need internally to run Tilt smoothly (for Support, Biz Dev, Fraud, etc.)
- SysOps (TiltOps) - Keeping everything running smoothly, building out our future PAAS
All of our teams are cross-functional, and largely full-stack. At about 30 engineers, we all wear many hats, and always encourage each other to improve, learn about, and tinker with any and all parts of our stack, even those outside of their team or responsibilities.
We keep our teams small and moving fast. You can read more about the evolution of our teams from 1 in the beginning, to the current structure/breakdown here. Check back on our blog for more updates on this in the future.
Our team is also highly distributed (we have offices in Toronto and Austin, and engineers in both, as well as engineers in Colorado, Maine, Philadelphia, Virginia, etc.), so we are really passionate about digital tools to bring us all closer. We make heavy use of Slack, Google Hangouts, Jira/Confluence, Skype, Mumble, etc., to keep everyone on the same page. Over communication is very much baked into our culture.
We chose PostgreSQL as our main datastore very early on because we had grown unhappy with MySQL after extensive painful experiences using it in the past, and wanted better data integrity and constraint checking. PostgreSQL has been very reliable, and has definitely seen a huge resurgence in popularity over the last 3 years that we’ve been building Tilt. At the time, Amazon RDS did not support PostgreSQL, so we had extra the management overhead of managing it ourselves, and setting up our own monitoring + wal-e backups, but it turns out to be well worth it. We were also lucky enough to be in the early beta with Amazon to help them test their PostgreSQL RDS implementation, and now use several of these in production ourselves (though our core DB we still manage ourselves).
We use Nginx as a reverse-proxy to our apps, and also to serve static files directly in many cases. Varnish sits in front of our web application to help with performance for both desktop and mobile web users, and we make heavy use of Amazon's ELB's throughout our stack to load balance our application servers.
The core of Tilt's architecture from the very beginning has centered around our RESTful Tilt API. The Tilt API is the central place for our business logic and domain objects, and all of our products (Web, Mobile, Tilt/Open, etc.) work via this API. This API is written in Perl (check out our api-style-guide) using the Dancer web framework. The API makes use of Memcached (via ElastiCache) for performance improvements as well as locking where necessary.
For logging, we currently use Logstash to parse all of our various application logs and ship them into an ElasticSearch cluster for easy investigation/debugging. We also use StatsD with InfluxDB and a Grafana dashboard for server, application, request/response, and deploy data monitoring.
We've always managed our servers with Chef, and that enabled us to easily migrate from Rackspace to Amazon in the early days, but we're constantly looking for ways to reduce our management overhead and leverage Amazon's AWS services as much as we can. All of our applications are in Auto-Scaling groups, and We're experimenting with things like ElasticBeanstalk, Elastic Container Service, with an eye towards reducing the amount of Chef code we have to maintain.
Owning Group Payments
Tilt has had an incredibly educational journey in the payments space, to say the least. We’ve had to learn more about payments than we could have imagined, from navigating banking relationships, to becoming PCI compliant (level 1!), learning all of the 3-letter acronyms (KYC, AML, FBO, PCI, etc.), integrating with multiple payment processors (and their quirks), and the list goes on and on.
A lot of the difficulty we had along the way came from the unique challenges of group payments. In the traditional payments model consumers transfer money to a merchant for a good or service as a 1-to-1 transaction. In the group payments model, the ‘merchant’ (Tilt, in this case) facilitates one person collecting money from many people as a many-to-one transaction. This presents numerous challenges in how to properly escrow funds, comply with regulatory differences, properly account for funds, and even which payment methods and networks can be supported. While figuring all of these things out, we’ve completely migrated payment processors 3 times without a hitch. Now we seamlessly support multiple payment processors and multiple currencies.
We’re building a platform and service that will provide the best localized group payments experience in every country that we operate in.while drastically reducing the PCI scope in our stack. This allows the rest of our product services and teams to iterate even faster without compromising our user’s security.
For our development environment at Tilt, we love Vagrant. We use Vagrant to spin up local VMs (or AWS machines) which are automatically provisioned via the same chef recipes that provision our production servers. This gives us a high confidence that our chef recipes work, and that our development environment closely mimics production.
It also makes it really easy for new hires to get up and running and deploy code on day one :)
Shipping code at Tilt has improved by leaps and bounds over the past 6 months to the point where it's now just a few commands in Slack to our trusty "TiltBot", and takes just seconds for each app. Recently, we've been doing 60-70 deploys a week across all of our engineering teams. Most of our new engineers ship code on their first day!
We have a few fun bots that visit our chat rooms, from Code ReviewTratron, to "skynet", to our beloved TiltBot. They do things like help with code reviews, talk back to us, deploy our code, see who's deploying what and when, provide us with useless facts, "helpful" links, and much more.
Here's a little taste of what our Slack rooms look like:
Code Review Halp!
Assets Not Ready =(
We uh, have a lot of fun in Slack...but back to shipping code.
Our current deploy process is an improvement over our old Chef + git based deploy model, where servers would actually check out a git repository. Now, we build a static asset (a .tgz currently) with Jenkins, which bundles up the application with all dependencies localized inside. These assets are named based on their git commit hash, stored in S3, and moved through the pipeline (staging -> production). This gives us better guarantees that the code we've tested in our CI process, and on staging is the same exact code deployed to production, with no room for race conditions or change. Each application repository contains its own Capistrano deploy scripts, which helps individual teams stay in control of how their apps get deployed.
When a developer builds a new feature, they do so in a new branch for the appropriate application, using a standard git flow-esque model. All our code lives in GitHub, so we do our code reviews via Pull Requests there.
Every Pull Request triggers a Jenkins build for that application, which will run our unit tests + Selinium tests on the frontend app, and trigger several thousand unit + integration tests on the API. We've made lots of performance improvements in the tests to get all of this down to just a few minutes using some cool tools we've built internally called Tilty, and Omni, which help us utilize EC2 instances to parallelize our test runs (we're looking to open source some of these, so stay tuned!).
Once your branch has a review + signoff by another engineer (Code ReviewTratron will let you know!) you're free to merge your branch.
Developers can see what commits are awaiting a deploy pretty easily with our Skynet Bot:
Commits to Deploy
And doing the deploy itself is pretty easy, too (Staging shown here):
There's still lots of improvements to make, but we've already made deploys pretty incredibly easy, which means we do it more often, and with smaller change deltas. Once code is reviewed, and the tests pass, we ship!
Of course, since our deploys are so quick, and based on git hashes, we can rollback to the previous deploy point just as easily and quickly :)
Inevitably as a company experiences rapid growth, there will be growing pains and challenges to overcome. A large number of the challenges we’ve run into revolve around optimizing and scaling our datastore: PostgreSQL.
PostgreSQL is amazing, and we love it more and more every day, but we’re constantly learning how to better tune it to our needs, improve our indexes, and do more efficient migrations.
One of our worst outages came from a rogue data migration that was reviewed, tested, and worked flawlessly forwards and backwards in our development and staging environments. Nonetheless, it still managed to bring our database to a screeching halt as soon as it hit production.
The issue? Well, our
users table turns out to be incredibly busy, and, by default, if you add a Foreign Key from another table, it needs an exclusive lock on that table. This is exactly what we were trying to do, and our migration process hung while locking our
users table, and breaking large portions of functionality throughout our platform.
Killing the migration and sorting out the aftermath was fun, but we quickly learned about the
NOT VALID argument when adding constraints/foreign keys,which prevents Postgres from validating them (and thus needing to lock the table). We also learned how to make better use of
DEFERRABLE INITIALLY DEFERRED to have postgres defer constraint checking.
This is just one example of our many scaling adventures with PostgreSQL. ou can check out more SQL migration hacks we’ve learned here.
We're currently working on lots of improvements across all parts of our stack. We're building a much easier platform for developers to spin up new services on, which will be built on top of Amazon ECS and EB. This platform will help us shift towards smaller, more isolated services, which we hope will help us move faster with fewer bugs.
Mobile is eating the world, and this is certainly true for payments as well. That's why we continue to invest heavily in beautiful, simple iOS and Android apps. People love the ease and simplicity of doing things on their phones over bulky computers, and we're really pushing the envelope when it comes to group payments on mobile.. We've adopted a mobile-first mentality (for web, too) which is helping us simplify the product, move faster, and give our users the experience they expect.
One current project that we're incredibly excited about, is our next-generation payments layer. We've realized that group payments are still a very new space in the payments world. Group payments involve complicated technical and regulatory challenges that are definitely not solved yet, but we've come up with some really innovative solutions and abstractions to help us expand internationally into new currencies, payment methods, and payment networks. We'll be posting more about this project very soon, so be on the look out for updates.
And, of course, the reason we focus on any of these improvements at all, is because of our relentless drive to build the easiest, most frictionless group payments product possible, for everyone, everywhere. We're already in the US, Canada, and the UK, and incredibly excited to be expanding further. I know there are many hurdles and challenges to come as we scale our payments and social infrastructure to more people in more countries using more languages and more payment methods.
We're actively looking for great engineers to join all parts of our growing team. Are you passionate about building great user experiences? Want to help build a state-of-the-art, next-gen (group) payments systems? Are rock-solid APIs your thing? How about a scalable platform as a service? Maybe data infrastructure, machine learning, and analytics is your thing? If any of these interest you, drop us a note, and tell us why you'd like to help us create a better world one Tilt at a time. #TiltTheWorld.