By John Barton, Director of Engineering, at 99designs. You can find him here on StackShare and Twitter
99designs is the world's leading graphic design marketplace. We're best known for our design contests, but we connect designers and clients in a variety of ways including contests, fixed price tasks, 1:1 projects, and even off the shelf logos from our stock store. We were founded in Melbourne, Australia, but after raising VC our headquarters moved to San Francisco. The bulk of our product development group is still based here in Melbourne - just in a much bigger office!
As of April 2015, we've facilitated 390,000+ crowdsourced design contests for small businesses, startups, agencies, non-profits and other organisations, and have paid out over $110M to our community of 1M+ graphic designers around the world. We also serve localised versions of the site in English, Spanish, German, French, Italian, Dutch, Portuguese and Japanese.
I'm the Director of Engineering here at 99, which puts me one rung under the CTO where I take care of the day to day running of our tech team and the short to mid-term architecture vision. My background is as a Ruby on Rails developer turned dev manager, and if there's a career trend it seems to be working in small to mid-sized teams in fast growing dual sided marketplaces.
Architecture & Engineering Team
We're big believers in Conway's Law at 99:
organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations
Like so many other tech startups, 99 started out with the classic LAMP monolithic architecture and team, and as the company grew rapidly that approach added friction to our development processes. Early attempts at splitting the monolith and moving onto newer technologies weren't 100% successful with some services suffering bitrot making new changes more expensive than going back to the monolith. Maintaining a wide spread of services and languages created a high operational burden and our ratio of sysadmins to developers was uneconomical.
Around two years ago Lachlan, our CTO, went back to the drawing board taking Conway's law to heart. We now almost exclusively design the staffing around a particular product or "platform" challenge and allow the architecture to be an almost emergent property of our team structure.
We're now 33 engineers across both San Francisco and Melbourne arranged into around 8 cross functional teams, each of which is predominantly responsible for one major system. We've got a couple of developer positions in our Melbourne office opening very soon. Keep an eye on our jobs page for ads over the next few days. We generally look for anyone with PHP or Ruby experience, with Go knowledge being icing on top.
The architecture that has fallen out from this structure is roughly:
- Varnish providing our outermost ring tying everything together - caching, carving up our route map to various sub-products, etc
- Some core services written in Go, most notably our Identity & Single Sign On service and the system handling all of our marketing emails
- 4 main product teams with their own "mini-monoliths" in either PHP or Ruby on Rails, in a fairly standard LAMP shape of loadbalancer plus application tier plus Amazon RDS database cluster
- A cross platform payments service in Ruby on Rails
- A data science team tying everything together in an Amazon Redshift database powering our business intelligence
Most of the systems listed are deployed using Docker which we manage directly on EC2. Our technology per product have been driven largely by the sizes of our pager rotations. Every engineer takes part in an on call rotation, and we've divided the engineers into four roughly equal groups along technology lines: PHP on bare VMs, PHP in Docker, Rails/MySQL in Docker, and Rails/Postgres.
We're running > 130 EC2 instances, if I had to guess around 70-80 of those hosts run Docker.
In development our container orchestration is 80% bash and env vars, and we've added Docker Compose to manage our always on. A lot of the complexity in our dev orchestration is about selectively starting containers (and their dependent containers) based on what you'll actually be working on that day - we don't have infinite resources in our little MacBooks :-) In production, for now, we just have one Docker container per EC2 host and pretty much just manage those with Cloud Formation.
Our main application images are built up from our own base Docker image in dev and prod, but we use a lot of the stock images on the Docker registry for things like databases, Elasticsearch, etc in development as rough equivalents for off the shelf Amazon products we use (like Elasticache or RDS).
Workflow & Deployment
Our development environments are 100% Docker-powered. Every codebase & system we develop day to day has a Docker container set up, and we wire those together with some bespoke bash and a standardised Vagrant image. We deliberately cobbled our current solution together in a language we didn't like so that once community consensus emerged on container linking we wouldn't feel the least bit bad about deleting what we've got and fight off "Not Invented Here" syndrome.
We practice full continuous delivery across all of our systems, with the exclusion of our credit card processor, where every commit to master is built and tested in CI and automatically deployed to production.
We use Buildkite to manage our build/test/deploy pipeline. It's a hosted management service with agents installed on your build servers that works just as well managing Docker containers and working with legacy "handcrafted" CI boxes for some of our older bits and pieces. Having a unified build management system without necessarily a unified test environment is really useful.
Tools & Services
For monitoring we use New Relic, Papertrail for unified logging, Bugsnag for error reporting, PagerDuty for on call rotations, Cloudability for AWS cost analysis, lots of CloudWatch alerts, and Wormly for our external http healthchecks.
For issue tracking, we stopped using Github Issues quite a while back internally, it was a real barrier to getting team members outside of engineering involved with the bug reporting and triaging process, so now we just handle it all with Trello cards.
We use Segment for tracking our business events for analytics on both client and server side. It makes product development a lot easier for each team to have one API to work with and only worry about what kind of events they emit, without getting bogged down in how they'll be analysed.
We use a bunch of different payment processors depending on which market the customer is in, which method the customer wants to use, and which services are up. The main ones in production now are Stripe, Braintree, Paypal, and Adyen. We use Sift as one among several fraud prevention measures.
As I mentioned, 99Designs is available in eight different languages. Our localisation efforts proceeded in two waves, both of which heavily rely on Smartling. For the initial rollout we used their proxy service in front of the site that would swap out content on the fly with translated sentences we managed inside their CMS as we had just too many English-only pages to convert by hand. For the second phase we've been using Smartling to export XLIFF files so that we can display the right content direct from our servers. The second phase has been a much more organic process - as we redesign pages or launch new products we'll roll them out as internationalised from our hosts, but we haven't treated that as a project in and of itself.
We've standardised on SASS across the business for CSS preprocessing, and use a framework one of our devs created call Asimov to manage the way we share SASS across projects in a component driven design process.
Each product has it's own makefile and gulp/grunt based asset pipeline. We eschewed the Rails asset pipeline as we felt it was better to have company-wide consistency in front-end workflow (as the asset pipeline is not available to our PHP teams) than it was to keep consistency with the Rails community.
Right now we consistently build our assets in CI (rather than on our production boxes) but what we do with them varies by team. Most of the products ship the assets along with the server-side code and serve them directly alongside dynamic content. Some of the newer projects have started shipping all static assets to S3 as part of the build, requiring us only to ship an asset manifest alongside our serve-side code. We'll be converging over time on this solution as it gives us a few operational wins. Firstly it makes for smaller git repos or Docker images, our compiled assets are quite heavyweight proportionately. The second benefit is that it puts much less through our load balancers, giving us a lot more headroom for future growth in pageviews without adding latency - something that can happen if you overburden ELBs.
What's coming up next
The biggest change for us right now is the introduction of React in the front end. It's been a huge benefit for some of our complex real-time parts of the site. Our designer to customer messaging system in contests is all React-based now and it made a very challenging problem much more tractable.
Go is also becoming a bigger and bigger piece of our stack. It's such a pleasure to use as a JSON API layer that paired with our use of React, should allow us to see some great results.
As we get closer and closer to all of our production environment being Docker-based we've been keeping a close eye on how Amazon's Elastic Container Service has been evolving. Getting rid of our custom code managing our Docker containers on EC2 will be a big maintenance cost cutter for us, freeing up more engineering time to do product work.