500px is an online community for premium photography. Millions of users all over the world share, discover, sell and buy the most beautiful images. We value design, simplicity in the code, and getting stuff done.
I'm a DevOps Developer at 500px working on the Platform Team. I work on stuff like backend systems, monitoring, configuration management, and deployment and automation. Prior to joining 500px, I spent many years as a sysadmin working in the insurance industry.
Engineering at 500px
The 500px Engineering team is split into four groups: the Web team, the Mobile team, the QA/Release Engineering team and my team, the Platform team, which is a combined team that handles building our API and backend services as well as technical operations and infrastructure management.
Our teams are highly cross functional and boundaries are fairly loose, so engineers wind up moving around a lot between teams to work on specific projects. This helps us spread knowledge and prevent siloing. There is also a very tight communications loop between engineering, product, design and the customer excellence teams, which helps to keep us honest, agile, and focused on delivering the right things.
The architecture of 500px can be thought of as a large Ruby on Rails monolith surrounded by a constellation of microservices. The Rails monolith serves the main 500px web application and the 500px API, which powers the web app, mobile apps and all our third party API users. The various microservices provide specific functionality of the platform to the monolith, and also serve some API endpoints directly.
The Rails monolith is a fairly typical stack: the App and API servers serve requests with Unicorn, fronted by Nginx. We have clusters of these Rails servers running behind either HAProxy or LVS load balancers, and the monolith's primary datastores are MySQL, MongoDB, Redis and Memcached. We also have a bunch of Sidekiq servers for background task processing. We currently host all of this on bare metal servers in a datacenter.
The microservices are a bit more interesting. We have about ten of them at the moment, each centered around providing an isolated and distinct business capability to the platform. Some of our microservices are:
- Search related services, built on Elasticsearch
- Content ingestion services, in front of S3
- User feeds and activity streams, built on Roshi and AWS Kinesis
- A dynamic image resizing and watermarking service
- Specialized API frontends for our web and mobile applications
We run our microservices in Amazon EC2 and in our datacenter environment. They are mostly written in Go, though there are a couple outliers which use NodeJS or Sinatra. However, regardless of the language in use, we try to make all of our microservices good 12-factor apps, which helps to reduce the complexity of deployments and configuration management. All of our services are behind an HAProxy or an ELB.
The microservices pattern is great because it allows us to abstract away complex behaviour and domain-specific knowledge behind APIs and move it out of the monolith. Front-end product teams consuming these services only have to know about the service's API, and service maintainers are free to change anything about their service (as long as they maintain that API!). For example, you can query the search service without having to know a thing about Elasticsearch. This flexibility has proven to be extremely powerful for us as we evolve our platform because it lets us try out new technologies and new techniques in a safe and isolated way. If you are curious about implementing microservices yourself, former 500pxer and overall rad dude Paul Osman gave a great talk at QConSF last year about how and why we did it. My face is on one of the slides, but that is only one of the many reasons why this talk is awesome.
Probably the most interesting set of microservices we run at 500px are the ones having to do with serving images and image processing. Every month we ingest millions of high resolution photos from our community and we serve hundreds of terabytes of image traffic from our primary CDN, Edgecast. Last month we did about 569TB of total data transfer, with 95th percentile bandwidth of about 2308Mbps. People really like looking at cool pictures of stuff!
To ingest and serve all these images we run a set of three of microservices in EC2, all built around S3, which is where we store all of our images. All three of these services are written in Go. We really like using Go for these cases because it allows us to write small, fast, and concurrent services, which means we can host them on fewer machines and keep our hosting costs under control.
The first microservice users encounter when they upload a photo is one we call the Media Service. The Media Service is fairly simple: it accepts the user's upload, does some housekeeping stuff, persists to S3, and then finally enqueues a task into RabbitMQ for further processing.
Next, consuming those tasks off of RabbitMQ is another service called (creatively) the Converter Service. The Converter Service downloads the original image from S3, does a bunch of image processing to generate various-sized thumbnails, and then saves these static conversions back to S3. We then use these conversions in lots of places around our site and for our mobile apps.
Probably so far this isn't very surprising for a photo-sharing website, and, for awhile, these two services did everything we needed -- we simply set the S3 bucket containing the resulting thumbnails as the origin for our CDN. However, as the site continued to grow, we found this solution was pretty costly and space inefficient, as well as not very flexible when new products required new sizes.
To solve this problem, we recently built what we creatively call our Resizer Service (yes, we tend to choose descriptive names for these things). This new service now acts as the CDN origin and dynamically generates any size or format of image we need using the S3 original. It can also watermark the image with a logo and apply photographer attribution, which is reassuring to our community.
The Resizer Service is fairly high throughput, with the cluster handling about 1000 requests per second during peak times. Doing all this resizing and watermarking is pretty compute-intensive, so it's a bit of a challenge to keep response times reasonable when the load is high. We've worked really hard on this problem, and at peak traffic we're able to maintain a 95th percentile response time that is below 180ms. We do this through the use of a really cool, really fast image processing library called VIPS, aggressive caching, and by optimizing like crazy. Outside of peak hours, we can usually get below 150ms.
And we're not done with this problem yet! There are almost certainly more optimizations to be found, and we hope to keep pushing those response times down further and further in the future.
We use Github and practice continuous integration for all of our primary codebases.
For the Rails monolith, we use Semaphore and Code Climate. We use a standard rspec setup for unit testing, and a smaller set of Capybara/Selenium tests for integration testing. Fellow 500pxer and professional cool guy Devon Noel de Tilly has written at length about how we use those tools, so I won't try to out do him -- just go check it out.
For our Go microservices, we use Travis CI to run tests and to create Debian packages as build artifacts. Travis uploads these packages to S3, and then another system pulls them down, signs them, and imports them into our private Apt repository. We use FPM to create packages, and Aptly to manage our repos. Lately, though, I've trying out packagecloud.io and I really like it so far, so we may be changing how we do this in the near future.
For deployments, we use a combination of tools. At the lowest level we use Ansible and Capistrano for deploys and Chef for configuration management. At a higher level, we've really embraced chatops at 500px, so we've scripted the use of those tools into our beloved and loyal Hubot friend, BMO.
Anyone at 500px can easily deploy the site or a microservice with a simple chat message like
bmo deploy <this thing>. BMO goes out, deploys the thing, and then posts a log back into the chat. It's a simple, easy mechanism that has done wonders to increase visibility and reduce complexity around deploys. We use Slack, which is where you interact with BMO, and it makes everything really nicely searchable. If you want to find a log or if you forget how to do something, all you have to do is search the chat. Magical.
Other Important Apps
We monitor everything with New Relic, Datadog, ELK (Elasticsearch, Logstash and Kibana), and good old Nagios. We send all our emails with Mandrill and Mailchimp and we process payments with Stripe and Paypal. To help us make decisions, we use Amazon's Elastic MapReduce and Redshift, as well as Periscope.io. We use Slack, Asana, Everhour, and Google Apps to keep everyone in sync. And when things go wrong, we've got Pagerduty and Statuspage.io to help us out and to communicate with our users.
The Future, Conan?
Right now I'm working on experimenting with running our microservice constellation in Docker containers for local dev (
docker-compose up), with an eye to run them in production in the future. We've got a CI pipeline working with Travis and Docker Hub, and I'm really excited by the potential of cloud container services like Joyent Triton and Amazon ECS. As we build more and more microservices and expand the stack, we're also looking at service discovery tools like Consul and task frameworks like Mesos to make our system scale harder better and faster.
More Faster Is More Better
We're expanding quickly and hiring all kinds of positions. We're looking for DevOps types, Backend and Frontend developers, Mobile developers (both Android and IOS), UX designers, and sales people. We build cool stuff, we're passionate, and we're flying by the seat of our pants at breakneck speed, building the best thing we know how to make. If you like doing awesome cool stuff and you aren't afraid to get your hands dirty, come join us.