There are three ideals at the core of every technical decision we make at Opsee:
- Everything we do should be reproducible.
- Automate everything when it should be automated.
- Write only the code that is necessary to get our job done.
When we began evaluating and planning for our infrastructure needs, we kept these three goals in mind. During the evaluation process, we set out to streamline our workflow such that, after a change is reviewed in GitHub and merged to master, changes are in production as quickly as possible.
In an ideal world, our infrastructure would be as easy to use as our product. It should be easy to deploy, and once deployed, it should continue to work without constant maintenance from our engineers. We believe that building complex systems presents enough challenges without having to worry about infrastructure. That's why we built Opsee, because we recognized that monitoring presents many of the same complications as application infrastructure, and we want engineers to be in the business of building applications--not application infrastructure.
Reproducibility is paramount
The key to iterating quickly is reproducibility. There should be no surprises when your tests run in a continuous integration environment. In order to make this a reality, we standardized our build process for all of our projects very early. To make this task easier, we decided to use Docker. We have a two-phase build process. Our standard Go build container automates the test and build process for our projects.
Once the artifacts have been collected from the Go build container, they are put into a lightweight runtime container to be deployed to production. We use Makefiles to automate the execution of the build pipeline so that building any project is as simple as typing
make at the command-line immediately after checking out a repository from GitHub.
Automate when you should
Whether or not to automate is largely dependent upon the time spent doing a particular task. If you're performing a task of low to medium effort infrequently, it may not be worth automating. There are costs associated with both manually performing and automating tasks. Generally, once the cost of the task approaches or exceeds the cost of automation, it is time to automate. The most commonly automated components of any application or platform are the provisioning of infrastructure, building and "packaging" the application, and deploying the application to a runtime environment.
Provisioning infrastructure is almost always worth automating. We use Ansible to provision and manage all of our resources running in AWS. The Ansible libraries for AWS are reasonably easy to use, but care should be taken when using them to make it easy to deploy to multiple regions and multiple VPCs (in case you need high availability or you intend to deploy a staging environment). With Ansible, we deploy EC2 instances running CoreOS Linux. We believe that CoreOS is a best of breed container hypervisor, and their CoreOS Update service keeps our EC2 instances up-to-date with little effort on our part.
The primary drivers of our build and deploy mechanisms are CircleCI and again Ansible. CircleCI makes the process of building with Docker painless. This is perfect for our two-phase build process as previously described. All pull requests are built on CircleCI, and then merges to master push the resulting runtime containers to CoreOS's Quay.io container registry.
Write as little code as possible
Opsee is a young company and needs to maintain a very high velocity in engineering. We have only six engineers, and each of us is busy building a critical piece of the product. We don't want to run infrastructure services ourselves, and we don't want to write the often necessary supporting structures around a custom platform.
Because of this, we focused our efforts on finding a platform that would do as much of the heavy lifting as possible. Our product is built specifically for companies that operate in Amazon Web Services, and it relies on many of the components provided by AWS (CloudFormation, Kinesis, SNS, SQS, Lambda). Because of this, we limited our choices in container platforms to those that we could deploy in AWS or those that AWS provided.
Finding a container platform
We investigated a number of container orchestration platforms:
Ultimately, the decision of which platform to use revolved around the level of effort required to achieve production readiness. We chose to measure this with a very simple metric: the number of components we would have to deploy, secure, and maintain ourselves. For part of Opsee's functionality, we utilize the Identity and Access Management (IAM) service's ability for cross-account access. Because of this, we take security very seriously. We're also a monitoring company, thus high availability of all of our services is extremely important.
Docker Swarm has no innate role-based access control. At the time we were investigating platforms, Docker's Universal Control Plane (UCP, part of Docker Datacenter) did not exist. Likewise, the Attribute Based Access Control (ABAC) authorization mode in Kubernetes didn't exist. Had these features existed, we would have more seriously considered Docker Swarm via Docker Datacenter deployed in AWS or possibly considered an enterprise offering for Kubernetes from Kismatic.
Without an enterprise offering, both of these platforms require significant operational investment to secure, make highly available, and maintain. Each has multiple moving parts that require configuration for high availability, and we would have to automate deploying them in a highly-available manner. This would have required considerable engineering effort. We could have accepted the technical debt, but were trying to avoid it if possible.
One of the tools included in CoreOS is Fleet--a container scheduling service that controls systemd at the cluster level. Early experiments with Fleet however, showed that it wasn't appropriate for scheduling services in a production environment. Fleet's scheduling algorithm is overly simplistic. Tools like Kubernetes and ECS both require that you declare resources for your services before scheduling them, allowing them to binpack and overprovision a predictable amount. During experimentation, we found that lacking this functionality, there are several scenarios that could cause cascading failures, and so we abandoned work on Fleet.
AWS EC2 Container Service (ECS) has proven to be a robust and feature-rich platform for scheduling containers. It was easy to deploy to CoreOS, and it requires little operational overhead for our team. It natively supports
docker-compose, and is easy to deploy with CloudFormation. It directly integrates with AWS Elastic Load Balancing making rolling deploys with ELB connection draining possible. ECS represents considerable infrastructure automation that we did not have to build ourselves. In becoming ECS users, we have also begun building our ECS integration as a product which should be launching in the coming weeks--allowing ourselves and our users to easily monitor services running in ECS with almost no configuration required.
The state of infrastructure automation is rapidly changing, and we're currently looking for automation to replace our cluster management and service deployment code. For this, we are excited by Convox's work. Convox Rack is a fully-automated container scheduling platform built on top of ECS. We're in the early stages of experimentation with them, but so far have been very impressed. We firmly believe that platforms like Convox are the next generation of application infrastructure.
Where we are now
Today, we're still running the same infrastructure we stood up a year ago with ECS. We have four m4.medium instances running about 50 containers. We have a small cluster of three machines for our Etcd quorum, and we have a couple of additional CoreOS instances that serve as endpoints for customer EC2 instances to connect back to Opsee. All of our stateless microservices rely on other AWS services for their data stores like Kinesis, DynamoDB, AWS ElasticSearch, or more commonly RDS.
Our build workflow is fairly simple and almost entirely automated. After opening a pull request, CircleCI runs tests and builds a runtime container that can be used to test locally or deployed as a canary in production. Once the PR is merged, CircleCI builds and then pushes a new runtime container to a Docker registry. All containers are tagged with the git SHA of the revision that generated the container to make its source easy to identify.
Ansible is then used to deploy the containers to production. Ansible updates the ECS task definition and then the service definition. After that, ECS takes over. ECS launches containers with the new task definition and registers them to the ELB configured for that service. ECS then starts draining connections for the old containers in the ELB, and traffic slowly starts to flow only to the new containers. Once the new containers have taken over all traffic, the old containers are deregistered from the ELB and shutdown. The only missing part here is a cleanup step where the old container images are deleted from Docker (stopped containers are cleaned up after a grace period of three hours). We use a scheduled systemd unit that deletes all stopped containers and unused Docker images to make up for this.
Learning how to operate on a Linux distribution like CoreOS has been somewhat of a challenge. CoreOS has forced us to rethink how we manage our compute instances. It is not a traditional Linux distribution in that it does not include something as common as a package manager. Installing programs on CoreOS requires either downloading static binaries at boot time or running everything in containers. Log aggregation is a bit of a challenge, but ECS has provided a mechanism for centralizing logs in CloudWatch with the awslogs Docker log driver. It's important to consider the logging configuration when working with Docker, because otherwise long-running containers run the risk of filling up disk in /var/lib/docker. After a couple of production incidents where containers ran amok and filled up disk on our compute nodes, we isolated /var/lib/docker to its own EBS volume on every compute instance.
It's simple finishing touches like this that we would like to see from CoreOS, Docker, and ECS. All of these tools do an excellent job of getting us most of the way there, but then fall slightly short when it comes to being production ready. As ECS is the management layer for Docker, we believe it should take some additional steps to ensure containers are cleaned up. We'd also like to see log aggregation be a first-class citizen in the CoreOS ecosystem. While we could add systemd units that forward logs from journalctl's standard output via ncat to a logging server or configure rsyslog, we think that a self-described cluster operating system should consider this a requirement.
As Opsee is a monitoring company, monitoring our services is of critical importance. We use our product which integrates with CloudWatch, ECS clusters, and ELBs to respond to changes in our infrastructure that happen with every deploy. We have ancillary monitoring in CloudWatch that alerts us when certain dependent services go down. We forward our metrics from services and systems to Librato for visibility into performance during operational incidents. Opsee responds to changes in our infrastructure as soon as they happen. It does this by inspecting ECS services to see where tasks are running and then monitors the containers directly on those instances. We also have alerts that can help us ensure our compute cluster isn't over-provisioned.
It's common for people, particularly when they're building the platform upon which they will build their product, to give in to the inclination to build everything themselves. The most common reason we generally hear is that they want to avoid vendor lock-in. We have instead optimized for rapidly building and iterating on a product. We espouse the belief that we should not build what we do not sell, and this has not led us astray so far.