By Martin Cozzi, Infrastructure Engineer at Cotap.
Background
Cotap is a secure messaging app for mobile workers who need a fast way to communicate via text, voice and video. We currently have approximately 150,000 users and process about 2 million messages each month.
I am a Software Engineer who nowadays spends more time architecting systems than actually writing code. After being part of the Engineering team at Formspring, a Q&A site, and helping to scale the site to billions of posts, I was on the hunt for a startup facing modern engineering challenges, which brought me to Cotap.
Hailing from both large corporations and startups, our team of 15 engineers is comprised of everyone from graduates with a master's degree to self-taught engineers. It's a great balance! We mix and match teams based on the projects we are working on.
Archicture & Infrastructure
In order to quickly get a proof of concept into the hands of users, Cotap started as a monolithic Rails app deployed with the database and everything it needed to run on a single AWS host.
Three months before the launch, we went to the drawing board and decided that we would automate all configurations and code deployments using what we were all familiar with - code. CloudFormation became our tool of choice for AWS hardware-related configuration and Chef for software configuration.
We rely a lot on Auto Scaling Groups. So much so, in fact, that all hosts except our database hosts are behind Auto Scaling Groups. This gives us the flexibility of rotating an entire ElasticSearch cluster with a single-line change in our configuration. It also spares us being paged when individual nodes go down, since they get immediately replaced.
We run every service behind ELBs, and almost all of them have a Public and a Private ELB allowing us to query our APIs internally and externally based on the situation. This saves us precious milliseconds when querying internally.
PostgreSQL was our database of choice from the get-go, and we deployed and configured everything in-house on EC2 with the help of Chef rather than relying on services such as RDS or ElastiCache.
With security and compliance on our mind we knew that EC2 would give us more flexibility than RDS in the way we handle our encrypted backups and our failovers. We also found out later that AWS does not include RDS in their Business Associate Agreement, which means you cannot be HIPAA compliant and run on RDS.
All of our clients, whether iOS, Android or Web, consume our private JSON APIs. Our web clients live in different codebases than the backend which allows teams to work in parallel and not step on each other's toes.
In a nutshell:
- Our main API is backed by Rails (messages, authentication and authorization).
- We have a service written in Go that dynamically resizes images, encodes video, and uses groupcache to cache the content.
- User information and messages are stored in PostgreSQL.
- Relationships between users are stored on ElasticSearch.
- Everything runs behind Auto Scaling Groups.
- Configuration is done using CloudFormation and Chef.
We run about 60 instances in production. Because our hardware, configurations, and deployments are managed by code, it's easy for us to maintain a parallel sandbox environment, which is an exact replica of our production environment just running on less expensive hardware, smaller clusters, and in a different AWS region. This one runs about 50 instances, so at any given time we are running over 100 instances in two different regions.
Development & Deployment Workflow
Everyone on the backend team has access to all of our code repositories and is able to run the services locally, which makes debugging issues and developing new features easy. Once we're ready to test with our client teams, we'll spin up a "stack" in our Sandbox environment for them to work against. A stack is a new set of instances that run a user-specified branch of code and has a CNAME we can provide our client teams to point to, instead of the default that points at our master branch. When the feature is ready, we'll open a pull request in GitHub for code review.
We deploy to production multiple times per day. As soon as a feature or bug-fix has been approved, we merge the branch into master and tag it, which automatically triggers Travis-CI to run through our specs and create a tarball of the code at that revision. Once the specs pass and the tarball has been pushed to S3, we can trigger a converge through Chef, which will do a zero-downtime rolling deploy to all of our instances.
To keep it simple, we follow 4 principles that allow us to move fast and stay out of trouble:
- No instance should be launched manually. Ever.
- All changes to the infrastructure should be under version control.
- All changes should be deployed to a sandbox environment before being deployed to production.
- Production is nothing more than a larger and more powerful version of our sandbox environment.
To make sure those principles are maintained, everyone meets in a room every other week to do a Disaster Recovery practice of one of our services or datastores. Breaking things is always fun and that way we all learn from each other.
All infrastructure related changes are first developed against Vagrant and VirtualBox, using a bare Ubuntu 12.04 image. We use kitchen-ci to run integration tests and chefspec for unit tests. We experimented with Docker as a replacement for slow VMs, but quickly realized that the time to install all the missing components on the docker image was greater than the VM bootstrap time. Maintaining a Docker image for the sake of dev would put us further away our goal of parity between dev, sandbox and production, so we decided now was not the time.
Becoming HIPAA-compliant
Having worked at a bank before joining Cotap, I knew that many security measures were about people process. Working at a startup gives me more flexibility, so achieving the same rigorous levels of security without drastically slowing down our team's productivity by adding process is a big challenge. Changing code is easy, but changing people's behavior is extremely hard.
Ultimately, going through any sort of security compliance forces you to take a step back and look at the processes in place from a different perspective. In our day-to-day work, we're always thinking about security, disaster recovery, scaling, permissions, etc. When going through compliance you start thinking differently about what your biggest assets are and how well protected they are. Security goes beyond just software security: Payroll, contracts with third parties, insurance etc. How well-protected do you think your company is? Does it have insurance to cover your clients in case of a data leak? How easy would it be for one of your own employees to destroy your most important assets?
Security is also about finding that one account who has access to everything, and working to reduce the negative impact that this person can have. Regardless if the account owner is a co-founder or executive, if their account were to be compromised in some way, attackers could potentially delete all your servers or empty the company's bank account.
You have to be methodical when evaluating your assets, and it is a rather tedious process. We started by doing a risk assessment of each asset, and applying potential threats to them. There is a wide variety of threats, such as natural disaster, hardware failure, malicious code or social engineering.
Once you are done with this lengthy process, you calculate your risks and sort them from highest to lowest. This process immediately highlights risks you may not have even thought about in the past, and so you start working on them.
For example, right after we started the risk assessment, we realized that even though we trust AWS, communication between internal services wasn't happening over SSL, meaning an AWS employee could potentially listen to our traffic. The risk of something like that happening is obviously low, but the impact would be extremely high, so we decided to upgrade the standards of our internal services to the same ones as our public services.
From day one, we were running Cotap inside of a VPC, following best practices and isolating public-facing instances in their own subnets behind restrictive ACLs. However, one thing we did not see coming was a requirement imposed by AWS. In order to sign a BAA with them, they require that all hardware handling PHI be dedicated. Unfortunately, CloudFormation only supports the choice of tenancy at the VPC level instead of at the instance level. The timing of our announcement had already been planned, press releases were queued and we realized a week before launch day that we had to switch all our hardware to a dedicated VPC. We planned the transition for about two days, migrated sandbox in one day and and migrated the entire production environment to a new dedicated VPC under 6 hours without downtime. Seeing that workflow decisions that were made in the early days of Cotap still paid off two years later was a truly rewarding moment for us.
I would say that meeting HIPAA compliance requirements made our workflow even tighter and brought us closer without slowing us down at all. We pay more attention to details, review each other’s code more often and are generally more aware that our industry is constantly at risk of getting hacked.
To sum up, a few words of advice for readers looking to achieve HIPAA compliance on AWS:
- Don’t run your databases on RDS.
- Encrypt all communication with TLS.
- Encrypt your EBS volumes with AES-256.
- Run on a dedicated VPC.
Product Changes
Originally, when a Cotap user sent a message or picture to a non-user, we would fall back to sending them an email with the content of the message. Unfortunately, if the message contains PHI (Protected Health Information), HIPAA compliance requires that the data is sent over encrypted channels to the end user. If the recipient has TLS configured on their email server then this wouldn't be a problem, but email is an old protocol, and it's impossible to guarantee a TLS connection between email servers. This forced us to change the way we send emails, now we do not display contents of messages.
The same restrictions apply to push notifications sent and received from customers that require HIPAA compliance. We now hide the contents of push notifications because they have to go through Apple's or Google's servers, which are not HIPAA compliant, before being delivered to the device.
Other than that, we did everything we could to keep the product experience as close as possible to what it was like before HIPAA compliance. We believe that security is possible without the added cost of a bad user experience.
What's Next
We are currently rethinking how clients connect to our services by moving away from the traditional HTTP transport layer to using MQTT. This should considerably lower the latency when sending messages by persisting connections to the backend. As a bonus, it will also lower the client's battery usage. There are definitely unique challenges regarding scaling MQTT to hundreds of thousands of concurrent connections. Not that many people have done it, especially with as few engineers as we have!