Dan Robinson, CTO of Heap
When Dan Robinson joined Heap as the company's first engineer, it was unclear whether it was even possible to build the product to scale. And that's exactly why he joined. Most startups, he says, face a significant risk in finding product-market fit. He was absolutely confident that a need for this product existed. The real challenge would be a technical one.
Most analytics platforms require the user to choose the events they want tracked ahead of time. This requires significant developer time and the foresight to know which analytics events you'll care about later. Heap instead tracks everything up front and lets the user define events with a visual tool afterward.
Heap was founded by Matin Movassate, a former Product Manager at Facebook, and Ravi Parikh. They entered the Winter '11 class of YCombinator with a simple MVP of the product: a single Node.js server running on EC2 with PostgreSQL. All of the persistent data had to be mirrored in memory for the queries to be fast. Dan joined soon after, and his first project was to rebuild the infrastructure to be able to handle more than ~200gb of data.
CEO Matin Movassate cutting Dan's "welcome cake"
PostgreSQL was an easy early decision for the founding team. The relational data model fit the types of analyses they would be doing: filtering, grouping, joining, etc., and it was the database they knew best. Shortly after adopting PG, they discovered Citus, which is a tool that makes it easy to distribute queries. Although it was a young project and a fork of Postgres at that point, Dan says the team was very available, highly expert, and it wouldn’t be very difficult to move back to PG if they needed to:
The stuff they forked was in query execution. You could treat the worker nodes like regular PG instances.
Citus also gave them a ton of flexibility to make queries fast, and again, they felt the data model was the best fit for their application.
In early 2014, Heap released an event visualizer tool that allowed non-technical people to use the product, which Dan believes was the key piece in achieving product-market fit. As a result, the company grew to the point where they had users who were processing millions of events per month. They started to hit the limits of the initial Citus infrastructure. As larger customers began signing up, the large datasets they brought with them became difficult to handle. Eventually, the analyses became too slow to be viable, and simply “throwing more machines” at the problem was cost-prohibitive.
The early version Heap was using didn’t have much distributed systems functionality, so they had rolled their own solutions for things like recovering from a failed node, splitting data into different sharding schemes, and moving data between machines. That homegrown functionality was starting to have issues at scale.
The major breakthrough came when they found a way to cheaply index the event definitions users were creating. These were the only points of data that users were querying, and each event definition represented far less than 1% of the overall data Heap was collecting. It became clear that they could achieve substantial performance gains if they could build an infrastructure around indexing these events.
Heap searched for an existing tool that would allow them to express the full range of analyses they needed, index the event definitions that made up the analyses, and was a mature, natively distributed system. After coming up empty on this search, they decided to compromise on the “maturity” requirement and build their own distributed system around Citus and sharded PostgreSQL. It was at this point that they also introduced Kafka as a queueing layer between the Node.js application servers and Postgres.
The front end had also begun to grow unwieldy. The original jQuery pieces became difficult to maintain and scale, and a decision was made to introduce Backbone, Marionette, and TypeScript. Ultimately this ended up being a “detour” in the search for a scalable and maintainable front-end solution. The system did allow for developers to reuse components efficiently, but adding features was a difficult process, and it eventually became a bottleneck in advancing the product.
Reducing Cost and Improving Performance
Because of the massive amounts of data that Heap is ingesting, it’s taken a great deal of work to get to a cost-viable product. One of the major projects in reducing cost involved switching to ZFS, which allows compression at the file system level. That switch alone allowed them to compress their data by a factor of 2. Dan says they’re currently experimenting with even further improvements that could increase this compression to 3-3.5x. Additional gains have come from doing some low-level CPU profiling” to determine where their resources were being used on the EC2 instances.
Today, they’re doubling query speed each quarter and constantly seeking even more improvements. As Dan points out, the size of the customer they can support is directly correlated to the performance of the application.
If we can make queries 3x faster, we can support a customer who is 3x larger.
DevOps and Organizational Structure
Engineering teams at Heap are broken into 3-5 developers, and about half of them are working on infrastructure. This is mostly related to business priorities, since again, the primary challenge behind the product is not, “What new features should we add?” but “How do we scale to customers who are 100x larger?”.
Dan with members of the engineering team
All of the code at Heap lives on GitHub, and they use CircleCI, which in turn kicks off Ansible scripts for deployment. They use Salt for managing machine configuration, and Terraform to manage all of their AWS configuration. Currently, everything is running on AWS for Heap, so Terraform was an easy choice, as they loved the modularity and “great dev workflows” it provides.
After 5 years of building one of the fastest growing tools for analytics and ingesting billions of events, Dan Robinson has some valuable advice for budding CTOs:
There are decisions you make that are hard to reverse and decisions that are easy to reverse - like one-way doors and two-way doors. Most things are two-way doors and it's better to just go fast and learn something.
He advises that if you’re building a serious distributed system where the performance is critical to the success of your application, one-way doors include things like selecting your data model and data system:
If you want to change from PostgreSQL to MySQL, that's going to be a rewrite.
If he could go back in time, Dan probably would have started using Kafka on day one. He’s learned that it’s a very good fit for an analytics tool, since you can handle a huge number of incoming writes with relatively low latency. Kafka also gives you the ability to “replay” the data flow: “It’s like a commit log for your whole infrastructure.”
One of the biggest benefits in adopting Kafka has been the peace of mind that it brings. In an analytics infrastructure, it’s often possible to make data ingestion idempotent. In Heap’s case, that means that, if anything downstream from Kafka goes down, they won’t lose any data – it’s just going to take a bit longer to get to its destination. He’s also learned that you want the path between data hitting your servers and your initial persistence layer (in this case, Kafka) to be as short and simple as possible, since that is the surface area where a failure means you can lose customer data.
Dan also says he’s been “continuously shocked” at how often YAGNI has been true:
I remember writing our exports feature in 2014. It was a simple feature that let you get a nightly dump of your Heap data on S3. The code was littered with TODOs that I was completely sure we were going to need to resolve within the next few weeks – minor extensions of the feature, configurability options, operability improvements, or known technical debt items. A lot of those TODOs didn't come up for years, and some of them are still there.
Instead of focusing on writing perfect software, he believes it’s much more important to get something in front of users. This is something you may have heard by now from a product perspective, but it’s equally important for code:
You’ll learn what areas of your technical debt actually matter, and that learning is a lot more important than getting decisions right if the decisions are easily reversible.
The Present and Future of Heap
Most recently, Heap has released a feature that also pulls in data from third-party providers like MailChimp, Stripe, Optimizely, and Shopify. The team there realized that a large percentage of their customers’ data didn’t actually live on their own platform, but instead was scattered around these various vendors.
Dan points out that the additional data volume hasn’t been a significant challenge - especially since events like payments are very high value from an analytics perspective. The real challenge has been learning the “language” of these providers. Each one has a different API, a different way to format event data, and different semantics for retrieving it.
Once the data is in Heap, they also had to figure out how to correlate that data with their own. How do you attribute an email sent in one system to a button click in another? The answer was building out their own UI for Heap users to tie the events together.
Beyond features, performance improvements continue to be a major focus today and in the future. Heap currently stores 1 petabyte of data, ingests 1 billion events per day, and performs over 250,000 analyses per week.
If your company has a great story behind your tech, email us to be featured!