Redux: Scaling LaunchDarkly From 4 to 200 Billion Feature Flags Daily

5,172
LaunchDarkly
Serving over 200 billion feature flags daily to help software teams build better software, faster. LaunchDarkly helps eliminate risk for developers and operations teams from the software development cycle.

Written By John Kodumal, CTO and Co-Founder, LaunchDarkly


Background

LaunchDarkly is a feature management platform—we make it easy for software teams to adopt feature flags, helping them eliminate risk in their software development cycles. When we first wrote about our stack, we served about 4 billion feature flags a day. Last month, we averaged over 200 billion flags daily. To me, that's a mind-boggling number, and a testament to the degree to which we're able to change the way teams do software development. Some additional metrics:

  • Our global P99 flag update latency (the time it takes for a feature flag change on our dashboard to be reflected in your application) is under 500ms
  • Our primary Elasticsearch cluster indexes 175M+ docs / day
  • At daily peak, 1.5 million+ mobile devices and browsers and 500k+ servers are connected to our streaming APIs
  • Our event ingestion pipeline processes 40 billion events per day

We've scaled all our services through a process of gradual evolution, with an occasional bit of punctuated equilibrium. We've never re-written a service from scratch, nor have we ever had to completely re-architect any of our services (we did migrate one service from a SaaS provider to a homegrown; more on that later). In fact, from a high level, our stack is very similar to what we described in our earlier post:

  • A Go monolith that serves our REST API and UI (JS / React)
  • A Go microservice that powers our streaming API
  • An event ingestion / transformation pipeline implemented as a set of Go microservices

We use AWS as our cloud provider, and Fastly as our CDN.

Let's talk about some of the changes we've made to scale these systems.

Buy first, build if necessary

Over the past year, we've shifted our philosophy on managed services and have moved several critical parts of our infrastructure away from self-managed options. The most prominent was our shift away from HAProxy to AWS's managed application load balancers (ALBs). As we scaled, managing our HAProxy fleet became a larger and larger burden. We spent a significant amount of time tuning our configuration files and benchmarking different EC2 instance types to maximize throughput. Emerging needs like DDoS protection and auto scaling turned into large projects that we needed to schedule urgently. Instead of continuing this investment, we chose to shift to managed ALB instances. This was a large project, but it quickly paid for itself as we've nearly eliminated the time spent managing load balancers. We also gained DDoS protection and auto scaling "for free".

As we've evolved or added additional infrastructure to our stack, we've biased towards managed services:

  • Most new backing stores are Amazon RDS instances now. We do use self-managed PostgreSQL with TimescaleDB for time-series data—this is made HA with the use of Patroni and Consul.
  • We also use managed Elasticache instances instead of spinning up EC2 instances to run Redis workloads.
  • In our previous StackShare article, I wrote about a project to incorporate Kafka into our event ingestion pipeline. In keeping with our shift towards managed services, we shifted to Amazon's Kinesis instead of Kafka.

Managed services do have some drawbacks:

  • They're almost never cheaper (in raw dollars) than self-managed alternatives. Pricing is often more opaque, more variable, and hard to predict
  • Much less visibility into the operation, errors, and availability of the service
  • Vendor lock-in

Still, it's a false economy to measure the raw cost of a managed service to an unmanaged service—factor in your team's time and the math is usually pretty clear.

There is one notable case where we've moved from a managed SaaS solution to a homegrown. LaunchDarkly relies on a novel streaming architecture to push feature flag changes out in near real-time. Our SDKs create persistent outbound HTTPS connections to the LaunchDarkly streaming APIs. When you change a feature flag on your dashboard, that change is pushed out using the server-sent events (SSE) protocol. When we initially built our streaming service, we relied heavily on a third-party service, Fanout, to manage persistent connections. Fanout worked well for us, but over time we found that we could introduce domain-specific performance and cost optimizations if we built a custom service for our use case. We created a Go microservice that manages persistent connections and is heavily optimized for the unique workloads associated with feature flag delivery. We use NATS as a message broker to connect our REST API to a fleet of EC2 instances running this microservice. Each of these instances can manage over 50,000 concurrent SSE connections.

At scale, everything is a tight loop

Some of our analytics services receive tens of thousands of requests per second. One of the biggest things we've learned over the past year is that at this scale, there's almost no such thing as premature optimization. Because of the sheer volume of requests, every handler you write is effectively running in a tight loop. We found that to keep meeting our service level objectives and cost goals at scale, we had to do two things repeatedly:

  1. Profile aggressively to identify and address CPU and memory bottlenecks
  2. Apply a set of micro-patterns to handle specific workload

Profiling must be done periodically, as new bottlenecks will constantly emerge as traffic scales and old bottlenecks are eliminated. As an example, at one point, we found that the "front-door" microservice for our analytics pipeline was CPU-bound parsing JSON. We switched from Go's built-in encoding/json package to easyjson, which uses compile-time specialization to eliminate slow runtime reflection in JSON parsing.

We also identified a set of "micro-patterns" that we have extracted as self-contained libraries so they can be applied in appropriate contexts. Some examples:

  • Read coalescing—In a read-heavy workload, expensive calls to fetch data can be queued to await the first read—a kind of memoization. This pattern is encapsulated in Google's singleflight package
  • Write coalescing—The dual of read coalescing. In a write-heavy workload, where last write wins, writes can be queued and discarded in favor of the latest write attempt.
  • Multi-layer caching—In scenarios where an in-process, in-memory cache is necessary for performance, horizontal scaling can reduce cache hit rates. We make our fleet more resilient to this effect by employing multiple layers of caching—for example, backing an in-memory cache with a shared Redis cache before finally falling back to a slower persistent disk-backed store.

These simple patterns improved performance at scale and also helped us deal with bad traffic patterns like reconnection storms.

Get good at managing change

Scaling up isn't just about improving your services and architecture. It requires equal investment in people, processes and tools. One thing we really focused on the process and tools front is understanding change. Better visibility into changes being made to the service had a massively positive impact on service reliability. Here are a few things we did to improve visibility:

  • Internal changelog service: This service catalogues intentional changes being made to the system. This includes deploys, instance type changes, configuration changes, feature flag changes, and more. Anything that could potentially impact the service (either in a positive or negative way) is catalogued here. We couldn't find anything off the shelf here, so we built something ourselves.
  • COGS (cost of goods sold) log: Very similar to our changelog, but focused on price changes to our services. If we scale out a service, or change instance types, or make reserved instance reservations, we add an entry to this log. For us, this is just a Confluence page.
  • Observability / APM: We use a number of services to gain observability into what is happening to our service at runtime. We use a mix of Graphite / Grafana and Honeycomb.io to give us the observability we need. We're big fans of Honeycomb here.
  • Operational and release feature flags: We feature flag most changes using LaunchDarkly. Most new changes are protected by release flags (short-lived flags that are used to protect the initial rollout and rollback of a feature). We also create operational flags—which are long-lived flags that act as control switches to the application. Observability lets us understand change, and feature flags allow us to react to change to maintain availability or improve user experience.
  • Spinnaker / Armory: LaunchDarkly is almost a five year old company, and our methodology for deploying was state of the art... for 2014. We recently undertook a project to modernize the way we deploy our software, moving from Ansible-based deploy scripts that executed on our local machines, to using Spinnaker (along with Terraform and Packer) as the basis of our deployment system. We've been using Armory's enterprise Spinnaker offering to make this project a reality.

Like the sound of this stack? Learn more about LaunchDarkly.

LaunchDarkly
Serving over 200 billion feature flags daily to help software teams build better software, faster. LaunchDarkly helps eliminate risk for developers and operations teams from the software development cycle.
Tools mentioned in article
Open jobs at LaunchDarkly
Solutions Engineer
Oakland, CA
As a Solutions Engineer, you will educate and guide prospects on the proper implementation of LaunchDarkly's SaaS product and Private Instances. You are passionate about trends and technologies involved in modern application development. You will be the technical voice during our sale and ensure our customers are comfortable with the way our systems work. You are passionate about the developer tools space and helping development teams eliminate risk and deliver value. LaunchDarkly is a rapidly growing software company with a strong mission and vision carried out by a talented and diverse team of employees. Our goal is to help teams build better software, faster. You'll join a small team from companies like Atlassian, Intercom, and GitHub, and you'll have an immediate impact on our product and customers. Software powers the world and LaunchDarkly empowers all teams to deliver and control their software.
  • Evangelize and advise customers on the importance and different uses of feature flags and how to administer them
  • Create solutions to customer's challenges implementing feature flags across large monolith and microservice applications, large organizations, and different technology stacks
  • Become a domain expert on LaunchDarkly architecture
  • Demo LaunchDarkly product to technical and business audiences
  • Become a subject matter expert on LaunchDarkly and communicate our value and features to potential customers
  • Be the voice of the customer by translating, aggregating, and representing customer feedback to the Product and Engineering teams

  •  4+ years of experience consulting with enterprise customers and large development teams
  • You led successful technical proof of concepts 
  • Proven success in building strong customer relationships
  • Ability to learn and synthesize large amounts of information with little context
  • Effective communicator with the ability to simplify complex technical concepts
  • A self‐starter and problem solver, willing to take on hard problems and work independently when necessary.
  • Experience working with teams that underwent development process transformation
  • Familiarity with at least one of our supported languages: Java, .NET, GO, JS, Python, PHP, Node, Ruby, Rails, iOS, or Android
  • Experience with data persistence technologies like Varnish or Redis
  • Developer Advocate
    This role blends expertise from engineering, marketing, and product with the mission of developer engagement. As a Developer Advocate, you will engage with our community of developers and drive excitement around developer-related technologies. This is a great opportunity to help improve awareness and usage of LaunchDarkly’s technologies through both marketing programs and in-depth engagement with our key accounts. LaunchDarkly is a rapidly growing software company with a strong mission and vision carried out by a talented and diverse team of employees. Our goal is to help teams build better software, faster. Software powers the world and LaunchDarkly empowers all teams to deliver and control their software. About You You love solving problems with software and have an enthusiasm for educating and sharing solutions with your community. You have a background in engineering and a passion for the community. You have passion, curiosity, technical depth, and extraordinary written communication skills. You should have the ability to converse with a broad range of programming language communities (Java, .NET, Node.js, Python, Ruby, iOS, Android, etc.), and have a real passion for modern application development trends at the intersection of development and operations. Our Developer Advocates can be responsible for anything from organizing developer events, to writing production-quality code and contributing to LaunchDarkly’s SDKs. Ultimately, your goal is to empower developers with the tools they need to make their job better. We meet developers wherever they are and support their journey, wherever that may lead.
  • Develop demo applications against our integrations and/or SDKs to showcase the product use case.
  • Collaborate with our Partnerships team to advocate for the developer voice and create impactful content in the form of demos, blog posts, webinars, and workshops.
  • Write about technology trends focused around feature management, modern application architecture with the goal of engaging developers, developer managers, and senior technical leaders.
  • Lead conversations in the community around best practices for feature flag management.
  • Articulate the technical value proposition of LaunchDarkly experience vs competitive solutions
  • Provide cross-audience support and in-depth technical enablement
  • Minimum 3 years of production-level software development or operations experience
  • Ability to independently build apps, craft solutions, interact with developers and operators to help them learn through the articulation of your experience.
  • Engaging written and verbal communication skills
  • Ability to work autonomously, willingness to travel when need be.
  • PM experience and/or have experience building communities.
  • A history of successful speaking engagements, industry influence and / or recognition in technology publications
  • Technical Support Engineer (London)
    London
    Note: This Technical Support Engineer position is located at the LaunchDarkly office in Hoxton, London. The hours will be London-based; however there is an expectation of overlapping some Pacific Time hours as well. * At this time our offices are closed due to COVID-19. We are looking for a Technical Support Engineer who will take end-to-end ownership of customer issues, including initial troubleshooting, identification of root cause, and issue resolution. To best serve our customers, you will become an expert in the LaunchDarkly product and develop tools to improve the LaunchDarkly customer experience. You should be a self-starter who works well with little supervision; we trust you to do the right things with little oversight. LaunchDarkly is a rapidly growing software company with a strong mission and vision carried out by a talented and diverse team of employees. Our goal is to eliminate risk and deliver value for development teams. You'll join a small team and have an immediate impact with our product and customers.
  • Become a technical expert on the LaunchDarkly platform (including SDKs)
  • Use this expertise to troubleshoot customer issues and answer questions internally
  • Communicate with customers in a friendly, timely manner
  • Reproduce and document bugs with Engineering for product issues that are impacting customers
  • Create process or troubleshooting documentation in the support knowledge base
  • Contribute to process improvements and new ways to delight customers
  • Represent customers in internal company discussions
  • 2+ years of customer support, technical support, or related customer facing role
  • Technical fluency with one (or more) development platforms: Python, Node.js, Java, JavaScript, Ruby/Rails, Go, .NET, PHP, iOS, and Android
  • Passion for solving customer issues and advocating for their success, in a fast paced, highly technical environment
  • Experience working with APIs or building integrations between SaaS services
  • Ability to learn new technologies quickly
  • Excellent relationship management, customer service and communication skills in variety of forms (written, live chat, conference calls, in-person)
  • Ability to work independently with little direct supervision and as a part of a team
  • Ability to remain calm, composed and articulate when facing tough customer situations
  • Interest in working on technical side projects to validate what you’ve learned
  • Excellent time management skills and ability to balance numerous projects at once
  • Customer Success Engineer
    Oakland, CA /
    Customer Success Engineers at LaunchDarkly are an elite team who help companies achieve progressive delivery. Customer Success Engineers train users, advise customers on how to integrate LaunchDarkly and create custom solutions for our customers. By joining LaunchDarkly, you will work with software development teams at some of the most advanced companies across industries, including Technology, Finance & Insurance, Pharmaceuticals & Life Science, Entertainment, and more.   LaunchDarkly is a rapidly growing software company with a strong mission and vision carried out by a talented and diverse team of employees. Our goal is to help teams build better software, faster. You'll join a small team from companies like Atlassian, Intercom, and GitHub, and you'll have an immediate impact on our product and customers. LaunchDarkly is a rapidly growing software company with a strong mission and vision carried out by a talented and diverse team of employees. Our goal is to help teams build better software, faster. You'll join a small team from companies like Atlassian, Intercom, and GitHub, and you'll have an immediate impact on our product and customers. Software powers the world and LaunchDarkly empowers all teams to deliver and control their software.
  • Work with LaunchDarkly’s most strategic customers to ensure their success.
  • Advise our customers on software development best practices and how to leverage LaunchDarkly. 
  • Plan, own, and conduct training for LaunchDarkly’s largest customers. 
  • Actively commit to helping the customer success engineering team iterate to excellence. 
  • Become a subject matter expert on LaunchDarkly.
  • Be the voice of the customer by translating, aggregating, and representing customer feedback to the Product and Engineering teams.
  • You learn and synthesize large amounts of information with little context.
  • You are an effective communicator and you can simplify complex technical concepts.
  • You are a self‐starter and excited to take on hard problems. 
  • You are passionate about helping customers and have a strong sense of ownership.
  • You can effectively communicate with experts from different backgrounds, and build strong stakeholder relationships.
  • You have a technical background and are interested in a customer-facing role.
  • You are familiar with the software development lifecycle. 
  • You have worked with teams that underwent development process transformation.
  • You are comfortable with at least one of our supported languages: Java, .NET, Go, JS, Python, PHP,  NodeJS, Ruby, Rails, iOS, or Android.
  • You are familiar with DevOps, Continuous Integration, and Continuous Delivery. 
  • You have worked with one of the major cloud providers (AWS, Azure, GCP). 
  • You have worked with Linux, Docker, and Virtual Machines.
  • Verified by
    Engineering Lead
    Director Marketing
    VP of Product and Engineering
    You may also like