Continuous Delivery and Effective Feature Flagging with LaunchDarkly
Guest post by John Kodumal, CTO and co-founder, LaunchDarkly
We started LaunchDarkly in 2014 to help software teams build better software, faster. My co-founder, Edith Harbaugh, and I met at Harvey Mudd College, and we’ve both worked in software since then. We’d both seen the shift from multi-year releases to quarterly to monthly to weekly and now, with continuous delivery, to multiple times a day.
We believe that continuous delivery allows software teams to provide more value to their own customers with less risk. Feature flagging (wrapping a feature in a flag that’s controlled outside of deployment) is a technique for effective continuous delivery. We saw the larger companies (Google, Facebook, Twitter) invest heavily in custom-built feature flagging infrastructure to roll features out to whom they want, when they want. Smaller companies were building and maintaining their own feature flagging infrastructure or doing without. That’s where we saw an opportunity to start up LaunchDarkly. We’re going to share how we started, issues we ran into, and how AWS helped us scale.
LaunchDarkly lets software teams focus on their core competencies and depend on our platform for effective feature flagging. With LaunchDarkly’s feature flagging as a service, any company, no matter its size, can separate out deployment from rollout. Our customers deploy features “off” (dark), and then use LaunchDarkly’s dashboard to control who has access. Customers use LaunchDarkly to reduce risk and move more quickly. By wrapping a risky new infrastructure project with a flag, it can easily be stopped or turned off independent of code deployment. Another customer use for LaunchDarkly is to run private betas, giving early access to some of their own users. A business user can easily add end users via the LaunchDarkly dashboard. Customers also use LaunchDarkly to get analytics on how many people are using a given feature and to run A/B tests on the effectiveness of features.
Our lead back end engineer, Patrick Kaeding, is responsible for scaling up LaunchDarkly. We worked together at Atlassian, building the Atlassian Marketplace initially as a “ShipIt” project, and then continuing to scale it up. I’ll let Patrick describe how we’re preparing LaunchDarkly to support billions of daily requests.
Operational Challenges to Scaling Up
At LaunchDarkly, the biggest operational challenge in our application is recording and processing events. We have an SDK that customers include in their applications, which they can call into to evaluate feature flags. In the background, the SDK records an event indicating the flag value that was given to the user. The customer code can also signal other goal-related events to the SDK (for example, a purchase was completed). These events are batched up and flushed to the LaunchDarkly servers every couple of seconds. In addition, there is a client-side snippet that can record goal-related browser events (like clicks and page views), and a mobile SDK that can also send events. This means we are processing up to 10k event batches per second (and each batch contains many events).
Once we record the raw events, we need to correlate the feature flag events with the goal-related events to be able to drive the A/B test results, and indicate how likely someone is to fulfill a goal, depending on which feature variation they get. The feature event and the goal event might come in on different channels, with a feature event coming from one backend server, a click event from the front end, and the purchase complete event coming from another backend.
Overcoming Operational Challenges with AWS
To deal with the fact that the goal events and the related feature event might come in different batches, we delay this correlation process by a couple of minutes. We use Amazon EC2 to be able to easily scale out our front-end tier, which accepts these event posts over HTTPS. It then queues up asynchronous jobs to save the events, and then more jobs to follow up after a delay to correlate the events from the same user. The front end just queues up the jobs in Amazon SQS and forgets about them, which means that we can always quickly accept more jobs, regardless of how busy the backend is. We can monitor the queue depth using Amazon CloudWatch and even auto scale our worker group to be sure it doesn’t get too far behind. But even if there is a brief lag, it won’t impact our customers. For persisting the events, we first used a single unsharded MongoDB replica set. As the heavy write load overburdened it, rather than deal with the operational complexity of a sharded replica set, we opted for the simplicity and infinite scalability of Amazon DynamoDB.
We still use Mongo for some data, where we need to be able to query by keys that would cause hot shards in DynamoDB. For this, we have three replicas with local disks (in different Availability Zones), and one hidden replica set member backed by Amazon Elastic Block Store (EBS). This member just follows along, but does not accept any queries or writes. Periodically, it takes a snapshot of its EBS volume, which we use for backup. This allows us to get the performance of local SSDs for the live system, while allowing point-in-time snapshot backups, with no downtime.
In the event we need to restore from backup, we just create a new instance from the snapshot, create a new replica set, and add new instance-store-backed instances to replicate the data. Once all the nodes are up-to-speed, we can demote the EBS-backed instance to a hidden secondary, and we are back to where we need to be. The fact that EC2 makes it so easy to take point-in-time snapshots at the block level, and easily clone from those snapshots, is a huge help to our operations. I’ve dealt with backing up Mongo databases where this was not an option, and it is a real pain, especially with very large databases that take a long time to dump. Because we take snapshots so frequently, they are pretty quick (each snapshot just saves the delta since the prior snapshot).
The following illustration shows a simplified version of LaunchDarkly’s AWS architecture.
What We Learned
The biggest thing I’ve learned from using AWS at this scale is to put as much as possible in a persistent queue like SQS to be worked on later (so that processing incoming requests cannot be hampered by the queue getting backed up). At first, we used beanstalkd as the queue service, but as our load grew, beanstalkd could not keep up. We were persisting the raw events in the web application, and then just enqueueing a beanstalk job to correlate them after a delay. We switched to SQS for everything, and it is rock-solid.
LaunchDarkly is serving hundreds of millions of requests daily, and we anticipate being in the billions as soon as feature flags are increasingly adopted as a way to do effective continuous delivery. We’re getting customers who’ve built their own in-house feature flag management system but are tired of maintaining in-house. We’re also getting customers who want to adopt the best practices of feature flagging but don’t want to invest in building their own feature flagging management system. Amazon Web Services has been a fantastic partner to help us scale with demand