How Statsig runs 100x more cost-effectively using Amazon ElastiCache for Redis
Today, developers use feature gates to control feature roll out to manage the risk of bad deployments, but some remain blind to how customers would respond to a new feature. Statsig is a modern feature management and product experimentation platform that enables teams to ship faster using smart feature gates backed by crucial insights into customer and system behavior. In addition to controlling roll outs, Statsig’s smart feature gates automatically run A/B tests to validate a feature’s impact on customer behavior and application usage. Developers enable product adoption growth while implementing engineering and operational best practices that feature gates commonly serve by using a comprehensive understanding of how new features perform. We needed to find an efficient solution and optimize costs to enable these benefits for our customers. We turned to Amazon ElastiCache for Redis to do this.
In this post, we highlight how Statsig uses ElastiCache for Redis for three use cases. First, how it is used for smart sampling logs in an efficient and cost-effective way. Second, how it is used as a no-hassle event streaming solution. Third, how it is used as an abstraction over the backend database to deliver a highly performant, scalable, and efficient data access layer that simplifies developer experience. At the end of this post, we summarize how we ran our solution 100x more cost effectively with ElastiCache.
Statsig processes large volumes of logs. In fact, most AWS costs we incur are from infrastructure that stores and processes these logs. However, this presents a challenge for instrumentation, because each request received by a server generates multiple instrumentation loglines and multiplies our costs. The popular way to solve this problem is static sampling, where only 1 out of n instrumentation loglines is preserved. Redis lets us implement dynamic sampling to significantly improve on this.
When we receive a request, we decide whether or not to sample it. Deciding per request, not per logline, means that for a given request we have a complete picture of every log it generated. If we threw the dice and selected requests randomly, then we would likely bias toward logs from our largest customers calling our most common endpoints. However, we want to track particular issues that appear for customers who are starting to integrate with Statsig, or who are integrating with our less frequently used endpoints.
We simply maintain a Redis counter keyed on smaller customers and less frequently used endpoints with a quarter-hour expiration to ensure that we capture at least a few samples from those two parameters. We sample the request if this counter is small. This simple dynamic sampling on Redis lets us log at least the first few requests from each company and endpoint every quarter hour, while requiring 400x fewer logs than static sampling.
First, we define a Lua script for incrementing and expiring counters. This helps avoid race conditions between INCR and EXPIRE operations, and it keeps our execution efficient:
Then we use this command to decide whether or not to sample a request:
A problem with this implementation is that after a key is sampled, the application will keep sending requests to Redis until after the sampling window has elapsed. These unnecessary requests can be avoided with a self-clearing hashset that contains keys that the application knows to be sampled. This idea does introduce some memory overhead, and might cause a few false positives, so you can use different data structures and clearing strategies to tweak this trade-off. Here is a final implementation using the self-clearing hashset approach:
When customers log in to the Statsig console, we show a live stream of events being logged from their application. We chose to use Redis instead of Kafka as it was simpler to build and operate as well as more cost-effective. The diagram below illustrates our high level approach:
When a customer logs an event using the Statsig SDK, we push it into a Redis list. If the list has 50 or more elements, then we also pop from it. This ensures that the list always maintains the last 50 elements sent by a customer. We have found this to be more than enough for our debugging use cases. Redis’ Cluster Mode facilitates this implementation by easily allowing the horizontal scaling of Redis to meet throughput demands as write and read throughputs fluctuate. Redis makes it easy to implement this strategy. For instance, when using ioredis in a typescript environment, all we do is write to and trim the queue whenever an event appears:
Our web console polls this list to display the most recent events to a given customer:
This approach lets us maintain a stream of recent events even at high volume while incurring little Redis load. We also use Redis counters and HyperLogLog extensively to display event counts to our customers.
We use Amazon DocumentDB (with MongoDB compatibility) for our persistent storage needs on AWS. Inspired by Facebook’s geographically distributed data store, TAO, our data is modeled as a graph, where objects maintain pointers to other objects. This lets us abstract our database queries as graph operations. For example, fetch an object with this ID or fetch all objects connected to this ID via this connection type. This lets our engineers use a simple API to issue even the most complex queries.
We also optimize our writes to DocumentDB using Redis. It’s easy to serialize each query into a compact format because the API for our graph queries is simple. We take the serialized form of a query (or a fragment of it) and use it as a Redis key. Then, the query’s result becomes the Redis value. This lets all queries, no matter how complex, be automatically cached in ElastiCache for Redis. The diagram below illustrates this approach:
When we write to DocumentDB, we also construct query patterns for all object IDs that are affected, and we invalidate any keys matching those patterns on Redis. At Facebook, TAO implemented this pattern to replace Memcached for many data types that fit this model. TAO can process a billion reads and millions of writes per second by running on thousands of machines. To optimize our resource usage, we use the new auto scaling features for Elasticache for Redis to automatically scale our Redis cluster based on the cluster’s memory utilization.
Redis remains a critical enabler of our data infrastructure as we aim to put the power of data science in the hands of every product team in the most cost-effective way possible. We run our dynamic sampling and database abstraction patterns 100x more cost effectively with ElastiCache for Redis than without it.