Building resiliency at scale at Tinder with Amazon ElastiCache

This is a guest post from William Youngs, Software Engineer, Daniel Alkalai, Senior Software Engineer, and Jun-young Kwak, Senior Engineering Manager with Tinder. Tinder was introduced on a college campus in 2012 and is the world’s most popular app for meeting new people. It has been downloaded more than 340 million times and is available in 190 countries and 40+ languages. As of Q3 2019, Tinder had nearly 5.7 million subscribers and was the highest grossing non-gaming app globally.

At Tinder, we rely on the low latency of Redis-based caching to service 2 billion daily member actions while hosting more than 30 billion matches. The majority of our data operations are reads; the following diagram illustrates the general data flow architecture of our backend microservices to build resiliency at scale.

Fig.1 Cache-aside approach

In this cache-aside approach, when one of our microservices receives a request for data, it queries a Redis cache for the data before it falls back to a source-of-truth persistent database store (Amazon DynamoDB, but PostgreSQL, MongoDB, and Cassandra, are sometimes used). Our services then backfill the value into Redis from the source-of-truth in the event of a cache miss.

Fig. 2 Sharded Redis configuration on EC2

Before we adopted Amazon ElastiCache for Redis, we used Redis hosted on Amazon EC2 instances with application-based clients. We implemented sharding by hashing keys based on a static partitioning. The diagram above (Fig. 2) illustrates a sharded Redis configuration on EC2.

Specifically, our application clients maintained a fixed configuration of Redis topology (including the number of shards, number of replicas, and instance size). Our applications then accessed the cache data on top of a provided fixed configuration schema. The static fixed configuration required in this solution caused significant issues on shard addition and rebalancing. Still, this self-implemented sharding solution functioned reasonably well for us early on. However, as Tinder’s popularity and request traffic grew, so did the number of Redis instances. This increased the overhead and the challenges of maintaining them.

Motivation

First, the operational burden of maintaining our sharded Redis cluster was becoming problematic. It took a significant amount of development time to maintain our Redis clusters. This overhead delayed important engineering efforts that our engineers could have focused on instead. For example, it was an immense ordeal to rebalance clusters. We needed to duplicate an entire cluster just to rebalance.

Second, inefficiencies in our implementation required infrastructural overprovisioning and increased cost. Our sharding algorithm was inefficient and led to systematic issues with hot shards that often required developer intervention. Additionally, if we needed our cache data to be encrypted, we had to implement the encryption ourselves.

Finally, and most importantly, our manually orchestrated failovers caused app-wide outages. The failover of a cache node that one of our core backend services used caused the connected service to lose its connectivity to the node. Until the application was restarted to reestablish connection to the necessary Redis instance, our backend systems were often completely degraded. This was by far the most significant motivating factor for our migration. Before our migration to ElastiCache, the failover of a Redis cache node was the largest single source of app downtime at Tinder. To improve the state of our caching infrastructure, we needed a more resilient and scalable solution.

Investigation

We decided fairly early that cache cluster management was a task that we wanted to abstract away from our developers as much as possible. We initially considered using Amazon DynamoDB Accelerator (DAX) for our services, but ultimately decided to use ElastiCache for Redis for a couple of reasons.

Firstly, our application code already uses Redis-based caching and our existing cache access patterns did not lend DAX to be a drop-in replacement like ElastiCache for Redis. For example, some of our Redis nodes store processed data from multiple source-of-truth data stores, and we found that we could not easily configure DAX for this purpose.

Secondly, we performed some benchmark latency testing of both technologies and found that, for our specific use cases, ElastiCache was a more efficient and cost-effective solution. The decision to maintain Redis as our underlying cache type allowed us to swap from self-hosted cache nodes to a managed service nearly as simply as changing a configuration endpoint.

After we decided to use a managed service that supports the Redis engine, ElastiCache quickly became the obvious choice. ElastiCache satisfied our two most important backend requirements: scalability and stability. The prospect of cluster stability with ElastiCache was of great interest to us. Before our migration, faulty nodes and improperly balanced shards negatively impacted the availability of our backend services. ElastiCache for Redis with cluster-mode enabled allows us to scale horizontally with great ease.

Previously, when using our self-hosted Redis infrastructure, we would have to create and then cut over to an entirely new cluster after adding a shard and rebalancing its slots. Now we initiate a scaling event from the AWS Management Console, and ElastiCache takes care of data replication across any additional nodes and performs shard rebalancing automatically. AWS also handles node maintenance (such as software patches and hardware replacement) during planned maintenance events with limited downtime.

In addition, we appreciate the data encryption at rest that ElastiCache supports out-of-the-box.

Finally, we were already familiar with other products in the AWS suite of digital offerings, so we knew we could easily use Amazon CloudWatch to monitor the status of our clusters.

Migration strategy

The following diagram (Fig. 3) illustrates our new migration strategy.

Fig. 3 Migration strategy

First, we created new application clients to connect to the newly provisioned ElastiCache cluster. Our legacy self-hosted solution relied on a static map of cluster topology, whereas new ElastiCache-based solutions need only a primary cluster endpoint. This new configuration schema led to dramatically simpler configuration files and less maintenance across the board.

Next, we migrated production cache clusters from our legacy self-hosted solution to ElastiCache by forking data writes to both clusters until the new ElastiCache instances were sufficiently warm (step 2). Here, “fork-writing” entails writing data to both the legacy stores and the new ElastiCache clusters. Most of our caches have a TTL associated with each entry, so for our cache migrations, we generally didn’t need to perform backfills (step 3) and only had to fork-write both old and new caches for the duration of the TTL. Fork-writes may not be necessary to warm the new cache instance if the downstream source-of-truth data stores are sufficiently provisioned to accommodate the full request traffic while the cache is gradually populated. At Tinder, we generally have our source-of-truth stores scaled down, and the vast majority of our cache migrations require a fork-write cache warming phase. Furthermore, if the TTL of the cache to be migrated is substantial, then sometimes a backfill should be used to expedite the process.

Finally, to have a smooth cutover as we read from our new clusters, we validated the new cluster data by logging metrics to verify that the data in our new caches matched that on our legacy nodes. When we reached an acceptable threshold of congruence between the responses of our legacy cache and our new one, we slowly cut over our traffic to the new cache entirely (step 4). When the cutover completed, we could scale back any incidental overprovisioning on the new cluster.

Conclusion

As our cluster cutovers proceeded, the frequency of node reliability issues plummeted and we experienced a marked increase in app stability. It became as easy as clicking a few buttons in the AWS Management Console to scale our clusters, create new shards, and add nodes. The Redis migration freed up our operations engineers’ time and resources to a great extent and brought about dramatic improvements in monitoring and automation. We later optimized our application Redis clients to implement smooth failover auto-recovery. For more information, see Taming ElastiCache with Auto-discovery at Scale on Medium.

Our functional and stable migration to ElastiCache gave us immediate and dramatic gains in scalability and stability. We could not be happier with our decision to adopt ElastiCache into our stack here at Tinder.

Disclaimer

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

About the Authors

Will Youngs is a Software Engineer on the Identity team at Tinder. Will works to build and maintain microservices which manage user data and user authentication. He has also worked extensively on the company’s migration to Amazon ElastiCache.

Daniel Alkalai is a Senior Software Engineer on the Identity team at Tinder. He leads the Identity Platform team dealing with profile data management, helping Tinder’s core backend services and infrastructure maintain performance as they scale.

Jun-young Kwak is a Senior Engineering Manager on the Identity team at Tinder. He is leading the engineering efforts to make a Tinder account the ultimate foundation for introducing members to new people.

AWS Database Blog