Beat Increases Scalability, Reduces Compute Costs by 90% on Amazon ElastiCache
When Beat, a top ride-hailing app in Latin America, reached a phase of hypergrowth in 2019, it started to experience outages due to bottlenecks in the system. Though Beat was using Amazon ElastiCache, a fully managed in-memory data store, it needed to update configuration to take advantage of key features specific to the team’s use case.
When Colombia banned Beat’s primary competitor in February 2020, opening the floodgates to potential new users, the team at Beat had to act fast to find a scalable new cache solution that could accommodate many more users in a short period of time. So it went back to the drawing board, brainstorming how to use Amazon Web Services (AWS) solutions to create a more effective configuration for its needs. After only 2 weeks, the new setup with ElastiCache significantly improved Beat’s performance and scalability while reducing costs and the workload for its team. This ultimately helped prepare the company for an influx of new business by making its infrastructure more scalable and highly available.
We could split the traffic among as many instances as we liked, basically scaling horizontally, which we couldn’t do in the previous solution.”
Senior Engineering Manager, Beat
Troubleshooting Based on Scaling Needs
Founded in 2011 in Greece, Beat is the fastest-growing app in Latin America, where it now operates 90 percent of its business. In 2017, Beat was acquired by FREE NOW, a joint venture of BMW Group and Daimler Mobility AG, and became the group’s Latin American company, which drove additional growth. Today, the company has over 700,000 drivers and more than 22 million users across 6 countries and 23 cities. Since its early days, Beat used Amazon Elastic Compute Cloud (Amazon EC2), Amazon Aurora, AWS Auto Scaling, and Amazon ElastiCache for Redis in a configuration that worked well for years. But as Beat amassed millions of users, it noticed that the existing architecture could no longer scale efficiently. As the app prepared for a dramatic spike in volume after its competitor's exit from Colombia, it needed to prepare to scale to meet that demand and avoid downtime for its users.
At that point, Beat’s system could scale only vertically, meaning that all Beat’s traffic was routed to one primary instance. This setup led to frequent bottlenecks and hours-long downtime, which prevented Beat from engaging with customers and resulted in a high number of lost rides, the company’s core key performance indicator. Beat experienced decreasing revenue and increased compute costs from not being able to efficiently use instances.
So Beat needed to reevaluate its existing configuration and find a cost-effective way to scale horizontally and balance the traffic load among instances without overworking its engineering team. During an in-depth review, the team consulted the AWS Enterprise Support team of engineers and solutions architects about Beat’s configuration and learned that enabling cluster mode would help resolve its challenges and scale to meet its customer demands. “It’s better for us to use a managed service that is balanced in market price and quality so that we can put our engineering efforts toward developing other features that bring more value to our users,” says Jim Ntosas, site reliability engineer at Beat.
Assembling a Core Team for a Strategic, Fast Transition
Beat first explored different cluster-configuration options. When launching an Amazon ElastiCache for Redis cluster, users have the option to use one of three cluster configurations: single node, cluster mode disabled, and cluster mode enabled. With cluster mode enabled, users can scale to have very large amounts of storage—potentially hundreds of terabytes across up to 500 nodes whereas a single node can only store as much data in memory as the instance type has capacity to support.
Previously, Beat had cluster mode disabled, which meant the infrastructure could only scale vertically. To improve the performance of its architecture, Beat updated its configuration and enabled cluster mode to enhance reliability and availability with little change to the existing workload. Cluster mode also provided a number of additional benefits for Beat. The primary benefit was that it enabled Beat to scale in or out the number of shards (horizontal scaling), versus scaling up or down the node type (vertical scaling).
Beat aimed to get the new configuration ready as quickly as possible to accommodate the sudden opportunity in the Colombian market. It assembled a core team of people from the infrastructure domain, such as DevOps engineers, backend engineers, and testers. The team began by performing stress and load tests that evaluated the new architecture’s ability to add replica nodes without impacting the cluster and to integrate online resharding and shard rebalancing, as well as to adjust instance types with virtually no downtime. “We could split the traffic among as many instances as we liked, basically scaling horizontally, which we couldn’t do in the previous solution,” says Antonis Zissimos, senior engineering manager at Beat.
The core team made cluster mode fully operational in just 2 weeks, during which it configured Beat’s strategy to optimize traffic distribution and take full advantage of Amazon ElastiCache offerings. “Not all the instances were balancing the traffic as well as expected, because of how our library works and how Redis responds when there are shards,” says Andreas Strikos, lead site reliability engineer at Beat. First, the team rolled out Redis cluster mode on a small cluster in the Greek market and saw that traffic distribution could be further improved by deploying code with a persistent connection flag to enable new connections to the Redis cluster. That adjustment reduced the load per node by 25–30 percent, boosting performance for end users. The core team also disabled the internal library it had developed as part of its previous setup to increase traffic to the primary nodes.
By using Amazon ElastiCache for Redis with cluster mode enabled to better distribute load balance, Beat has not only seen virtually zero downtime but also eliminated 90 percent of the hours the engineering team had previously spent on managing the system. “Migrating to the newer cluster mode and using newer standard Redis libraries enabled us to meet our scaling needs and reduce the number of operations our engineers had to do,” says Zissimos. The balanced distributed load also efficiently uses the instances to scale, so Beat has seen a 90 percent drop in compute costs.
Realizing a Better Future with Automated, Simplified Systems on AWS
Following the guidance of AWS Enterprise Support, Beat optimized Amazon ElastiCache for Redis by enabling cluster mode. As a result, Beat became more prepared for the future in just 2 weeks, creating seamless scalability while reducing compute costs, relieving its staff of unnecessary stressful work, and improving performance for its users.
The engineering team has recommended Amazon ElastiCache for Redis for other teams within Beat that want to improve and simplify operations. According to Zissimos, automated systems are the key to success: “We’ll always take the solutions that are robust and simple to implement in an automated way by autonomous teams. That’s the future—building a solution teams can use to reduce their cognitive load and be able to respond to the needs of the business, deliver faster, and outgrow the competition.”
Founded in 2011, Beat is a ride-hailing app that has more than 700,000 drivers and 22 million users globally. Headquartered in Greece, the company is the fastest-growing app in Latin America, serving Peru, Chile, Colombia, Mexico, and Argentina.
Benefits of AWS
- Migrated in 2 weeks
- Reduced load per node by 25-30%
- Reduced compute costs by 90%
- Eliminated 90% of time staff spent on managing caching layer
- Has virtually zero downtime
AWS Services Used
Amazon ElastiCache allows you to seamlessly set up, run, and scale popular open-source compatible in-memory data stores in the cloud.
Amazon Elastic Cloud Compute (EC2)
Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers.
Amazon Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud, that combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open source databases.
AWS Auto Scaling
AWS Auto Scaling monitors your applications and automatically adjusts capacity to maintain steady, predictable performance at the lowest possible cost.
Companies of all sizes across all industries are transforming their businesses every day using AWS. Contact our experts and start your own AWS Cloud journey today.