AWS for Games Blog

Optimizing price and performance: How Dream11 gained up to 92% benefits using Amazon EBS gp3

With over 130 million users, Dream11 is the world’s largest fantasy sports platform offering fantasy cricket, football, kabaddi, basketball, hockey, volleyball, handball, rugby, futsal, American football & baseball on it. Dream11 is the flagship brand of Dream Sports, India’s leading Sports Technology company, and has partnerships with several national and international sports bodies and cricketers.

Dream11 has seen exponential growth in the past five years. It is expected to reach 150 million users in the next few months. In the 2022 edition of the Indian Premier League (IPL), Dream11 achieved its all-time high concurrency. To sustain growth, it’s important for us to offer the best in class experience to end-users by providing an uptime of 99.9% for their end-user-faced applications. With so many users using the app to create their fantasy sports teams during big-ticket events such as the IPL, some infrastructural challenges could pop up. To ensure a great experience for them through advanced and customised cloud operations at lower spends, Dream11 had signed up Enterprise Support agreement with Amazon Web Service (AWS). AWS’ technical account managers (TAMs) drive strategic initiatives such as cost optimisation workshops (COW), cloud operations reviews (COR) and security improvement programs (SIP). Moreover, they recommend disaster recovery architecture for critical applications.

In this article, we will explore cost optimization journey by focusing on current challenges, as well as  Dream11’s migration approach.

Some critical applications that Dream11 has deployed on AWS as microservices on EC2 instances. Each of these instances are attached with gp2 Elastic Block Store (EBS) volumes. Every month, TAM provides recommendations for cost optimization. In one such report, the TAM noticed that we were spending about $500K monthly on just EBS volumes. So, the team did a detailed analysis and found that non-production environments were using instances with gp2 EBS volumes of size 500GB and their utilization was less than 10%. In the production account, instances were deployed with excessive size storage volumes which were hardly utilized up to 25%.

Cost Optimization Recommendations

TAM recommendations comprised of

  1. 20% reduction in EBS volume cost by migrating gp2 based EBS volumes to gp3. Gp3 offers SSD performance with up to 20% lower cost per GB than gp2 volumes. Furthermore, we could easily provision higher input/output operations per second (IOPS) and throughput without the need to provision additional block storage capacity.
  2. 80 to 90 % cost reduction in the storage cost for the non-production environment by allocating adequate size of storage volumes to minimize the wastage. To illustrate further, TAM recommended that instead of using 500 GB volume size, use 50 GB of EBS volume for stateless workloads and 100 GB of EBS volume size for top tier microservices.
  3. In production,  TAM recommended using a monitoring tool to analyse the usage of EBS volume and take the appropriate decision to reduce the size of EBS volumes.

Cost optimization by migrating to gp3 EBS volumes

Dream11 saved over $11,000 monthly by performing PoC on Kafka workload by migrating to gp3 EBS volumes. In the journey to our first success, we initially asked TAM to identify which application should be migrated to gp3 EBS volume. We decided to experiment with the data-highway application used for tracking the journey of users on the Dream11 application. It would be easier to migrate by creating a new parallel cluster and at lower risk by synchronising old data to avoid any data loss. The data-highway application comprised of 40 instances running Kafka cluster and each was attached with 5 TB gp2 EBS volume.  The primary reason to have such a big disk size was to get the required disk IOPS of 15,000 additionally we checked that the maximum gp2 EBS volume consumption was at  1.5 TB out of the allocated volume size of 5 TB, therefore we decided to reduce the disk size from 5 TB  to 2 TB while synchronising the old data to newly provision gp3 EBS volumes for a new Kafka cluster while the existing one continued to run with gp2 volumes. During the planned cutover window we ensured that 100% of old data is in sync between the old and new Kafka clusters before we redirected the traffic to the new cluster and retired the old Kafka cluster. The entire migration exercise lowered the overall EBS volume cost by 56% without impacting any performance of the Kafka cluster.

The below table depicts the gp2 vs gp3 EBS volumes cost comparison for the 40 nodes of the Kafka cluster.

Volume Type Volume Size IOPS per volume Cost per month/node Cost saved for 40 nodes/month
gp2 5 TB 15000 $512 $11,520 (56%)
gp3 2 TB 15000 $224

Table 1 – Cost comparison for gp2 vs gp3 EBS volume

Migration approach & performance benchmarking

The successful migration of Kafka cluster encouraged us to prepare a list of applications and datastores which could be migrated to gp3 EBS volumes. At Dream11, applications are deployed as microservices that run on EC2 instances. These EC2 instances are part of Auto Scaling Groups (ASG) to scale up or scale down the number of instances based on end-user traffic patterns. To migrate EBS volumes, we changed the deployment configuration to replace  gp2 with gp3 volume type and rolled over the instance deployment to have new instances launched with gp3 volumes. Now, we have 100% of microservices running on gp3 EBS volumes. Additionally, we also reduced the size of the EBS volumes wherever possible.

We had a few standalone EC2 instances running with applications like Jenkins. For such instances, we used Amazon EBS Elastic Volumes to modify volume type from gp2 to gp3 without detaching volumes or restarting instances. This meant that there were no interruptions to the applications during modification. A disk with 100 GB of data got converted to gp3 within 10 minutes. Jenkins had unique storage IOPS requirements with 3000 IOPS. The application was originally provisioned with gp2 EBS volume with a size of 1TB to meet the 3000 IOPS requirement. While checking Volume consumption on the Jenkins server, we realize it was utilized around 80GB out of the allocated 1TB of GP2 volume size. So we migrated the gp2 volume to gp3 and resized it to 100 GB with 3000 IOPS to meet the application requirement and optimized storage cost by 92%.

The below table depicts the cost comparison for the applications between gp2 vs gp3 EBS volumes.

Volume Type Volume Size IOPS per volume Cost per month Cost saved per month
gp2 1000 GB 3000 $100 $92
gp3 100 GB 3000 $8

Table 2 – Cost comparison between gp2 vs gp3 EBS volume

We went ahead and automated the migration process without any downtime.

Dream11 uses GraphQL, a query language for APIs and a server-side runtime for executing queries. From CloudFront, incoming requests go to Route53 which routes the incoming traffic into multiple shards using a weighted policy. Each shard consists of an application load balancer that routes the traffic to multiple GraphQL servers attached behind the target group. GraphQL application has hundreds of servers which go up by 3-4 times during large scale cricket tournaments such as the Indian Premier League or the T20 World Cup. Every instance is deployed with gp2 volumes of 50GB in size. We made changes in packer flow for storage/volume type with respect to GP3, which created a service specific AMI replacing gp2 volumes with gp3. We further carried out the load test on this application against 40 million requests per minute (RPM). The below table depicts the performance comparison between gp2 vs gp3 volumes for GraphQL (stateless) service.  It is evident that for the same volume, size gp3 performance is 10 % better for CPU utilization and 14% better for latency than that of gp2 volume.

The below table and graphs represent performance statistics on gp2 vs gp3 EBS volumes.

Application RPM gp2 CPU gp3 CPU gp2-p95 Latency gp3-p95 Latency
Graphql Service 40 million ~57% ~52% 50ms 43ms

Table 3 – Graphql service tested at 40 million RPM with gp2 and gp3

Below images depict the graphical representation for the load tests of 40 million RPM against gp2 and gp3 volumes.

Image 1 – Graphql service tested at 40 million RPM for gp2 volume

Image 1 – Graphql service tested at 40 million RPM for gp2 volume

Image 2 – Graphql service latency result at 40 million RPM for gp2 volume

Image 2 – Graphql service latency result at 40 million RPM for gp2 volume

Image 3 – Graphql service tested at 40 million RPM for gp3 volume

Image 3 – Graphql service tested at 40 million RPM for gp3 volume

Image 4 – Graphql service latency result at 40 million RPM for gp3 volume

Image 4 – Graphql service latency result at 40 million RPM for gp3 volume

Lessons learned

  1. Leverage AWS Enterprise support (TAM) for cost optimization tools, techniques and recommendations.
  2. Migrate using Elastic Volumes in-place tool to have zero disruption to the accessibility of the applications.
  3. It’s not only high storage cost optimization, but it could also lead to compute cost optimization if the number of workloads to take advantage of higher IOPS can be reduced.  Here is a  list of instances supporting Amazon EBS-optimized bandwidth.

Summary

After the initial success of migrating Kafka cluster and Graphql Service to gp3 EBS volumes, we migrated EBS volumes from gp2 to gp3 for several applications and datastores. We had hundreds of applications running on tens of thousands of workloads attached with gp2 EBS volumes. After migrating them to gp3 volumes, we achieved the cost benefits within the range from 8% to 92%, depending on the storage volume size and IOPS requirements. Overall, we saved about 20% in costs for storage service.

Thank you for reading this blog post on the adoption of Amazon EBS gp3 Volume to save on storage costs. If you have any feedback or questions, feel free to leave them in the comments section.

About the authors:

NIkhil Mamania Nikhil Mamania is a Senior Technical Consultant at AWS, based out of Mumbai, India. Nikhil has over 21 years of experience in IT Networking and Systems, and his more recent experience is as a Cloud Architect. He is passionate about working on complex issues and finding creative solutions, strategizing optimal customer cloud deployments based on each customer’s unique business requirements.
Sanket Raut Sanket Raut is a Principal Technical Account Manager at AWS based in Vasai ,India. Sanket has more than 16 years of industry experience, including roles in cloud architecture, systems engineering, and software design. He currently focuses on enabling large startups to streamline their cloud operations and optimize their cloud spend. His area of interest is in serverless technologies.
Parth Ingole Parth Ingole is a SRE-II at Dream11. He has over six year of experience and specializes in AWS and Linux administrations. In his current role, Parth is responsible for developing and maintaining tools, solutions, and microservices. He is also part of the engineering team that works on ultra-scalable and highly reliable software systems that include monitoring, configuration, troubleshooting, and maintenance of the operating system.
Siddharth Terse Siddharth Terse is a engineering manager at Dream11. With almost a decade of experience in DevOps & SRE Support Services, his current role requires him to scale Dream11 in terms of user engagement and concurrency. He works with multiple AWS Stack and monitoring systems like New Relic, automation tools, cost optimization and more. Siddharth Terse is solution-oriented and thrives in solving fast-paced challenges that directly impact the bottom line.