AWS for Games Blog

Dream11 gained significant performance and saved 42% compute cost by simply migrating to the Graviton2 instance family

With over 120 million users, Dream11 is the world’s largest fantasy sports platform that offers fantasy cricket, football, kabaddi, basketball, hockey, volleyball, handball, rugby, futsal, American football & baseball, on it. Dream11 is the flagship brand of Dream Sports, India’s leading Sports Technology company and has partnerships with several national & international sports bodies and cricketers.

In order to efficiently enhance cloud operations, Dream11 worked closely with Enterprise Support – the highest-level support offered by AWS – to optimise the cost of their infrastructure hosted on AWS Cloud. The Technical Account Manager (TAM) from Enterprise Support provided guidance on multiple strategic initiatives and suggested migrating the workloads from Intel to Graviton2 processor instances.  AWS had launched the Graviton2 processor at its annual event re:Invent 2020. It provides performance benefits that meet the Dream11’s expectations at a 40% lower price point. Graviton2 has a different set of processor instructions and we wanted to test the feasibility of deploying applications on it.  We further engaged with the TAM to identify the application which is less critical in nature for business but has a higher potential for cost optimization. After a series of discussions, we decided to choose a self-managed service spark for migrating to the Graviton2 processor. This spark cluster is a CPU and Memory intensive application which makes it a perfect candidate for moving to Graviton2 processor type to test the performance and benchmark against the intel-based processor.

In this blog, we will walk you through the architecture of the PDF service Spark used on Dream11, explaining the significance of each tier and they interact with each other. Then, we will explain the migration approach, the challenges faced by Dream11 and how Graviton2 came to the rescue in our cost optimization journey.

PDF Service Spark Architecture:

In line with Dream11’s FairPlay policy, before a fantasy sports contest begins, users can download a PDF document that includes details of other users and their teams who are participating in that particular contest. We use a self-managed PDF service Spark cluster to generate the PDF file.

The below diagram depicts the architecture of the PDF service Spark cluster.

Explanation of data flow and Architecture:

Explanation of data flow and Architecture:

As depicted in the diagram, we use STEP functions to trigger a Spark to read the contestant details and their team details from S3 bucket. Spark generates the PDF file using these details and stores it back in the S3 bucket, making it available for its download to fantasy sports through CloudFront.  Additionally, these details are written to the Cassandra database, where point calculations take place for Leader Board functionality. This keeps on changing based on the performance of the on-field players during the game. The leaderboard determines the winner with the highest number of points at the end of the contest. Initially, the Spark cluster was running on m4.4xlarge instance type. Later, it was migrated to m6g.4xlarge instance type.

Migration approach and challenges:

To begin the migration, the PDF Service Spark application was deployed on m6g.4xlarge instance type in a staging environment for its functionality tests and to check the compatibility of applications libraries with Graviton2 processors. We had to upgrade the application libraries to make the application compatible with Gravition2 processor. After a successful sanity check, we decided to perform the load test to ensure that the application could take up the load similar to the scale of the Indian Premier League (IPL). The final match of the IPL 2020 saw over 5.5 million user concurrency on Dream11 and hence, we decided to use the same concurrency level to carry out the load test. Earlier, the application was running on 80 on-demand instances of m4.4xlarge instance type. Hence, we deployed the application with the same number of instances but on m6g.4xlarge instance type. During the load test, we observed that aggregated CPU utilization on m4.4xlarge was 42% while the aggregated CPU utilization on m6g.4xlarge instance type was 32%.

Graviton2 is an ARM-based processor which has a different set of processor instructions. Making application compatible with the Graviton2 was another challenge, therefore, to reduce the Kernel / OS level Library compatibility challenge, we planned to upgrade from CentOS 7 to CentOS 8. Additionally, we also applied the latest OS security patches to reduce the vulnerability of the systems. Due to application dependency, we decided to use jdk-8u301-linux-aarch64.rpm on Graviton2 processor. In terms of packages, we also needed to upgrade GCC library to the latest version as it is not available as a default CentOS 8 repository. The rest of the migration was seamless for running the application on Graviton2 processor.

Performance Comparison:

We decided to carry out the performance benchmarking with the concurrency number of 5.5 million users on Intel vs Graviton2 instances.

Instance Type Number of instances Latency (avg) Aggregated CPU utilization
m4.4xlarge 80 243ms 42%
m6g.4xlarge 80 204ms 32%

Table 1 – Intel vs Graviton 2 processor comparison for 80 number of both instance type.

Evidently, Graviton processor-based instances provided better performance as compared to Intel processor-based instances.  Hence, we decided to test again but with reduced compute capacity by 25%. This time, we carried out the load test with 60 instances of m6g.4xlarge. We were surprise to see that the PDF generation took almost the same time that was used while running on 80 instances of m4.4xlarge.

Instance Type Number of instances Latency (avg) Aggregated CPU utilization
m4.4xlarge 80 243ms 42%
m6g.4xlarge 60 241ms 39%

Table 2 – Intel vs Graviton 2 processor comparison for 80 of intel and 60 of graviton instances.

We concluded the load test as a success and decided to move the PDF Service Spark application into the production environment. We deployed it before the IPL 2021 and used it throughout the tournament. We did see improved performance in the PDF generation process due to the usage of Graviton processor throughout the tournament as compared to the previous IPL seasons when the Spark cluster was running on an Intel-based processor.

Timelines:

Overall, it took about three weeks to deploy the application in the production environment. For the initial staging setup, we took one week for infra deployment on AWS cloud. In the staging environment, we successfully carried out the code changes and functional testing. Next, a whole week was spent on load testing.  We tested with the concurrency of 5.5 million users and examined all the edge cases successfully to ensure there were would be no glitches when the application goes live in production. After a successful load test, we were confident of using Graviton2 processor for the PDF Service Spark application.

By the third week, we moved the application to the production infra setup. To start using the application in a production environment, we shifted a few of the low impacted rounds to a new architecture for a couple of days along with a backup strategy of running it on an Intel-based processor.  As expected, we did not encounter any issues and decided to run all the rounds on the new application stack, on Graviton2. No issues or failures surfaced during the process.

Benefits:

The Graviton2 processor provides a much better price and performance as compared to Intel-based processors.  Dream11 deployed the PDF Service Spark application on the m4.4xlarge instance type. To serve at a scale of 120+ million users and a user concurrency of over 5.5 million, we expanded our cluster to deploy 80 instances of m4.4xlarge. We also realized that just by migrating from Intel to Graviton2 processor, we saved around $10,746 (23%) of computing cost per month.

Instance Type Instance Count OnDemand Cost per hour/instance Total Monthly Cost Difference
m4.4xlarge 80 $ 0.8 $ 46,720 $ 10,746
m6g.4xlarge 80 $ 0.616 $ 36,974

Table 3 – Cost Comparison between Intel (80 instances) vs Graviton (80 Instances)

During the load test, we found that Graviton2 instances have better processor performance which allowed us to reduce the number of compute instances further by 25%. So, we carried out the test with 60 instances of m6g.4xlarge. This reduced the overall computing cost by $19,739 each month (42%).

Instance Type Instance Count OnDemand Cost per hour/instance Total Monthly Cost Difference
m4.4xlarge 80 $ 0.8 $ 46,720 $ 19,739
m6g.4xlarge 60 $ 0.616 $ 26,981

Table 4 – Cost Comparison between Intel (80 instances) vs Graviton2 (60 Instances)

Next Steps:

The project was so successful for Dream11 that we have already planned to replicate this to other applications by migrating them to Graviton2 processor for price and performance benefits.

Summary:

After successful test runs at Dream11 scale, their Developers and DevOps teams were encouraged to adopt Graviton2 for cost and performance optimisation. We carry on with the load test and figured out that the CPU utilization on Graviton2 based instances were far less than the intel based instances.  We realized that there is further scope of reducing the compute nodes by 25%,  by decreasing the number of Graviton based instances to 60 nodes without impacting the performance of the PDF generation process. This helped to reduce the overall costs by 42% per month.

Thanks for reading this blog post on the adoption of Amazon Graviton2 instances to save on compute costs. If you have any comments or questions, leave them in the comments section.

About the authors:

NIkhil Mamania Nikhil Mamania is a Senior Technical Consultant at AWS, based out of Mumbai, India. Nikhil has over 21 years of experience in IT Networking and Systems, and his more recent experience is as a Cloud Architect. He is passionate about working on complex issues and finding creative solutions, strategizing optimal customer cloud deployments based on each customer’s unique business requirements.
Sanket Raut Sanket Raut is a Principal Technical Account Manager at AWS based in Vasai ,India. Sanket has more than 16 years of industry experience, including roles in cloud architecture, systems engineering, and software design. He currently focuses on enabling large startups to streamline their cloud operations and optimize their cloud spend. His area of interest is in serverless technologies.
Parth Ingole Parth Ingole is a SRE-II at Dream11. He has over six year of experience and specializes in AWS and Linux administrations. In his current role, Parth is responsible for developing and maintaining tools, solutions, and microservices. He is also part of the engineering team that works on ultra-scalable and highly reliable software systems that include monitoring, configuration, troubleshooting, and maintenance of the operating system.
Siddharth Terse Siddharth Terse is a Senior Site Reliability Engineer at Dream11. With almost a decade of experience in DevOps & SRE Support Services, his current role requires him to scale Dream11 in terms of user engagement and concurrency. He works with multiple AWS Stack and monitoring systems like New Relic, automation tools, cost optimization and more. Siddharth Terse is solution-oriented and thrives in solving fast-paced challenges that directly impact the bottom line.