AWS Storage Blog

Electronic Arts optimizes storage costs and operations using Amazon S3 Intelligent-Tiering and S3 Glacier

Electronic Arts (EA) is a global leader in digital interactive entertainment. We make games that touch 450+ million players across console, PC, and mobile including top franchises such as FIFA, Madden, and Battlefield. All of our games are instrumented with telemetry that help our teams in building, enhancing, and operating these world-class player experiences at scale. This is a fundamental part of the overall game development process.

During the last decade, the scale and complexity of this telemetry data has grown tremendously. EA has transformed from a Hadoop dominant environment to one centered around an AWS Cloud Storage based data lake on Amazon S3, including S3 Glacier for data archiving and long-term backup. This evolution has been a crucial journey with learnings on how to evolve with scale and complexity while being cost conscious. To support our top games, our core telemetry systems routinely deal with 10s of petabytes, 10s of thousands of tables, and 2+ billion objects. These datasets are produced, processed, and consumed by a wide variety of data tools with high performance. Providing solutions that meet the demands of modern games at EA’s scale is a considerable challenge. As part of the central platform organization (a.k.a EA Digital Platforms or EADP), our Data and AI team (a.k.a EADP Data and AI team) worked closely with AWS to build a resilient solution that directly fits our needs.

In this blog post, we share our experiences during our modernization journey with AWS and talk through how our EADP Data and AI team implemented these solutions. We also discuss our use of Amazon S3 storage classes and S3 features, such as S3 Standard, S3 Intelligent-Tiering, S3 Glacier, S3 Glacier Deep Archive, and object tagging to help us significantly optimize costs and operational overhead.

EA’s telemetry pipeline

The EADP Data and AI team started with simple pipelines collecting data from various games. Back then, the team was building aggregates and reports using a self-managed Hadoop build over Amazon EC2 instances. However, over time, the popularity of our games continued to climb, with the volume of generated data growing alongside. We realized that Hadoop was becoming a bottleneck to build analytics products that could handle such scale. Using Hadoop, our storage and compute capacities were coupled. This wasn’t ideal for building customer-facing data marts that typically have unpredictable compute loads. These early use cases, along with our need for archival storage, led us to move our data to S3 and S3 Glacier. Our storage pipeline with AWS is shown in the following Figure 1. This approach opened up opportunities for us to spin up multiple compute clusters that directly operated off our telemetry data in Amazon S3. This setup allowed us to scale in pace with our business needs and deliver the promise of the elastic cloud.

figure-1-prefix-based-telemetry-data-lifecycle-management

Figure 1: Prefix-based telemetry data lifecycle management (early design)

Overall, the data was distributed in logical databases and tables with traditional Hadoop conventions of storage paths. These datasets were typically produced in Hadoop and were propagated into Amazon S3 with objects using the same path structure. The management of this data was done at the table level with each table represented by the following unique path (prefix) structure. Policies for data access, lifecycle management, ownership, etc. were defined on a per path/prefix basis:

/warehouse/<database_name>/<table_name>/ <partitions and files>
/warehouse/mpst/mpst_event/dt=2021-08-08/hour=23/service=battlefield-1-pc/

Growth and opportunities with Amazon S3

Over a few years, we accumulated dozens of petabytes of data catering to business cases from several generations of highly successful games. After reflecting on our historical storage solutions about 4 years back, we realized we were facing several challenges.

  1. Data growth: The number of datasets, objects, and users started expanding at a much faster rate. We grew into 10s of thousands of telemetry tables, with some of them running into 100s of millions of objects. Many of these tables were also stored in a handful of S3 buckets that had level limits around certain operations such as fine-grained ACLs. Our existing tools all centered around prefix-based table management, which limited our ability to scale up more tables in these buckets. Any changes to the prefix-based table architecture were too disruptive to the rest of the stack. Our operational overhead was mounting and status quo was not a choice.
  2. Cost management: As our data grew, so did the costs. Optimizing our storage became a business necessity. We looked into several approaches, including efficient storage formats and compression methods. However, with the rate of growth of our data, it was difficult to keep pace.
  3. Retention/lifecycle management: As mentioned previously, we organized our large buckets into an Apache Hive/Hadoop file structure with data paths. All management was prefix-based. Depending on the retention/archival needs of each object, we set a prefix-based lifecycle policy. However, our larger buckets grew into more than 10,000 tables and we ran into S3 limits on the maximum number of lifecycle policies. An S3 Lifecycle configuration can have up to 1,000 rules on a given S3 bucket. To work around this, we were manually intervening with operationally inefficient tools. This added to the cost as the data was retained longer than we needed.
  4. Data usage: Data usage for us varies widely. While there are a large number of structured use cases, usually our ad-hoc analysis drives the use of our PBs of historical data. Given the unpredictable nature of this process, it is usually hard for us to forecast the right time to archive certain datasets. S3 Glacier allows us to optimize costs for data that isn’t needed immediately. The time taken to restore archived data for analyst use is often in the critical path for our users’ needs. Over time, we developed various tools and models to do this at our scale. However, in practice, we often end up being too conservative on table archival dates to minimize impact to users.

Overall, we had multiple ongoing initiatives building tools to manage these challenges. However, storage management was consuming significant bandwidth from the team. Our team’s engineering effort is best focused on EA-specific challenges directly supporting game launch efforts rather than spending time on managing our data storage footprint. In addition, our internal data users were also looking to gain more control over their datasets. This meant we were set to upgrade and revamp most of our storage management tools.

Upgrade 1: S3 Glacier Deep Archive

With AWS introducing S3 Glacier Deep Archive in 2019, it was an immediate win for us. With no impact to our existing workflows, we were able to instantly reduce the cost of our archival data. With 20+ PB of data currently managed in our archives, it provided us with a very significant cost savings. AWS made the adoption of S3 Glacier Deep Archive easy by incorporating it into our existing lifecycle transitions and was part of the standard S3 APIs. After convincing ourselves of the suitability of S3 Glacier Deep Archive for our workflows, we switched all our buckets from S3 Glacier directly to S3 Glacier Deep Archive.

Key takeaways:

  • No performance challenges or workflow changes for our workloads.
  • 75% cost savings from cold storage (saving ~$60K/month) within the first quarter of migration.

Upgrade 2: Object tagging and retention management

In order to scale beyond the prefix-based limitations, we ideally needed a process that works on table-level abstraction. We also needed to ensure that our existing tools and systems were not disrupted. Working with AWS, we switched our approach to leveraging lifecycle management using object tagging, employing just ~100 retention tags. It turns out, our large number of tables only need a small number of lifecycle policies. The number of policies is fixed and does not grow with the number of tables. Because of this setup, we were able to cater to our large number of tables with ease.

The following Figure 2 shows our current AWS Lambda based architecture for data/table lifecycle management. A Lambda function listens to S3 object creation events for applicable buckets. For each captured event, the Lambda function will parse the bucket name and the prefix out of each event. The Lambda function will then send these two parameters to the retention metadata service. The Lambda function will compose a retention string based on the returned retention values from the metadata service. The Lambda function will also tag the object with the retention string (e.g. Retention): A_360_E_720, indicating that the object will be archived as S3 Glacier Deep Archive after 360 days and then deleted after 720 days from the creation day.

figure 2 tagging based data lifecycle management

Figure 2: Tagging-based data lifecycle management

The following is an example of a metadata service request and response:

GET https://metadata-service-prod.****.ea.com/v1/metadata/bucket1?filePath=data/vpc/mpst/mpst_event/hive/warehouse/mpst/mpst_event/dt=2021-08-08/hour=23/service=battlefield-1-pc/battlefield-1-pc_event-m-00000

{
   "bucket": "bucket1",
   "prefix": "data/vpc/mpst/mpst_event/hive/warehouse/mpst/mpst_event/",
   "db": "mpst",
   "table": "mpst_event",
   "archive": 360,
   "expire": 720,
   ...
}

On the S3 bucket level, we have a set of tag-based lifecycle policies. These policies control the storage class transition based on the object’s retention tag. We successfully consolidated tens of thousands of tables into around 100 lifecycle policies. We were able to perform this consolidation by limiting expiration and archive days to a set of numbers that reflected our business needs.

Key takeaways:

  • All objects are tagged using Lambda functions with each tag representing a predefined retention period.
  • We were able to consolidate the retention of tens of thousands of tables into about 100 lifecycle policies.
  • The solution provided us with fine-grained tools that allow us to implement EA strict data policies around access controls and lifecycle management.
  • The solution was backward compatible and no major changes needed to be made to other data processing tools in our environment.

Upgrade 3: Amazon S3 Intelligent-Tiering

S3 Intelligent-Tiering provided us an out-of-the-box mechanism to optimize our storage costs, with no data retrieval or lifecycle transition fees. S3 Intelligent-Tiering automatically moves data that is not accessed for 30 days to the Infrequent Access tier, and back to the Frequent Access tier when we access the data again. S3 Intelligent-Tiering charges for the storage, and a small per-object monitoring and automation fee. To automate cost savings for rarely accessed data, we opted in to the Archive Access tiers in S3 Intelligent-Tiering, which provide the same price and performance as the S3 Glacier and S3 Glacier Deep Archive storage classes. After 90 days of no access, S3 Intelligent-Tiering moves objects to the Archive Access tier, and after 180-days of no access, objects are moved to the Deep Archive Access tier. In addition, this also helps us avoid any unpredictable retrieval fees when we do need the data again. The only costs were associated with the storage itself and a small per-object cost for monitoring and orchestrating the access.

This pattern is ideal for us. As per our analysis, a majority of our data gets used often within 30 days of the object creation. The rest of the usage is ad-hoc use that tends to be sporadic and sparse. When the data is accessed by our analysts for ad-hoc use, usually, they run multiple queries for a few days and complete their analysis. S3 Intelligent-Tiering is ideal for access patterns exactly like ours, unpredictable, changing, or unknown.

Key takeaways:

  • We have not seen any performance gaps or negative effect due to switching to the S3 Intelligent-Tiering storage class.
  • With this out-of-the-box solution, we were able to shut down several internal tools that centered around such optimizations.
  • The per-object monitoring cost was negligible compared to the overall storage costs and savings.
  • Overall savings were about 35% of our S3 Standard storage costs (that is HUGE!!).

Current storage architecture

Figure 3 shows our current architecture with S3 Intelligent-Tiering as the foundation of our data lake. We are rapidly adding tools and capabilities that allow for easier interaction with our data lake, with the first phase centered around storage simplification. With the right investments into metadata management, we are enabling a data mesh across EA with our data assets truly democratized.

figure 3 current storage architecture

Figure 3: Current storage architecture

Conclusion

In this blog post, we discussed our journey with large data storage on AWS. We adopted S3 and S3 Glacier in the early days, and grew with it on both scale and complexity. Optimizing for costs and operations of the ever-increasing volume of data was a considerable challenge.

With the launch of S3 Intelligent-Tiering and S3 Glacier Deep Archive, our team has been able to instantly use the solutions from AWS. With minimal to no changes to our existing tools, we were able to reduce our archive costs by 75% using S3 Glacier Deep Archive, our data with unpredictable access had cost savings of 30% with S3 Intelligent-Tiering, and we also removed operational overhead. This has helped our data infrastructure team concentrate on other important areas of EADP’s Data and AI stack, such as our core competencies related to game launches. Our collaboration with AWS allows us the ability to focus even more on growing and delighting our customers to continue inspiring the world to play.

Thanks for reading this blog post! If you have any questions or suggestions, please leave your feedback in the comments section.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.
Sundeep Narravula

Sundeep Narravula

Sundeep Narravula is the principal technical director leading the EADP Data and AI platforms team. As part of the Data and AI platforms, Sundeep’s team handles the central storage infrastructure and provides the storage solutions for the game telemetry across EA. Sundeep has 15+ years of experience in building massive scale platforms in the areas of big data, advertising, experimentation, AI, and high performance computing.

Tony Ma

Tony Ma

Tony Ma is the lead architect for the storage management layer for EADP Data and AI Platforms. Tony has 10+ years of experience in building AI/data platforms over cloud infrastructure. Tony is passionate about modernizing traditional platforms to cloud adoption and building highly available, elastic, fully managed, and self-served data platforms for a wide range of EA data users.

Yu Jin

Yu Jin

Yu Jin is the senior engineering manager of the data platform team at EA. Yu has 15+ years of experience in research and development of big data and AI platforms for various applications, including data analytics, query engines, online advertising, and fraud detection.