Building an Event-Based Analytics Pipeline for Amazon Game Studios’ Breakaway

All software developers strive to build products that are functional, robust, and bug-free, but video game developers have an extra challenge: they must also create a product that entertains. When designing a game, developers must consider how the various elements—such as characters, story, environment, and mechanics—will fit together and, more importantly, how players will interact with those elements.

It’s not enough to just assume that those interactions occur as intended—is a particular level too difficult? Are the controls responsive enough? Is the combat balanced? While in-house and focus testing can help answer those questions during development, nothing provides a better answer than actual data from real world players.

o_breakaway_1

We’re currently developing Amazon Game Studios’ new title Breakaway; an online 4v4 team battle sport that delivers fast action, teamwork, and competition. We’re releasing Breakaway in a live alpha state on December 15, 2016, so that we can iterate and improve the game by analyzing massive amounts of gameplay data from our player community.

We’ve built a telemetry pipeline on AWS to ingest, store, and analyze gameplay telemetry to help us answer important game design questions. The elastic scalability and pay-as-you-go cost model of AWS made it the perfect platform for us to quickly assemble a working analytics pipeline during development and be able to scale that pipeline over time to handle our production workload.

We start by thinking about our players, determining what game design questions we want to answer and then working backward to determine what gameplay telemetry we need. As Breakaway has grown, we’ve started collecting telemetry for almost everything that happens in-game from coarse-grained events like matches won to fine-grained events like attacks attempted.

Gameplay telemetry is invaluable for improving game design and player experience in numerous ways, but in this blog post, I want to specifically focus on how we use AWS and gameplay telemetry to influence the design of Breakaway’s arenas.

Designing arenas

Arenas are the cornerstone of the Breakaway gameplay experience. The design of an arena can dictate the pace of a match. It influences team strategy and defines where combat occurs. A well-designed arena must provide elements of risk and reward, empower players, encourage exploration, and still be fun to play after tens or even hundreds of matches. Designing an arena is a daunting task.

To better show you how gameplay telemetry can inform arena design, I’m going to discuss Breakaway’s first arena: El Dorado.

o_breakaway_2

El Dorado, shown above, is a symmetrical arena. The Relic starts in the center and the opposing teams start next to their colored Relays at each end. A team wins by delivering the Relic to the enemy Relay, getting rid of the opposing team or achieving a territory win based on Relic position when time runs out. There are also secondary objectives atop ziggurats on either side of the arena. On top of these ziggurats are “buff” crystals that provide damage (red crystal) or health (green crystal) boosts to the team that destroys them. Players can fall off (or be knocked off!) the edge of the arena, so they must tread carefully. For deeper context, check out some videos of Breakaway gameplay.

There are many facets to El Dorado’s design, but one major aspect that we know influences the player experience is the distribution of player deaths across the arena. Specifically, we want to answer the questions, “What are the most dangerous areas of the arena (that is, where are warriors dying the most)?” If deaths are too evenly distributed, players might become bored by the lack of variety in the arena or if deaths are too heavily concentrated in one area, it could mean that players lack sufficient motivation to explore.

In the remainder of this post, I discuss the technical details of how the Breakaway team uses AWS to collect, process, and analyze gameplay telemetry to answer questions about arena design.

Architecture overview

Before I dive into implementation specifics, it’s important to understand the motivations behind Breakaway’s telemetry architecture. We knew that we wanted a system that could support our internal development efforts, but eventually scale to support our entire player base. We wanted to archive gameplay data for years, but also have that data available for analysis within one hour. Cost efficiency, scalability, and ease of operational maintenance were also major factors.

AWS provides several managed solutions that greatly reduce operational overhead and serve as convenient building blocks for constructing an end-to-end telemetry system. A high-level diagram of Breakaway’s telemetry architecture is shown below:

o_breakaway_3

Breakaway’s game servers run on Amazon GameLift and transmit gameplay events directly to Amazon Kinesis. We chose to use Amazon Kinesis Streams because it’s a managed service that can scale to ingest massive quantities of data; we wanted to avoid operating, scaling, and maintaining a fleet of servers to do that ourselves.

For long-term data archival, we chose Amazon S3 because it’s a fully managed service that can persist data indefinitely while being cost-effective and highly durable. Amazon S3 is also a good choice because it serves as a gateway to numerous other AWS services, such as Amazon Redshift.

We’ve chosen to use Amazon Redshift as a queryable backend data store because it’s a fully managed service that scales to petabytes of data and provides an SQL query interface that’s familiar to analysts and engineers. We load data into Amazon Redshift using the Amazon Redshift COPY command, which can conveniently read data directly from Amazon S3.

While Amazon Kinesis, Amazon S3, and Amazon Redshift provide our data storage, AWS Elastic Beanstalk provides the glue that holds the architecture together. AWS Elastic Beanstalk is our architectural Swiss Army knife; it greatly simplifies the process of managing and deploying multiple different applications throughout our system. We use AWS Elastic Beanstalk to run Amazon Kinesis Client Library (KCL) applications, trigger periodic cron jobs to perform regular maintenance, and host various web-based tools for analysis and reporting.

Transmitting gameplay events

Whenever an important event occurs in Breakaway (for example, a player death), we transmit it, along with relevant metadata, to our telemetry system. The minimum viable set of metadata for gameplay events is the event type (for example, player death), the arena (for example, El Dorado) and the (x,y) coordinates of the event location in game world space. We also add a unique, client-generated event ID and an event timestamp to de-duplicate events and simplify queries later on. Below is a sample JSON-formatted gameplay event:

{
	‘event_id’ : ‘05b00439-6a07-4112-9c8d-165f1643e5d1’,
	‘event_type’ : ‘player_death’,
	‘event_timestamp’ : ‘2016-11-01T21:05:18.000Z’,
	‘arena’ : ‘el_dorado’,
	‘position_x’ : 507.12,
	‘position_y’ : 551.61
}

Breakaway is a server-authoritative game, so all gameplay telemetry is transmitted directly from our servers on Amazon GameLift to the telemetry system. On each game server, we maintain an in-memory ring buffer of pending events and periodically flush them out in batches using the Amazon Kinesis PutRecords API. Because Breakaway is an Amazon Lumberyard game developed using C++ and Lua, we use the AWS C++ SDK to communicate with AWS services directly from our game code.

Archiving gameplay data

Breakaway game servers are constantly pushing gameplay events into Amazon Kinesis and our first priority is to read these events as quickly as possible and archive them. We use the code structure and communication patterns laid out in the Amazon Kinesis Connectors project to marshal data from Amazon Kinesis into Amazon S3 and eventually into Amazon Redshift.

To consume events from Amazon Kinesis, we’ve implemented a KCL consumer application running on top of AWS Elastic Beanstalk to process our incoming telemetry. This initial KCL application, referred to as the S3Connector, is responsible for validating, sanitizing and archiving incoming data. The S3Connector validates each event to ensure that required fields, such as event_id, are present and formatted according to expectations. The S3Connector also sanitizes incoming events to meet the expected format and length so they won’t cause downstream issues loading into Amazon Redshift. The S3Connector batches all valid events in memory and then periodically compresses those batches via Gzip and writes them to Amazon S3.

It’s important to batch events to both reduce S3 PUT request costs and optimize file sizes for loading into Amazon Redshift. To be able to easily find data from a particular date and time later, we format our S3 keys as follows:

<year>/<month>/<day>/<hour>/<startSequenceNumber>-<endSequenceNumber>.gzip

After a batch telemetry file is successfully written to Amazon S3, the S3Connector sends a pointer to the location of the file in Amazon S3 to a secondary Amazon Kinesis stream to initiate the process of loading it into Amazon Redshift.

Making gameplay data queryable

To process files of events written to Amazon S3 by the S3Connector, we’ve implemented a second KCL consumer application (also running on top of AWS Elastic Beanstalk) called the RedshiftConnector. The RedshiftConnector is responsible for loading batches of events from Amazon S3, getting rid of duplicate events and inserting events into tables in Amazon Redshift.

A simple SQL CREATE statement for an Amazon Redshift table to store gameplay events might look something like this:

CREATE TABLE game.events_2016_11
( 
    event_id VARCHAR(36) NOT NULL ENCODE LZO,
    event_timestamp TIMESTAMP NOT NULL ENCODE RAW,
    event_type VARCHAR(64) NOT NULL ENCODE LZO,
    level_id VARCHAR(64) ENCODE LZO,
    position_x FLOAT ENCODE RAW,
    position_y FLOAT ENCODE RAW,
    PRIMARY KEY(event_id)
)
DISTKEY(event_id)
COMPOUND SORTKEY(event_timestamp);

When you design an Amazon Redshift table schema, it’s important to choose the best distribution style to maximize performance. We chose to distribute data evenly across the cluster by using the unique event_id as our DISTKEY. We also defined a primary key constraint on the event_id column to assist the query optimizer in generating more efficient plans.

To load data from Amazon S3 into Amazon Redshift, we follow the guidelines from Best Practices for Micro-Batch Loading on Amazon Redshift. We also perform some extra processing to further improve load performance and filter out duplicate events.

First of all, we use the event_timestamp field to organize events into time-series tables that improve query time and decrease the overhead of VACUUM operations. We hide our time-series tables behind a UNION ALL view to maintain ease of querying.

Secondly, the RedshiftConnector loads multiple files in parallel by using the manifest file load feature of the Amazon Redshift COPY command. Using manifest files avoids eventual consistency issues reading files from Amazon S3. We also COPY using a JSONPath file to decouple our telemetry event format from our Amazon Redshift table schemas so that they can be updated independently.

Finally, we exploit the fact that every telemetry event has a unique, client-generated event_id field to make our loads idempotent. We first load telemetry data into a temporary staging table and then perform a modified merge operation to remove duplicates. We use a SELECT DISTINCT query and then LEFT JOIN our staging table against our destination table on event_id to get rid of duplicate records that may have been introduced by Amazon Kinesis-related retries or during backfills of old data. We also make sure to load our data in sort key order to reduce or even eliminate the need for VACUUM operations.

After data has been successfully loaded into Amazon Redshift, it is available for reporting, ad-hoc querying and more detailed analysis.

Analyzing gameplay data

We’ve found that telemetry analysis works best when it follows the scientific method: ask a question, form a hypothesis, run an experiment, and analyze the results. Earlier in this post, we examined the El Dorado arena and asked the question, “What are the most dangerous areas of the arena (that is, where are warriors dying the most)?”. Now that our gameplay telemetry is available in Amazon Redshift, we’re equipped to answer that question.

Looking at the arena, we hypothesize that many deaths will occur in the middle of the arena where the Relic is heavily contested and we theorize players will periodically plummet to their doom while making risky maneuvers near the arena’s edge. To test these hypotheses, we’ve instrumented a telemetry hook that will send an event, complete with an (x,y) position, every time a warrior dies during a match.

A common method for analyzing spatial data in games is to overlay color-coded, aggregate data on top of a two-dimensional arena image to create a heatmap. Heatmaps are great for visualizing high-level summaries of where player activity is happening. On Breakaway, we generate heatmaps by writing an SQL query to aggregate all the relevant gameplay data into spatial bins and then using custom display logic to overlay the aggregated data onto an arena screenshot.

We created an interactive heatmap generator built on top of an AWS Elastic Beanstalk web server environment. We use JQuery and Bootstrap on the frontend and Java, Apache Tomcat, and Spring MVC on the backend. The website frontend gives users the ability to generate heatmaps based on various parameters such as event_type, arena, and time range. When the user submits a request, a Java servlet dynamically creates an SQL query and sends it to Amazon Redshift to bin all the data into one meter by one meter cells. A query to aggregate player deaths spatially looks something like this:

SELECT FLOOR(position_x), FLOOR(position_y), COUNT(*)
FROM game.events
WHERE arena='el_dorado' 
AND event_type='player_death’
GROUP BY 1, 2
ORDER BY 1, 2;

Amazon Redshift is optimized for aggregating large amounts of data quickly, so it makes short work of binning events into cells. The resulting aggregated data is returned to our custom website, which uses an HTML5 canvas to create an interactive heatmap where users can change color gradients, re-bin data by varying hex sizes and pan or zoom to explore the heatmap. Below is a heatmap that visualizes player deaths in El Dorado:

heatmap-enlarged-final

Revisiting our initial hypotheses, we can see that many player deaths do indeed occur in the middle of the arena. Also, as predicted, there is a good amount of color outside the arena image, which indicates deaths from falling (or getting pushed!) off the edge. The corners near the ziggurats where the arena widens seem especially dangerous. Interestingly, deaths on top of the ziggurats near the buff crystals seem less frequent than we’d expect. There are many possible explanations for this behavior, ranging from players not understanding the value of the buff crystals to players not deeming them worth the risk.

The larger point is that, at a glance, this heatmap has surfaced an anomaly that warrants further investigation and we can use more gameplay telemetry to further drill down into buff crystal usage. Arena designs are not static; we continue to iterate and improve them over time with the help of real-world gameplay data.

Conclusion

Breakaway is a 4v4 team battle sport whose future will be shaped by analyzing gameplay telemetry from real-world players. By using AWS, we’re able to easily ingest large quantities of data, perform in-depth analysis into player behavior, and iterate on our game design to create the best player experience possible. A single developer was able to design and build this end-to-end system in a matter of weeks early in Breakaway’s development and we continue to use it today, with minimal modifications, in our live production environment.

If you’d like to be part of the community driving the future of Breakaway, come find out more at https://playbreakaway.com. We’ll see you in the arena!

If you have questions or suggestions, please leave a comment below.

About the Author

brent_nash_90_2 Brent Nash is a Senior Software Development Engineer for Amazon Game Studios. He’s currently involved in designing and building a number of backend services for the Breakaway project. In his spare time, he enjoys following Liverpool FC, playing rogue-likes, reading any fantasy novel with a dragon on the cover, and having dance parties with his daughters.

Best Practices for Micro-Batch Loading on Amazon Redshift

AWS Big Data Blog

Building an Event-Based Analytics Pipeline for Amazon Game Studios’ Breakaway

Designing arenas

Architecture overview

Transmitting gameplay events

Archiving gameplay data

Making gameplay data queryable

Analyzing gameplay data

About the Author

Related

Resources

Follow

Learn

Resources

Developers

Help