Gaming Analytics: Leveraging AWS Glue and Amazon Redshift Spectrum for Player Insights

Introduction

In the dynamic landscape of game development, efficient data management and analysis are pivotal for optimizing player experiences and driving business growth. Game developers and analysts often encounter the challenge of aggregating data from diverse sources, ranging from real-time operational metrics to historical analytical records. To address these challenges, AWS provides a robust suite of services. Analytics services include AWS Glue for data preparation and transformation purposes. Additionally, Amazon Redshift Spectrum supports seamless querying of data across data warehouses and data lakes. This article explores the integration of AWS Glue and Amazon Redshift Spectrum to streamline the process of joining operational and analytical data for gaming analytics. By leveraging these services, game developers can extract valuable insights from disparate data sources while minimizing development effort and operational costs.

Redshift Spectrum and AWS Glue setup requirements

To illustrate this integration, you’ll use Amazon Aurora MySQL-Compatible Edition for operational data and Amazon Redshift for analytical data storage. This scenario involves joining player data from Amazon Aurora MySQL with player statistics stored in Amazon Redshift. Before diving into the implementation, you’ll step-through prerequisite set-up, including the creation of Amazon Virtual Private Cloud (Amazon VPC) endpoints, appropriate AWS Identity and Access Management (IAM) roles, and download of a database driver for connectivity.

Security prerequisites

Add a self-referential rule to the Amazon Aurora MySQL security group.

Create an AWS Glue role to call other AWS services.

Create a Redshift Spectrum role to allow Amazon Redshift to call other AWS services. The Amazon Redshift CREATE EXTERNAL SCHEMA command uses this role.

Add an inline policy to MyRedshiftSpectrumRole to allow actions for Amazon Simple Storage Service (S3), AWS Glue and AWS Lake Formation.
a. Choose the Permissions tab, Add permissions and Create inline policy.
b. Under Specify permissions, toggle JSON and paste the below policy in Policy editor.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:Get",
        "s3:List"
      ],
      "Resource": ""
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:CreateDatabase",
        "glue:DeleteDatabase",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:UpdateDatabase",
        "glue:CreateTable",
        "glue:DeleteTable",
        "glue:BatchDeleteTable",
        "glue:UpdateTable",
        "glue:GetTable",
        "glue:GetTables",
        "glue:BatchCreatePartition",
        "glue:CreatePartition",
        "glue:DeletePartition",
        "glue:BatchDeletePartition",
        "glue:UpdatePartition",
        "glue:GetPartition",
        "glue:GetPartition",
        "glue:BatchGetPartition",
        "lakeformaDon:GetDataAccess"
      ],
      "Resource": [
        ""
      ]
    }
  ]
}

c. Select Next and Create policy.

Endpoint prerequisites

To allow AWS Glue access to Amazon S3 from within an Amazon VPC, create an Amazon VPC Gateway Endpoint for Amazon S3.

AWS Glue requires an Amazon VPC interface endpoint to utilize a JDBC connection and is needed to activate networking between AWS Glue and Amazon Aurora MySQL.

JDBC driver prerequisite

1. For the final prerequisite step, download and store the latest MySQL driver in Amazon S3. The JDBC connection requires the driver for AWS Glue to crawl the Amazon Aurora MySQL table.

2. To download the connector, choose mysql-connector-j-8.3.0.zip

3. Create an Amazon S3 bucket to host the preceding driver and upload the driver jar to the bucket. No bucket policy is needed for access as the AWS Glue service role created above has necessary read permissions.

Operational and analytics data stores

Queries use an Amazon Aurora MySQL table:

And an Amazon Redshift provisioned cluster table:

With prerequisites in place and data stores defined, let’s revisit our objectives: extracting operational data from the Amazon Aurora MySQL, conducting transformations by eliminating unnecessary columns, and loading the data into Amazon S3 with date format partitions for seamless querying via Redshift Spectrum. The extract, transform, and load (ETL) operations must be repeatable to accommodate recurring reporting needs. AWS Glue provides built-in Apache Spark and Python environments for executing transformation jobs, and it handles data connectors and workflow orchestration capabilities. The strategy requires deploying connectors, crawlers, jobs, and a workflow to prepare the data for integration with Redshift tables.

Data extraction and transformation with AWS Glue

First, define data sources in AWS Glue by creating crawlers. These crawlers will scan the Amazon Aurora MySQL instance and the data stored in Amazon S3, updating the AWS Glue Data Catalog with schema and partition information.2. Create AWS Glue connectors for Amazon Aurora MySQL and Amazon S3 to allow AWS Glue crawlers to connect to the database instance for data extraction and write data to S3. Add the Amazon S3 bucket hosting the MySQL connector jar file and use: com.mysql.cj.jdbc.Driver for ‘Driver class name’.

To create a connector for reading objects in Amazon S3 and updating the AWS Glue Data Catalog with transformed schema and new partitions, create an AWS Glue network connection.

Configure AWS Glue data crawlers

Define two crawlers to extract data from Amazon Aurora MySQL and catalog data written to Amazon S3. The rds_player_db_crawler uses the JDBC connection and identifies the playerDB database and player table as data source.

The s3_player_db_crawler uses the previously created network connection to support crawling of Amazon S3 objects and updating the AWS Glue Data Catalog with table and partition metadata.

Transform data with AWS Glue jobs

1. Next, create an AWS Glue job to transform the operational data extracted from Amazon Aurora MySQL. The job involves dropping redundant columns, formatting data, and writing the transformed data to Amazon S3 in compressed Apache Parquet format. Additionally, you’ll generate a timestamp parameter to facilitate partitioning for optimized query performance and cost efficiency.

2. Create a transform_player_table job using the Apache Spark runtime environment and the Aurora MySQL connection. Portions of transform_player_table were generated using Amazon Q data integration in AWS Glue.

import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from datetime import datetime

# Convert job runtime to timestamp format, format for partition query
job_run_timestamp = datetime.now().strftime("year=%Y/month=%m/day=%d/hour=%H/minute=%M")
runtime_params_string = format(job_run_timestamp.replace("/", ","))

# Set s3 partition path
output_path = f"s3://EXAMPLE-BUCKET/player-db/player/{job_run_timestamp}/"

# Create glue session
glueContext = GlueContext(SparkContext.getOrCreate())

# Get runtime parameters
glue_client = boto3.client("glue")
args = getResolvedOptions(sys.argv, ["JOB_NAME", "WORKFLOW_NAME", "WORKFLOW_RUN_ID"])
workflow_name = args["WORKFLOW_NAME"]
workflow_run_id = args["WORKFLOW_RUN_ID"]
workflow_params = glue_client.get_workflow_run_properties(
    Name=workflow_name, RunId=workflow_run_id
)["RunProperties"]

# Set runtime paramters with formated timestamp
workflow_params["job_run_timestamp"] = runtime_params_string
glue_client.put_workflow_run_properties(
    Name=workflow_name, RunId=workflow_run_id, RunProperties=workflow_params
)

# Create a frame for player table
playerdb_player = glueContext.create_dynamic_frame.from_catalog(
    database="default", table_name="playerdb_player"
)

# Drop unnecessary columns
playerdb_player = playerdb_player.drop(columns=["id"])

# Write frame to s3 in compressed parquet with partitioned path
glueContext.write_dynamic_frame.from_options(
    frame=players,
    connection_type="s3",
    connection_options={"path": output_path, "region": "us-west-2"},
    format="parquet",
    format_options={"compression": "SNAPPY"},
)

The job writes the operational data to Amazon S3 with timestamped prefixes in compressed Parquet format. The Amazon S3 crawler run updates the AWS Glue data catalog with new partition metadata. This partitioning scheme supports reporting requirements and optimizes queries for cost and performance. When querying the external Redshift Spectrum table, constraining by partition will reduce the size of data scanned and costs associated with Redshift Spectrum usage.

AWS Glue workflows for data pipeline orchestration

With the data extraction and transformation processes defined, orchestrate the workflow using AWS Glue workflows (orchestration). By creating a workflow, you automate the execution of AWS Glue crawlers and job, ensuring a seamless and repeatable process for preparing data for analysis.

Redshift Spectrum setup

Before querying the data, set up Redshift Spectrum to access data stored in Amazon S3. This involves creating an external schema in Amazon Redshift that mirrors the schema of the transformed data stored in Amazon S3.

1. The external schema command requires a schema name, a data catalog name, region, and the ‘MySpectrumRole’ created in the prerequisite steps.

CREATE EXTERNAL SCHEMA IF NOT EXISTS player_stats_s
FROM DATA CATALOG DATABASE 'default'
IAM_ROLE 'arn:aws:iam::111111111111:role/MySpectrumRole'
REGION 'us-west-2'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

For the last Redshift Spectrum step, GRANT USAGE on the external schema to required users.

GRANT USAGE on SCHEMA player_stats_s to PUBLIC

Query data with Redshift Spectrum

Once the integration is complete, query the transformed dataset using Amazon Redshift’s SQL capabilities. By leveraging Redshift Spectrum, we can query data stored in Amazon S3 alongside data in an Amazon Redshift cluster, supporting powerful analytics and reporting capabilities.

Select all available partitions:

SELECT schemaname, tablename, VALUES, location FROM svv_external_partitions;

Select all rows from a specific partition:

SELECT * FROM player_stats_s.playerdb_player
WHERE (year = '2024' and month = '3' and day = '14' and hour = '14' and minute = '0');

Select all rows joining the Amazon Redshift provisioned cluster table and the Redshift Spectrum external table for a specific partition:

SELECT COUNT(*)
FROM playerdb_player
JOIN player_stats_s.playerdb_player on playerdb_player.player_id = player_stats_s.playerdb_player.player_id
WHERE (year = '2024' AND month = '3' AND day = '14' AND hour = '14' AND minute = '0');

With data available from both data sources, we can join Amazon Redshift provisioned cluster tables and Redshift Spectrum external tables. Let’s query to find players and communities with the highest number of seconds played:

SELECT DISTINCT playerdb_player.player_id, player_stats_s.playerdb_player.community_id, playerdb_player.total_seconds_played
FROM playerdb_player
JOIN player_stats_s.playerdb_player on playerdb_player.player_id = player_stats_s.playerdb_player.player_id
WHERE (year = '2024' AND month = '3' AND day = '14' AND hour = '14' AND minute = '0')
ORDER BY playerdb_player.total_seconds_played DESC;

Lastly, communities with the highest payment amounts:

SELECT player_stats_s.playerdb_player.community_id, playerdb_player.player_id, SUM(player_stats.total_payment_amount) AS total_payment_amount
FROM playerdb_player
JOIN player_stats_s.playerdb_player on playerdb_player.player_id = player_stats_s.playerdb_player.player_id
WHERE (year = '2024' AND month = '3' AND day = '14' AND hour = '14' AND minute = '0')
GROUP BY player_stats_s.playerdb_player.community_id
ORDER BY total_payment_amount DESC;

Cleaning up

You’ve now successfully created an AWS Glue workflow to join operational and analytics data. To avoid ongoing charges for resources you created following the steps detailed in this article, you should delete:

Amazon Aurora MySQL database and table use as an operational data source.
The Amazon Simple Storage Service (s3) buckets use for the MySQL JDBC driver and AWS Glue job output location.
The Amazon VPC interface and gateways endpoints.
The AWS Glue crawlers, job and workflow.
The Amazon Redshift provisioned cluster table and Redshift Spectrum external table.

Conclusion

AWS Glue and Redshift Spectrum provide game developers and analysts with a robust platform for combining, transforming, and analyzing data from disparate sources. By automating the extract, transform, and load (ETL) processes with AWS Glue, organizations can optimize costs and operational efficiency. Leveraging the querying capabilities of Amazon Redshift Spectrum, they can also derive actionable insights from their data.

In the fast-paced world of game development, where data-driven decisions are paramount, the integration of AWS Glue and Redshift Spectrum offers a scalable and cost-effective solution. This integration unlocks the full potential of gaming analytics by providing a powerful combination of data processing and querying capabilities. By harnessing the power of these AWS services, game developers can gain deeper insights into player behavior, drive engagement, and ultimately, deliver exceptional gaming experiences.

AWS for Games Blog