Train responsible gaming inference models for sports betting with Amazon SageMaker

The global sports betting market is growing rapidly, forecast to grow from USD $68.3B in 2022 to USD $117B by 2027, a compound annual growth rate of 14.3%. Sports betting operators provide an exciting form of entertainment for their patrons, combining knowledge of the game and associated statistics, with the possibility of winning big. Although sports betting can be a fun social outlet, for some it comes with the risk of going overboard. According to the Mayo Clinic, problem gambling is the uncontrollable urge to continue gambling despite the toll it takes on your life. Various studies estimate that gambling disorder affects approximately 3% of the world’s population. It is therefore incumbent upon sports betting operators and regulators to ensure that responsible gaming (RG) controls are made available to players to reduce this risk.

AWS customers realize that a powerful approach to address this issue is to use data and machine learning (ML) models to detect and prevent risky behavior before it becomes a bigger problem. Detecting RG cases involves processing numerous data streams, including behavioral, transactional, and financial information. Betting operators typically have access to a limited amount of data, without a full view of an individual’s history across multiple betting sites. AWS customers may collect user journey information from multiple channels, including betting anonymously through a network of physical betting terminals or through intermediate resellers and partners. By using the power of their own data, betting operators can develop risk detection models that align with their specific businesses and applicable governing regulations.

In this blog post, we demonstrate the use of Amazon SageMaker to explore and process sports betting data, and train models to help predict RG cases in this context. We use Amazon SageMaker Data Wrangler to import, analyze, and prepare data. We also use Amazon SageMaker Autopilot to train a machine learning (ML) model to detect RG cases.

Dataset

We use as an example the dataset from the Transparency Project, Division on Addiction, the Cambridge Health Alliance, a teaching affiliate of Harvard Medical School. The dataset entitled Behavioral Characteristics of Internet Gamblers Who Triggered Corporate Responsible Gambling Interventions includes a sample of 4,113 users who subscribed to an internet betting service provider. Half of this sample includes users who triggered operator RG cases; the other half are randomly selected users.

To follow the steps in this post, load the dataset to an Amazon Simple Storage Service (Amazon S3) bucket of your choice. The dataset includes four distinct files:

Demographics file includes information about the users, and is typically extracted from a Customer Relationship Management tool.
Daily aggregates file lists the daily activity per subscriber and per betting product type. It is typically derived from the operator’s gaming server logs.
RG details file includes the RG trigger types and interventions. It is typically produced by the customer RG team.
We don’t use the analytics file, as we use SageMaker Data Wrangler for feature engineering.

Solution overview

The processing flow, outlined in the following diagram, includes 6 steps:

FFigure 1: High-level data processing flow for model training and deployment.

Import the data into a SageMaker Data Wrangler flow
Streamline feature engineering
Processing job to export the data to an S3 bucket
Split the dataset for training and querying
SageMaker Autopilot to train and evaluate the model
Real-time predictions over the inference endpoint

Prerequisites

Implement in your account the Quick setup to onboard a SageMaker domain via the SageMaker console. Create a user profile and associated AWS Identity and Access Management (IAM) execution role that has the appropriate IAM policy to allow access to the S3 bucket used to store the dataset. Start with SageMaker Studio by selecting the created domain and user profile on the SageMaker console. You will be directed to the SageMaker Studio user interface, where you can use SageMaker Data Wrangler to import and visually prepare the data.

Figure 2: Importing data from S3 using Amazon SageMaker Data Wrangler.

Feature engineering

Demographics dataset
The output of feature correlation with SageMaker Data Wrangler shows that the variables CountryName and LanguageName are highly correlated. Drop the CountryName column and convert the LanguageName column to a long type to reflect labels.

1. Choose the plus sign next to Data types and choose Add transform
2. Choose Add step and Custom transform
3. Fill in Your custom transform field with the following PySpark code and choose Add:

# Table is available as variable `df`
from pyspark.sql.functions import col, udf
from pyspark.sql.types import LongType
from pyspark.ml.feature import StringIndexer

# Convert the Gender from string type to Boolean
def gender_to_label(gender):
    if gender == "M": return 0
    elif gender == "F": return 1
    else: return 2
gender_to_label_udf = udf(gender_to_label, LongType())
df = df.withColumn("Gender_label", gender_to_label_udf(col("Gender")))

# Convert LanguageName from String type to long
unique_values = df.select("LanguageName").distinct().rdd.flatMap(lambda x: x).collect()
string_indexer = StringIndexer(inputCol="LanguageName", outputCol="Languagelabel")
df = string_indexer.fit(df).transform(df)

Next, create aggregate features that capture the age at registration, the age at first deposit, and the time elapsed between user registration and first deposit date. Add a built-in Validate timestamps transform, using drop as a target policy, to drop lines with invalid Registration_date or First_Deposit_Date. Then, proceed as follows:

4. Choose the plus sign next to the last step and choose Add transform

5. Choose Add step and Featurize Date/Time

6. Select the input columns Registration_date and First_deposit_date

7. Select Year to keep as a targe feature, and choose Add

8. Add a custom transform step, as in 1-3, with the following PySpark code:

# Table is available as variable `df`
from pyspark.sql.functions import lit, col, datediff

df = df.withColumn("age_at_registration", col("Registration_date_year") - col("YearofBirth"))
df = df.withColumn("age_at_first_deposit", col("First_Deposit_Date_year") - col("YearofBirth"))
df = df.withColumn("time_elapsed", datediff(col("First_Deposit_Date"), col("Registration_date")))
df = df.withColumn("time_elapsed", col("time_elapsed").cast("long"))

Next, apply the built-in Drop column transform to drop the columns Registration_date and First_deposit_date used in the previous steps.

Daily aggregates dataset
Order the Daily aggregates table per UserID and productType. That is, for every UserID and betting product, compute the total amount of turnover and loss, the number of active betting days, and daily aggregate features, such as the average, minimum, and maximum amounts of turnover and loss.

Apply the built-in Drop missing transform, followed with a Custom transform step that executes the following PySpark code:

# Table is available as variable `df`
from pyspark.sql.functions import sum, avg, max, count, col, datediff, min, abs, expr, lit

# Grouping by the UserID and ProductType
grouped_df = df.groupBy('userid', 'productType')

# Feature engineering to create useful features for Machine Learning
df = grouped_df.agg(
  sum('Turnover').alias('sum_Turnover'),
  sum('NumberofBets').alias('Nb_Bets'),
  sum('Hold').alias('Total_Loss'),
  count("*").alias('Active_Days'),
  avg('Turnover').alias('Avg_Turnover_Day'),
  max('Turnover').alias('Max_Turnover_Day'),
  max('Hold').alias('Max_Loss_Day'),
  avg('Hold').alias('Avg_Loss_Day'),
 datediff(max("Date"), min("Date")).alias("Duration_Activity"))

minvalloss=abs(lit(df.select(min(col("Total_Loss"))).first()[0]))
minmaxlossday=abs(lit(df.select(min(col("Max_Loss_Day"))).first()[0]))
minavglossday=abs(lit(df.select(min(col("Avg_Loss_Day"))).first()[0]))

# Creating aggregate features
df = df.withColumn("Avg_Bets_day", col("Nb_Bets")/col("Active_Days"))
df = df.withColumn("Avg_Turnover_Bet", col("Avg_Turnover_Day")/col("Avg_Bets_day"))
df = df.withColumn("Total_Loss", col("Total_Loss") + lit(minvalloss))
df = df.withColumn("Avg_Loss_Bet", col("Avg_Loss_Day")/col("Avg_Bets_day"))
df = df.withColumn("Max_Loss_Day", col("Max_Loss_Day") + lit(minmaxlossday))
df = df.withColumn("Avg_Loss_Day", col("Avg_Loss_Day") + lit(minavglossday))
df = df.withColumn("Duration_Activity", col("Duration_Activity").cast("long"))
df = df.withColumnRenamed("userid", "tempuserid")

Pivot the table to rearrange the data per userID, while matching the userID column in the first dataset. Add a Custom transform step, with the following PySpark code:

# Table is available as variable `df`
from pyspark.sql.functions import sum, avg, max, col

# Grouping by the UserID and ProductType
grouped_df = df.groupBy('tempuserid')

# Feature engineering to create useful features for Machine Learning
df = grouped_df.pivot('productType').agg(
  sum('sum_Turnover').alias('sum_Turnover'),
  sum('Nb_Bets').alias('Nb_Bets'),
  sum('Total_Loss').alias('Total_Loss'),
  sum('Active_Days').alias('Active_Days'),
  avg('Avg_Turnover_Day').alias('Avg_Turnover_Day'),
  max('Max_Turnover_Day').alias('Max_Turnover_Day'),
  max('Max_Loss_Day').alias('Max_Loss_Day'),
  avg('Avg_Loss_Day').alias('Avg_Loss_Day'),
  max("Duration_Activity").alias("Duration_Activity"),
  avg('Avg_Bets_day').alias('Avg_Bets_day'),
  avg('Avg_Turnover_Bet').alias('Avg_Turnover_Bet'),
  avg('Avg_Loss_Bet').alias('Avg_Loss_Bet'))

RG details dataset
The Data Quality and Insights Report applied to the RG details table shows an imbalanced dataset. Half of the RG cases belong to a single category, and the other half is split across 4 out of 13 remaining categories. To avoid overfitting, we train a binary classifier instead, while disregarding the information about RG event types.

Constitute a single data frame for ML using the Join Datasets built-in transform. Use Left outer as a Join Type, while choosing UserID for the left column and tempuserid for the right column, as follows.

Figure 3: Transform the dataset with SageMaker Data Wrangler using a left outer join.

Add destination node to export the data flow to an S3 bucket.

The resulting SageMaker Data Wrangler flow is depicted in the following screenshot.

Figure 4: Visual depiction of Amazon SageMaker Data Wrangler data flow.

Launch a processing job to export the transformations that you made to a destination node, see Launch processing jobs with Amazon SageMaker Data Wrangler.

Data split and model training

To illustrate the inference phase explained later in this post, we split the processed dataset in two parts:

Model_training (95% of the dataset) serves as input to train the ML model
Realtime_inference (5% of the dataset) serves to test the ML model

Import the new processed dataset into a SageMaker DataWrangler flow, using the approach outlined in the Prerequisites section. Then, create a randomized split using the built-in Split data transform.

Figure 5: Create a randomized split data transform in order to test the ML model.

Add two destination nodes to export the data flow to an S3 bucket. The resulting SageMaker Data Wrangler flow is depicted in the following screenshot.

Figure 6: Revised data flow visualization after splitting out randomized test data.

Create a processing job to export the transformation flow to a destination node, using the same procedure described earlier.

As we are ready now for model training, you can automatically train models on your data flow using the SageMaker Data Wrangler integration with SageMaker Autopilot; refer to Unified data preparation and model training with Amazon SageMaker Data Wrangler and Amazon SageMaker Autopilot for details on implementing this step. SageMaker Autopilot explores different solutions to find the best performing model, and automatically deploys the model to the specified endpoint to get predictions. The best-performing training job achieved an accuracy of 84.4% based on the input dataset. In this post, we use real-time inferencing to set up an endpoint and obtain predictions interactively.

Query the model

You can use the following Python script to test the model, using as input the Realtime_inference subset that we previously extracted:

 import boto3, sys
from botocore.config import Config

my_config = Config(region_name = 'eu-west-3')
sgm_rt = boto3.Session().client('runtime.sagemaker', config=my_config)

count=tp=tn=fp=fn=0
with open('realtime_inference.csv') as f:
        lines = f.readlines()
        for curline in lines[1:]:
                curline = curline.split(',')
                label = curline[0]
                curline = curline[1:]
                curline = ','.join(curline)
                response = sgm_rt.invoke_endpoint(EndpointName='resp-betting-rt-prediction-endpoint', ContentType='text/csv', Accept='text/csv', Body=curline)
                response = response['Body'].read().decode("utf-8")
                count += 1
                if '1' in label:
                        if '1.0' in response: tp += 1
                        else: fn += 1
                else:
                        if '0.0' in response: tn += 1
                        else: fp += 1
sys.stdout.write(“Number of tested samples: %d” % (count))
sys.stdout.write("true negative: %d, false positive: %d" % (tn,fp))
sys.stdout.write("false negative: %d, true positive: %d" % (fn, tp))

As your dataset grows, you may need to update the ML predictions to accommodate for the newest data. You can do this by rerunning the previous steps described in this post. Alternatively, you can use incremental training in SageMaker, where you use the artifacts from an existing model and an expanded dataset to train a new model. See Use Incremental Training in Amazon SageMaker for more details.

Clean up

Throughout this post, you deployed the infrastructure components to run a SageMaker Data Wrangler instance and deploy a SageMaker inference endpoint. To avoid incurring additional charges, after you are done with the solution, delete all the resources you created:

1. Delete the SageMaker inference endpoint

2. Empty the S3 bucket that you used to store the data

3. Shut down the Data Wrangler instance

Conclusion

In this blog post, we demonstrated how to set up SageMaker Data Wrangler to analyze and transform sports betting data, completing all the steps of the data preparation workflow from a single visual interface. The solution described in this post showcases how straightforward it is for a sports betting operator to ingest RG data using SageMaker Data Wrangler; perform data preparation tasks such as exploratory data analysis, feature selection and engineering; and then use the processed data to train an RG detection model. This solution only covered a few of the capabilities of SageMaker Data Wrangler, based on the properties of the input dataset. You can use the same approach described in this post for more advanced data analysis to accommodate for other data sources, and build custom RG models using an intuitive user interface.