AWS for Industries

Solving Business Challenges in Online Betting & Gaming Using No Code AI & ML

In the world of online betting & gaming companies often have vast quantities of data at their disposal, but can face challenges in harnessing that data with Artificial Intelligence and Machine Learning (AI & ML) to deliver business value. Small to midsize companies may find it especially difficult to utilize this data as they often face challenges hiring and retaining talent from the high demand data science and analytics fields.

AI & ML can be applied to solve a variety of problems in gaming, such as toxicity detection, localization of chat or in-game text, content personalization, and early detection of players who may stop using your products (churn detection). This blog post will demonstrate step-by-step how AI & ML can be applied to solve such real-world business challenges in online betting & gaming, using the problem of early detecting players who are likely to attrition or churn from an online sports betting app as an example.

Walkthrough

Before investing substantial time and resources in applying AI & ML to a problem, it is often best to determine some baseline of performance (for example, a coin-flip for a binary classification problem) and then conduct an experiment to see how the ML solution performs against the baseline. If the ML experiment outperforms the prediction accuracy of the baseline, it may be worth further investment by the business to test and move forward with the ML model for production use.

The first section of this blog post will show how a data or business intelligence (BI) analyst, without any data science experience, can use the no-code Amazon SageMaker Canvas graphical user interface (GUI) to visually analyze a dataset and train a minimum viable product (MVP) model.

The second section of the post will demonstrate how the MVP model can be shared with a junior data scientist or engineer with basic data science experience to validate the model before deploying for real-time inference using Amazon’s machine learning development platform, SageMaker Studio, and the automatic model training and tuning capability, SageMaker Autopilot.

Building a churn prediction model with no-code using SageMaker Canvas

A data or BI analyst may be comfortable visualizing data across one or several key features of a dataset to evaluate patterns of correlation, but scaling this effort to identify the influence of many features on predicting a specific target variable can prove difficult without AI & ML experience. SageMaker Canvas solves this challenge by providing a GUI familiar to data and BI analysts that enables no-code visual data analysis and ML model experimentation. Canvas models can be shared with users in SageMaker Studio, a web-based machine learning development platform, for deeper evaluation and tuning before deployment for real-time production inference.

To begin this scenario, assume you are a BI analyst at AnyCompany Sportsbook Inc. AnyCompany recently hired a new Chief Commercial Officer (CCO), and they have tasked you with investigating potential root causes for our user churn, as well as if there is a way to predict user churn before it occurs to pro-actively engage users and prevent it.

You’ll start by uploading your dataset to the cloud. AnyCompany’s data engineering team already curated a subset of user data for this use-case, but you need to import it for use in your AWS account environment. To do this efficiently, you’ll enter AWS Command Line Interface (CLI) commands in AWS CloudShell, a browser based shell for quickly running scripts or commands in your AWS environment. These commands will create a Bucket in AWS’ Simple Storage Service (S3) in order to securely store your dataset for SageMaker to access it.

CloudShell service consoleCloudShell service console

To accomplish this, you login to your AWS account and navigate to the CloudShell service console. From there, you enter the Bash and AWS CLI commands highlighted below into the CloudShell terminal. Note that you must replace the <bracketed>  text in the declaration of the myname variable with your name before pasting into CloudShell. Bucket names must be globally unique, so the copied commands create yours by adding your first name and appending Unix epoch time to the end.

myname="<delete all text inside quotes and add your first name. remove any punctuation or spaces!>"

timestamp=$(date +%s)

mybucketname="${myname}-sagemaker-canvas-test-${timestamp}"

echo $mybucketname

aws s3api create-bucket --bucket $mybucketname
targets3uri="s3://${mybucketname}/data.csv"

wget http://d3ed0sar9a13lk.cloudfront.net/gaming_churn_data.csv

aws s3 cp gaming_churn_data.csv $targets3uri

The SageMaker service has not been used in this account yet, so before it can be used you need to create a SageMaker Domain and Users for yourself and your peers. You follow the SageMaker Quick Setup documentation to create a Domain, ensuring you grant access to the Bucket you created for your dataset in the Identity and Access Management (IAM) Execution role step during setup. Then, following the SageMaker documentation on User profiles, you add two Users to your domain: one named analyst-user for myself, and one named data-science-user for my peer, using the default settings for both.

With your SageMaker Domain and Users setup complete, you navigate in the AWS console to the SageMaker Canvas landing page, select your analyst-user User from the dropdown, and select Open Canvas to launch the SageMaker Canvas GUI.

Once in the GUI you begin by loading your dataset into SageMaker Canvas from S3. To do this you navigate to Datasets and select + Import. From the resulting page, you select Amazon S3 from the dropdown as your Data Source, find and select the Bucket you created for your dataset, then select the data file to view a sample of your data before importing by selecting Select dataset and finally Import data.

Navigating to Datasets in SageMaker Canvas GUINavigating to Datasets in SageMaker Canvas GUI

With your dataset imported, you select it from your Datasets and choose Create a model. Next you give your model a name (ex: my-churn-model) and select your problem category, in this case selecting Predictive analysis.

Creating the model objectCreating the model object

From the resulting page you can choose your target variable for prediction, analyze the features of your data for correlation, and drop or even engineer or add new features. At this point you pause to review the documentation on the dataset my data engineering team shared summarizing each of the included features.

  • wallet_balance – a numerical value for the cash balance a given user has in their wallet.
  • state – a categorical value representing the state a given user resides in.
  • session_length – a numerical value representing the average time a given user spends in the AnyCompany app when they open it.
  • session_depth – a numerical percentage value representing the likelihood a given user makes a desired action with an AnyCompany product during a session, such as placing a bet.
  • row_id – the numerical index for the dataset.
  • player_id – a numerical value uniquely identifying a given user.
  • nonpromo_spend – a binary (true/false) value representing if the user has interacted with an AnyCompany product using their own money as opposed to promotional offers.
  • multiple_products – a binary (true/false) value representing if the user has engaged with more than one AnyCompany product offering in the app.
  • label – a binary (true/false) value representing if a user churned from the AnyCompany app, 1 for churn, 0 for still active. This is the target variable desired for prediction.
  • days_active – a numerical value representing the total number of active days a given user opened the AnyCompany app, starting from the time they created their account until when they churned, or the present if they are still active.
  • age – a numerical value representing the age of a given user.
  • account_length – a numerical value representing the time from when a given user created their account to when they churned, or the present if they are still active.

Based on the descriptions for row_id and user_id you make a reasonable assumption that these features will have no correlation with the label. You can verify this quickly by changing from a Column view of the data to a Grid view to visualize the distribution of each feature. Doing so you see both features have an almost uniform distribution, indicating they are unlikely to provide any meaningful value for prediction and can be dropped by de-selecting their associated checkboxes. You also note that the intended target variable, label, appears to have a roughly equal 50/50 distribution between the two classification categories. Because of this you decide you will consider your experiment successful if the ML model can predict churn better than 50% of the time.

https://d2908q01vomqb2.cloudfront.net/c5b76da3e608d34edb07244cd9b875ee86906328/2024/01/30/Creating-the-model-object.pngVisualizing the distribution of the features in my dataset using Grid view.

You use the built-in Data Visualizer to visualize correlation between your remaining features. The visualizations work better for numerical features, so you hone your focus on those. You start by choosing Data visualizer to bring you to the Visualizations options, where a scatter plot is ready for you to drag and drop features for visualization.

By plotting account_length versus days_active you see what appears to be strong correlation, which makes sense due to the dependent nature of these features detailed in the documentation. Because these features are highly correlated you consider dropping one of them to improve model performance, however you decide to leave them both in for the time being and worry about fine-tuning the model later. By dragging your label target variable to the Color by area you can also see that churned users skew towards older accounts with more total days active, potentially indicating your longest active users are losing interest in your products. You make a note to share this with your CCO as a potential cause of churn.

Visualizing correlation between features using the built-in scatter plot visualization.Visualizing correlation between features using the built-in scatter plot visualization.

Next, you navigate to the Analytics option of the Data Visualizer tool to view the correlation matrix. This will help you to understand the correlation between your numerical features all at once rather than plotting individually using a scatter plot.

By default, the correlation matrix will use all features in the dataset, so you must manually de-select row_id and player_id, which you earlier identified to be meaningless, along with your other non-numerical features. As a refresher, correlation values closer to 1 or -1 mean feature pairs are strongly positively or negatively correlated, while values closer to 0 mean pairs have little or no correlation. You again see that account_length and days_active show strong positive correlation, along with account_length and session_length. You add to your note about investigating churn for oldest active users to also look into why these users are correlated to longer session time.

Identifying correlation between features using the built-in correlation matrix visualizationIdentifying correlation between features using the built-in correlation matrix visualization

Now that you’ve used visual analysis to prepare your dataset you’re ready to use it to train an ML model and evaluate it against the assumed baseline accuracy of 50%. You return to the Grid view of your dataset and configure the Target column dropdown to the be target variable for prediction of your dataset, label. Next, you de-select row_id and player_id to drop them from the features you will use to train the model. Finally, you select Validate data to validate your dataset. This checks ensure there is no data quality issues, such as missing or null fields, before training that would cause the training job to fail.

configuring the model object and validating the dataset for training

Configuring the model object and validating the dataset for training

Before submitting for full training, which can take some time, you use the Preview model functionality to provide you with an estimated accuracy of how well the trained model will perform on predicting the target label. Your baseline performance assumption was 50% prediction accuracy, so the estimated accuracy of 93.4% demonstrates that using ML to predict and prevent churn warrants further investment. The Preview also indicates that the age and state features have negligible impact on predicting churn, so you could consider dropping them for future model iterations.Lastly, select the Standard build button to proceed with the model training build process.

Previewing the model and submitting for model training buildPreviewing the model and submitting for model training build

Once training is complete, you summarize your findings in an email to your CCO, then select Share to share the MVP model with the data-science-user User profile of your data science peers for them to validate before considering for production use.

Trained MVP model ready to be leveraged for predictions the MVP churn detection model from their recent experiment. Once validated, the CCO has Trained MVP model ready to be leveraged for predictions

Evaluating and tuning a SageMaker Canvas model for production real-time inference

For the second part of this example you will role-play as a junior data scientist at AnyCompany Sportsbook Inc. Your CCO has tasked you with working with your data analyst peer to validate the MVP churn detection model from their recent experiment. Once validated, the CCO has asked that you deploy it to production to enable the Marketing team to conduct targeted campaigns against users who are predicted as likely to churn.

You start by navigating to the SageMaker Studio service landing page, selecting your data-science-user User profile from the dropdown, and choosing Open Studio to launch. Note that since this is the first time launching SageMaker Studio in the Domain created for this project it may take several minutes to load while it creates the backend resources for your Studio environment.

Once SageMaker Studio has loaded, you navigate to the model your analyst peer shared with my by expanding the Models collapsible, selecting Shared models, and choosing View model on the model you see shared.

Navigating to the shared Canvas model in SageMaker StudioNavigating to the shared Canvas model in SageMaker Studio

SageMaker Canvas uses SageMaker Autopilot’s AutoML feature behind-the-scenes, so when your analyst peer built the MVP model in SageMakerCanvas many classification models using different algorithm types and hyperparameter options were trained behind-the-scenes. Autopilot then compares the trained models automatically before selecting the best performing candidate. You can see evidence of this in the Hyperparameters section of Explainability, where all the candidate model algorithm types used for finding the best performing model are shown. Next, you choose Performance to evaluate the predictive power of the model.

Investigating the hyperparameters of the SageMaker Autopilot AutoML jobInvestigating the hyperparameters of the SageMaker Autopilot AutoML job

Your analyst peer reported an accuracy of 93.4%, a metric representing how often the model classified correctly overall. In this case, due to the high cost to the business associated with falsely classifying a user that is truly likely to churn as unlikely to churn, you am more interested in recall, a performance metric that takes into account false positives. You see that the recall appears to be negligibly different from the accuracy, so you do not see any risk for this model generalizing to real-world data. However, if the model chosen by Autopilot AutoML had low recall, you could choose Update model to identify if one of the alternate model candidates created by AutoML happens to have higher recall and would make a better choice for this application. Now that you have validated the model you determine it is suitable for production and begin with production deployment configuration by choosing Deploy model.

Evaluating performance metrics before model deploymentEvaluating performance metrics before model deployment

To start out the Marketing team wants to be able predict on individual users in real-time, so you choose the Make real-time predictions endpoint option. If the Marketing team instead wanted to predict against many users in bulk then they could do so with the Make batch predictions endpoint option, however note that this type of prediction happens asynchronously.

To save on cost you lower the Instance type to ml.m5.large and fill in the required Endpoint name with an appropriate name. Finally, you select Deploy to endpoint to deploy your model. Note that endpoint deployment may take several minutes.

the model endpoint before deploymentThe model endpoint before deployment

Once the endpoint is deployed you test it to ensure it works before notifying Marketing. To do so you copy/paste the JSON payload you prepared highlighted below into the JSON editor and choose Send Request. You remember that a label of 1 indicates a user is likely to churn while 0 indicates churn is not likely. For the test user the model predicts 1 meaning this user will likely churn.

{
        "data": {
           "features": {
              "columns":
                [
                    "age",
                    "account_length",
                    "days_active",
                    "session_length",
                    "session_depth",
                    "multiple_products",
                    "nonpromo_spend",
                    "wallet_balance",
                    "state"
                ],
               "values":
                [
                    53,
                    800,
                    200,
                    165,
                    1,
                    0,
                    1,
                    133.67,
                    "NY"
                ]
                }
        }
}

Testing the model endpoint using the SageMaker Studio JSON editorTesting the model endpoint using the SageMaker Studio JSON editor

Finally, you navigate to the Deployments collapsible and select Endpoints to note the Amazon Resource Name (ARN) for the model endpoint to share with the Marketing team so they can begin testing with the model endpoint using the AWS SDKs.

Conclusion

With data science expertise in high demand, it can be hard for online betting & gaming companies to hire and retain talent. This can make it difficult for companies to take advantage of the data available to them and leverage AI & ML to solve business challenges.

This blog post illustrated how this difficulty can be overcome with Amazon’s managed machine learning platform, SageMaker. First, we demonstrated how a data or BI analyst without any prior data science experience can visually analyze a dataset and build and train an ML model MVP with no code using SageMaker Canvas. Next, we demonstrated how a data engineer or junior data scientist with basic data science experience can use SageMaker Studio and SageMaker Autopilot to evaluate and tune the performance of the MVP model before production deployment for real-time inference.

Other examples of how customers can solve real-world business problems in online betting & gaming using AI & ML on AWS include player cohort modeling, real-time profanity masking, and content personalization.

Rob Percival

Rob Percival

Rob is an Account Manager focused on supporting AWS customers in the US Betting & Gaming industry. Prior to his current role, Rob spent 3 years as a Solutions Architect helping startups and midsize enterprise customers learn to build on AWS.