AWS Machine Learning Blog

Build a March Madness predictor application supported by Amazon SageMaker

What an opening round of March Madness basketball tournament games! We had a buzzer beater, some historic upsets, and exciting games throughout. The model built in our first blog post (Part 1) pointed out a few likely upset candidates (Loyola IL, Butler), but did not see some coming (Marshall, UMBC). I’m sure there will be more madness in the coming weeks!

Now let’s take our picks and create a March Madness prediction application leveraging the model and endpoints we created in Part 1 using Amazon SageMaker.

In this blog post, we’ll show you how to create a static Amazon Simple Storage Service (Amazon S3) website that allows users to simulate any hypothetical matchup from 2011 – 2018, display expected scores for all games on the current day, as well as show all expected score and win likelihoods for every matchup in the 2018 NCAA tournament!

We’ll leverage an AWS CloudFormation script to launch the AWS services required to create this website. CloudFormation is a powerful service that allows you to describe and provision all the infrastructure and resources required for your cloud environment, in simple JSON or YAML templates. In this case, that includes the following:

  • AWS Lambda functions that transform and send data to the Amazon SageMaker endpoints for predictions.
  • Amazon API Gateway for accepting user input and triggering one of the Lambda functions to generate predictions based on user inputs.
  • Amazon Elastic Compute Cloud (Amazon EC2) instance for pulling down new game results and team efficiency data daily to keep our website and predictions fresh.
  • Amazon S3 bucket to store data and host the static website
  • AWS Identity and Access Management (IAM) roles that allow the Lambda functions and EC2 instance to interact with other AWS services.

The Amazon SageMaker endpoints that we created in Part 1 are in the background working with these resources to generate new predictions for games on the present day, as well as user initiated hypothetical matchups.

Since Part 2 of this two-part series requires spinning up an EC2 instance (initially eligible for the Free Tier), much like the Amazon SageMaker endpoints it will be important to shut down the instance when you’re done with it to avoid an unexpected bill! The Amazon S3 bucket created is public, so be sure to avoid including any private data on the bucket. If you deleted the endpoints you created in Part 1, you’ll need to re-create them in order to complete Part 2.

Before executing the CloudFormation script to start up the resources, we’ll need to complete a few simple tasks starting with downloading the CloudFormation script and index.html file that will populate your website.

CloudFormation pre-work

To get started, you’ll need to pull down the files to your local directory. This can be achieved using the following AWS command line (AWS CLI) script (substituting {local_file_location} with your relevant local directory). For more details on installing and using the AWS CLI, see this link:

aws s3 cp s3://aws-ml-blog/artifacts/bball/mm_cloudformation.json local_file_location
aws s3 cp s3://aws-ml-blog/artifacts/bball/index.html local_file_location

After executing that code, you should have files called index.html and mm_cloudformation.json in the local directory that you selected.

Create an Amazon EC2 key pair

To build this application, you’ll need to connect to an EC2 instance using SSH, which requires access to an Amazon EC2 key pair in the region you’re launching your CloudFormation stack. If you have an existing Key Pair in your region, feel free to use that Key Pair for this exercise. If not, to create a key pair open the AWS Management Console and navigate to the EC2 console. In the EC2 console left navigation pane select Key Pairs.

Choose Create Key Pair then type in march_madness (be sure to type it exactly like this!), then choose Create. This downloads a file called march_madness.pem. Be sure to keep this in a safe and private place, and don’t upload this file to the public S3 bucket discussed later in this process! Without access to this file, you will lose the ability to use SSH to connect with your EC2 instance.

Changes required if launching in a Region outside of us-west-2

If you launched your Amazon S3 bucket and Amazon SageMaker notebook instance in a Region other than us-west-2 in Part 1 you’ll need to copy data from the public S3 bucket (wp-public-blog-cbb) into the S3 bucket you created in Part 1. wp-public-blog-cbb is in us-west-2, and you’ll use .zip files in that bucket as deployment packages for Lambda functions created by the CloudFormation stack.

Deployment packages must be in the same Region as the AWS Region in which the Lambda functions were created. If you deployed Amazon SageMaker and the S3 bucket from Part 1 in us-west-2, copying this data is not necessary. 

If you did deploy in a Region other than us-west-2 you must copy the deployment packages (.zip files) from the public bucket in us-west-2 into the S3 bucket you created in Part 1. Use the following commands and replace {your_s3_bucket} with the name of the S3 bucket you created in Part 1):

aws s3 cp s3://wp-public-blog-cbb/ s3://your_s3_bucket
aws s3 cp s3://wp-public-blog-cbb/ s3://your_s3_bucket
aws s3 cp s3://wp-public-blog-cbb/ s3://your_s3_bucket
aws s3 cp s3://wp-public-blog-cbb/ s3://your_s3_bucket

Execute the CloudFormation Script

Now we’re ready to run the CloudFormation script! Navigate to the CloudFormation console and choose Create Stack in the same Region as your Amazon SageMaker endpoint and S3 bucket. On the next screen, choose the radio button Upload a template to S3 and upload the mm_cloudformation.json file from your local directory, then choose Next.

On the next page, give your stack a name, and then edit the Parameters to fit your environment. The significance of each parameter is as follows: 

  • CoefficientParameter: The model coefficient associated with the logistic regression win probability model built in Part 1. If you did not build this logistic regression model, leave the default value.
  • DeploymentPkgS3BucketParameter: The S3 bucket housing the deployment packages for the Lambda functions created by this CloudFormation stack. If you launched your stack in us-west-2, then use the default bucket. If you used a different region, make sure you completed the step above to copy the deployment packages into an S3 bucket in the region you’re launching your CloudFormation stack.
  • DifferenceModelParameter: The name of the SageMaker endpoint associated with your difference model. If you did not alter the endpoint deployment code in Part 1, then use the default value.
  • InterceptParameter: The intercept associated with the logistic regression win probability model built in Part 1. If you did not build this logistic regression model, leave the default value.
  • KeyPairParameter: The name of the EC2 Key Pair you are using. If you created the march_madness Key Pair, leave the default value.
  • TotalModelParameter: The name of the SageMaker endpoint associated with your total model. If you did not alter the endpoint deployment code in Part 1, then use the default value.

After filling out these parameters to fit your environment choose Next. On the next page, we’ll add the same tags you used in Part 1 with a Key of “public_blog” and Value of “cbb”. Feel free to adjust any of the other settings on the page, but none of them are required to build this application.

Finally, review all the settings on the next page, check the box marked I acknowledge that AWS CloudFormation might create IAM resources (this is required since the script creates IAM resources), then select Create. This will create all the resources required for this application and will take some time to execute. To view the stack’s progress, select the stack you created and choose the Events section or panel.

Fantastic! Now you need to make a few adjustments to a shell script preloaded in the EC2 instance and set up a cron job to allow you to run this application from your own environment with fresh data every day.

Customize code in your EC2 instance

To make these changes we’ll need to use SSH to connect to the EC2 instance created by the CloudFormation script. Assuming your local machine has an SSH client, this can be accomplished from the command line. Navigate to the directory that contains the march_madness.pem file you downloaded earlier and insert the following commands, replacing your-public-ip and your-region with the relevant values from your EC2 instance. Type yes when prompted after the SSH command:

chmod 400 march_madness.pem
ssh -i "march_madness.pem"

For more details, or if you run into any issues, use this link to troubleshoot.

After you type the ls -l command in the EC2 instance you should see the following information (with different dates and times) in the home directory. These were installed within the UserData portion of the EC2 instance in the CloudFormation startup script.

If instead of seeing ec2-user you see root, which means that the start-up scripts need some more time to execute. Type exit in the EC2 instance, wait about 5 minutes, and then use SSH to connect back to the instance. Type ls -l again to see if ec2-user now appears. If you see ec2-user, keep going!

Great! Now let’s update the shell script. Type the following in your EC2 instance:

cd ~/

This will open up, type i to edit the file. This file is used to pull updated team performance data as well as daily matchups down from the “wp-public-blog-cbb” S3 bucket.

Replace insert-s3-bucket-name with the S3 bucket you created in the CloudFormation stack. This S3 bucket name can be found easily by viewing the Resources panel within the CloudFormation console with the stack executed earlier selected (it will have mycbbpredbucket in the name).

Save the file by hitting the escape key then :wq

Next, let’s set up a cron job that will automate this process daily. Type the following command to open up the cron scheduler:

crontab -e

Then type i to edit the file, insert the following, then save using the same escape :wq command:

00 14 * * * /home/ec2-user/

This will execute the script contained in the file at 2PM GMT (7AM PST, 10AM EST) daily, allowing predictions to update well before when games begin on a given day. The two files that necessitate this daily update (kenpomnew_active.csv and new_day_matchups.csv) are updated around 1:30 PM GMT (6:30AM PST, 9:30AM EST) every day on the wp-public-blog-cbb S3 bucket, so setting this cron job to run earlier than that time will exclude data from the immediate prior day. Without creating this cron job, the content on your website will grow stale.

Update Lambda function triggers

Great! Now you need to set triggers for your Lambda functions which cause your Lambda function to run based on an event. Navigate to the Lambda console (or look in the Resources panel of the CloudFormation console) and select the lambdaupdater Lambda function. In the Add Triggers pane at the left, select S3.

Next, scroll down to the Configure Triggers section, select the S3 bucket created by the CloudFormation stack (it has “mycbbpredbucket” in the name), configure the rest of the settings to match the following settings, and then choose Add:

Choose Save at the top right of the page to enact the changes.

Practically speaking, the purpose of this function is update the code and deployment packages for both the newpreds and userpreds Lambda functions each time a deployment package (.zip file) is saved to Amazon S3. The newpreds Lambda function will update your site with predictions from the current day’s games, and the userpreds Lambda function responds to user input on hypothetical matchups. The cron job we’ve set up will pull down new data from the public S3 bucket and re-upload the .zip files to your S3 bucket triggering the lambdaupdater Lambda function which updates your deployment packages. Without this function, your site and predictions will become stale.

Now we’ll repeat the trigger creation process that we used previously for the newpreds Lambda function. In the Configure Triggers section, choose the newpreds function and repeat the same steps that you did earlier, filling out the section with the following information and choosing Add:

Choose Save at the top right of the page to enact the changes. After updating the code in the deployment packages, the lambdaupdater Lambda function also uploads a file called dummy_uploader.txt to your S3 bucket, which will now trigger the newpreds Lambda function, populating your website with predictions from the current day’s games.

The userinput Lambda function doesn’t require any trigger configuration since it’s user initiated and the deployment package will be updated daily by the lambdaupdater Lambda function. We’re all finished with the AWS Lambda updates!

Now we’ll briefly go back to your EC2 instance and run the file to initiate the upload of the AWS Lambda deployment packages. Using SSH, connect to your EC2 instance if your credentials have expired and execute the following commands:

cd ~/

After a few minutes, you should see the following files in your S3 bucket:

The zip files are the same as before, but the updates to the Lambda functions have created the empty dummy_uploader.txt file (discussed earlier), as well as a file called new_day_preds.csv. This CSV file contains the prediction data from today’s games that will populate your website. If you happen to complete this tutorial on a day where there are no games, then nothing will display in the bottom half of your site, but you will still have full historical information. When games start up again the site will populate with current games as well.

Add in 2018 NCAA Tournament predictions

If you completed all of Part 1 in this two-part blog series, you should have a file called tourney_outcome.csv in the S3 bucket you used for modeling with Amazon SageMaker. Copy that file from that S3 bucket into the S3 bucket created by the CloudFormation stack using the AWS CLI to populate your website with predictions from this year’s tournament. Replace {your_model_bucket} and {your_cloudformation_bucket} with your specific S3 buckets:

aws s3 cp s3://your_model_bucket/tourney_outcome.csv s3://your_cloudformation_bucket

Now your S3 bucket should have the following content:

If you didn’t complete this section don’t worry, the site will still populate predictions for games happening on the current day, it just won’t display full 2018 NCAA tournament predictions.

Update Index.html and test the website

Next, we’ll update the index.html file to reflect the API Gateway created by the CloudFormation stack to ensure that user requests are properly directed to the Amazon SageMaker endpoints you created in blog post Part 1.

In the CloudFormation console, choose the Resources panel and copy the Physical ID (Rest API ID) associated with PredAPI. Open up the index.html file you downloaded at the beginning of the exercise and update the action for the form element with your Rest API ID and Region.

Save the index.html file to the mycbbpredbucket S3 bucket, and be sure to make the file public. The CloudFormation script configured the S3 bucket to be a publically viewable static website, so if you navigate to the following address (replacing your-s3-bucket-name and your-region with your S3 bucket and Region), you should see a form at the top of the page and projected scores from today’s games and (if you uploaded the tourney_outcome.csv file) the expected outcome of the full 2018 NCAA tournament at the bottom.

Congratulations! You’ve built a website that will generate new predictions daily, take in user input for any hypothetical matchup from 2011-2018, and predict the 2018 tournament, all powered by Amazon SageMaker!

Feel free to adapt the process to fit your application. Dig into the code and functions that support the application to make any desired adjustments or improvements. If values are not displaying on the website make sure everything in the S3 bucket is publically viewable, and that there is a file called new_day_preds.csv in the S3 bucket for today’s predictions and a file called tourney_outcome.csv for predictions from the 2018 tournament. If that CSV file is missing or outdated, ensure that the Amazon EC2 cron job is running as intended and the Lambda functions are operating properly.

Wesley Pasfield is a Data Scientist with AWS Professional Services. He has an MS from Northwestern and has worked on problems across numerous industries including sports, gaming, consumer electronics, and retail. He is currently working to help customers enable machine learning and artificial intelligence use cases on AWS.