AWS for Industries

NFL Next Gen Stats Expected Punt and Kickoff Return Yards: Q&A with the Amazon Machine Learning Solutions Lab

In this post, get a behind-the-scenes look from the Amazon Machine Learning (ML) Solutions Lab team on how the new Expected Return Yards stat from NFL Next Gen Stats was developed.

Give us the main idea behind Expected Return Yards and how it came together.

In collaboration with AWS over the last five years, the NFL Next Gen Stats (NGS) team has introduced a series of stats that provide an analytical look into the many different aspects of how the game is played. Historically, these new stats have been focused on the offensive and defensive sides of the ball (see these two blog posts for more details: defense coverage classification and predicting fourth-down conversion). This season, we’ve taken those learnings and applied them to special teams and the return game. More specifically, we built two separate models that predict the expected punt and kickoff return yards respectively. The Expected Punt Return Yards model predicts the yards a punt returner is expected to gain if they field the punt, while the Expected Kickoff Return Yards model predicts the yards a kick returner is expected to gain once they field the kickoff.

When building advanced stats, AWS and the NGS teams use a variety of artificial intelligence (AI) and ML techniques and algorithms. We also leveraged existing model architectures to develop new stats. For example, the 2020 Expected Rushing Yards model was a solution designed by Austrian data scientists Philipp Singer and Dmitry Gordeev (2019 Big Data Bowl winners) that leveraged raw player tracking data and deep learning techniques. Two years later, NGS and AWS leveraged this architecture for similar solutions, including the 2021 Expected Points Added (EPA) model, the primary model that drives NGS Passing Score (2021). Utilizing the similar modeling architecture, we developed expected yards models for the return game – Expected Punt and Kickoff Return Yards.

Architecture of the 2019 Big Data Bowl Winner (source: Kaggle)

On the surface, one might assume there isn’t that much of a difference between punts and kickoffs, but you built separate models for each one. What were some of the variables with each that led you to taking that path? Why wouldn’t one model work?

When we started our experiments, we initially explored combining punts and kickoff data to train a single model. As in a typical ML problem, we had an expectation that adding more data to the challenge would result in a better model. However, after training a model with the combined dataset, it did not perform as expected. It had worse performance than the ones trained separately.

One of the reasons was due to the difference in the distribution of the gained yardage in punts and kickoffs. From the NFL datasets, we can easily see that the average gained yardage is higher in kickoffs than in punts (see graph below). Another reason was their differences in player location on the field, proximity of defenders when the returner catches the ball, returner speed, real-time position in relation to one another, acceleration, and other factors. Because of these reasons, the model could not easily differentiate between punts and kickoffs. For example, when we trained the model with combined data, the Root Mean Squared Error (RMSE), which is one of the metrics used to measure performance, almost doubled as compared to the individual models.

Distribution of yards returned per play type (punt return yards in blue and kickoff return yards in orange). Vertical red lines indicate median value.

As a result, we decided to build separate models for punts and kickoffs. This ended up having a couple of advantages. First, we were able to tune the models independently based on the underlying data, such as the players’ relative position to one another and the resulting validation performance. This gave us more flexibility in running independent experiments for both return types so that they have their own tuned hyperparameters. Second, while performing error analysis for model optimization, it helped us clearly see their strength and weakness for each return type. By analyzing those results independently, we used customized procedures that helped boost the model performances. To measure model performance, we used multiple metrics such as RMSE, correlation between the true and predicted return yards, and the continuous ranked probability score (CRPS), which can be seen as an alternative to the log likelihood that is more robust to outliers.

AWS has been working with the Next Gen Stats team for a number of years to develop new analytics. Talk about the significance of not having to build this new stat from scratch, and instead, leverage existing models and techniques.

Leveraging existing techniques and models has several advantages over starting from scratch. First, it helped us to finish the project within a short amount of time. It only took us about six weeks to perform all the training experimentation and get good performing models. In a typical ML problem, more time is spent in understanding the problem, doing literature review, exploring the dataset, and ultimately running different training experimentation. However, in our case, we were able to shorten the project time by leveraging existing techniques and models from the previously developed stats.

In addition to saving time, leveraging existing techniques and models also allowed us to focus on the biggest challenge: the fat-tailed problem. As explained earlier, the datasets have very few events related to touchdowns (only two out of 865 punt test data and only nine out of 1,130 test data). However, they play a very important role in the dynamics of the game. So, we wanted to have a technique that could correctly model those rare events in addition the other normal returns. This helped us spend enough time to understand, integrate, and experiment with the Spliced Binned-Pareto (SBP) distribution, which was initially used in the NGS Quarterback (QB) Passing Score stat, in our ML pipeline. In addition, we also focused on improving the overall performance of the models and applied different smoothing techniques on the predicted probability distribution.

A data distribution with fat tails is common in real-world applications such as extreme rainfall for flood prediction, where rare events have significant impact on the overall performance of the models. Using a robust method to accurately model a distribution over extreme events is crucial to have better overall performance as the events have bigger weights in the problem. We found out that SBP distribution developed by our colleagues Elena Ehrlich, Laurent Callot, and François-Xavier Aubet could effectively be used in such scenario. SBP extends a flexible binned distribution with a rare event distribution (the generalized Pareto distribution) at both ends (see figure below). Originally designed for time-series forecasting, SBP robustly and accurately models time-series data with heavy-tailed noise, in non-stationary scenarios, by handling both extreme observations and allowing accurate capture of the full distribution.

Predictive Spliced Binned-Pareto distribution

Another advantage of starting from an existing model is transfer learning. In most ML problems with a small dataset, it is difficult to build a good performing model. One of the typical ways to boost the model performance in low data situations is by using transfer learning. In transfer learning, if a model is already trained on a task, it can be reused to train another model for another similar task but with limited labelled data. Because we had access to pretrained models from the rushing yards project, we were able to use this technique for the punt and kickoff returns. Interestingly, it worked well at the beginning when our models were not yet optimized and tended to overfit. The incremental value disappeared when we solved overfitting with better regularization. So eventually, we did not retain the transfer-learning technique in this case.

What kind of factors go into building this model? Do you look at things like player speed, or kick height or distance, where players are on the field in relation to each other?

We used player tracking data to develop the models. The data contains players’ information for each punt and kickoff play, like the position in X and Y coordinates, speed in yards per second, and direction in degrees. For each pair of opposing players (offense-defense except the ball carrier), we derived 14 features. Some of the derived features are the X, Y, and Euclidean distance, and X and Y speed of each player relative to the ball carrier in a given play. We also computed X, Y distance and speed of each offense player (except the ball carrier) relative to each of the 11 defense players. We also considered the X, Y speed and acceleration of each player as part of the derived features. Once all the features are computed, the data was finally transformed into 10X11X14 with 10 offensive players (excluding the ball carrier), 11 defensive players, and 14 derived features. The data preprocessing and feature engineering was adapted from the winner of the NFL Big Data Bowl competition on Kaggle.

In summary, the model predicts the expected yardage based on the relative position, speed, acceleration, and other real-time information of each player relative to an opposing player or a ball carrier. In this case, we are not adding any historical information related to the players or teams.

It’s interesting that the Expected Return Yards model will change, and probability for more yards will increase as the returner makes it past the first wave of defenders, for example. Can you break down how you’re able to observe that information? And are there other football-related factors into the model’s performance?

We did not directly observe it, but it can be inferred from the average probability distribution of the punt return yards at ball reception (see graph below).

The predicted probability (in orange) is somewhat smoother than the observed frequencies (in blue), but we clearly see a peak around zero yards and another one around 10 yards. After talking with NFL experts, we explained this by waves of defense. When the ball is fielded by the returner, the most likely event is that the returner makes no progress at all (the zero peak). If the player does make progress, then it becomes more likely that they travel 10 yards. This is because the player passed the first wave of defenders and got some room to run before the next wave.

If we continuously modeled the return yards at all times during the play and not just when the ball was fielded, we would see the distribution change. If the ball carrier passes the first wave of defenders, we should see the peak at zero disappear and the next peak get higher. This is an interesting next step that could be taken from our work.

How many plays over how many seasons did you use to build these models? And in many AI/ML use cases, we’re talking tens of thousands of examples. How were you able to build this based on so few examples?

The two datasets contain player tracking information for punt and kickoff plays with the player’s position (X and Y), speed, direction, acceleration, and other factors. It also contains the yardage gained or lost for each play. There are about 3,000 and 4,000 plays from four NFL seasons (2018-2021) for punt and kickoff respectively. To avoid the model bias toward the zero-return plays, we also excluded touchbacks and fair catches where there are no returns. In addition, there are less than 1% of punt- and kickoff-related touchdowns in the datasets. We used the data from the first three seasons for model training and validation and the remaining for testing.

In addition, in order to augment the data and account for the right and left field positions, the X and Y position values were also mirrored. By mirroring the position values, we could double each dataset in size.

Building the models with limited data was not an easy task as the model is prone to overfit. K-fold cross validation was one mechanism we used to efficiently utilize the small dataset. In this technique, it has a parameter called k that refers to the number of groups that a given dataset is split into. Depending on the value of k, such as 5 or 10, the data is split into k groups and models are independently trained and evaluated on the different subsets of the dataset. The overall performance of the resulting models will then be the average of the individual models trained on the different dataset splits.

Describe what model smoothing is, how it was applied to this project, and anything unique about what it enabled you to do. Was it the regularization of output distribution or predicting the whole shape of the distribution? And if so, can you describe that to our readers?

Regularization is a central concept in machine learning. When a model is very flexible, it can fit perfectly to the training data and essentially memorize the training data. This often leads to a model that is bad at extrapolating to new data, even if the new data is close to the training data. Regularization techniques limit overfitting and improve model performance on new data. In this project, we were also interested in regularization in a less abstract sense: we wanted the predicted probability distribution to look somewhat smooth when plotted on a graph (not too many big variations from one yard to the other). The good news is that standard regularization techniques do make the predicted probability distribution smoother, so we started by adding early stopping and ensembling. These techniques improved the performance on new data and made the predicted distribution smoother but not totally smooth.

At this point, we could have stopped because the performance and smoothness was good, but we wanted to see if we could specifically target the predicted distribution smoothness, instead of using techniques where smoothness is a side effect. We implemented a custom penalty term that limited the slope of the predicted distribution and we added it to the general optimization criteria. This allowed us to control the level of smoothness independently of other factors and find the optimal level of smoothness for each dataset. Interestingly on the punt dataset, adding a little smoothness penalty improved the predictive performance. Overall, we concluded that ensembling was sufficient from the performance and “look” perspectives and not much smoothness penalty, if any, was needed. The other unusual thing that we did in this project is the way we created the ensemble. Instead of ensembling several model types, we ensembled the models produced during 10-folds cross-validation. This was a good way to improve our results without adding the complexity of maintaining many different model types.

We touched on this idea of being able to focus on the pipeline for a longer time where the team was able to really dive deep into the final layer of the output of the model and distribution path in some really subtle aspects we usually don’t do. Can you elaborate on this for a bit?

During a standard proof of concept, we start from scratch and have many tasks: agree on the business objective, agree on the ML objective, explore the data, refine the business and ML objective, research the literature, prepare the data, develop baseline models, develop advanced models, circle back to ML and business objectives, explain the models, optimize the hyper-parameters, deploy for inference, and so on. To deliver all of this in a short time frame, we allocate more experimentation time to the “first order” design decisions (the one most likely to impact performance significantly) and the “second order” decisions are often made without explicit experimentation. Here, a lot of the first order decisions (such as how to frame the problem, what data and features to use, what type of model to use) where already made by our colleagues during previous engagements and the focus was on things that are typically second order decisions (what particular probability distribution should we use to model the yards returned). Of course, what is first or second order depends on the use case, and the fitting of extreme events (touchdown) and overall appearance of the probability distribution was more important than usual, which was an interesting opportunity for us.

What were some of the tools or techniques you tried, or fixes you had to make along the way to get to the end result? Were there any interesting anecdotes about accuracy, bugs, or regularization techniques and how you overcame these challenges?

The main tool was an internal component of GluonTS that is modeling the SBP distribution. The component is relatively new and designed to be used by GluonTS for time-series forecasting, so we were expecting difficulties in using it for a different purpose (here, the problem is not framed as a time-series forecast). We were not disappointed in the sense that we had several problems! The most challenging one was an error message from the inverse cumulative probability function that appeared pseudo-randomly in the middle of the training loop. In the middle of the training loop, it is possible to get numerical overflows or underflows but the message and variable inspection did not confirm this hypothesis. Because it took so long to get to the error, we could not simply try different fixes and rerun, we had to copy the internal code and state and reproduce just this part of the code in a notebook. Once we could reproduce easily, we quickly found the problem: numerical errors were creating small gaps or overlaps between segments that were partitioning the [0, 1] domain. These gaps and overlaps were small enough so that the code would run properly for a while, but big enough so that one input would hit them after enough tries. We changed the way the partition was built so gaps and overlaps were not possible and the problem was solved. The fix is now in GluonTS. 

Even after we fixed the errors, the SBP model was not performing well and we were close to giving up on SBP. Then, we realized that the training procedure was overfitting. All models were affected, but SBP more than the others. We implemented early stopping and all models started performing much better, especially SBP, which started to perform very well.

Either as a standalone response or sprinkled in above, which AWS technologies (like Amazon SageMaker and other AI/ML tools) did you use to develop this new stat and how?

We trained and evaluated our models in Amazon SageMaker notebook instances. The data and model artifacts and files are stored in Amazon Simple Storage Service (Amazon S3) buckets. We also used the SBP distribution provided by GluonTS. GluonTS is a Python package for probabilistic time-series modeling, but the SBP distribution is not specific to time series and we were able to re-use it for regression. PyTorch is the main deep learning framework that was used in the project.


Learn more about Next Gen Stats powered by AWS, and hear more insights on this new stat from Mike Band of NFL Next Gen Stats in this blog post.

Tesfagabir Meharizghi

Tesfagabir Meharizghi

Tesfagabir Meharizghi is a Data Scientist at the Amazon ML Solutions Lab where he helps AWS customers across various industries such as healthcare and life sciences, manufacturing, automotive, and sports and media, accelerate their use of machine learning and AWS cloud services to solve their business challenges.

Marc Van Oudheusden

Marc Van Oudheusden

Marc van Oudheusden is a Senior Data Scientist with the Amazon ML Solutions Lab at AWS. He works with AWS customers to solve business problems with artificial intelligence and machine learning. His current research interests include temporal graphs and recommendation systems.

Panpan Xu

Panpan Xu

Panpan Xu is a Senior Applied Scientist and Manager with the Amazon ML Solutions Lab at AWS. She is working on research and development of Machine Learning algorithms for high-impact customer applications in a variety of industrial verticals to accelerate their AI and cloud adoption. Her research interest includes model interpretability, causal analysis, human-in-the-loop AI and interactive data visualization.