In the news

Expected Punt & Kickoff Return Yards - MLSL Q&A

A behind-the-scenes look with the ML Solutions Lab about Expected Return Yards, the latest Next Gen Stat from AWS and the NFL, that uses AI to take you inside the huddle to better understand punt and kickoff returns.

ML Solutions Lab: Panpan Xu, Tesfagabir Meharizghi, Marc van Oudheusden


1) Give us the main idea behind Expected Return Yards and how it came together.

In collaboration with AWS over the last 5 years, the NFL Next Gen Stats (NGS) team has introduced a series of stats that provide an analytical look into the many different aspect of how the game is played. Historically, these new stats have been focused on the offensive and defensive sides of the ball (see these two blogposts for more details: defense coverage classification and predicting 4th down conversion).  This season, however, we’ve taken those learnings and applied them to special teams and the return game. More specifically, we built two separate models that predict the expected punt and kickoff return yards respectively. The punt return model predicts the yards a punt returner is expected to gain if they field the punt, while the kick return model predicts the yards a kick returner is expected to gain once they field the kickoff. 
 
When building advanced stats, AWS and the NGS teams have used a variety of AI/ML techniques and algorithms. We also leveraged existing model architectures to developed new stats. For example, the 2020 Expected Rushing Yards model was a solution designed by Austrian data scientists Philipp Singer & Dmitry Gordeev ( 2019 Big Data Bowl winners) that leveraged raw player-tracking data and deep-learning techniques. Two years later, NGS & AWS have leveraged this architecture for similar solutions, including the 2021 expected EPA model, the primary model that drives the NGS Passing Score (2021). Utilizing the similar modeling architecture, we developed expected yards models for the return game – Expected Punt and Kick Return Yards. 
Architecture of the 2019 Big Data Bowl Winner (source: Kaggle)

2) On the surface, one might assume there isn’t that much of a difference between punts and kickoffs but you built separate models for each one. What were some of the variables with each that led you to taking that path? Why wouldn’t one model work?

When we started our experiments, we initially explored combining punts and kickoff data to train a single model. As in a typical ML problem, we had an expectation that adding more data to the challenge would result in a better model. However, after training a model with the combined dataset, it did not perform as expected. It had worse performance than the ones trained separately.  

One of the reasons was due to the difference in the distribution of the gained yardage in punts and kickoff. From the NFL datasets, we can easily see that the average gained yardage is higher in kickoffs than in punts (see graph below). Another reason was their differences in player location on the field, proximity of defenders when the returner catches the ball, returner speed, real-time position in relation to one another, acceleration, etc. Because of these factors the model could not easily differentiate between punts and kickoff. For example, when we trained the model with combined data, the Root Mean Squared Error (RMSE), which one of the metrics used to measure performance, almost doubled as compared to the individual models.

Figure 1
Distribution of yards returned per play type (punt return yards in blue and kickoff return yards in orange). Vertical red lines indicate median value.

As a result, we decided to build separate models for punts and kickoffs. This ended up having a couple of advantages. First, we were able to tune the models independently based on the underlying data, such as the players’ relative position to one another and the resulting validation performance. This gave us more flexibility in running independent experiments for both return types so that they have their own tuned hyper-parameters. Second, while performing error analysis for model optimization, it helped us clearly see their strength and weakness for each return type. By analyzing those results independently, we used customized procedures that helped boost the model performances. To measure model performance, we used multiple metrics such as RMSE, correlation between the true and predicted return yards, and the continuous ranked probability score (CRPS) which can be seen as an alternative to the log likelihood that is more robust to outliers.

3) AWS has been working with the Next Gen Stats team for a number of years to develop new analytics. Talk about the significance of not having to build this new stat from scratch and instead were able to leverage existing models and techniques.
Leveraging existing techniques and models has several advantages over starting from scratch. First, it helped us to finish the project within a short amount of time. It took us only about 6 weeks to perform all the training experimentation and get good performing models. In a typical ML problem, longer time is spent in understanding the problem, doing literature review, exploring the dataset and ultimately running different training experimentation. However, in our case, we were able to shorten the project time by leveraging existing techniques and models from the previously developed stats. 
 
In addition to saving time, leveraging existing techniques and models also allowed us to focus on the biggest challenge, the fat-tailed problem. As explained earlier, the datasets have very few events related to touchdowns (only 2 out of 865 punt test data and only 9 out of 1130 test data). However, they play a very important role in the dynamics of the game. So, we wanted to have a technique that could correctly model those rare events in addition the other normal returns. This helped us to spend enough time to understand, integrate and experiment with the Spliced Binned Pareto distribution, which was initially used in the QB passing score NGS stat, in our ML pipeline. In addition, we also focused on improving the overall performance of the models and apply different smoothing techniques on the predicted probability distribution.  
 
A data distribution with fat tails is common in real world applications such as extreme rainfall for flood prediction, where rare events have significant impact on the overall performance of the models. Using a robust method to accurately model a distribution over extreme events is crucial to have better overall performance as the events have bigger weights in the problem. We found out that  Sliced Binned-Pareto (SBP) distribution developed by our colleagues Elena Ehrlich, Laurent Callot, and François-Xavier Aubet could effectively be used in such scenario. SBP extends a flexible binned distribution with a rare event distribution (the generalized Pareto distribution) at both ends (see figure below). Originally designed for time series forecasting, SBP robustly and accurately model time series data with heavy-tailed noise, in non-stationary scenarios, by handling both extreme observations and allowing accurate capture of the full distribution.
Predictive Spliced Binned-Pareto distribution
Another advantage of starting from an existing model is transfer learning. In most ML problems with a small dataset, it is difficult to build a good performing model. One of the typical ways to boost the model performance in low data situations is by using transfer learning. In transfer learning, if a model is already trained on a task, it can be reused to train another model for another similar task but with limited labelled data. Because we had access to pretrained models from the rushing yards project, we were able to use this technique for the punt and kickoff returns. Interestingly it worked well at the beginning when our models were not yet optimized and had a tendency to overfit. The incremental value disappeared when we solved overfitting with better regularization. So eventually we did not retain the transfer-learning technique in this case.

4) What kind of factors going into building this model? Do you look at things like player speed, height or distance of the kick, where players are on the field in relation to each other?

We used players tracking data to develop the models. The data contains players information for each punt/kickoff play like the position in X and Y coordinate, speed in yards/second, direction in degrees. For each pair of opposing players (offense-defense except the ball carrier), we derived 14 features. Some of the derived features are the x, y, and euclidean distance and x & y speed of each player relative to the ball carrier in a given play. We also computed x, y distance and speed of each offense player (except the ball carrier) relative to each of the 11 defense players. We also considered the x, y speed and acceleration of each player as part of the derived features. Once all the features are computed, the data was finally transformed into 10X11X14 with 10 offensive players (excluding the ball carrier), 11 defensive players and 14 derived features. The data preprocessing and feature engineering was adapted from the winner of the NFL Big Data Bowl competition on Kaggle. 

In summary, the model predicts the expected yardage based on the relative position, speed, acceleration and other real-time information of each player relative to an opposing player or a ball carrier. In this case, we are not adding any historical information related to the players or teams.

5) It’s interesting that the expected return yards model will change/probability for more yards will increase as the returner makes it past the first wave of defenders, for example. Can you break it down how you’re able to observe that information? And are there other football related factors into the model’s performance?
 
We did not directly observe it but it can be inferred from the average probability distribution of the punt return yards at ball reception (see graph below).
The predicted probability (in orange) is somewhat smoother than the observed frequencies (in blue) but we clearly see a peak around 0 yards and another one around 10 yards. After talking with NFL experts we explained this by waves of defense. When the ball is fielded by the returner, the most likely event is that the returner makes no progress at all (the 0 peak). If he does make progress then it becomes more likely that he travels 10 yards. This is because he passed the first wave of defenders and got some room to run before the next wave.  
 
If we modeled the return yards continuously at all time during the play (and not only when the ball was fielded) we would see the distribution change. If the ball carrier passes the first wave of defenders, we should see the peak at 0 disappear and the next peak get higher. This is an interesting next step that could be taken from our work.

6) Break down the dataset for me, how many plays over how many seasons did you use to build these models? And in many AI/ML use cases, we’re talking tens of thousands of examples. How were you able to build this based on so few examples?

The two datasets contain player tracking information for punt and kickoff plays with the player’s position (X & Y), speed, direction, acceleration, etc. It also contains the yardage gained or lost for each play. There are about 3000 and 4000 plays from 4 NFL seasons (2018-2021) for punt and kickoff respectively. To avoid the model bias towards the zero-return plays,  we also excluded touchbacks and fair catches where there are no returns. In addition, there are only less than 1% of punt and kickoff-related touchdowns in the datasets. We used the data from the first three seasons for model training/validation and the remaining for testing. 

In addition, in order to augment the data and account for the right and left field positions, the X & Y position values were also mirrored. By mirroring the position values, we could double each dataset in size. 

Building the models with limited data was not an easy task as the model is prone to overfit. K-fold cross validation was one mechanism we used to efficiently utilize the small dataset. In this technique, it has a parameter called k that refers to the number of groups that a given dataset is split into. Depending on the value of k, such as 5 or 10, the data is split into k groups and models are independently trained and evaluated on the different subsets of the dataset. The overall performance of the resulting models will then be the average of the individual models trained on the different dataset splits.

7) Can you please describe what model smoothing is, how it was applied to this project, and anything unique about what it enabled you to do? Was it the regularization of output distribution/predicting the whole shape of the distribution? And if so, can you describe that to our readers?
 
Regularization is a central concept in machine learning. When a model is very flexible it can fit perfectly to the training data and essentially memorize the training data. This often leads to a model that is bad at extrapolating to new data, even if the new data is close to the training data. Regularization techniques limit over-fitting and improve model performance on new data. In this project we were also interested in regularization in a less abstract sense: we wanted the predicted probability distribution to look somewhat smooth when plotted on a graph (not too many big variations from one yard to the other). The good news is that standard regularization techniques do make the predicted probability distribution smoother, so we started by adding early stopping and ensembling. These techniques improved the performance on new data and made the predicted distribution smoother but not totally smooth.  
 
At this point we could have stopped because the performance and smoothness was good, but we wanted to see if we could specifically target the predicted distribution smoothness, instead of using techniques where smoothness is a side effect. We implemented a custom penalty term that limited the slope of the predicted distribution and we added it to the general optimization criteria. This allowed us to control the level of smoothness independently of other factors and find the optimal level of smoothness for each dataset. Interestingly on the punt dataset adding a little smoothness penalty improved the predictive performance. Overall we concluded that ensembling was sufficient from the performance and “look” perspectives and not much smoothness penalty, if any, was needed. The other unusual thing that we did in this project is the way we created the ensemble. Instead of ensembling several model types we ensembled the models produced during 10-folds cross-validation. This was a good way to improve our results without adding the complexity of maintaining many different models types.
 
8) We touched on this idea of being able to focus on the pipeline for a longer time where the team was able to really dive deep into the final layer of the output of the model/distribution path in some really subtle aspects we usually don't do. Can you elaborate on this for a bit?
 
During a standard POC we start from scratch and have very many tasks: agree on the business objective, agree on the ML objective, explore the data, refine the business and ML objective, research the literature, prepare the data, develop baseline models, develop advanced models, circle back to ML and business objectives, explain the models, optimize the hyper-parameters, deploy for inference etc. To deliver all of this in a short time frame we allocate more experimentation time to the “first order” design decisions (the one most likely to impact performance significantly) and the “second order” decisions are often made without explicit experimentation. Here a lot of the first order decisions (How to frame the problem? What data and features to use? What type of model to use? ...) where already made by our colleagues during previous engagement and the focus was on things that are typically second order decisions (What particular probability distribution should we use to model the yards returned?). Of course what is first or second order depends on the use case and here the fitting of extreme events (touchdown) and overall appearance of the probability distribution was more important than usual, which was an interesting opportunity for us.
9) What were some of the tools/techniques you tried or fixes you had to make along the way to get to the end result? Were there any interesting anecdotes about accuracy, bugs, regularization techniques, etc. that we can describe how you overcame these challenges?
 
The main tool was an internal component of  GluonTS that is modeling the Spliced-Binned Pareto distribution. The component is relatively new and designed to be used by GluonTS for time series forecasting so we were expecting difficulties in using it for a different purpose (here the problem is not framed as a time-serie forecast). We were not disappointed in the sense that we had several problems! The most challenging one was an error message from the inverse cumulative probability function that appeared pseudo-randomly in the middle of the training loop. In the middle of the training loop it is possible to get numerical overflows or underflows but the message and variable inspection did not confirm this hypothesis. Because it took so long to get to the error we could not simply try different fixes and rerun, we had to copy the internal code and state and reproduce just this part of the code in a notebook. Once we could reproduce easily we quickly found the problem: numerical errors were creating small gaps or overlaps between segments that were partitioning the [0, 1] domain. These gaps and overlaps were small enough so that the code would run properly for a while, but big enough so that one input would hit them after enough tries. We changed the way the partition was built so gaps and overlaps were not possible and the problem was solved. The fix is now in GluonTS:  https://github.com/awslabs/gluonts/commit/8a8bed57eae93509afb646ccab7434ad2bb2f4fd. 
 
Even after we fixed the errors, the SBP model was not performing well and we were close to giving up on SBP. Then we realized that the training procedure was over-fitting. All models were affected but SBP more than the others. We implemented early stopping and all models started performing much better, especially SBP which started to perform very well.
10) Either as a standalone response or sprinkled in above, which AWS technologies (like SageMaker and other AI/ML tools) did you use to develop this new stat and how?
 
We trained and evaluated our models Amazon SageMaker Notebook instances. The data and model artifacts/files are stored in Amazon S3 bucket. We also used the SBP distribution provided by  GluonTS. GluonTS is a Python package for probabilistic time series modeling but the SBP distribution is not specific to time-series and we were able to re-use it for regression. PyTorch is the main deep learning framework that was used in the project.