ML Solutions Lab: Panpan Xu, Tesfagabir Meharizghi, Marc van Oudheusden
In collaboration with AWS over the last 5 years, the NFL Next Gen Stats (NGS) team has introduced a series of stats that provide an analytical look into the many different aspect of how the game is played. Historically, these new stats have been focused on the offensive and defensive sides of the ball (see these two blogposts for more details: defense coverage classification and predicting 4th down conversion). This season, however, we’ve taken those learnings and applied them to special teams and the return game. More specifically, we built two separate models that predict the expected punt and kickoff return yards respectively. The punt return model predicts the yards a punt returner is expected to gain if they field the punt, while the kick return model predicts the yards a kick returner is expected to gain once they field the kickoff.
2) On the surface, one might assume there isn’t that much of a difference between punts and kickoffs but you built separate models for each one. What were some of the variables with each that led you to taking that path? Why wouldn’t one model work?
When we started our experiments, we initially explored combining punts and kickoff data to train a single model. As in a typical ML problem, we had an expectation that adding more data to the challenge would result in a better model. However, after training a model with the combined dataset, it did not perform as expected. It had worse performance than the ones trained separately.
One of the reasons was due to the difference in the distribution of the gained yardage in punts and kickoff. From the NFL datasets, we can easily see that the average gained yardage is higher in kickoffs than in punts (see graph below). Another reason was their differences in player location on the field, proximity of defenders when the returner catches the ball, returner speed, real-time position in relation to one another, acceleration, etc. Because of these factors the model could not easily differentiate between punts and kickoff. For example, when we trained the model with combined data, the Root Mean Squared Error (RMSE), which one of the metrics used to measure performance, almost doubled as compared to the individual models.
As a result, we decided to build separate models for punts and kickoffs. This ended up having a couple of advantages. First, we were able to tune the models independently based on the underlying data, such as the players’ relative position to one another and the resulting validation performance. This gave us more flexibility in running independent experiments for both return types so that they have their own tuned hyper-parameters. Second, while performing error analysis for model optimization, it helped us clearly see their strength and weakness for each return type. By analyzing those results independently, we used customized procedures that helped boost the model performances. To measure model performance, we used multiple metrics such as RMSE, correlation between the true and predicted return yards, and the continuous ranked probability score (CRPS) which can be seen as an alternative to the log likelihood that is more robust to outliers.
4) What kind of factors going into building this model? Do you look at things like player speed, height or distance of the kick, where players are on the field in relation to each other?
We used players tracking data to develop the models. The data contains players information for each punt/kickoff play like the position in X and Y coordinate, speed in yards/second, direction in degrees. For each pair of opposing players (offense-defense except the ball carrier), we derived 14 features. Some of the derived features are the x, y, and euclidean distance and x & y speed of each player relative to the ball carrier in a given play. We also computed x, y distance and speed of each offense player (except the ball carrier) relative to each of the 11 defense players. We also considered the x, y speed and acceleration of each player as part of the derived features. Once all the features are computed, the data was finally transformed into 10X11X14 with 10 offensive players (excluding the ball carrier), 11 defensive players and 14 derived features. The data preprocessing and feature engineering was adapted from the winner of the NFL Big Data Bowl competition on Kaggle.
In summary, the model predicts the expected yardage based on the relative position, speed, acceleration and other real-time information of each player relative to an opposing player or a ball carrier. In this case, we are not adding any historical information related to the players or teams.
6) Break down the dataset for me, how many plays over how many seasons did you use to build these models? And in many AI/ML use cases, we’re talking tens of thousands of examples. How were you able to build this based on so few examples?
The two datasets contain player tracking information for punt and kickoff plays with the player’s position (X & Y), speed, direction, acceleration, etc. It also contains the yardage gained or lost for each play. There are about 3000 and 4000 plays from 4 NFL seasons (2018-2021) for punt and kickoff respectively. To avoid the model bias towards the zero-return plays, we also excluded touchbacks and fair catches where there are no returns. In addition, there are only less than 1% of punt and kickoff-related touchdowns in the datasets. We used the data from the first three seasons for model training/validation and the remaining for testing.
In addition, in order to augment the data and account for the right and left field positions, the X & Y position values were also mirrored. By mirroring the position values, we could double each dataset in size.
Building the models with limited data was not an easy task as the model is prone to overfit. K-fold cross validation was one mechanism we used to efficiently utilize the small dataset. In this technique, it has a parameter called k that refers to the number of groups that a given dataset is split into. Depending on the value of k, such as 5 or 10, the data is split into k groups and models are independently trained and evaluated on the different subsets of the dataset. The overall performance of the resulting models will then be the average of the individual models trained on the different dataset splits.