Improving NFL player health using machine learning with AWS Batch
The National Football League (NFL) is the most popular sports league in America and home to 1,500+ professional athletes. The NFL is committed to better understanding the frequency and severity of player injuries to reduce their occurrence and make the game of football safer. This has led them to establish the NFL Player Health and Safety (PH&S) initiative for which more details are available on NFL’s website.
A core focus of the NFL PH&S initiative is to reduce helmet impacts among its athletes and to mitigate their effects when they do occur. To accomplish this, they first needed better insight to how often helmet collisions occur at the individual player level over the course of several games, spanning multiple seasons. These insights would empower stakeholders to make meaningful, strategic decisions that prioritize the sport’s safety.
In this post we’ll show you how the NFL, in partnership with AWS Professional Services, leveraged the scalable compute of AWS to run their ML workloads at scale to produce the first comprehensive dataset of helmet impacts across multiple NFL seasons.
Historically, human annotators were tasked with sampling a very small subset of plays to label individual impacts, frame-by-frame, to provide a minimal level of insight to leaders. However, manually annotating videos in this capacity highlights some serious limitations. For one, cost is an issue since human annotators are expensive to employ but also slow to generate accurate labels. It takes even the most experienced, accurate annotators about an hour to produce quality labels for a single play.
This brings us to the second issue: accuracy. Human annotators demonstrate high variability in labeled outcomes between one another (and even with themselves when relabeling the same play).
Finally, it’s difficult and expensive to generate high quality helmet impact labels at scale since it is only feasible to sample and annotate a small subset of plays using human annotators. For example, it would take over 270 annotators six months (each working 24 hours per day, 7 days per week) to label just a single season. Given these limitations, the NFL needed an alternative solution to identify helmet impacts.
To gain the needed insights without the limitations accompanying manual annotation, the NFL decided to use machine learning (ML) at scale in the AWS cloud. This was to be the first, fully comprehensive, historical dataset of its kind and would require immense resources to achieve. Running their workload using AWS Batch and other services allowed the NFL to minimize cost, optimize for reliability, and maximize throughput. The resulting catalog of helmet impacts allows NFL leaders, league owners, and coaches to make informed decisions to improve game safety through better equipment standards, rule updates, and enhanced coaching strategies.
The NFL’s machine learning workload considers videos of NFL plays, such as the one displayed in Video 1, and NFL’s Next Gen Stats player-tracking data to identify and assign helmet impacts. Multiple steps are necessary to combine and leverage these video and player-tracking data. These steps may be summarized into three primary tasks as follows.
Video 1 – This is a raw input video example. Videos such as this serve as input to the workflow that identifies and assigns helmet impacts to players.
- Snap detection – We determine when in the video the ball is snapped so that videos can be aligned to each other and to the NFL Next Gen Stats data. This step uses an image segmentation ML model to identify players on the field and stabilize the image with respect to the camera motion. Then, the result is fed to a change point detection model to determine the frame when the players start moving, corresponding to the ball being snapped. This step requires a GPU instance in order to do the image segmentation efficiently.
- Helmet detection and player assignment – For this step, a fine-tuned helmet detector, using a GPU, identifies helmets at every frame of the video and tracks them through the play. To assign them to specific players, we next match the detected player tracks to the known player positions observed by the NFL Next Gen Stats player-tracking system. The result is the set of x-y coordinates for every player’s helmet throughout the play.
- Impact classification – We use a deep learning-based action recognition model to identify when helmets are undergoing an impact. The input to this step is a set of cropped frames centered around a particular helmet detected in the previous step.
Once a game play’s video and respective NFL Next Gen Stats data have been processed through all three steps, impacts and player assignments are available to help inform NFL PH&S decision makers. To demonstrate these output results, we’ve overlayed those produced for the input video (Video 1) in Video 2.
Video 2 – This is an example of the video shared in Video 1 with helmet bounding box, player assignment, and classified helmet impact results overlayed onto each frame. Notice the solution only produces results for frames after ball snap occurs. The video has been slowed after ball snap, and more so surrounding each impact, for the reader’s benefit and interpretation.
Scaling with AWS Batch
Each of these tasks requires a different combination of memory and compute resources, so we matched each one with the Amazon EC2 instance family best suited for it. Since each job utilizes a GPU for deep learning tasks, we utilize instances in the P3 and G4dn families. While P3 instances are optimized for model training, their NVIDIA Tesla V100 GPUs and higher ratio of CPU per GPU make them an appealing choice for debugging and quick tasks where speed is the most important factor. On the other hand, the G4dn family of instances has NVIDIA T4 Tensor GPUs and a lower cost per task.
Since we are mostly a team of data scientists and machine learning engineers, we wanted a managed way to implement AWS best practices including:
- Running workloads across multiple availability zones
- Allowing as wide a variety of instance types as possible for each task
- Maintaining a strong security posture by running the entire workload in a VPC
- Scaling down the entire cluster to zero vCPU when it’s not in use to save on cost
AWS Batch supported these requirements with just a few lines of code (AWS CDK in our case), allowing us to spend our time focusing on improving the models’ accuracy rather than debugging networking or a multi-Availability Zone setup.
We configured our AWS Batch environment with two compute environments to match the G4dn and P3 instance families totalling 9,248 vCPU across three Availability Zones in a single AWS Region. This setup gives us the flexibility to change the priority of the G4dn and P3 compute environments while maintaining access to the widest array of instance types in case of limited AWS capacity. For example, when a new model is ready and we want to repopulate data for entire NFL seasons, we prioritize G4dn instances to minimize total cost. But, if we are experimenting on a few plays, we prioritize P3 instances as shown in Figure 1 to have the shortest possible iteration time.
In the future, we plan to experiment with G5 and P4 instance families which can have up to 3.3x higher performance for ML workloads than G4 instances and 2.5x higher performance for ML workloads than P3 instances respectively.
Serverless architecture design patterns
As we mentioned in the Task Overview section, it was important to split up the steps into three different tasks. But the challenge to coordinate these steps while maintaining our ability to scale, handling the occasional failure, and providing visibility into each underlying step was glaring. That’s where AWS Step Functions comes in.
Since an AWS Batch job can be submitted directly from a Step Functions state machine, we were able to start simple and adapt as our needs changed. Today, we utilize a design, shown in Figure 2, consisting of an inference Step Functions state machine (Block C) that organizes the results from related videos (i.e., those belonging to plays from the same NFL game) using a Step Functions Map state where each map iteration represents a different video. From within each map iteration, we invoke a standardized, nested state machine (Block B) which is responsible for submitting jobs to AWS Batch while standardizing logging, error handling, retry, and caching logic.
This design approach simplifies maintenance, improves scalability, and promotes modularity. It ensures that our solution easily scales to meet the demands of the project. Since an inference Step Functions state machine (Block C) execution could contain anywhere between 1 to 150+ plays (this count is determined at execution launch), we use the Step Functions Map state to define our inference steps only once and then dynamically scale operations as the input data necessitates. Furthermore, we designed the nested state machine (Block B) to standardize supporting engineering and operations logic before and after Batch job submission while yielding to specific task needs. Specifically, AWS Lambda and AWS Batch job definition resource names (ARNs) are passed to the nested Step Functions state machine each time it is executed from an inference step associated with one of our primary tasks. This is how we ensure our solution is highly scalable while maximizing reusability and utilization by any type of task, even those belonging to non-inference pipelines.
While the Map state does allow the inference Step Functions state machine (Block C) to dynamically scale parallel workloads, we made additional efforts to prioritize parallelism even further to make the very best use of AWS Batch.
Given Step Functions’ Map state limitation of no more than 40 concurrent Map iterations at any given time, we decided to define our inference pipeline operation in a way that would mitigate the effects of this limitation while maintaining a simple operations tracking strategy.
Before we go on (a quick aside): The Step Functions Map state limitation is associated with Map’s Inline mode at the time we performed this work. Our design was created before the Step Functions team announced Step Functions Distributed Map, which supports a maximum concurrency of 10,000 Map iterations at any given time. We’d encourage readers with similar use cases to ours to investigate this option for much better parallelism.
We chose to run an inference Step Functions state machine execution on subsets of all play videos belonging to a game. So, if a game has 300 unique videos where, for example, half of the videos were recorded from one viewing angle and the other half from another, we’d run two separate executions of the inference state machine, each processing 150 videos. Prioritizing parallelism in our operational design enabled the NFL to maximize their AWS Batch Compute Environment resource utilization, whether processing an entire season’s worth of data or only a couple games at once.
As we alluded to already, it’s important to have visibility into the status of our executions when we’re processing so much data with AWS Batch. Doing so helps us understand the conditions of our executions, improve our debugging processes, and ensure that our solution maintains an accepted level of “explainability.” To address this need, we created engineering hooks, using AWS Lambda, Amazon DynamoDB, and Amazon CloudWatch, throughout the inference (Block C) and nested (Block B) Step Functions state machines to track custom events specific to our use case. These mechanisms laid the foundation for a separate observability pipeline which provides structure to data processing status, raises errors when they occur, and provides long-term persistence of all metadata needed for thorough review, audit, or debug.
Retaining and consolidating all metadata associated with our executions is especially important when using AWS Batch since Batch job descriptions are deleted from the Batch API 7 days after their job’s completion. By preserving this information in searchable form, we can easily refer back to any previous execution for debugging purposes or to gain insights from past runs. Also worth noting, since application of our pipeline may be characterized as spiky with somewhat unpredictable periods of virtually no utilization broken up by bursts of extremely high utilization, we chose to use DynamoDB on-demand capacity to maximize reliability and reduce tracking costs.
Finally, we designed our orchestration workflows to reduce cost, wasted compute, and time lost due to redundant re-processing and to utilize Batch resources efficiently. By skipping snap detection, helmet detection & player assignment, and impact classification Batch job submission in cases where previous Batch job executions already produced output data using the same input data, we’re able to save more than $1,000 per hour in on-demand EC2 costs. We accomplished this by profiling task input data, metadata, code versions, and model versions as part of our nested Step Functions state machine (Block B) auxiliary logic. If the task profile matches one accompanying a previous, successful execution, the Batch job is skipped and the pre-existing output data is passed as output from the current nested Step Functions state machine execution. Alternatively, if no successful matches are identified due to a change in task metadata, code, the model, or upstream tasks, the state machine proceeds to submit the Batch job to produce new output results.
Inference and beyond
Using AWS Batch in conjunction with nested Step Functions state machines to automate the step-by-step execution of tasks for various pipelines on the project further promotes reusability of individual components, and in turn the design’s scalability beyond inference. This high modularity in design and the dependencies between different pipelines leveraged on the project are showcased in the figure above.
Referring to Block D of the figure, whenever the Evaluation Step Functions state machine is triggered to generate evaluation metrics for a given set of task models, it invokes the Inference Step Functions state machine (Block C) automatically to generate the assigned helmet impact predictions for a specified subset of videos. These results produced by the Inference pipeline are then consolidated and evaluated against a set of model evaluation metrics by the Evaluation pipeline.
Similarly, our design enables the Training Step Functions state machine (Block A) to leverage the standardized, nested Step Functions state machine (Block B) for triggering feature generation, hyper-parameter tuning, and training AWS Batch jobs. This way, training pipeline tasks may also take advantage of the same tracking, error handling, retry, and caching logic discussed already.
Through these design choices and implementation strategies, we produced a massively-scalable orchestration framework around AWS Batch. NFL PH&S support teams benefited from a highly robust set of automated workflows that leverage serverless resources on AWS, and stakeholders quickly received a comprehensive dataset that would guide them to make powerful, informed decisions to improve player health and safety.
Through AWS Batch, the NFL fixed one of their biggest pain points with identifying helmet impacts in their video data. Compared to hiring human annotators, our solution on AWS Batch achieves 1) $700,000/season in cost savings, 2) 90% reduction in hours of manual labor, and 3) beats human labelers in accuracy by 12%. In addition, the use of AWS Batch allows our solution to scale in massively parallel ways, which allowed us to compress more than 24 years’ worth of computation time into less than 6 weeks. With this solution, the NFL was able to create the first comprehensive dataset of helmet impacts for all players dating back to 2016. This helps inform and guide NFL league owners and coaches in their decision-making process to make the game safer for all players.