PGA TOUR® enhances AI commentary by operationalizing generative AI

This blog was co-authored by David Provan, Principal Architect, and Trey Pezzetti, Product Manager, PGA TOUR.

Professional golf presents unique broadcasting challenges with tournaments featuring up to 144 players, competing across 100-150 acres, with 18 holes representing 18 fields of play. Over multiple days this can total over 10,000 shots captured daily.

These challenges become amplified when the PGA TOUR executes Every Shot Live (ESL), a feature that provides live video coverage of every shot that every player makes. From the very first shot to the very last shot played in the tournament, everything is covered by over 40 video streams, using approximately 120 cameras during THE PLAYERS Championship. These video streams are then distributed to broadcast partners and different rights holders around the world.

Scott Gutterman, SVP of Digital and Broadcast Technologies, explains how it’s important for the PGA TOUR to enable fans to consume content on a wide variety of platforms and in multiple formats. “We want to deliver what fans want, like the ability to follow their favorite player through the platform of their choice. When we do that everything else falls into place. Modern AWS [Amazon Web Services] technology enables us to deliver customized content to fans, matching their viewing preferences and player interests across platforms.”

At THE PLAYERS Championship in 2025, the PGA TOUR unveiled automated AI commentary in the TOURCAST application. TOURCAST can be accessed through the mobile app and the PGA TOUR website. Here we’ll focus on the PGA TOUR’s operational considerations in taking this application to production.

Monitoring and deploying generative AI (gen AI) systems present unique challenges compared to traditional software. When gen AI foundation models process user inputs and generate outputs, engineers need to track not just standard metrics like latency and throughput, but also the quality and reliability of the model’s decisions and outputs. This visibility becomes even more critical for generative AI systems that create text, images, and code where traditional metrics fall short. We’ll discuss how the PGA TOUR performs observability and deployment on one of their critical gen AI production applications, AI commentary.

Observability in generative AI

Observability lets you collect, correlate, aggregate, and analyze telemetry in your network, infrastructure, and applications. While many of these metrics are consistent across all types of workloads, there’s additional considerations for generative AI. It’s helpful to break down the data collected into different levels:

Component level metrics: Includes consumption metrics like input tokens, output tokens, throughput, cost and other useful metrics.
End user and product level metrics: Collects information like acceptance of content, large language model (LLM)-as-a-Judge outputs (defined later), and other business information that are useful for understanding how your application is performing.

Deployments in generative AI

Continuous integration/continuous deployment (CI/CD) pipelines are crucial for decreasing time to value (TTV) from code pushes to business value. While most of the process for deploying gen AI systems remains the same as traditional cloud applications, there are some key components which differ. Prompts to a LLM function as soft code, but should be treated more closely to a traditional machine learning operations (MLOps) model deployment. Prompts are tested using evaluations compared to traditional unit or integration tests.

How the PGA TOUR implements generative AI operations

The PGA TOUR’s primary observability platform is Amazon CloudWatch. Events and metrics are collected from various sources and aggregated to create a holistic view of how the system is performing. Amazon CloudWatch alarms are triggered when certain thresholds are met.

Prompts are stored in code repositories allowing data scientists and software developers to deploy their respective updates in a unified code base while maintaining consistency in testing and rollback capabilities. The following diagram provides a high-level overview of the AI commentary system, deployment process, and observability system.

Architectural diagram shows how developers and data scientists collaborate in the AI commentary’s system CI/CD pipeline. When developers push their source code into the Git repository, or when the data scientists push their prompts or eval datasets into the Git repository, it triggers the CI/CD pipeline with build, unit/integration tests, evaluations using Evaluation Datasets. Upon tests and evaluation success, the code is deployed into the development environment. Once deployed end-to-end tests are performed, once satisfied the code is promoted for Production Deployment. The diagram also shows how Level 1 and 2 metrics from the AI commentary system is sent to an observability stack to build Amazon CloudWatch dashboards and the insights are sent to Amazon SNS for notifications.

Figure 1: Technical architecture of AI Commentary.

Level 1: Component metrics

Many of the Level 1 metrics are collected by the services themselves and pushed out to Amazon CloudWatch. For example, Amazon Bedrock enables input tokens, output tokens, latency, throttling, and other metrics by default.

PGA TOUR’s observability dashboards

Once deployments are successful, the PGA TOUR uses a CloudWatch dashboard to monitor how the system is performing. It provides insights that include all the shots coming out of TOUR’s data through to the generative AI commentary generation.

This dashboard image shows graphical representation of shot feed rate, API Resource utilization rate, API Request rate, API Gateway errors, Red Flag Counts and API errors rate.

Figure 2: System metrics dashboard.

Level 2: End user and product metrics

Level 2 is where we start to get into gen AI specific metrics, namely subjective evaluations (content quality and understanding—does it talk about golf correctly) and validation metrics (error rates and accuracy). With free-form text, traditional metrics become less effective. Using a pattern called LLM-as-a-Judge, we can pass AI generated content into a second LLM to generate metrics that would otherwise be impossible to generate at scale.

With this approach, the PGA TOUR can evaluate whether the commentary generated is using PGA TOUR data correctly and create other metrics critical for approving/denying incoming commentary. These custom metrics are then pushed to Amazon CloudWatch to provide deeper levels of understanding into how the system is performing.

This dashboard shows in a table consisting of the tournament, player, round, hole, shot and the corresponding narrative of the commentary that were published and the ones rejected due to discrepancies.

Figure 3: Gen AI specific metrics dashboard.

LLM-as-a-Judge

As previously described, generative AI outputs require special metrics to measure their performance. Traditional metrics become less effective when the outputs of your system are qualitative and free-form.

In a single tournament the PGA TOUR collects information on over 10,000 shots each day, making human evaluations unfeasible. The solution to this is to use another LLM to evaluate the outputs based on a grading rubric to “judge” the results generated. Using this approach, the PGA TOUR captures metrics like factuality, coherence, and other metrics that are important. These metrics are used for:

Tracking accuracy/context
Making decisions on whether commentary should be pushed to the commentary feed

If the LLM-as-a-Judge evaluation system identifies three or more flags in a piece of content, it is recorded and rejected.

Alignment is key for an LLM-as-a-Judge approach. First, a hand graded subset of answers is curated based on the judging criteria using human experts. The LLM-as-a-Judge prompt is then run through the hand graded evaluation set to ensure it aligns with expectations.

This dashboard shows total published commentaries compared with unpublished commentaries, shot distribution by error score, attribute distribution, and the expected value compared with extracted value of the attribute for the unpublished commentary.

Figure 4: Summary/Error attribution dashboard.

Evaluations (Gold standard dataset)

Arguably the most critical component of gen AI system deployments is running evaluations. Without evaluations, we can’t answer some of the most critical questions, “Which model should I use?”, “Are my changes improving the commentary?”, and “Am I causing a regression?”.

The same approach for creating observability metrics can be reused for evaluations with one additional consideration. Outside of a live event, we can define what the responses should look like. This answer can be used to guide the LLM-as-a-Judge prompt to determine if a prompts output is better or worse across a validation dataset. The PGA TOUR periodically runs previous tournament data through the AI commentary system to evaluate and validate the system responses.

Key learnings

Following are some of the key learns gathered when creating the PGA TOUR’s automated AI commentary in the TOURCAST application:

Use a different model to grade your results: Models have an affinity towards their own answers. For example, if you grade Anthropic’s Claude 3.7 Sonnet results with same model. Anthropic Claude Sonnet will love its own answers and will not be a fair evaluation of the results. Using a different model to grade your outputs removes that bias.
Use evaluations to determine which model to use: Once evaluations are curated, you can use them to affect your model choice. Start with the cheapest model and run an evaluation against it. Using curated, evaluated models in this way allows the PGA TOUR to mix and match models to right-size workloads for each prompt—balancing intelligence, latency, and cost. Pairing the broad choice of models within Amazon Bedrock and a unified invocation API, model selection becomes trivial.
Treat evaluations more like benchmarks compared to unit tests: While it is customary to expect 100 percent pass rate in unit tests, it’s rare to get 100 percent passing on an entire evaluation set. Each organization, depending on the use case, must decide on a threshold for passing the evaluations. It is, however, common to see a significant degradation in evaluation scores when bad prompts are introduced.

Conclusion

The PGA TOUR orchestrates more than 50 golf tournaments annually. By embracing generative AI powered by AWS, the PGA TOUR not only continues to innovate, but also elevate the fan experience.

This deep dive into the PGA TOUR’s generative AI observability system demonstrates how organizations can effectively monitor and evaluate AI-generated content in production. By implementing a two-level metrics approach (from basic component metrics to sophisticated LLM-as-a-Judge evaluations), and a comprehensive tracing and a gold standard dataset for testing, the PGA TOUR has created a robust framework. This framework will ensure their AI commentary system remains reliable and high-quality.

The PGA TOUR’s approach shows that monitoring generative AI systems presents unique challenges. Combining traditional observability practices with AI-specific evaluation methods can provide the visibility and quality control needed for production-grade AI applications to overcome these challenges. Using generative AI has improved operational efficiencies and employee productivity across the organization.

Learn more about how the PGA TOUR is using AWS Cloud infrastructure for dynamic tournament experiences for fans, players, and media partners, or contact an AWS Representative to know how we can help accelerate your business.

AWS for M&E Blog

PGA TOUR® enhances AI commentary by operationalizing generative AI

Observability in generative AI

Deployments in generative AI

How the PGA TOUR implements generative AI operations

Key learnings

Conclusion

Further reading

Resources

Follow

Learn

Resources

Developers

Help