AWS for M&E Blog

How Warner Bros. Discovery uses audio analysis to improve data accuracy and enrich the fan experience

This blog is co-authored by Steve Cockett, Sports Data Adviser – Tech, at Warner Bros. Discovery and Andrea Fanfani, Principal Product Manager – Tech, at Warner Bros. Discovery.

In this blog post, we explain how Warner Bros. Discovery (WBD), a global media and entertainment company, used serverless components from Amazon Web Services (AWS) to improve the accuracy of data streams used to power fan experiences for the WBD streaming platform.

Warner Bros. Discovery’s TNT Sports over-the-top (OTT) streaming platform broadcasts top-tier sports to millions of customers around the world every week. Warner Bros. Discovery enriches the fan experience by adding timeline markers to soccer videos. This lets viewers jump to points of interest in the video by clicking on the relevant marker.

A screenshot of the TNT streaming platform showing a soccer match with timeline markers for match events.

Accurately aligning data feeds with the video has a number of challenges. The timing accuracy of these markers is critical to the fan experience. A marker placed too late (later in the video than the action occurred) means a user might miss the action when it is clicked. A marker placed too early (ahead of the video) would result in a “spoiler” where the user sees the marker before the action happens on screen.

WBD ensures precision by manually setting kick-off markers and monitoring for accuracy. Today, by adopting an AWS serverless data processing approach, WBD benefits from improved operating efficiency through an automated alert system, significantly boosting the number of matches covered by an operator.

The challenge

There are two key challenges that affect data accuracy for markers in live sports.

Challenge 1 – Latency

Video feeds and data feeds are both subject to latency as part of their delivery from event to end user. These two paths diverge at the stadium and are only resynchronised on the viewer’s end user device. While different camera feeds may be genlocked (synchronised to a common signal clock) to ensure individual cameras are synced for the final production, data feeds may not necessarily use the same clock. In addition, as data markers are often captured manually by third party observers, known as ‘Loggers’, watching the broadcast feed. The data capture times are subject to variations in the transmission latency and the speed of operation of the Loggers.

A diagram showing how differing latencies cause an offset in kickoff timing between video and data sources.

As shown in the previous diagram, a timing offset may exist between events in the video feed and the data feed. This must be solved by calculating the offset between an anchor point (such as kick-off) and applying the same offset to subsequent markers.

Challenge 2 – Accuracy

For the most part, sports data feeds are still curated manually by humans. The volume of video data coupled with the speed at which processing must take place means fully automated detection is often not feasible for general fan experience applications. Markers created by humans are subject to some degree of timing accuracy issues despite best efforts, especially given the need to provide data points in near real time.

A diagram showing how low accuracy causes timing issues between video and data feeds .

As shown in the previous diagram, even once the kick off anchor point is aligned, data markers may still drift ahead or behind the video.

The solution

Analysing audio using spectrograms

The solution works by converting the audio into ‘mel’ spectrograms. Mel spectrograms use frequency bands that are equally spaced on the mel scale. This approximates the human auditory system’s logarithmic response to pitch more closely than the linearly-spaced frequency bands used in normal spectrograms.

The following example shows a mel spectrogram for audio during a penalty goal. The audio peaks (in dark red) clearly show when the whistle is blown, the ball is kicked, and the crowd’s reaction once the goal is scored.

A mel spectrogram annotated to show where actions such as a whistle sound, a ball kick and a goal appear in the audio.

Using audio analysis instead of processing frames of video has several advantages:

  • Audio is fast to process. The solution’s audio processing algorithms run much faster (about 25x) than real time. 10 seconds of audio takes 0.4 seconds to process.
  • The solution makes use of data in the audio, only using computer vision for second level of validation. This greatly reduces the processing cost of the solution and removes operational overhead associated with managing custom AI/ML models.
  • Audio is generally representative of the exact time in a match. Instant replays show past events, but normally maintain live stadium audio.
  • By converting the audio to spectrograms, relatively little data is processed, improving cost efficiency.

Solving latency challenges by detecting kickoff

Kickoff is a known time in the data feed, but not in the video feed. To align the two feeds, kickoff must be detected in the video. The first step to achieve this is to detect the referee’s whistle that signals the start of the match.

This is described in the following graph:

A mel spectrogram showing an audio clip with a whistle being blown. Below, a graph shows how data analysis pinpoints the moment the whistle is blown.

A mel spectrogram is created from an audio segment. Intense samples in the expected whistle frequency band (3.6KHz to 4.1 KHz) are then isolated. An average of all frequencies is subtracted from the ‘whistle’ freqency band, producing a signal with a single dimension. This means whistle sounds are extracted from the background noise. When the signal crosses a configurable threshold, a whistle is considered detected.

Whistle sounds may occur before or after the kickoff whistle and so to further refine the results Amazon Rekognition is used on the whistle video frame to validate the result. Rekognition object detection checks the camera used is the high, wide “Camera 01” angle (normally shown during kickoff) and validates that a minimum number of players are on the pitch.

As a final fail safe, Amazon Rekognition text detection is also used to detect the time by analysing the match clock graphic, which is typically added to the broadcast feed, in the corner of the image. If the kickoff whistle is missed, the match clock is used to subtract the elapsed time from the time of the video frame.

Solving marker accuracy using crowd excitement

Using ambient crowd audio to align data events to video is an elegant solution to resolve marker timing issues. After all, there are tens of thousands of fans in the stadium telling us exactly when points of interest occur and it is sensible to use their noise for this purpose.

This is demonstrated in the following graphs:

A mel spectrogram showing an audio clip with a crowd cheering. Below, a graph shows how data analysis pinpoints the moment the crowd cheers.

The first graph shows the mel spectrogram extracted from the audio. The dark red crowd noise at roughly 12 seconds shows the crowd reaction from the penalty goal.

The second graph shows a derived excitement score. The mel bands where the crowd noise is most prominent are averaged out. This average of crowd intensity is combined with the change in crowd intensity (measuring the difference between each subsequent sample) to give an overall picture of crowd excitement.

When a new data marker is received, it is adjusted to the crowd excitement peak in a configurable time window. For example, if the marker showed the goal happening at 10 seconds in the window above, this could be corrected to 12 seconds, where the excitement peak is found (marked by a green cross).

AWS Processing Architecture

Warner Bros. Discovery was able to prototype the solution rapidly on AWS. The following describes the serverless, event-based architecture used for processing:

A diagram showing the AWS serverless architecture.

HLS (a common video streaming format) video segments are streamed into the “Media Source” Amazon S3 Bucket. This triggers an indexing function that extracts the segment’s embedded timecode and writes a record containing this information to the “Indexed Segments” Amazon Dynamo DB table. If kickoff has not yet been detected, the segment is also processed by the kick-off detection function, the result of which is stored in the “Kickoff Status” Dynamo DB table.

Data markers are captured by Amazon Event Bridge which in turn triggers a mapping function to route the action type to a specific action function. In this way, logic can be maintained independently for different action types, for instance goals vs yellow cards.

The correction is applied to the data marker and saved in the “Updated Sports Feed” Dynamo DB table for publishing and reporting. The correction delta values are also written out to Amazon CloudWatch metrics to allow dashboards and alarms to be created. In this way, operators are alerted when data events are seen to be drifting too far from the video.

Results and accuracy

The following table shows test results for a single match. In this case the events of the match were observed and recorded manually by a human. If the audio processing logic detected the result within 1000ms of the human detected time, the result is considered a pass.

A table of results showing the performance of the algorithm vs human results.

Of the 25 match events, 24 were detected correctly. The one failed detection for the shot on goal was due to the shot being very poor. The ball rolled slowly towards the goal keeper, who picked it up and kicked it away. The crowd cheer came at the point the goal keeper picked up the ball and kicked it away roughly 2.6 seconds after the “shot” itself. This is reflected in the low crowd excitement score.

The results are ordered by crowd excitement as seen by the top two results. These actually form the same event, as the goal keeper deflected the first shot with the goal scored immediately after. A benefit of this approach is that match events can be ranked by crowd excitement. As shown in the list, goals typically elicit the most excitement, followed by cards, followed by shots on goal. This provides an insight into which events were most meaningful to the crowd.

Conclusion

Warner Bros. Discovery was able to build a working prototype for marker data correction in a matter of weeks. This rapid innovation was made possible by the elasticity and simplicity of AWS serverless compute and managed services. Unlocking the rich data of audio in the workflow has enabled a cost-effective solution that runs in near real time and will ultimately improve the end fan experience.

Andrew Lee

Andrew Lee

Andrew is a Senior Media Cloud Architect with AWS Professional Services based in Amsterdam, The Netherlands. In his spare time he loves playing guitar, reading a good book or traveling around the world.

James Kellar

James Kellar

James is a Principal Consultant with AWS Professional Services based in the United Kingdom. Outside of AWS he can be found playing bass with his Irish folk band or grunge with his daughter’s 90s band.

Vivek Thacker

Vivek Thacker

Vivek is a Senior Engagement Manager with AWS Professional Services based in London, The United Kingdom. When not managing engagements Vivek is out with family and friends, scouring London for the best food on offer.