Automate broadcast video monitoring using machine learning on AWS

Monitoring service providers for broadcast and over-the-top (OTT) livestreams perform a large number of quality checks. These range from low-level signal errors to high-level issues like content errors. Traditional live media analyzer software focus on quality checks at the signal level, such as the ETSI TR 101 290 Priority 1 and 2 checks. Higher-level quality checks, such as verifying program content, subtitles, or audio language, are performed by human operators constantly watching the broadcast stream for issues. As the number of broadcast video streams grows, it is challenging and costly to scale the manual monitoring effort to support additional channels and programs.

Latest advances in artificial intelligence (AI) can automate many of the higher-level monitoring tasks that were once entirely manual. With assistance from AI-based detections, human operators can focus on higher-level tasks, respond to issues faster, and monitor larger number of channels with higher quality. In this blog post, we walk through an example application that uses AWS AI services such as Amazon Rekognition to analyze the content of a HTTP Live Streaming (HLS) video stream. It performs an example set of monitoring checks in near real-time (<15 seconds). We also discuss how the example application can be extended to support additional use cases.

Prerequisites

We shared the sample application code in the GitHub repo. If you would like to deploy the application in your account to customize it for your own monitoring use case, make sure you have the following in place:

An AWS account that you have administrative permissions for
Github account
Python 3 and pipenv installed
Prior experience with AWS CloudFormation and Python is advised to deploy the solution, but not required

Before diving into the solution, we also recommend that you familiarize yourself with the technologies and standards used throughout this post.

HLS is an HTTP adaptive bitrate streaming communications protocol.
AWS Elemental MediaLive is a real-time video service that lets you create live outputs for broadcast and streaming delivery
AWS Step Functions is a serverless workflow orchestrator
Amazon Rekognition Custom Labels allows you to build models to identify the objects and scenes specific to your business needs

Solution Overview

A broadcast quality control solution needs to monitor various aspects of the livestream to make sure:

Video – there are no blank screens, no pixilation or blocky artifacts. There is smooth movement, and the correct content is playing according to schedule.
Audio – has audio, at the appropriate volume, lip sync is correct, correct language.
Logo and graphic overlays – logo is present and not cropped, overlays are correct according to schedule, everything is refreshing correctly
Close caption – is available, with correct timing, subtitles match audio, correct language
Embedded metadata (e.g. SCTE) – is present, with correct ad insertion

While some of these checks can be performed using traditional image and audio analysis algorithms, many are well-suited for detection using custom machine learning (ML) models. The sample application implements a subset of these checks, representative of different approaches for automation:

Audio silence detection – based on volume threshold, no machine learning required.
Station logo verification – identifying known logos from images is well suited for Convolutional Neural Networks (CNN) based ML models. In this application, we leveraged Amazon Rekognition Custom Labels to build an object detection model for this feature.
Correct video content verification (domain specific) – to determine whether the correct program is playing according to schedule is a complex task that is best answered by breaking the question down into more specific problems. You can use an ensemble of different ML models to help a computer identify the program being streamed. To demonstrate this approach, we implemented content identification for a specific domain: livestream team sports. The system may verify the content of a sports event by combining the following:
- High level program type: Does the video look like a sports program? If so, is the right sport being played? (e.g. soccer). To answer this question, we built a custom image classification model using Rekognition Custom Labels.
- Sports team identification: If the video is showing the correct sport, can I verify that the correct match is being streamed? For team sports such as soccer or football, there are a variety of visual and audio signals that identifies the team playing. In the sample application, two different ML approaches are combined to answer this question: we used the text-in-image extraction feature of Amazon Rekognition to look for team names or abbreviations on screen. We also leveraged Rekognition Custom Labels to train a model to recognize specific soccer team logos.

Solution Architecture

The solution architecture for the application consists of three main components:

A video ingestion pipeline where HLS streams produced by AWS Elemental MediaLive is stored in an Amazon Simple Storage Service (Amazon S3) bucket
A video processing pipeline orchestrated by AWS Step Functions that performs monitoring checks on extracted frames and audio from each video segment
A web application that demonstrates the real-time status and details of each monitoring check being performed on the video stream

Broadcast Video Monitoring architecture

(Broadcast Video Monitoring architecture)

Video ingestion

In our sample broadcast monitoring application, AWS Elemental MediaLive ingests and transcodes source content into HLS format. The source content may come from a number of upstream systems, such as streaming camera, contribution encoder appliance, or another HLS stream. For testing and development of the application, we opted to use a mp4 file stored on Amazon S3 as an input source for the MediaLive channel. Using an S3 file for input, along with the auto-looping capability of MediaLive, is an easy way to create a livestream for development and testing.

Another useful feature of AWS Elemental MediaLive is the option to write HLS files into Amazon S3. By enabling this S3 HLS output option, every time a new .ts segment file is generated for the livestream, it is written to an S3 object, and uploads a new version of the .m3u8 playlist file. This allows you to directly set up AWS Lambda triggers on these S3 upload events to start the processing workflow for each new HLS segment. We discuss this further in the next section.

It’s important to note that although MediaLive’s HLS output to S3 feature is used here to simplify the video ingestion using a push model, you can also use a poll model. This requires setting up a fleet of long running Amazon EC2 instances or AWS Fargate containers that continuously poll and download new segments of each HLS stream needing monitoring. The polling method requires more engineering effort and operational overhead, but allows more flexibility and can be used to monitor any live stream produced by any media server without introducing additional transcoding.

Deploy the video ingestion and processing pipeline

If you want to deploy the video ingestion and backend processing pipeline of the sample application in your account, use the following steps (otherwise, you can skip to the next section to continue reading):

Using the following button to start the launch of a CloudFormation stack:
Select the “Next” button to continue
In Step 2: Specify stack details review the stack parameters. These settings configure the source of the HLS stream the AWS Elemental MediaLive pipeline produced and monitored by the application. Keep the default to generate a test stream using a sample mp4 file hosted on S3 we provided. You can change these settings here to point to your own video files/streams. Once the stack is created, you can change the input configuration any time by doing so in the AWS Elemental MediaLive console. The AWS Elemental MediaLive pipeline allows switching between different input sources seamlessly as long as you stop the pipeline before making changes.
(Video ingestion and processing pipeline CloudFormation stack parameters)
Click the “Next” button. In Step 3 Configure stack options page, keep all defaults, and click Next again
In Step 4 Review page, click the checkmarks to acknowledge that CloudFormation can create IAM resources and the CAPABILITY_AUTO_EXPAND capability, and then click “Create stack”.
You can continue reading the rest of this blog while you wait for the stack to finish launching (this could take about 10 minutes, due to the creation of a CloudFront distribution for video playback).

More details on the deployment instructions can be found at the README.md of the Github repo.

Event driven processing of new video segments

Events generated by new video segments drive the monitoring checks of the video processing pipeline. Every time a new version of the .m3u8 playlist file is written to S3 by Elemental MediaLive, it triggers an AWS Lambda function StartSfnFunction (source code here). The Lambda function then kicks off a Step Functions state machine that executes the analysis workflow on the new HLS segment.

The AWS Step Functions workflow consists of processing steps implemented by a separate AWS Lambda function. In production, the quantity of monitoring checks required and the media sampling rate may vary significantly depending on the livestream schedule and content. By building the processing pipeline using serverless and modular components, the architecture enables easy and cost-effective elastic scaling to support the fluctuating demand. The sample application also includes a simple feature flagging system that allows disabling individual monitoring checks at runtime (controlled by the environment variables of StartSfnFunction Lambda function).

When creating AWS Step Functions state machines, there are two options to choose from: Express or Standard workflows (you can learn more about their differences in the documentation here). The Express Workflow option is specifically optimized for high-volume and short-duration use cases. While we recommend leveraging the Express mode in a production environment to monitor video streams at scale, we found the graphical workflow UI of the Standard option rather convenient for initial development and testing. The workflow type is chosen at creation time of a state machine and can be configured by a parameter in CloudFormation (see example).

The following is a graphical representation of an example execution of the monitoring workflow on a specific HLS segment. Using workflows defined in Step Functions allows you to parallelize execution of different monitoring checks to minimize end-to-end detection latency. In my testing, each segment in the HLS stream is configured for a duration of 6 seconds – typical for live streaming. The processing workflow is able to complete all the analysis steps for each segment in less than 15 seconds. If you need a lower latency for your use case, you can reduce the HLS segment size to 2-4 seconds. Doing so can reduce the time for steps in the pipeline that need to read the whole segment (such as the Extract Frames ). However, consider the additional load that shorter segments can place on the system such as increased number of AWS Lambda function invocations, number of S3 put/get requests, etc.

In the next sections, we discuss details of its major processing steps.

(Processing pipeline state machine graph)

Parse HLS manifest

As the preceding workflow diagram shows, the processing pipeline starts with a “Parse Manifest” step that downloads and parses the .m3u8 HLS manifest file (a.k.a. playlist file). With S3 object versioning enabled on the S3 bucket storing the HLS output files produced by Elemental MediaLive, every update to the manifest file results in a new S3 object version. This allows each update to the manifest file to be processed by the event-driven processing flow at least once.

With the exception of the master manifest (learn more here), the manifest file for a given bitrate contains a list of the most recent media segments in the stream and their timing information. See the following example:

#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:6
#EXT-X-MEDIA-SEQUENCE:478
#EXT-X-PROGRAM-DATE-TIME:2020-03-12T21:08:45.867Z
#EXTINF:6.00000,
test_1_00478.ts
#EXT-X-PROGRAM-DATE-TIME:2020-03-12T21:08:51.867Z
#EXTINF:6.00000,
test_1_00479.ts
#EXT-X-PROGRAM-DATE-TIME:2020-03-12T21:08:57.867Z
#EXTINF:6.00000,
test_1_00480.ts
#EXT-X-PROGRAM-DATE-TIME:2020-03-12T21:09:03.867Z
#EXTINF:6.00000,
test_1_00481.ts
#EXT-X-ENDLIST

(an example of an HLS manifest file)
Parsing the manifest file allows the application to identify the latest media segment (“test_1_00481.ts” in the example above), the programming time (“2020-03-12T21:09:03.867Z“), and its duration (6 seconds) and perform subsequent analysis on this segment.

The EXT-X-PROGRAM-DATE-TIME tag in the HLS manifest file must be present for the application to understand the timestamp for each media segment to compare it with the expected schedule metadata. In AWS Elemental MediaLive, this is enabled in the “Manifest and Segments” section of the HLS output configuration.

(AWS Elemental MediaLive configuration required to produce the datetime tag)

The source code for the manifest parsing logic can be found on the project GitHub repo here.

Match expected program metadata

In order for the broadcast monitoring application to determine whether the current video segment has the correct station logo and is playing the correct content, it needs to compare what it detects in the stream against what the schedule expects. In the sample application, the expected program metadata is stored in an Amazon DynamoDB table, video-processing-dev-Schedule:

DynamoDB view of expected program metadata (DynamoDB view of the expected program metadata)

The “Find Expected Program” step in the workflow reads the timestamp and duration of each media segment and looks up the corresponding program and metadata. It also dynamically disables monitoring checks that are not relevant based on the metadata of the scheduled program. For example, in the first row of the preceding table, the expected program is a news program. Because the sports identification check is not applicable for this type of program, this step sets a configuration flag to skip this check in later processing.

The source code for this step can be found on GitHub here. To easily populate the schedule DynamoDB table with example values for the demo video footage we supplied, follow instructions in broadcast-monitoring/scripts/README.md (Github link)

Note that the preceding example table and code uses relative start and end time for program timing because our testing video stream is produced by playing a recorded video on loop. These timestamps would be replaced by absolute date and time values in a production environment.

Implement Monitoring Check #1: Audio silence monitoring

Monitoring “loudness” is a capability typically found in traditional live media analyzer software. However, in this sample application we implemented a silence check to demonstrate how comparable monitoring could be achieved with the audio track of the livestream. The implementation leverages the ffmpeg library to detect loudness and periods of silence on each HLS segment.

Although ML is not used for this check, the AudioDetection function is an example of how to perform checks at the video segment level. The AWS Step Functions workflow kicks off this function with the S3 location of the transport stream .ts file provided by the state machine data passed between steps. As shown in the following code sample, this sample application also demonstrates the use of Lambda Layers to include the ffmpeg executable as a layer from an application from the AWS Serverless Application Repository.

ffmpeglambdalayer:
  Type: AWS::Serverless::Application
  Properties:
    Location:
      ApplicationId: arn:aws:serverlessrepo:us-east-1:496010403454:applications/ffmpeg-lambda-layer-python3
      SemanticVersion: 0.0.3

...


AudioDetectionFunction:
  Type: AWS::Serverless::Function
  Properties:
    CodeUri: ../src/audio_detect/app/
    Handler: main.lambda_handler
    Role: !GetAtt ProjectLambdaRole.Arn
    Layers:
      - !GetAtt ffmpeglambdalayer.Outputs.ffmpegLayerArn

(Example of using a Lambda Layer from a serverless application repo app)

While simple, this functionality can easily be extended for use in your solution to extract the audio track for use with Amazon Transcribe to detect languages, monitor for keywords, or a host of other imaginative use cases.

Video frame sampling

All the image-based monitoring checks (e.g. station logo verification, sports type and team identification in the example application) require running ML models on frames extracted from the video segment. To maximize efficiency, the “Extract Frame” step in the Step Function workflow extracts frames from the video using OpenCV and persists them in S3 (storing metadata in DynamoDB). The set of extracted frames are then consumed by multiple subsequent processing steps.

To increase confidence of ML-based detections and reduce noise caused by outlier false positives, the workflow evaluates inference results performed on multiple frames from one media segment. It only reports the status for a check as failing when the percentage of problematic frames exceeds a threshold. On the other hand, it is not necessary to run the ML models on every single frame of the video stream, which can be costly. Our demo application samples frames at a rate configurable through the environment variable of the FrameExtractor function.

Implement Monitoring Check #2: Station Logo Monitoring

To identify station logos from the extracted frames, this solution uses Amazon Rekognition Custom Labels. Paired with a process to rapidly generate datasets, Rekognition Custom Labels enable custom object detection models to be quickly trained, validated, and deployed for this solution. We will dive deeper to discuss the details.

Generating Training Images

Training a custom ML model typically requires thousands of examples and days or weeks of tuning to detect with confidence. Amazon Rekognition Custom Labels can build a model to achieve similar results with only tens to hundreds of training images. To train the station logo model, we synthetically generated training images by overlaying different station logos on various background images. To build a robust model that can detect a logo that might appear in production video streams with some variations (e.g. different size, greyscale, opacity, etc.) We applied a technique called Image Augmentation.

To make the training data generation a repeatable process, we wrote a script (generate-logo-images.py) automating these steps. For each station logo, we want to teach the model to recognize it, the script first augments it using the imgug library, and then overlays it on different background images at a randomly selected position. The following figure shows the result of an augmented AWS Elemental logo overlaid on background images (the bounding boxes drawn on the images are for illustration only). This process provides a repeatable mechanism to easily scale the number of images available for a training dataset.

(Images with augmented logos overlaid used for Rekognition Custom Labels training)

Create a dataset and train in Amazon Rekognition Custom Labels

An Amazon Rekognition Custom Labels project dataset consists of images, assigned labels, and bounding boxes you use to train and test a custom model. You create and manage datasets by using the Custom Labels console. For experimentation and small datasets, you can upload images to the console, then manually label and draw the bounding boxes. You can also create a dataset by importing a SageMaker Ground Truth manifest file. If you have a collection of images with station logos, you can use a data labelling service like Amazon SageMaker Ground Truth to create a manifest file. Because we artificially generated training images with known labels (station logo) and bounding boxes, the generation script outputs a Ground Truth .manifest file to create the training and test datasets.

For my sample application, we trained models using a generated dataset with an 80/20 split for training and testing. The following image displays training and validation results for a custom model with 800 images and 200 test images in ~2 hours—Amazon Rekognition Custom Labels handled all of the heavy lifting.

(Training and validation results from an object detection custom labels training job)

Make inferences using Amazon Rekognition Custom Labels

Once satisfied with model performance, it’s time to run the model to perform inference in the processing pipeline. There are a number of factors that determine the throughput of Custom Labels inference unit. To ensure we provisioned enough capacity for the pipeline in the use case, we set MinInferenceUnits to 2 to provide buffer for processing video segments. Frames extracted from each video segment are processed in parallel by the LogoDetection Lambda function. As with the AudioDetect function, state machine data is supplied to each function invocation with the program information and S3 location of the image to be processed.

{
      "parsed": {...},
      "config": {...},
      "frame": {
        ...
        "DateTime": "2020-01-23T21:36:35.290000Z",
        "Chunk": "test_1_00016.ts",
        "Millis_In_Chunk": 0,
        "Frame_Num": 0,
        "S3_Bucket": "livestream-artifact-bucket",
        "S3_Key": "frames/test_video_single_pipeline/test_1/original/2020/01/23/21/36:35:290000.jpg"
      }
    }

(Example state machine data passed to Step Functions states)

With a model trained and started, the following code demonstrates how the frame data is prepared and sent to the detect_custom_labels function for inference.

    frame_info = event['frame']
    bucket = frame_info['S3_Bucket']
    key = frame_info['S3_Key']
    min_confidence = int(os.getenv('LOGO_MIN_CONFIDENCE', 60))
    model_arn = os.getenv('LOGO_MODEL_ARN')

    ...

    img_data = {'S3Object': {'Bucket': bucket, 'Name': key}}

    with DDBUpdateBuilder(...) as update_builder:
        try:
            response = rekognition.detect_custom_labels(
                Image=img_data, 
                MinConfidence=min_confidence,
                ProjectVersionArn=model_arn
            )
        except ClientError as e:
            ...
        else:
            result = response.get('CustomLabels', [])
            ...

(Code sample showing Rekognition API call used in a Lambda function to detect custom labels)

A single call to detect_custom_labels using data extracted from the input event is all that’s necessary to perform inference on a frame with the custom model. The inference results are compared against the logo for the expected program and written to DynamoDB. And that’s it, Amazon Rekognition Custom Labels takes care of the heavy lifting for inference.

Implement Monitoring Check #3: Sports type verification

We’ve talked about how we utilized Amazon Rekognition Custom Labels to detect logos in still images, but now we will show how to use it to predict sports content. A broadcaster may handle many different streams from various providers and must make sure the streaming content corresponds with the programmed content. To verify the correct type of sports content, we leverage Custom Labels. Unlike the station logo detection use case which required a custom object detection model, we built a custom image classification model to identify the type of sports.

Generating training data for image classification

Various sports and activities have visual similarities regardless of what teams are playing or which competitor is winning or losing. Custom Labels provides the ability to build an image classification model from a dataset of labeled images. For this use case, preparation of the data is the most important part, but it is easily achieved by structuring the source images in S3.

S3-bucket
└── sports
   ├── soccer
   │ ├── .
    │   └── .
    ├── .
   ├── .
   └── basketball
   ├── .
   └── .

(Example of S3 folder structure to use custom labels classification functionality)

Although outside of the scope of this post, generating the datasets utilized AWS Batch jobs and OpenCV to extract frames from videos of the 5 test sports and store the images in S3.

Training and Inference

(Training and validation results from a Custom Labels training job)

As seen in the preceding figure of the Rekognition console, we trained the sports detection model for 5 different labels with ~5000 images in approximately 2 hours. Aside from the differences in dataset creation, the steps to deploy a classification model are the same as for the logo detection model. Reviewing the SportsDetectFunction implementation in the sample application, you can see that performing inference can be achieved in a similar fashion.

Implement Monitoring Check #4: Sports Team Verification

For sports team verification, the sample application demonstrates combining two different ML techniques to perform a monitoring check:

Extract text from frames that matches team names
Detect appearances of sports team logos

For the first method, we rely on the fact that it is common for broadcast team sports to show graphic overlays or banners with team names or their abbreviations (see the following example):

(Example of a scoreboard overlay used to detect teams)

The TeamDetection Lambda function leverages the text-in-image feature of Amazon Rekognition to find text present in extracted video frames. It then matches the text against a collection of known team names, abbreviations, and nicknames. This text matching functionality seems rather simplistic in the sample application, it can be extended to support a more sophisticated lookup approach using a database such as Elasticsearch and techniques such as fuzzy matching.

To add more detection confidence, and support times when the scoring banners with team names are not visible on screen, the sample application also identifies logos of sports team appearing on screen. Especially in close-up camera angles, team and league logos are visible on players’ uniforms, as seen in the example frame below:

(Extracted frame with team logo clearly visible on players jersey)

In the sample application, we again leveraged Amazon Rekognition Custom Labels to perform this team logo detection, very similar to the station logo detection detailed earlier.

By combining results from two different ML methods that use different types of information, the system can make detections with higher confidence if the results from both methods match.

Visualize monitoring results using the demo web application

To examine and visualize the results of the monitoring checks, we developed a demo web application using AWS Amplify, AWS AppSync, and VueJS. The web app frontend uses a GraphQL subscription over web sockets to receive updates on the analysis results of each new HLS media segment. When you click on a specific segment to see more detailed results, you can inspect the information extracted vs the expected for each sampled frame and confidence scores of each evaluation. You can also replay the video of selected segment, powered by the time-shifted viewing feature of AWS Elemental MediaPackage.

The following is a screenshot of the sample app that detected sections of the video streams with lost audio. Notice the two sports related checks are disabled because a news program is scheduled for the given time.

Web application view where audio is not detected and sports monitoring is disabled (media credit: news video clip originally from the China News Service shared under the CC BY 3.0 License)

(Web application view where audio is not detected and sports monitoring is disabled. Media credit: news video clip originally from the China News Service shared under the CC BY 3.0 license)

And here’s another example screenshot of the application monitoring a sports livestream. The Sample Inspector pane on the right shows details on how the selected segment passes all the monitoring checks:

(Web application view with check results for video segment frame)

Deploy the web application and running the sample application

If you followed the instructions to deploy the video ingestion and processing pipeline using AWS CloudFormation, you can set up the accompanying web application and test the monitoring pipeline end-to-end. It’s important to note that since several features of the sample applications (sports detection, logo detection, etc.) relies on Custom Models built in Amazon Rekognition, they are not enabled by default. To run the application without these features, use the following steps:

To set up the demo web application in your account using AWS Amplify Console, follow the steps here in the GitHub README.md (before doing so, make sure the CloudFormation stack for the backend processing pipeline has completed launching).
Populate the expected programming schedule table in DynamoDB. Use the provided script and sample schedule if you are using the test source video provided, by running the commands below. If using your own video, adjust the content accordingly.
```
cd broadcast-monitoring 
pipenv run python scripts/load_csv_to_ddb.py scripts/schedule.csv video-processing-Schedule
```
In the DynamoDB console, verify the video-processing-Schedule table is populated
Start the media processing pipeline. Go to the MediaLive console and start the MediaLive channel created by the CloudFormation stack to kick off the HLS stream production:(AWS Elemental MediaLive channel starting)
Go to the AWS Amplify console, find the URL of the web application, and open the web app in Chrome or Firefox.
Register a log in using your email. After verifying your email with a verification code, you should be able to log in to the web app.

To enable logo and sports detection features, you can train your own custom model in Amazon Rekognition and provide the model ARNs to the corresponding Lambda functions. Read more on how to do so here.

Resource clean-up

If you followed the steps to launch the sample application, remove the deployed infrastructure by following the steps below to avoid unwanted charges:

Go to the AWS CloudFormation console, delete the root stack for backend resources for Amplify web app with name starting with amplify-broadcast
Delete the stack for the media ingestion and processing pipeline broadcast-monitoring
Go to the AWS Amplify console and delete the web application.

Go beyond the sample application

In this post, we walked through the components and considerations to build a livestream video monitoring application using AWS AI services. The sample application implemented is just a small subset of video monitoring tasks that can be automated using AI/ML. With the modular processing framework based on AWS Step Functions, you can easily expand the application with additional monitoring tasks using building blocks provided by AWS AI services, for example:

Verify the audio language and closed caption correctness using Amazon Transcribe
Identify program genre (e.g. differentiate between news, talk shows, game shows, cartoons) by training custom image classification models using Amazon Rekognition Custom Labels
Apply text classification on closed caption data using Amazon Comprehend Custom Classification
Recognize specific programs within each genre (e.g. identify a game show to be a Jeopardy episode) by applying a combination of image based ML models using Amazon Rekognition (e.g. text detection, facial recognition, image classification) and text-based models using Amazon Comprehend (named entity extraction and text classification)
Detect image quality impairments by training custom models on Amazon SageMaker (E.g. the VIDMAP model described in this paper)

AWS for M&E Blog