AWS for M&E Blog

[Updated 3/15/2021] Inserting ad breaks into video content using Amazon Rekognition, AWS Elemental MediaConvert and AWS Elemental MediaTailor

Customers are often overwhelmed with the amount of undifferentiated heavy-lifting involved in preparing their media for monetization and streaming. The traditional ad insertion process is repetitive, error-prone, and time-consuming, and ads are inserted in a non-intrusive way so that the attention of the audience is not disturbed when presented with ad content. A cloud-based monetization workflow makes it easy to identify suitable insertion slots for ad breaks without hours of manual content review.

For example, we want to avoid that ads are placed in the middle of a scene, or a sentence, or in the middle of the end credits. On the other hand, we want ads to be placed between scenes, when nobody is speaking, when the screen goes black for an extended period of time, or when there is enough context change, so that the ad insertion is not disruptive.

In a previous post, we demonstrated how to use AWS services to automatically detect black frames, insert ad breaks into video content, and stream the content with AWS Elemental MediaTailor. In this post, we expand on that solution by incorporating additional features to detect the most suitable ad insertion slots using AWS Elemental MediaConvert and Amazon Rekognition.

in this picture is shown the result of the ingestion of a media file via the solution presented in this article. When the solution detects a fade to black, it properly inserts a related ad.

The result of the ingestion of a media file via the presented solution.

This implementation is built on the top of the Media Insights Engine (MIE), an AWS solution to analyze video, audio, and images using AWS artificial intelligence and media services. MIE allows you to define workflows that encode your media content using MediaConvert, apply computer vision algorithms using Rekognition, and perform additional processing through your own custom operators. Its most recent version (0.1.8) includes the Video Segment Detection feature, which can detect frame-accurate black frames, shot changes, end credits, and SMPTE (Society of Motion Picture and Television Engineers) color bars in video.

In this tutorial, you will learn how to detect:

  • Black frames (with silence)
  • Silence
  • Shot changes
  • End credits


Be sure to have the following to get the most out of this blog post:

Before diving into the solution, we also recommend familiarizing yourself with the technologies and standards used throughout this post.

  • AWS Elemental MediaConvert is a file-based video transcoding service with broadcast-grade features.
  • HLS is an HTTP adaptive bitrate streaming communications protocol.
  • AWS Elemental MediaTailor is a content personalization and monetization service.
  • Video Multiple Ad Playlist (VMAP) is an XML template that content owners can use to specify the structure of an ads inventory. It determines when and how many ads should be displayed in a stream. MediaTailor uses VMAP files to inject ads into your content. Ads included in an ad inventory may be described by using the Video Ad Serving Template (VAST) specification.
  • Amazon Rekognition Segment Detection APIs easily identify useful segments of video such as color bars and end credits.

Solution Overview

In a previous article, we built a solution based on the third-party open source software FFmpeg deployed on a custom Fargate ECS tasks. Customers that prefer a managed, serverless approach for shot and technical cue detection can now leverage the Rekognition Segment APIs. The APIs allows customers to easily analyze large volumes of videos and detect markers such as black frames or shot changes. The APIs then return SMPTE timecodes and timestamps for each detection.

The results of this analysis are used in this solution to estimate the best position for ads insertion. The output of the solution is an URL to a HLS playlist of the ads-featured stream. Server-side ad insertion is done via MediaTailor. Alternatively, the VMAP file generated by the solution can be used for client-side ad insertion.

Solution Architecture

Architecture Diagram. This solution makes use of custom MIE operators to convert the video to an HLS playlist, find the top 3 slots for ads insertion and for generating a VMAP file. The HLS playlist and the VMAP file are made available via CloudFront and stored on S3.

We make use of custom MIE operators to convert the video to an HLS playlist, find the top 3 slots for ads insertion and for generating a VMAP file. The HLS playlist and the VMAP file are made available via CloudFront and stored on S3.


Once a video file is uploaded to an input bucket in Amazon S3, an AWS Lambda function is triggered and starts the AWS Step Functions workflow:

  • The first stage makes use of a custom-built MIE operator to start and monitor the conversion of the original video file to HLS format (.m3u8 manifest files and .ts video segments) and the extraction of loudness information from the video’s audio track. This operation is carried out by a MediaConvert job.
  • In stage two, six MIE operators run in parallel to extract semantic and technical information from the uploaded file using Rekognition. In particular, the Technical Cues Detection operator and Shot Detection operators make use of Rekognition Segment APIs.
  • In stage three, a custom MIE operator collects the output from previous stages and detects candidate slots for ad insertion. A timestamp and fitness score are calculated for each slot.
  • In the last stage, another custom MIE operator chooses the top three candidate slots, produces a VMAP file with ads inserted, and stores it in the Dataplane Bucket on Amazon S3. This is the file that MediaTailor use to generate the HLS manifest with both content and ads.

The HLS manifest generated from the uploaded content is then made available for streaming via Amazon CloudFront and featured with ads by MediaTailor. Alternatively, the VMAP file is hosted on S3 and made available via CloudFront for client-side ads insertion.

MIE Workflow

In this section, we dive deep in the MIE operators that comprise the MIE Step Functions Workflow.

AWS Step Function MIE Workflow

AWS Step Function MIE Workflow

Video Transcoding Operator

The Video Transcoding Operator is a custom MIE operator that executes these four operations via MediaConvert:

  1. Converts the uploaded content into an HLS playlist with 1 second long segments so that it can be used with MediaTailor to insert ads
  2. Converts the uploaded content in a format that is suitable for Amazon Rekognition consumption (lower resolution MPEG-4)
  3. Extracts audio tracks as MPEG and loudness measurements as CSV file. The MPEG track can be used with Amazon Transcribe to generate subtitles and further semantic analysis with Amazon Comprehend. The loudness track is used for silence detection.
  4. The outputs of this operator are stored in the MIE dataplane for the further analysis carried out by the next operators.

Labels, Celebrities, Face Detection, and Content Moderation Operators

These operators are included in the MIE and use Amazon Rekognition APIs to deal with semantics for the provided content.

  • Objects, events, concepts or activities are detected and provided as labels by the Label Detection Operator
  • The Face Detection Operator returns information about where faces are detected in a video, facial landmarks such as the position of eyes, and detected emotions such as happy or sad.
  • The Celebrity Detection Operator gets tracking information for celebrities as they appear throughout the video
  • The Content Moderation Operator analyzes images and stored videos for adult and violent content.

The semantic information extracted by these operators can be used to inform the Ads Decision Server (ADS) about what ad to feature in the stream. This solution uses S3 as the ADS to host static VMAP responses. We’ll dive deep on how to build a dynamic ADS and reporting service in a next blog post.

Shot Detection and Technical Cues Operators

These two operators analyze content from a technical perspective. They both use Rekognition’s Segment Detection features to provide two detection options:

  • TECHNICAL_CUES, for the detection of silent black frames and end credits
  • SHOTS, for the detection of shot changes

Black frames with soundtrack or voice over are considered content by Rekognition and are not captured by the TECHNICAL CUES option. However, they are considered as shot changes by the SHOTS option. This solution uses the results returned by these operators to choose the best positions for ads insertion.

Slot Detection Operator

The Slot Detection Operator takes the results from previous stages to detect candidate slots. While results from Shot Detection and Technical Cues Detection are ready to be consumed as candidate slots, the detection of silent intervals requires a further step.

Silence Analysis is performed on the loudness track extracted via a MediaConvert job with the following configuration:

  • Audio Normalization Algorithm: ITU-R BS.1770-2: EBU R-128
  • Algorithm Control:Measure only
  • Loudness Logging:Log
  • Loudness Threshold: -50dB

The silence detection process checks that the audio signal loudness is lower than a customizable threshold. We set this threshold to -50dB. You can learn more about silence detection with MediaConvert jobs in this dedicated blog post. Silent intervals detected this way become candidates slots for ads insertion.

For each candidate slot, metadata from 2 seconds before and after its timestamp is retrieved among the results of the Labels, Celebrities, Face Detection, and Content Moderation Operators.

Now that a list of candidate slots is available and enriched with semantic metadata, an algorithm is used to assign a confidence score to each. Only the three slots with the highest score are selected for ad insertion.

The criteria we’re using to calculate the fitness score is:

  • We determine a base confidence score as a function of the slot type. In this implementation, we have used the following base scores:
        "Silence": 0.7,
        "BlackFrame": 0.8,
        "ShotChange": 0.7,
        "EndCredits": 1.0
  • If the distance of the candidate slot is <0.5s from the next available slot, the two slots are consolidated and their combined score is increased.
  • If the distance to the next slot is >0.5s and <30s, the slot confidence score is lowered.
  • If the slot appears in the first 25% of the video, the slot confidence score is lowered.
  • The intersection over union between the sets of Label metadata from before and after each slot is calculated as an approximate measure of context change associated with the slot. The higher the context change, the more likely the slot could represent a scene change, so the confidence score is increased accordingly.
# Score adjustment: labels before and after
slot["Context"] = __get_context_metadata(slot["Timestamp"], asset_metadata)
pre_labels = set(label["Name"] for label in slot["Context"]["Labels"]["Before"])
post_labels = set(label["Name"] for label in slot["Context"]["Labels"]["After"])
if pre_labels or post_labels:
    distance = 1.0 - (len(pre_labels.intersection(post_labels)) / len(pre_labels.union(post_labels)))
    slot["Score"] = __disjunction(slot["Score"], math.pow(distance, 4.0))

slot_detection/ lines 59 through 65.

VMAP Generation

Once semantic and technical information is extracted from the uploaded content, the top three slots are calculated, and the HLS playlist is ready for streaming, another custom MIE operator deployed by this solution, the VMAP Generation Operator, can build the ads manifest file.

The VMAP file generated using this operator is used by MediaTailor to perform server-side ad insertion, or you can use it for the same purpose on the client-side.

This operator is also responsible for the selection of ads that better match the processed content by calculating context similarity between the ad insertion slot and the ads contained in the Ad Server Bucket. It follows an approach similar to the one used to measure context change during slot detection (intersection over union).

The ads used in this operator and their context labels are configured in an included JSON file that references ads contained in a mock Ad Server bucket provided by AWS Elemental.

The following code snippet performs the selection.

def __select_ad(labels):
    print('labels: {}'.format(labels))
    # Searching ads to find the one with most similar labels
    top_similarity = -1.0
    top_ad = None
    slot_labels = set(labels)
    random.shuffle(ads) # Shuffle to return a random ad in case none has similarity
    for ad in ads:
        print('ad: {}'.format(ad))
        ad_labels = set(ad['labels'])
        similarity = len(slot_labels.intersection(ad_labels)) / len(slot_labels.union(ad_labels))
        if similarity > top_similarity:
            top_similarity = similarity
            top_ad = ad
    print('top_ad: {}'.format(top_ad))
    print('top_similarity: {}'.format(top_similarity))
    # Return URL to selected ad video file
    return top_ad['url']

vmap_generation/ lines 143 through 160

Deploying the Solution

This solution deploys in two phases. In the first phase, you will deploy the Media Insight Engine (MIE) to your AWS account. You can install MIE in your AWS account by following this guide. If you had already installed MIE in your account (version 0.1.8 or later), you can skip to the next step.

In the next phase, you will deploy the custom stack alongside MIE. Clone this repository: it contains a SAM template that deploys the stack containing custom Lambda functions, custom MIE operators, a CloudFront distribution to serve private MIE dataplane assets, and the MediaTailor campaign configuration.

Stack Parameters

The custom stack template needs some parameter values from MIE. You will need to check your MIE stack on the AWS CloudFormation console to obtain these values.

MediaInsightsEnginePython38Layer: This is the ARN of the MIE Lambda Layer used for slot detection. You can find this among the outputs of MIE main CloudFormation stack deployed in the previous phase.

WorkflowCustomResourceArn: This is the ARN of the MIE custom resource. You can find this among the outputs of MIE main CloudFormation stack deployed in the previous phase.

WorkflowEndpoint: This is the name of the Lambda function that handles the MIE Workflow API. You can find this as APIHandlerName among the outputs of the MIE Workflow API CloudFormation nested stack.

DataplaneEndpoint: This is the name of the Lambda function that handles the MIE Dataplane API. You can find this as APIHandlerName among the outputs of the MIE Dataplane API CloudFormation nested stack.

DataplaneBucket: This is the bucket that stores the HLS playlists, segments, and VMAP files. You can find this among the outputs of the MIE Dataplane CloudFormation nested stack.
Building and Deploying

Building and Deploying

Once you’ve cloned the repository, issue the following command in a terminal. Use the value from the MediaInsightsEnginePython38Layer parameter.

sam build --parameter-overrides 'ParameterKey=MediaInsightsEnginePython38Layer,ParameterValue=[Layer ARN obtained from MIE stack]'

SAM CLI command to build the application

After the Lambda function code is built, run the following command to deploy the stack. A guided procedure will ask you to provide the parameter values.

sam deploy --guided

SAM CLI command to build the application

Stack Outputs

After the stack is deployed, look at the command outputs and take note of their values, as you’ll need them later to access the assets generated by the solution.

InputBucket: The Input Bucket where videos will be uploaded

CloudFrontHLSPlaybackPrefix: URL prefix to access the MediaTailor configuration through CloudFront and to access the HLS stream with ads inserted

CloudFrontDomainName: Domain name URL from CloudFront that can be used to access the raw HLS stream and generated VMAP file for client-side ad insertion

Running the Solution

Start the workflow by uploading your video file to the Input Bucket, using the following command or in the console.

aws s3 cp [your video file].mp4 s3://[InputBucket output from CloudFormation]

AWS CLI command to upload the video file to the input bucket

The workflow will start automatically. Check the Step Functions console to monitor the workflow execution.

The picture shows the AWS Step Function console and where to fetch relevant information on the workflow

AWS Step Function console

Expand the Input section to get the AssetID.

AWS Step Function console – detail of Input

AWS Step Function console – detail of Input

With the AssetID, you can now access the HLS playlist, the VMAP file, and the ads-featured stream via MediaTailor.

Ad Insertion Playback

Server-Side Ads Insertion

In this solution, we create a MediaTailor Configuration at deployment time via a CloudFormation custom resource.

   Type: Custom::MediaTailorConfig
     ServiceToken: !GetAtt MediaTailorConfigFunction.Arn
     ConfigurationName: !Sub "${AWS::StackName}-config"
     VideoContentSource: !Sub "https://${DataplaneBucket}"
     AdDecisionServer: !Sub "https://${DataplaneBucket}[player_params.asset_id]/vmap/ad_breaks.vmap"

template.yml lines 404 through 410

The custom resource wraps the invocation of a Lambda Function that makes use of boto to create, update, and delete a MediaTailor Configuration.

res = mediatailor.put_playback_configuration(
     'AdSegmentUrlPrefix': cdn_ad_prefix,
     'ContentSegmentUrlPrefix': cdn_content_prefix

mediatailor/ lines 29 through 38

The parameters passed down to put_playback_configuration method are mapped on the properties of the MediaTailorConfig CloudFormation custom resource.

You can consume the HLS playlist created by MediaTailor via CloudFront using the following URL format.

https://[CloudFrontHLSPlaybackPrefix output from CloudFormation]/assets/[AssetId]/hls/playlist.m3u8?ads.asset_id=[AssetId]

MediaTailor HLS playback URL format

For example, you could use the following URL to test in an HLS player.

Although Rekognition provides frame-accurate positions for shot and cue detection, MediaTailor can only insert ad breaks between the HLS stream chunks. The actual position of the ads in the stream might not have the same precision. To address this,  tune your MediaConvert Profiles to produce smaller HLS segments, or to use the results of Rekognition Segment analysis to split the main video file into multiple HLS playlists.

Client-Side Ads Insertion

For client-side ad insertion, you can consume the VMAP file generated by the solution for each video with a VMAP compatible player via CloudFront.

Retrieve the URL to the VMAP file generated by the workflow using this URL format:

https://[CloudFrontDomainName Output from CloudFormation]/assets/[AssetId]/vmap/ad_breaks.vmap

VMAP file URL format

For example, you can use the following VMAP file to test your client-side ad insertion:

Retrieve the URL to the raw HLS playlist using this URL format:

https://[CloudFrontDomainName Output from CloudFormation]/assets/[AssetId]/hls/playlist.m3u8

HLS playlist URL format

For example, you can use the following HLS playlist to test your client-side ads insertion:


In this blog post, we demonstrated how to use the Amazon Rekognition Segment APIs to automatically find ad insertion slots. We built a solution able to ingest media content, analyze it, feature it with ads, and make it available for streaming using a host of AWS cloud services. Stay tuned for more blog posts about how to efficiently monetize your video workflows on AWS.