Media2Cloud on AWS Guidance: Scene and ad-break detection and contextual understanding for advertising using generative AI

Overview

Contextual advertising matches advertisements to the content a user consumes, creating a personalized advertising experience. It involves three key players: publishers (website or content owners), advertisers, and consumers. Publishers provide the platform and content, while advertisers create contextually tailored ads. Consumers engage with the content, and relevant ads display based on context. A challenge in implementing contextual advertising is the insertion of ads in to streaming video on demand (VOD) content. Traditionally, this process relied on manual tagging by human experts, who analyze the content and assign relevant keywords or categories to it. This approach is time-consuming, subjective, and may not capture the full context or nuances of the content. While traditional AI and machine learning (ML) solutions can automate this process, they often require extensive training data and can be expensive and limited in their capabilities.

Generative AI, powered by large language models, offers a promising solution to overcome these challenges. By leveraging a models’ vast knowledge and contextual understanding, broadcasters and content producers can automatically generate accurate and comprehensive contextual insights and taxonomies for their media assets. This approach enables a streamlined process and effective ad targeting and monetization of media archives.

In this blog post, we explore one of the new features of the Media2Cloud on AWS Guidance V4, Scene and Ad break detection and contextual understandings of the Ad break. We demonstrate step-by-step how to create contextually relevant insights and taxonomies for advertising using generative AI on Amazon Web Services (AWS). This allows broadcasters and content producers to monetize their media assets effectively, extract greater value from their media archives, unlock new revenue streams, and deliver personalized and engaging advertising experiences to audiences.

Watch the Media2Cloud scene and ad break detection demonstration video from NAB 2024.

Key terms and definitions

Frames – frame image extracted from the video content

Shots – continuous sequences of frames between two edits or cuts that define one action

Scene – continuous sequence of action taking place in a specific location and time, consisting of a series of shots

Chapter – logical divisions of the video content storyline, consisting of a series of shots and conversations on the similar topic

WebVTT – a file format used to store timed text track data, such as subtitles or captions, for video content on the web

The Interactive Advertising Bureau (IAB) Content Taxonomy – standardized metadata categories and subcategories that enable advertising platforms, publishers, and advertisers to target and match ads with relevant content effectively

Global Alliance for Responsible Media (GARM) Taxonomy – standardized categorization that defines sensitive content topics advertisers can avoid or apply to specific brand suitability settings in digital advertising

Solution overview

The AWS team tested various techniques in pursuit of optimal design: Self-hosted image caption models, leveraging large language models (LLMs) to summarize transcriptions and detected labels, using in-context learning to classify scene summaries according to the IAB Content Taxonomy Version 3, and harnessing embeddings for embedding search. During this intensive testing period, we saw remarkable advancement in generative AI. Models evolve rapidly, become faster, more cost-effective, and increasingly capable, allowing us to converge on a design harnessing the cutting-edge Anthropic Claude 3 multimodal foundation model.

Figure 1 describes a generic workflow of the solution using Amazon Transcribe, an AWS AI service, and foundation models (FMs) from Amazon Bedrock.

Steps overview:

The user uploads a media asset to Amazon Simple Storage Service (Amazon S3).
Generate audio chapter points: We use Amazon Transcribe, an automatic speech recognition (ASR) service, to generate transcription from the audio dialogues of the media asset, then employ Anthropic’s Claude 3 Haiku model to analyze the conversation and identify chapter points based on significant topic changes.
In parallel, generate a scene grid from video frames: We sample frames from the video and use Amazon Titan Multimodal Embedding model to group frames into shots and then group shots into scenes based on visual similarity.
Align scenes and audio chapters: Align video scenes with audio chapters to identify unobtrusive breaks for ad insertion.
Generate contextual responses: Send the scene grid and transcription to the Anthropic Claude 3 model in Amazon Bedrock to generate relevant contextual responses, such as scene descriptions, sentiment, relevant IAB, or other custom taxonomy.

For this blog post, a sample notebook is available detailing these steps. We walk through the steps and sample code snippets in the following section.

Prerequisites to run the sample notebook

An AWS account with requisite permissions, including access to Amazon Bedrock, Amazon SageMaker, and Amazon S3 for file uploads.
Permission to manage model access in Amazon Bedrock, as this solution requires the Anthropic Claude 3 Sonnet and Anthropic Claude 3 Haiku models.
Use the default Python 3 kernel on Amazon SageMaker Studio with a recommended ml.t3.medium CPU instance. Set up a domain for Amazon SageMaker Studio (refer to the documentation).
Install third-party libraries like FFmpeg, open-cv, and webvtt-py before executing the code sections (follow the instructions).
Use the Meridian short film from Netflix Open Content under the Creative Commons Attribution 4.0 International Public License as the example video.

Generate the chapter points from audio

After uploading the video to Amazon S3, we use Amazon Transcribe and a foundation model from Amazon Bedrock to automatically generate conversational chapter points, helping track when conversation topics start and end in the video. Amazon Transcribe converts speech to text and generates a transcription, which is then downloaded and formatted into the WebVTT format.

[
    {
        'text': 'So these guys just', 
        'start': '00:00:26.860', 
        'end': '00:00:28.260', 
        'start_ms': 26860, 
        'end_ms': 28260
     }, 
     {
        'text': 'disappeared.', 
        'start': '00:00:28.569', 
        'end': '00:00:29.459', 
        'start_ms': 28569, 
        'end_ms': 29459
     }, 
     ...
]

Next, the transcript is passed to the Anthropic Claude 3 Haiku model from Amazon Bedrock. The model analyzes the transcript and suggests conversational chapter points in a specific JSON format. In the prompt, we specify that each chapter should contain a start and end timestamp, along with a reason describing the topic. The prompts for the Anthropic Claude 3 Haiku model follow:

System prompt

You are a media operation assistant who analyses movie transcripts in WebVTT
format and suggest chapter points based on the topic changes in the conversations.
It is important to read the entire transcripts.

Messages

[
    {
        'content': 'Here is the transcripts in <transcript> tag:\n'
                '<transcript>{transcript}\n</transcript>\n',
        'role': 'user'
    },
    {
        'content': 'OK. I got the transcript. What output format?',
        'role': 'assistant'
    },
    {
        'content': 'JSON format. An example of the output:\n'
                '{"chapters": [{"start": "00:00:10.000", "end": "00:00:32.000", '
                '"reason": "It appears the chapter talks about..."}]}\n',
        'role': 'user'
    },
    {
        'content': '{', 'role': 'assistant'
    }
 ]

To ensure the model’s output accurately reflects the original transcript, the output JSON is post-processed to merge any overlapping chapter timestamps and align the chapter boundaries with the actual caption timestamps from the WebVTT file. Following is an example of a chapter in final JSON output:

{
    "chapters":[
        {
            'start': '00:00:26.860',
            'end': '00:01:00.529',
            'start_ms': 26860
            'end_ms': 60529,
            'reason': 'This section discusses the disappearance of three men - a school '
                    'teacher, an insurance salesman, and a retiree. It introduces a '
                    'witness who saw a strange occurrence on a rock near El Matador '
                    "beach around the time of the last man's disappearance.",
        },
        .....
    ]
}

Generate scene grid from video frames

In parallel with audio processing, we prepare visual elements by using similarity analysis of visual embeddings using Amazon Titan Multimodal Embedding (TME) model from Amazon Bedrock. First, we extract video frames at one frame per second with a 392×220 pixel resolution, optimized for visual quality and computational efficiency through numerous experiments. These extracted frames pass to the TME model to generate embeddings capturing visual features and semantics. The embeddings are compared to group visually similar frames into shots. Following are some shot examples:

In this process, we sample one frame per second, then employ cosine similarity logic on adjacent frames to group 719 frame images into 118 shots, representing camera shot changes. One frame per second down sampling is chosen based on past experience but can be calibrated for high-motion, high-frame-rate videos.

Even after identifying individual camera shots, there may still be too many semantically similar shots depicting the same setting. To further cluster these into distinct scenes, we expand frame comparison beyond adjacent frames. By looking at similar frames across an expanded time window, we can identify shots that are likely part of the same contiguous scene. We calculate pairwise similarity scores between all frames within a given time window. Frames with similarity scores above a certain threshold are considered part of the same scene group. This process performs recursively across all frames in a shot. The time window size and similarity threshold are calibrated parameters that can significantly impact scene boundary detection accuracy. In our example, a 3-minute time window and 0.85 similarity threshold gave the best scene clustering results across our video samples.

Technically, we accomplish scene grouping by first indexing all video frames using TME again and storing the embeddings along with their shot information and timestamps into a vector database, as illustrated in the following figure.

We then perform a recursive similarity search against this indexed frame corpus. For each frame, we find all other frames within a 3-minute time window in both directions with greater than 85% contextual similarity based on their vector representations. The shot information for these highly similar frames is recorded. This process iterates for all frames in a shot to compile contextually similar shots. This process repeats across all shots, and the compiled results look like this example:

shot 1 –> 2, 3, 4
shot 2 –> 1, 3
shot 3 –> 2, 4, 5
shot 7 –> 8, 9

Finally, we run a reduction process to group shots that are mutually identified as highly similar into distinct scene groups as follows:

shot 1, 2, 3, 4, 5 → scene 1
shot 7, 8, 9 → scene 2

This allows us to segment the initially detected shot boundaries into higher-level semantic scene boundaries based on visual and temporal coherence. The end-to-end process is illustrated in the following diagram.

After this process, the 118 shots are grouped into 25 unique scenes. Following are some example scene grids:

Align scene and chapter

At this point, we have separately processed the visual and audio cues from the video. Now, we bring them together and ensure that the transcription chapters align with the scene breaks. We do not want to insert an ad during an ongoing conversation or scene. To create alignment, we iterate over each conversational chapter, represented by its start and end timestamps, and a text description summarizing the topic. For each chapter, the code identifies the relevant video scenes that overlap or fall within the chapter’s timestamp range. The output is a list of chapters, where each chapter contains a list of scene IDs representing the aligned video scenes with the corresponding audio conversation. After the alignment process, we have combined visual and audio cues into 20 final chapters. The identified breaks are what the system suggests as ideal places for ad insertion. In real-world applications, we recommend surfacing these breaks as suggestions to the operator and having a human-in-the-loop step to confirm the final ad breaks.

A UI example of suggested ad break for human review.

Generate the contextual response

The last step in the process is sending the visually and audio-aligned data to Anthropic Claude 3 Haiku to generate contextual information for each chapter. This innovative approach takes advantage of the Anthropic Claude 3 family’s multimodal capabilities. From our testing, these models demonstrate the ability to capture minute details from large images and follow image sequences when provided with appropriate instructions.

To prepare the input for Anthropic Claude 3 Haiku, we first assemble video frames associated with each chapter and create a composite image grid. Through experimentation, we found the optimum image grid ratio is 7 rows by 4 columns, assembling a 1568 x 1540-pixel image that fits under Anthropic Claude 3 Haiku’s 5 MB image file size limit while preserving enough detail in each individual frame tile. You can assemble multiple images if needed.

Subsequently, we feed the composite images, transcription, IAB Content taxonomy definitions, and GARM taxonomy definitions into the prompt to generate descriptions, sentiment, IAB taxonomy, GARM taxonomy, and other relevant information in a single query to the Anthropic Claude 3 Haiku model. We can adapt this approach to any taxonomy or custom labeling use cases without the need to train a model each time, showcasing the true power of this approach. The final output can be presented to a human reviewer for confirmation if needed. Following is an example of a composite image grid and the corresponding contextual output for a specific chapter.

Description: The scene depicts a man, likely a detective or law enforcement officer, driving alone in a black sedan through a desert landscape. The sequence of frames shows the man’s face and expressions as he drives, conveying a sense of focus, determination, and perhaps some unease or concern. The scene has a dark, moody tone, with the man’s suit and the car’s interior creating a somber, serious atmosphere. Overall, the scene suggests a narrative of a detective or agent investigating a case or situation, with the desert setting adding a sense of isolation and remoteness. (95%)

sentiment: Neutral (85%)
iab_taxonomy: Automotive (90%)
garm_taxonomy: Crime & Harmful acts to individuals and Society, Human Right Violations (85%)
brands_and_logos: None
relevant_tags: Detective, Investigation, Noir, Isolation, Suspense

Cost breakdown

Finally, it’s important to consider the cost implications of this process. For this 12-minute video clip, we achieved cost-effective results by leveraging Amazon’s services and generative AI capabilities. The entire process, from frame extraction to ad break detection to contextual metadata generation, took 25 seconds and incurred a total cost of ~$0.36. Following is the detailed cost breakdown for each step:

When it comes to video processing, the cost and processing time primarily depend on the complexity of the video content, specifically the number of shots and scene transitions. Complexity can vary significantly across different types of video, making it challenging to provide an accurate cost estimate based solely on the video duration. We do not recommend extrapolating the cost linearly based on video length (e.g., assuming a 1-hour video will cost five times as much as a 12-minute video), as this may lead to inaccurate estimates and potential budgeting issues. Instead, we advise conducting a thorough analysis of similar video types before calculating cost. By analyzing videos with comparable content and complexity, you can gain a realistic understanding of the expected processing requirements and associated costs.

To provide a general reference point, we conducted tests on 1-hour TV footage. The average processing time for this type of content is approximately 30 minutes, with an estimated cost of $0.75 for 26 scenes. However, this figure serves only as an estimate and may vary considerably depending on the specific characteristics of your video.

Cleanup

To avoid incurring AWS charges after testing the guidance, make sure you delete the following resources:

Amazon SageMaker Studio Domain

Conclusion

In conclusion, this blog post demonstrates the powerful potential of generative AI in creating contextually relevant insights and taxonomies for advertising. By leveraging the latest Anthropic Claude 3 model from Amazon Bedrock, we leverage its multimodal capabilities to account for both audio and visual cues simultaneously. This enables us to summarize effectively, extract metadata, and generate taxonomies from media assets. This approach not only improves the efficiency of the contextualization process but also reduces associated costs. For instance, with one hour of TV footage, we reduced the processing cost to $0.75 for 26 scenes with Anthropic Claude 3 Sonnet and $0.06 for 26 scenes with Anthropic Claude 3 Haiku, while simultaneously simplifying the solution implementation. This advancement allows broadcasters and publishers to extend this contextualizing approach to a broader range of media assets, unlocking new monetization opportunities for their existing content.

We encourage you to check out other resources where to explore other generative AI use cases for the media and entertainment industry.

AWS for M&E Blog