AWS for M&E Blog

Analyzing and tagging video for optimal ad targeting and placement


The quantity of video content is expanding rapidly, driven by both professional publishers and the contributions of social media users. As users increasingly demand content that resonates with their interests, the role of recommendation engines becomes critical. For video, these engines rely on rich meta-information to deliver relevant suggestions. However, the sheer volume of content makes manual analysis impractical.

Meticulous tagging and description of videos by creators for recommendation engines poses its own set of challenges. Failure to categorize videos effectively makes it challenging to offer viewers material that interests them, which can lead to reduced engagement. In addition, the inability to precisely target advertisements can diminish the effectiveness of ad campaigns.

In this blog post, we delve into techniques aimed at automating the creation of tags and descriptions for video content. These tags and descriptions can be used to feed content recommendation engines or advertising platforms. We also explore how these techniques can be extended to determine the optimal placement of advertisements in a video.

An architecture diagram showing Amazon Transcribe, Amazon Recognition and Amazon Bedrock

Experimenting with Analysis Techniques on AWS

There are two ways to explore the analysis techniques described in this blog post:

  1. Use the open-source Media2Cloud guidance: Media2Cloud is a pre-built package that leverages many of the techniques covered here for analyzing media content. The guidance is available on the AWS Solutions Library. You can read more about the generative AI features within Media2Cloud here. With Media2Cloud, you can generate metadata, analyze images, detect segments, and identify potential ad placement points.
  2. Try the sample code: If you want to experiment with the code yourself, we’ve provided a Jupyter notebook with sample code in this blog post. You can access the notebook from the AWS Samples GitHub repository. The notebook has been tested on Amazon SageMaker, but you can run it in any environment with the necessary prerequisites. The notebook allows you to analyze your own video files, generate tags, and determine the best ad placement locations.

Whether you use the pre-built Media2Cloud guidance or the sample code, these resources will help you get started with applying generative AI and machine learning techniques to analyze and optimize your media content on AWS.

Matching relevant content

The first part of this blog post discusses automation for tagging video content, for which there are many use cases. Automation can make it easier for users to search for content. It can also be used to build a profile of viewer behavior and topics of interest. Or it can be used to ensure advertisements are relevant to the video’s content.

We use both the speech and visuals in the video to generate relevant information. This dual approach ensures a comprehensive understanding of the content, leading to enriched metadata. This not only facilitates efficient organization and retrieval, it also opens avenues for personalized recommendations and targeted advertising.

Analyzing Speech

While speech may not be prevalent in every video, it is a significant component for a vast majority. Amazon Transcribe is a fully managed automatic speech recognition (ASR) service from Amazon Web Services (AWS) that supports over 100 languages. In our use case, Amazon Transcribe is used to capture a transcription of the speech in a video and the timestamps when speech occurs. The timestamps are used later to determine where breaks in the content are located.

The transcribed text is fed into a Large Language Model (LLM) for processing. An LLM is a Foundation Model that can be used for general purpose language understanding and generation. By feeding our transcription into an LLM, it is possible to ask questions of the transcription—such as the topic of video or what tags could be applied.

Amazon Bedrock is a fully managed AWS service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon via a single API.

In this use case, we use Bedrock to analyze our transcription.

An example of asking questions of the transcription using Amazon Bedrock.

The input to an LLM is known as a prompt and each model has different conventions as to how best to format your prompt. An example prompt to analyze a transcription could be:

Here is a transcript from a video:
<Enter the transcription>

Analyze this transcript and determine what type of content this is and describe what is happening in the transcription.

This will produce a text analysis of what is happening in the video.

You can take this one step further and ask for a set of tags to be produced that can be used as metadata for content or advertisement matching engines.

Here is a transcript from a video:
<Enter the transcription>

What are the top three keywords you would use to describe the content above. Format them as a comma separated list.

Example code is provided in the accompanying notebook, which you can use to experiment with different prompts for your use case.

If using the Media2Cloud guidance, you can read about the generative AI functionality here.

Analyzing Video

What if your video content lacks speech, or the spoken word fails to convey the essence of the video? In such cases, we can supplement speech-based analysis by analyzing visual elements of the content and generating a caption for every scene in the video.

It is possible to use an AI model to generate a caption of our image—in other words, an image-to-text model. In the accompanying notebook, we use the multi-modal capabilities of the Claude 3 model via Amazon Bedrock to perform our image-to-caption processing. Claude 3 models have sophisticated vision capabilities and can extract understanding from an array of visual formats.

However, the model works on still images rather than videos. Therefore, we propose breaking down the video into a series of still frames that are fed into the AI model for captioning. While a straightforward approach might involve capturing frames at regular intervals, this method could either overlook rapidly changing scenes or flood the analysis with redundant frames.

Amazon Rekognition is an image recognition service that detects objects, scenes, and activities. Amazon Rekognition Segment Detection is a feature of Amazon Rekognition that detects changes of scene (or ‘shots’) within a video. By leveraging this capability, we can precisely pinpoint all the scenes in our video, allowing us to process only relevant still frames.

By feeding each still frame into the model, we create a caption for each scene. In this illustrative example, we feed in a video from the keynote from re:Invent 2023.

An example graphic from the reInvent keynote

Generated Caption: This image appears to be from a technology or business conference or event. It shows a large stage with blue lighting and a presenter standing in the middle. On the screen behind the presenter, there is text that states “Over 80% of Unicorns run on AWS”.


An example graphic from the reInvent keynote

Generated Caption: This image depicts a stage at what appears to be an AWS (Amazon Web Services) tech event or conference called “AWS re:Invent”. The main stage has a large screen displaying the AWS re:Invent logo with vibrant purple and reddish lighting effects. On the stage, there is a person standing behind a podium or lectern, likely giving a presentation or speech to the audience seated in the darkened auditorium area visible in the foreground.

We then use an LLM in a similar fashion to the previous steps to analyze what is happening in the video.

Here are a set of captions of what is seen in a video:
The captions from the video

Summarize what is happening in the video based on the captions above.

Accompanying code is provided in the notebook for testing this method on your own videos.

If using the Media2Cloud guidance, you can read how to generate captions for images and detecting scenes here.

Advertisement placement

Another crucial aspect of optimizing the viewing experience involves strategically placing advertisements within videos. The goal is not just relevance but also seamless integration. Breaks in content for advertising are best when aligned with the natural rhythm of the content. These breaks could manifest during moments of silence, scene transitions, or segments without dialogue.

Detecting break points

The techniques presented previously with Amazon Transcribe and Amazon Rekognition are also useful for determining where natural breaks in the content occur.

In the first stage, we used Amazon Transcribe to obtain a transcription of the speech—including the start and end timestamps of speech segments. This is used to infer when speech is not occurring.

In the second stage, we used Amazon Rekognition to detect a change in scene using the Segment Detection feature. This detects where a significant change in the shot occurs and can serve as an indicator of where there is a break in the content.

An additional indicator of a break could be a change in the volume levels of the content. Periods of low audio intensity could potentially indicate break points.

By combining these three indicators, a ‘score’ is devised for each second (or less if desired) of video as to its suitability for a break in content.

Time (Seconds) No Speech Shot Transition Volume (Lowest 10%) Score
10 0 0 0 0
11 0 0 0 0
12 1 0 0 1
13 1 1 1 3
14 1 0 1 2
15 0 0 0 0

In this example in the prior table, at the 13 second mark of the video, there is a period with no speech, a shot transition, and a period of low volume. This indicates an opportune moment for ad placement, aligning with a natural break in the video content.

If you want to learn more about how to accurately place breaks in your content using the Media2Cloud guidance, please reference implementation guide.

Example code is included in the notebook with suggestions on how this could be implemented.


For the techniques described in this post there is no need to process high-quality, high-definition content. Doing so would increase the time and ultimately the cost of analyzing the video.

Amazon Elemental MediaConvert is a file-based video transcoding service with broadcast-grade features. This solution can be used to compress and reduce the resolution of the video before processing. An example of how to use Amazon Elemental MediaConvert for this purpose is provided in the accompanying notebook.


With the growth in the quantity of video content, automating the analysis and tagging of content is crucial. By combining speech and visual analysis tools like Amazon Transcribe, Rekogntion, and Amazon Bedrock with the power of large language models, we have presented techniques to create richer metadata for your content. This enables more personalized recommendations and targeted advertising. Additionally, a placement within videos becomes less intrusive by identifying natural breaks using speech, scene transitions, and volume changes. Collectively, these strategies contribute to sustained user engagement and effective monetization of your platform.

Andrew Thomas

Andrew Thomas

Andrew is a Senior Solutions Architect working within the Strategic Accounts team at AWS. With 15 years of experience helping customers adopt new technologies, he guides them through architecting end-to-end solutions spanning infrastructure, AI, and big data.