AWS for M&E Blog

Unlock the value of Media Archives with AWS

Efficiently managing vast media archives remains a significant challenge for content owners, particularly when faced with insufficient metadata for quick asset discovery and reuse. Traditional approaches to archive enrichment—whether through human effort and/or machine learning (ML)—often prove prohibitively expensive or time-consuming.

In the following, we describe an innovative approach combining generative AI with ML to significantly reduce costs and enhance the efficiency of media asset enrichment and management. While our examples focus on news archives, the same strategies may be applied to other content types.

Introduction

The wide range of events a news organization covers is reflected in their content library: political events, wars, interviews, accidents, crimes, segments about the economy, health, celebrities, weather, sports teams…a growing list of events and stories that capture our history and inform our present-day. News organizations rely on their content libraries to provide context to current events:

  • The new political candidate: What did (s)he say about immigration in previous speeches and interviews?
  • The incoming “storm of the decade”: How was the community impacted during the last major storm?
  • The athlete that was just drafted to a top team: Do we have good footage from his or her high school games?

Unfortunately, for every recording of a pivotal speech, there’s likely footage of an empty podium. The incredible storm footage from twenty years ago may be submerged in a sea of mundane weather B-roll. For every incredible shot of a high school star making a goal, there are multiple nondescript wide-angle clips.

Metadata is an essential pillar in content search and discoverability. The fast-paced nature of news production often leads to sparse metadata. It’s not unusual to see XDCAM Discs labeled “Storm 07/03/04”, files named “CRASH_REC”, and legacy database exports with timecodes and cryptic comments. Some news organizations have staff dedicated to wrangling metadata from various sources and logging clips. Some organizations are now leveraging AI/ML to help automatically generate metadata. These approaches may provide great ROI for high-value content, but how do we sort through all the chaff, so we’re not wasting time and resources logging empty podium shots or unlikely-to-be-used B-roll?

The following strategy will help optimize costs, while enriching archived and future content, within your content libraries.

Strategy

We’ll focus on optimizing costs for both the metadata enrichment and ongoing storage of assets. We will minimize metadata enrichment costs by leveraging existing metadata and applying the services that will provide the best value for a particular asset. Furthermore, we will discuss how the anticipated usage determines the storage tier.

Architecture diagram of the contextual analysis and asset tiering solution featuring Amazon Bedrock and Machine Learning & Artificial Intelligence services on AWS.

Figure A: Architecture diagram for contextual analysis and asset tiering.

Figure A illustrates an example workflow where:

  1. Content is digitized and the video file if uploaded to Amazon Simple Storage Service (Amazon S3).
  2. A composite image grid is generated from the video. If a transcript doesn’t already exist, one is generated with Amazon Transcribe. Both the image grid and transcript are sent to Anthropic Claude in Amazon Bedrock—using generative AI to provide a low-cost contextual analysis. Later we’ll demonstrate the tremendous amount of metadata that Anthropic Claude Haiku in Amazon Bedrock can cost-effectively provide.
  3. Business logic is applied to the contextual analysis to classify assets into one of four tiers. Appropriate enrichment (done by Amazon Rekognition) and storage policies are applied to each tier or content.
A classification table for media archive content showing four storage tiers: Gold (for breaking news content), Silver (for yearly accessed content), Bronze (for rarely accessed content), and Marked for Deletion. Each tier specifies the content type, enrichment requirements, AWS storage class (using S3 Glacier variants), and storage retrieval details.

Table 1: Example classification and tiering summary.

In this example, gold and silver-tiered content have additional enrichment applied and are stored in more readily-accessible storage. We’re simplifying this example by having the Classification Tier determine both the additional enrichment and storage class to be used.

Customers’ environments can have different tier classifications for enrichment and storage. This allows customers to better accommodate the range of assets they are dealing with. Assets that are deemed valuable enough to justify additional enrichment may not need to be available for retrieval within milliseconds. Additional details about enrichment and storage strategies are provided in the following sections.

Operational Design

Figure B: Frame from an archive weather report. Content provided by CBS News and Stations.

Varying levels of metadata can be extracted from on screen elements such as people, scenes, on screen text, and graphics. It is also important to leverage existing metadata, including labels on film canisters or tape, segment guides, and transcripts, which often contain a high information density. The video and audio of the asset is analyzed using Amazon Bedrock large language models (LLMs) for near real-time evaluation.

Analyzing Audio

Amazon Transcribe is an automatic speech recognition (ASR) service that generates a transcript from the audio dialogues of an asset, such as the news in our example. If a transcript exists, this step can be skipped, and the existing transcript can be leveraged for analysis.

Analyzing Video

To generate a contextual response using foundational models (FM) from Amazon Bedrock, it is important to align the video and audio data before sending it to the FM for analysis. Using AWS Elemental MediaConvert Frame Capture an output of a composite image grid is created from the video frames to prepare the input for analysis. The composite image grid is composed of a sequence of the frame images extracted from the video. The sequence of the frame images allows us to instruct the FM to understand the temporal information of the video.

The composite images, transcript, taxonomy definitions, and other relevant information (such as news and broadcast classification) are then presented as a single query to Anthropic Claude 3 Haiku in Amazon Bedrock. LLMs can cost-effectively analyze and summarize content and generate new metadata that can be used for classifications.

Submitting a single prompt, with multiple questions, to the LLM allows for a summarizing of the video into a concise description based on the asset’s audio and video. For news assets we have also used the Interactive Advertising Bureau (IAB) classification to gain additional contextual information. This enhances understanding, search and discovery, and provides the necessary context to allow for media management.

Contextual information that can be gained for news assets include, but are not limited to:

  • Description
  • Anchors and reporters
  • Broadcast date
  • Show
  • Topics
  • Statistics
  • Themes
  • Notable quotes
  • On screen celebrities and personalities
  • News classification
  • Broadcast classification
  • Technical cues
  • Language
  • Brands and logos
  • Relevant tags

This approach can be adapted based on the content type.

Figure C: Output of generative AI contextual analysis. Content provided by CBS News and Stations.

As shown in the output of the generative AI contextual analysis (Figure C), using the LLM to analyze media assets provides detailed descriptive metadata which can be used to enhance search and discovery and provide the necessary information for asset tiering.

Video Classification

Based on the output of the generative AI contextual analysis and business logic, further content analysis may or may not be necessary to extract time-series based metadata.

This strategy can enhance the search and discovery of content, and also drive efficiencies within a growing content repository. With a growing content repository, it is recommended to implement a media management strategy so content can be stored across various tiers of storage.

Based on a news organization’s business requirements, there may be a need to keep content generated after a certain date, or aligned to current events, stored in Amazon S3 Glacier Instant Retrieval. This storage tier is designed for rarely accessed data that still needs immediate access with retrievals in milliseconds. In contrast, B-roll footage or segment content from over 20 years ago can be stored in lower-cost Amazon S3 Glacier Deep Archive, where retrieval time is within 12 hours.

Identifying People in News

Although the LLM can identify celebrities, such as nationally known broadcast reporters, there are a number of people that may not be identified. Two approaches can be used to identify people in the news: the Amazon Rekognition RecognizeCelebrities API, and the IndexFaces and SearchFaces APIs.

The Amazon Rekognition celebrity recognition API is designed to automatically recognize celebrities and well-known personalities in images and videos using machine learning. However, there are often cases where local celebrities (such as news anchors, meteorologists, and field correspondents) are not identified by the celebrity recognition API.

In such cases, Amazon Rekognition offers face “Collections”, which are used for managing information related to faces. Unknown faces are stored as face embeddings in a managed collection and indexed using the IndexFaces and/or SearchFaces API, where a custom collection can be created to store each unique face. Unknown faces are compared against the existing index, and if they have not been captured previously, are added to the index using IndexFaces and SearchFaces APIs. This custom face collection then acts as a discoverable database for face vectors.

In testing news assets, roughly 15 distinct faces were observed per 30-minute news program.

Uncovering Efficiencies

Processing at Fixed Compared with Dynamic Intervals

Processing at Fixed Intervals

Processing at a fixed frame rate (for example, one frame per second or one frame every two seconds) can certainly capture a large amount of metadata. However, it may come with trade-offs in terms of cost of computer vision API calls. The advantages of our strategy includes comprehensive metadata capture, capturing more on-screen text, and full correlation between visual metadata and the transcript. The following image (Figure D) is an example of processing at a fixed frame rate.

The image depicts a composite image grid of the archived weather report. The composite image grid is composed of frames extracted at fixed intervals such as one frame every second or one frame every two seconds. The reel begins with a news anchor introducing the winter weather news story and proceeds with a depiction of a map showing the weather pattern. Additonal reporters and meterologists are then shown to provide commentary on the winter weather story showing nine different scenes in succession, starting at a news anchor in studio, to a snowy airport, to a map of the nation, back to the anchor now talking to a remote field agent in a split scene, to just the field agent, to snow blind traffic, to another map of the nation, to the meteorologist in studio near the map, ending with local traffic pushing cars in the snow.

Figure D: Composite image grid created when processing content at fixed intervals. Content provided by CBS News and Stations.

Dynamic Frame Analysis

Configuring the metadata enrichment framework, to process news media assets dynamically, reduces cost by limiting the number of frames sent to Amazon Bedrock and Amazon Rekognition. The solution measures the Hamming distance between perceptual hashes created for each extracted image frame to decide when a frame has changed significantly. This approach only calls Amazon Rekognition APIs when visual frame changes indicate the need for new analysis.

While this method reduces the cost of computer vision APIs, detecting text with this approach may result in missed on-screen text between API calls.

To provide a general reference point, we conducted tests on news segments varying from five minutes to one hour. In testing, an average in reduction by 83% in API calls was reached with this approach.

Additional testing results:

  • High density content, such as sports highlights and war footage, saw roughly a 70% reduction in Amazon Rekognition Image API calls.
  • Low density content, such as press conferences and one-on-one interviews, saw roughly a 90% reduction in Amazon Rekognition Image API calls.
The image depicts a composite image grid of the archived weather report. The composite image grid is composed of frames extracted at fixed intervals such as one frame every second or one frame every two seconds. The image shows the use of dynamic frame analysis where of the 90 frames extracted 20 were analyzed. The frames highlighted show the news anchor in the studio, winter weather impacting travel, a map showing the storm radar along with reporters and meteorologists covering the winter weather.

Figure E: Dynamic frame analysis. Content provided by CBS News and Stations.

Media Management and Storage Tiering

Further efficiencies are realized by tiering assets. Using the contextual information derived from the asset, we can implement a more efficient media management and enrichment strategy.

Pricing Breakdown

To provide a general reference point, we will use the referenced 5 minute 36 second news segment clip (shown in Figures D and E) to map out pricing for each classification tier. The size of the video used is 12.3 GB. It is important to note that during the Metadata Enrichment stage, dynamic frame analysis was used. This created 668 frames being extracted from the video, with only 147 frames used in the analysis. This resulted in a 78% reduction in Amazon Rekognition Image API calls.

The image is a table summarizing the AWS services, regions, and costs involved in various processes of a gold tier analysis. It includes steps like frame embedding using Amazon Bedrock and Amazon Multimodal Embedding, transcription using Amazon Transcribe, contextual analysis using Amazon Bedrock and Anthropic Claude Haiku, time series-based metadata enrichment using Amazon Rekognition Image and Video APIs, and storage using S3 Glacier Instant Retrieval. The total cost with transcription is $1.6597, while without transcription, it is $1.5253.

Table 2: Example Cost Scenario for Gold Tier Analysis.

The image is a table summarizing the AWS services, regions, and costs involved in various processes of a silver tier analysis. It includes steps like frame embedding using Amazon Bedrock and Amazon Multimodal Embedding, transcription using Amazon Transcribe, contextual analysis using Amazon Bedrock and Anthropic Claude Haiku, time series-based metadata enrichment using Amazon Rekognition Image APIs, and storage using S3 Glacier Flexible Retrieval. The total cost with transcription is $1.0860, while without transcription, it is $0.8220.

Table 3: Example Cost Scenario for Silver Tier Analysis.

The image is a table summarizing the AWS services, regions, and costs involved in various processes of a bronze tier analysis. It includes steps like frame embedding using Amazon Bedrock and Amazon Multimodal Embedding, transcription using Amazon Transcribe, contextual analysis using Amazon Bedrock and Anthropic Claude Haiku, and storage using S3 Glacier Deep Archive The total cost with transcription is $0.1729 while without transcription, it is $0.0385.


Table 4: Example Cost Scenario for Bronze Tier Analysis.

The image contains two tables titled "First Year Cost Summary" with and without transcription costs. The tables display the one-time enrichment cost, annual storage cost, and total cost for three pricing tiers: Gold ($2.20 or $2.07 with/without transcription), Silver ($1.57 or $0.61), and Bronze ($0.46 or $0.20).

Table 5: Example Cost Model for Archive Weather Report.

 

As shown by the example cost scenarios, the proposed approach provides cost-savings for both metadata enrichment and ongoing storage costs. Based on different content types, storage tiers and the level of analysis may vary, all effecting the total cost. It is important to note that we do not recommend extrapolating the analysis cost based on content length.

For current AWS product and services pricing, please see https://aws.amazon.com/pricing/ .

Conclusion

Organizations want to ensure that archived assets are not only preserved, but also utilized to their full potential. By following the strategy we’ve described, organizations can significantly reduce the costs associated with enriching and storing archival content, while maintaining high standards of accessibility. This optimizes the management of vast content repositories, and also empowers organizations to uncover new opportunities for content discovery, reuse, and potential monetization.

Contact an AWS Representative to know how we can help accelerate your business.

Visit the following links, to learn more about additional media and entertainment industry use cases:

Vince Palazzo

Vince Palazzo

Vince Palazzo is a Sr Solutions Architect at Amazon Web Services. He is focused on helping enterprise Media and Entertainment customers build and operate workloads securely on AWS.

Ali Amoli

Ali Amoli

Ali Amoli is a Media & Entertainment Industry Specialist at AWS. He’s spent his career working with M&E customers, engineers, and product teams, around the world, to innovate and overcome technical, workflow, and business challenges.