AWS for M&E Blog

Video semantic search with AI on AWS

Media customers can have large video libraries, with thousands, if not millions of videos, that they need to search to discover content that can be reused and monetized. Finding a video in repositories of this scale is traditionally a time-consuming, repetitive, manual task. In addition, customers need to be able to search for video footage based on particular scenes, actions, concepts, people, and objects using natural language. Therefore, users need a solution that does more than just traditional text search on video transcriptions.

A recent class of generative AI models makes it possible to build these sorts of semantic video search applications. These models can process and understand video, audio, image, and text. This multimodal capability enables scalable, semantic video search that will streamline content production and enhance user experiences.

Video semantic search enables content discovery, efficient archiving and retrieval, and streamlined repurposing of video content through intelligent analysis of topics, entities, and context within the footage, at scale. This can drive cost efficiency, productivity gains, and scalability to a wide range of media and entertainment customers.

We’ll demonstrate how to build a complete workload to solve this business problem. Our solution leverages Amazon Web Services (AWS) artificial intelligence (AI), machine learning (ML) and generative AI services such as Amazon Bedrock, Amazon Rekognition and Amazon Transcribe. These services allow users to upload and index videos for semantic search. Several models are utilized, including Amazon Nova, a new generation of state-of-the-art foundation model available exclusively in Amazon Bedrock, to create a powerful and efficient video semantic search solution.

Solution overview

A high-level functional workflow for video semantic search is shown in the following diagram.

A high-level functional workflow of Video Semantic Search solution: from Media Archive to User Interface, showing video processing, embedding generation and retrieval process with vector database.

Figure 1: A functional diagram for video semantic search.

There are two main workflows shown in Figure 1:

  1. Media content ingestion (shown in red arrows)
  2. Media search workflow (shown in green arrows)

Ingestion begins when a media object (for example, a video file) is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket. This action triggers an AWS Lambda function that takes the media file and extracts contextual information using AWS AI/ML services. This data is transformed into an embedding using a pre-trained embedding model. These embeddings capture the semantic meaning of the video frames, audio segments and transcriptions. An embedding is a numerical representation of the given input in the form of a vector. The workflow stores the embeddings in a vector database, which enables highly efficient similarity searches over a large amount of information.

In the media search workflow, the user’s natural language search query is transformed into an embedding using the same pre-trained embedding model. This embedding is then used to perform a semantic search query against the vector database to retrieve the most semantically similar media content. The search workflow supports multimodal queries such as text, and image because the pre-trained embedding model is able to transform any of these media types into an embedding. It also supports traditional keyword search to provide a more robust, flexible, and user-friendly experience, serving a wider range of search scenarios and user intentions. Additionally, we utilize reranking techniques to further improve search relevance, thereby enhancing the overall search experience and accuracy.

Walkthrough

The following diagram illustrates the complete architecture of the video semantic search workload.

End-to-end architecture diagram of the video semantic search solution.

Figure 2: Video semantic search architecture diagram.

 

Starting from the user experience, the architecture contains a straightforward static web application hosted in Amazon S3. To serve the static website hosted on Amazon S3, you deploy an Amazon CloudFront distribution and use origin access control (OAC) to restrict access to the Amazon S3 origin. With Amazon Cognito, you are able to protect the web application from unauthenticated users.

You use Amazon API Gateway as the entry point for all near real-time communication between the frontend and the backend of the video semantic search workload. This is where requests to create, read, update, delete (CRUD) or run workflows begin.

The user interface allows the user to upload video files to an S3 bucket using an Amazon S3 pre-signed URL, which triggers the API requests to start the media content ingestion workflow. To support concurrent processing of multiple uploaded videos, you leverage Amazon Simple Queue Service (Amazon SQS) for efficient request management and parallel ingestion pipelines. The API requests invoke an AWS Lambda function that puts media ingestion tasks into an Amazon SQS queue. An Amazon SQS queue subscriber (in this case another AWS Lambda function) processes the message to trigger an AWS Step Function workflow.

Let’s focus more on the AWS Step Function media content ingestion workflow, as it is the core part of the solution.

The Step Function workflow begins with two parallel processes:

  1. Create an Amazon Transcribe job to generate a transcription for the video
  2. Create an Amazon Rekognition job to detect shot segments in the video

The following is an example of our workload’s Amazon Transcribe output in JSON format stored in an S3 bucket:

{
    "jobName": "203b2cad-ed24-4670-8ba6-b9c836d7c48b",
    "accountId": "6393590*****",
    "results": {
        "transcripts": [{
                "transcript": "AWS is the world's most comprehensive and broadly adopted cloud platform..."}],
        "items": [{
                "start_time": "2.369","end_time": "2.95",
                "alternatives": [{
                        "confidence": "0.902","content": "AWS"}],
                "type": "pronunciation"
            },
            ...
        ]
    },
    "status": "COMPLETED"
}
JSON

A video shot is a series of interrelated consecutive pictures taken contiguously by a single camera and representing a continuous action in time and space. You use an Amazon Rekognition Video Segment detection API to detect multiple shots from the original video, capturing the start, end, and duration of each shot as shot metadata. The following code snippet demonstrates this process:

def startSegmentDetection(bucket_videos, video_name, vss_sns_rekognition_topic_arn, vss_sns_rekognition_role):

    min_Technical_Cue_Confidence = 80.0
    min_Shot_Confidence = 80.0
    max_pixel_threshold = 0.1
    min_coverage_percentage = 60

    response = rek_client.start_segment_detection(
        Video={"S3Object": {"Bucket": bucket_videos, "Name": video_name}},
        NotificationChannel={
            "RoleArn": sns_rekognition_role,
            "SNSTopicArn": sns_rekognition_topic_arn,
        },
        SegmentTypes=["SHOT"],
        Filters={
            "ShotFilter": {"MinSegmentConfidence": min_Shot_Confidence},
        },
    )

    startJobId = response["JobId"]
    return startJobId
Python

Upon completion of the shot segment detection, an AWS Lambda function is automatically triggered to extract image frames for each identified shot segment at a specific frame rate (such as, 1 FPS). Each segment is processed in parallel using an AWS Step Functions Map state for maximum efficiency.

These shot-segmented image frames are stored in a designated S3 bucket for subsequent processing. The following is an example of frames extracted from a particular shot:

An example of a video shot segment as a tiled image. The three images show two men talking in a workshop setting.

Figure 3: Example of a video shot segment.

 

In order to understand the visual information from each video segment, you first use Amazon Rekognition celebrity recognition API to detect any celebrities present within the frames in a particular shot.

def startCelebrityDetection(bucket_images, key_image):
    response = rek_client.recognize_celebrities(
        Image={"S3Object": {"Bucket": bucket_images, "Name": key_image}}
    )
    min_confidence = 80.0
    celebrities = set()
    for celebrity in response.get("CelebrityFaces", []):
        if celebrity.get("MatchConfidence", 0.0) >= min_confidence:
            celebrities.add(celebrity["Name"])
    celebrities = ", ".join(celebrities)
    return celebrities
Python

Additionally, you leverage the foundation model, Amazon Nova, in Amazon Bedrock to detect private figures or characters by analyzing text labels or titles that appear when a person is shown in the video shot. Although we are using Amazon Nova as the primary large language model (LLM), the solution is open for customers to use other LLMS such as Anthropic’s Claude 3.5 Sonnet, and so on. Using Amazon Nova complements the celebrity detection functionality, expanding the solution’s ability to identify both public and private individuals in the media content. Here’s an example implementation:

def recognise_person_name(bucket_images, jobId, frames, tmp_frames_dir):
    person_names = []
    prompt = f"""You are provided with a frame image extracted from a video shot. Analyze the frame and identify any person names present in the shot.
    - If person names are recognized, list them separated by commas, with no additional context or text.
    - If no person names are recognized, respond with "No names recognized."
    - Do not include any other information or context in your response.
    """

    model_id = os.environ["bedrock_model"]
    for frame in frames:
        message = {
            "role": "user",
            "content": [
                {"text": prompt},
            ],
        }
        s3_object = s3_client.get_object(Bucket=bucket_images, Key=f"{jobId}/{frame}.png")
        image_content = s3_object['Body'].read()
        message["content"].append(
            {"image": {"format": "png", "source": {"bytes": image_content}}}
        )

        messages = [message]
        inferenceConfig = {
            "maxTokens": 512,
            "temperature": 0.1,
        }

        response = bedrock_client.converse(modelId=model_id, messages=messages, inferenceConfig=inferenceConfig)
Python

To propagate public and private figures across video frames, you leverage Amazon Titan multimodal embeddings model in Amazon Bedrock to generate image embeddings from video frames. This approach is effective when titles are absent and to address the challenges of Amazon Rekognition in detecting celebrities when image angles obscure faces. You then calculate the cosine similarity between these image embeddings. If two video frames share a high similarity (such as, close to one) and one contains recognized celebrities or private figures, you can infer that the same celebrities or private figures highly likely appear in the similar frame. This method enhances detection accuracy by utilizing the contextual and visual similarity between video frames, allowing for more robust recognition even in challenging scenarios.

def get_titan_image_embedding(bucket_images, jobId, embedding_model, image_name):
    s3_object = s3_client.get_object(Bucket=bucket_images, Key=jobId + "/" + image_name)
    image_content = s3_object['Body'].read()
    base64_image_string = base64.b64encode(image_content).decode()

    accept = "application/json"
    content_type = "application/json"
    body = json.dumps(
        {"inputImage": base64_image_string}
    )
    response = bedrock_client.invoke_model(
        body=body, modelId=embedding_model, accept=accept, contentType=content_type
    )
    response_body = json.loads(response["body"].read())
    embedding = response_body.get("embedding")
    return embedding

embedding = get_titan_image_embedding(bucket_images, jobId, os.environ["aws_bedrock_image_embedding_model"], f"{value["shot_keyFrame"]}.png")

query = {
    "size": k,
    "query": {"knn": {"shot_vector": {"vector": embedding, "k": k}}},
    "_source": [
        "jobId",
        "video_name",
        "shot_id",
	  "shot_startTime",
	  "shot_endTime",
        "shot_publicFigures",
        "shot_privateFigures"
    ],
}
response = client.search(body=query, index=jobId)
hits = response["hits"]["hits"]
for hit in hits:
    if hit["_score"] >= 0.9:
        public_figures = [name.strip() for name in hit["_source"]["shot_publicFigures"].split(',')]
        for name in public_figures:
            shot_publicFigures.add(name)
        
  private_figures = [name.strip()for name in hit["_source"]["shot_privateFigures"].split(',')]
        for name in private_figures:
       	shot_privateFigures.add(name)
Python

In the next step, you leverage the foundation model Amazon Nova in Amazon Bedrock to generate a comprehensive video shot’s description based on the shot’s visual images. These include all visible objects, logos, text, any celebrities and private figures detected in the previous steps.

prompt = f"""
    Provide a detailed description of a video shot based on the given frame images. Focus on creating a cohesive narrative of the entire shot rather than describing each frame individually.

    Incorporate the following elements in your description: 
    1. Visual elements:
    - Describe all visible objects, text, and characters in detail.
    - For any characters present, include:
        • Age
        • Emotional expressions
        • Clothing and accessories
        • Physical appearance
        • Any actions, movements or gestures

    2. Setting and atmosphere:
    - Provide details about the time, location, and overall ambiance.
    - Mention any relevant background elements that contribute to the scene.

    3. Incorporate provided information:
    - Seamlessly integrate details about public figures and private figures if available.
    - If this information is not provided, rely solely on the visual elements.

    Skip the preamble; go straight into the description."""

for index, value in enumerate(shot_frames):
    prompt += f"Frame {index}: Public figures: {value["frame_publicFigures"]}; Private figures: {value["frame_privateFigures"]}\n"
    
model_id = "us.amazon.nova-pro-v1:0"
message = {
    "role": "user",
    "content": [{"text": prompt},],
}
for index, value in enumerate(shot_frames):
    s3_object = s3_client.get_object(Bucket=bucket_images, Key=f"{jobId}/{value["frame"]}.png")
    image_content = s3_object["Body"].read()
    message["content"].append({"image": {"format": "png", "source": {"bytes": image_content}}})

messages = [message]
inferenceConfig = {
    "maxTokens": 512,
    "temperature": 0.1,
}

response = bedrock_client.converse(
    modelId=model_id, messages=messages, inferenceConfig=inferenceConfig
)
output_message = response["output"]["message"]
output_message = output_message["content"][0]["text"]
Python

The following is an example of the output generated by the prior code, which uses the Amazon Nova Pro model to create a detailed description of a video shot based on multiple frame images and additional context such as detected celebrities, and so on.

The shot features Werner Vogels, a bald middle-aged man with a gray beard, standing in a well-lit workshop. He is wearing a dark blue hoodie and has a cheerful, calm expression. Vogels stands confidently in front of a small airplane, seemingly engaged in a discussion with another person about its features. The airplane is predominantly white, with visible components such as the propeller ...

In the final steps of the ingestion workflow, you aggregate all the metadata information provided in the previous steps. Use an Amazon Bedrock text embedding model, as well as multimodal embedding model to create embeddings of shots’ descriptions, relevant audio transcriptions, and shots’ frame images to represent all the contextual details pertaining to each shot segment. The vector embeddings are stored along with related video shot segment’s metadata information in an Amazon OpenSearch Serverless vector database.

In the media semantic search workflow, when a user provides a search query either as text or as an image through the web user interface, an AWS API Gateway endpoint triggers an AWS Lambda function. This function invokes the multimodal embedding model through the Amazon Bedrock API to transform the user query into an embedding. This embedding captures the semantic meaning of the users search query. It’s then used as a search parameter against the Amazon OpenSearch Serverless vector database to retrieve the most relevant video shot segments and segments’ metadata, including start and end times of the video shots. Users have the flexibility to enhance their search with a hybrid search by combining the semantic search capability with traditional keyword searches across other fields in the OpenSearch Serverless database to find the most relevant content more precisely and efficiently.

The following sample code sets up an OpenSearch Service index with fields for video metadata and vector representations of shot images, descriptions and transcripts.

index_body = {
    "mappings": {
        "properties": {
            "jobId": {"type": "text"},
            "video_name": {"type": "text"},
            "shot_id": {"type": "text"},
		"shot_startTime": {"type": "text"},
		"shot_endTime": {"type": "text"},
            "shot_description": {"type": "text"},
            "shot_publicFigures": {"type": "text"},
            "shot_privateFigures": {"type": "text"},
            "shot_transcript": {"type": "text"},
            "shot_image_vector": {
                "type": "knn_vector",
                "dimension": len_embedding,
                "method": {
                    "engine": "nmslib",
                    "space_type": "cosinesimil",
                    "name": "hnsw",
                    "parameters": {"ef_construction": 512, "m": 16},
                },
            },
            "shot_desc_vector": {
                "type": "knn_vector",
                "dimension": len_embedding,
                "method": {
                    "engine": "nmslib",
                    "space_type": "cosinesimil",
                    "name": "hnsw",
                    "parameters": {"ef_construction": 512, "m": 16},
                },
            },
		"shot_transcript_vector": {
                "type": "knn_vector",
                "dimension": len_embedding,
                "method": {
                    "engine": "nmslib",
                    "space_type": "cosinesimil",
                    "name": "hnsw",
                    "parameters": {"ef_construction": 512, "m": 16},
                },
            },
        }
    },
    "settings": {
        "index": {
            "number_of_shards": 2,
            "knn.algo_param": {"ef_search": 512},
            "knn": True,
        }
    },
}
response = client.indices.create(index, body=index_body)
Python

This additional sample search query performs a semantic similarity search on both shot descriptions and transcripts, applying different boost values to each and returns the most relevant results.

aoss_query = {
    "size": k,
    "query": {
        "bool": {
            "should": [
                {
                    "script_score": {
                        "query": {"match_all": {}},
                        "script": {
                            "lang": "knn",
                            "source": "knn_score",
                            "params": {
                                "field": "shot_desc_vector",
                                "query_value": text_embedding,
                                "space_type": "cosinesimil",
                            },
                        },
                        "boost": 2.0,
                    }
                },
                {
                    "script_score": {
                        "query": {"match_all": {}},
                        "script": {
                            "lang": "knn",
                            "source": "knn_score",
                            "params": {
                                "field": "shot_transcript_vector",
                                "query_value": text_embedding,
                                "space_type": "cosinesimil",
                            },
                        },
                        "boost": 1.0,
                    }
                },
            ],
            "minimum_should_match": 1,
        }
    },

    "_source": [
        "jobId",
        "video_name",
        "shot_id",
	  "shot_startTime",
	  "shot_endTime"
        "shot_description",
        "shot_publicFigures",
        "shot_privateFigures",
        "shot_transcript"
    ],
}

response = client.search(body=aoss_query, index=aoss_index)
hits = response["hits"]["hits"]
Python

Last but not least, you can utilize Cohere Rerank 3.5 in Amazon Bedrock through the Rerank API to improve search relevance and content ranking capabilities. This step verifies that the most relevant results are prioritized, enhancing the overall quality and accuracy of the search results.

bedrock_agent_runtime = boto3.client("bedrock-agent-runtime", region_name="us-west-2")
rerank_model_id = "cohere.rerank-v3-5:0"
model_package_arn = f"arn:aws:bedrock:us-west-2::foundation-model/{rerank_model_id}"
sources = []
for doc in docs:
    sources.append(
        {
            "inlineDocumentSource": {
                "jsonDocument": doc,
                "type": "JSON",
            },
            "type": "INLINE",
        }
    )
response = bedrock_agent_runtime.rerank(
    queries=[{"type": "TEXT", "textQuery": {"text": user_query}}],
    sources=sources,
    rerankingConfiguration={
        "type": "BEDROCK_RERANKING_MODEL",
        "bedrockRerankingConfiguration": {
            "numberOfResults": min(num_results, len(docs)),
            "modelConfiguration": {
                "modelArn": model_package_arn,
            },
        },
    },
)
return response["results"]
Python

All the media, profiling and task metadata are stored in an Amazon DynamoDB NoSQL database that allows users to keep track of the tasks’ status and other relevant information. The solution stores videos in Amazon S3, which offers durable, highly available and scalable data storage at low cost. You also leverage Amazon CloudWatch and Amazon EventBridge to monitor in near real-time every component and make responsive actions during the Step Functions workflow.

User Interface

The following image illustrates the user interface of the video semantic search web application. The frontend is built on Cloudscape, an open-source design system for the cloud.

User experience of the video semantic search solution. It shows the entering of a search: Show all the scenes with waving flags. Once the search is entered it shows all the video clips under the search window that meet the criteria. The video scrolls to show that several different types of videos were found. The search field then shows the search of: goalkeeper making a save. Again several different video clips are returned based on the search. A final example is entered: Werner Vogels handshaking other people. Again several different video clips that meet the criteria are shown under the search bar.

Figure 4: User experience of the video semantic search solution.

 

 

Example search results from the video semantic search. It shows the following search in the search bar at the top: Werner Vogels driving a car. Underneath the search shows the video clips which meet this search criteria.

Figure 5: Example search result from the video semantic search.

Conclusion

We demonstrated how to use AWS AI/ML/generative AI services to build a highly scalable semantic video search solution with a serverless architecture. Our solution allows you to take video files with no metadata or tags, and run it through an automated workflow to extract metadata and contextual information that can be used to rediscover your content. This is particularly important within the media and publishing industry, where customers need to streamline content creation workflows, enhance the user experience, and get productivity gains in generating short-form video content.

The solution integrates a user experience component hosted using Amazon CloudFront and Amazon Cognito for authentication, along with a request management flow to handle user interactions and video processing requests. The solution includes a media analysis workflow that leverages AI/ML services on AWS including Amazon Rekognition for shot detection, celebrity recognition. It also uses generative AI foundation models in Amazon Bedrock for video understanding. Finally, the solution integrates with Amazon OpenSearch Serverless as a vector database to allow users to query and retrieve relevant video content efficiently.

To get started with this video semantic search solution, check out the sample code in our GitHub repository. The repository includes AWS CloudFormation templates and step-by-step instructions to help you deploy the solution in your own AWS account.

Contact an AWS Representative to know how we can help accelerate your business.

Further reading

Vu San Ha Huynh

Vu San Ha Huynh

Vu San Ha Huynh is a Solutions Architect at AWS. He has a PhD degree in Computer Science and enjoys working on different innovative projects to help support large Enterprise customers.

Adrian Daniel

Adrian Daniel

Adrian Daniel is a Solutions Architect at AWS working with customers in the Telco, Media & Entertainment, Games, and Sports industry. He helps media customers in designing, developing and deploying workloads on the AWS Cloud using best practices. His current focus area is helping customers solve their media use cases with artificial intelligence and machine learning.

Mark Watkins

Mark Watkins

Mark Watkins is a Principal Solutions Architect within the Telco, Media & Entertainment, Games, and Sports team, specialising on Publishing solutions using AI.

Wei Teh

Wei Teh

Wei Teh is a Senior Solutions Architect specializing in generative AI and machine learning technologies. He works closely with organizations to accelerate their AWS cloud adoption, focusing on innovative AI and ML solutions.