AWS for M&E Blog

On-screen computation time using machine learning tools from AWS

For linear broadcast television, it is common to rerun successful programs to fill available time slots. Often, episodes may need to be edited down and, at times, entire scenes may need to be removed to conform to an allotted time slot. This creates complexity because under a particular type of contract, artists have performance rights (also called related rights), which means they are paid proportionally for their contribution to the work.

Cast compensation for edited reruns needs to be adjusted to the actual on-screen time in a broadcast. In the past, broadcasters had entire teams dedicated to watching edited reruns in order to manually recalculate an actor’s screen time using a stopwatch and spreadsheets. This was an error-prone and time-consuming process.

In this blog post, we describe you how to use Amazon Rekognition to automate the identification and tracking of the screen time for artists, making it faster and more reliable. We encourage you to run the solution yourself. The Amazon CloudFormation template provided in the repository automates the deployment of the resources. Download the source code from the GitHub repository and follow the instructions in the README file.

Solution overview

The solution segments video into shots, extracting one frame per each shot, using Amazon Rekognition, a fully managed computer vision service that requires no machine learning expertise to use. Then, it applies face comparison and person detection to label camera-facing artists and to tag unidentified artists that face away from camera in these frames.

The following diagram depicts the solution architecture.

Architecture Diagram showing the system architecture: The user adds pictures to an Amazon S3 bucket, triggering a new artist event that creates and updates an Amazon Rekognition face collection. The user adds an episode to an Amazon S3 bucket, triggering an AWS Step Functions workflow that extracts frames and identifies faces.

This solution uses the following services:

Solution steps

Part 1: Amazon Rekognition collection management

The solution uses a collection of artists’ faces in Amazon Rekognition to identify when they appear in a video. After creating the collection with faces of the artists of the series, you can search for face matches (faces that are in the collection that appear in the videos). Refer to Searching faces in a collection.

You need to create a collection before running the video processing workflow. Each series must have its own collection, and you should follow the Recommendations for facial comparison input images. We recommend adding at least five images: full frontal, 90 degrees to each side, and 45 degrees to each side. If the character has heavy makeup or undergoes drastic change in appearance, add images of this to the collection as well to improve identification accuracy.

The solution leverages Amazon S3 and AWS Lambda to manage the collection. Whenever a new image writes into the S3 bucket following a naming pattern, an Amazon S3 trigger invokes a Lambda function that adds the image to the collection. The first time an image is written, a new collection is created. If an image is removed, the function removes it from the collection. Refer to Using an Amazon S3 trigger to invoke a Lambda function and Adding faces to a collection.

Amazon Rekognition currently only supports relating multiple pictures to the same named entity by using the ExternalImageId property. That property, along with other metadata, is stored in an Amazon DynamoDB table by the Lambda function.

The solution leverages the Amazon S3 folder structure to store information about the artist-role combination. Ideally, however, the ExternalImageId will reference an external API or collection that contains this information.

To add images to the collection, users place them under the faces folder into separate subfolders, one for each artist. Subfolders are named following the pattern Artist_Name-Role_Name. A dash (-) separates the name of the artist and the name of the role (character). An underscore (_) represents spaces between first and last names. The names must not include any special characters following Amazon Rekognition properties naming conventions. Following is an example:

Katherine_Franco-Kathy

Once all the pictures are uploaded and the collection is complete, the video processing workflow can start.

Part 2: Video processing workflow

An Amazon S3 trigger starts the video processing workflow when a video is uploaded into the Amazon S3 bucket inside the episodes folder.

The video processing workflow first segments the video into shots. Then, it applies face comparison and person detection to label camera-facing artists and to tag unidentified artists that face away from camera in these shots. The final results are stored in an Amazon DynamoDB table.

This solution leverages AWS Step Functions for the video processing workflow and the logic runs in Python Lambda functions. The following is a screenshot of a workflow execution taken from the AWS Console.

Screenshot from the Step Functions AWS Console. Shows a four-step workflow with the following steps: 1. Start Shot Detection; 2. Extract Frames Rekognition; 3. Record Job ID; 4. Process Key Frames.

Part 3: Shot frame extraction

The workflow first segments the video into shots by using Amazon Rekognition Video shot detection feature. A shot is a series of interrelated consecutive pictures taken by a single camera and represents continuous action in time and space. For more information, please refer to Shot detection.

Amazon Rekognition allows the user to specify the level of confidence of shot detection. A higher confidence detects fewer shots (segments). We recommend using values greater than 85% for the confidence. Lower numbers don’t generally improve performance and generate segments that are not necessary in this use case.

The Lambda function uses the asynchronous StartSegmentDetection and GetSegmentDetection API operations to start the video segmentation job and fetch the results. Segment detection accepts videos stored in an Amazon S3 bucket and returns a JSON output. Refer to Using the Amazon Rekognition Segment API.

The JSON output of the video segmentation job includes timestamps of the shots. The Lambda function uses OpenCV, a computer vision and image processing library, to extract the frames (images) in those timestamps (one frame per shot). Extracted frames are stored in Amazon S3.

In the third step of the workflow execution (Record Job Id), details such as job id, the path to the video, and other necessary information are stored in the Amazon DynamoDB table.

Part 4: Frame processing

The last step of the workflow performs a face comparison and person detection. Each of the frames extracted and stored in Amazon S3 is processed by calling Amazon Rekognition Image to compare faces against the collection and to detect labels in the stored frames.

For each frame, the solution calls Amazon Rekognition Image to detect faces and labels and saves the analysis result into an Amazon DynamoDB table. The frames are processed in parallel using AWS Step Functions Map state processing modes, which allow for up to 40 parallel executions. The step only succeeds if every frame is successfully processed.

Each frame processing starts by calling the DetectFaces API operation to detect faces in the frame. To match a face in an image with a face registered in a collection, the workflow must provide an image with exactly one face, otherwise Amazon Rekognition always selects the largest face in the image. Because there are likely images with several artists facing the camera, the adequate procedure is to detect every face in the image and provide Amazon Rekognition a cropped version containing only the face submitted for analysis. For more information, please refer to Detecting and analyzing faces.

The DetectFaces API operation returns a list of bounding boxes that determine where Amazon Rekognition has identified a face. For each face detected, the solution crops and calls the SearchFacesByImage API operation to match the face in the image with one in the face collection. If it finds a match, the name and the role of the artist is identified, along with the bounding box provided by Amazon Rekognition, and this information is stored in the Amazon DynamoDB table.

The solution also labels generic person shapes present in the images. In a shot, someone might face away from the camera, but that artist still counts as being present in the shot. To detect people in the image, the workflow uses the DetectLabels API operation to identify a defined set of labels in an image. The types of labels vary from cars to food and beverages to behavior and conversations. From the response, we keep only the labels of type Person.

Finally, the solution uses the bounding boxes coordinates to cross-reference detected people with the results from the face search to determine if there is anyone without facial identification. This happens when a person isn’t facing towards the camera, or if the face doesn’t match a face in the library. Those that don’t have a facial identification are stored in Amazon DynamoDB as unidentified persons.

Part 5: Consuming the video processing workflow results

After running a video processing workflow, the identified artists in each shot are available in an Amazon DynamoDB table. Unidentified people, and the start and end times of each shot, are also available along with the bounding boxes.

Next, you calculate the total screen time of each artist in the video with the results available in Amazon DynamoDB and by adding together each shot duration in which an artist appears. You can also create an application for humans to audit and edit results. A manual verification is especially useful to identify unidentified persons.

Solution improvements

This solution does not include an application to consume the results or any user interface. We suggest you improve the solution by developing an application on top of it.

The following is a screenshot example of a web application developed for an AWS customer. The results are from the video processing workflow for Forgive Me Not, a short movie written, directed. and edited by Vu Nguyen, and produced by Blue Light Pictures. The original movie is on Vimeo, licensed under Creative Commons: By Attribution 3.0 and we made no alterations to the content. It shows the bounding boxes and the shots start and end times. In the segment depicted (eighth shot, starting at 0:41), the workflow successfully detected the actor James Burleson but could not identify the other person in the scene, who is facing away from the camera. This person is labeled as Unknown. You could manually identify the latter, editing and enhancing the results.

Screenshot from a web application that utilizes the results produced by the video processing workflow. It shows a frame of the video where an actor faces the camera and an unidentified person faces away from camera. There are bounding boxes surrounding the actor’s face and the outline of the unidentified person.

If you want to create an application similar to the one in the screenshot, we recommend serving the results stored in Amazon DynamoDB through an API created in the Amazon API Gateway, a fully managed service that makes it easier for developers to create, publish, maintain, monitor, and secure APIs at any scale. A JavaScript application can then fetch data from the API.

Example: Screen time computation and comparing to manual task

To understand how this solution compares to a manual approach both in analysis time and computational values, we performed two analyses for Forgive Me Not: manual, logging the timestamps each artist appeared and left the screen; and assisted, where we ran our workflow and performed the role of the user, reviewing Unknown tags with the artist’s name.

Role Artist Manual (sec) Assisted (sec)
Kathy Katherine Franco 317 371
Steve James Burleson 270 295
Tommy Dillon Fontana 40 24
Customer Josh Schewerin 25 28
Jannette Jennifer Pilarcik 58 52
Attorney Lindsay Norman 133 99
Cop Freddy Garcia 75 75
Analysis time 1h30 12m

The results are close to a manual approach. The difference we see between the assisted and the manual calculations appears when using a higher confidence level for the Shot Detection task. For our analysis, we used 99% confidence, so only very well-defined shot changes are extracted from the video. While this value is fine for the majority of videos, it masks smoother scene transitions that might not be detected by Amazon Rekognition. Try different confidence levels to determine which one suits your use case.

While the assisted result differs sometimes, the manual approach is very error-prone and time consuming. This tool speeds up the analysis and produces similar results.

Cost analysis

The following table provides a cost breakdown example for deploying this solution with the default parameters in the US East (N. Virginia) Region, excluding free tier, to process one series assuming:

  • 26 episodes (1 season), each with 42 minutes duration
  • 830 frames extracted per episode
  • Frames with 2MB In size
  • 1 detect faces call per frame
  • 4 face search calls on average per frame
  • 30 artists to detect, with 5 face pictures each
  • Episodes files in mp4 format with 4GB in size
  • 1 workflow run per episode
Region Description Service Per Season (USD)
US East (N. Virginia) Shot Detection – detect scenes Rekognition Video 54.6
US East (N. Virginia) Facial Recognition Rekognition Image 107.9
US East (N. Virginia) Face Collection Rekognition Image 0.1515
US East (N. Virginia) Workflow – process frames AWS Lambda 0.26
US East (N. Virginia) Frame storage S3 Standard 1.01
US East (N. Virginia) Video storage S3 Standard 2.39
US East (N. Virginia) Analysis results storage DynamoDB on-demand capacity 0.61
US East (N. Virginia) Workflow runs Step Functions – standard workflows 0.42
Total 167.34
Per episode 6.44

View this estimate using the AWS Pricing Calculator by clicking on the link: Compute on-screen time using machine learning tools from AWS (Cost Estimate).

Conclusion

In this post, we walk through a solution that uses machine learning to track the on-screen time of a person in a video or a series of videos. The primary services used are Amazon Rekognition and serverless components, like AWS Lambda, AWS Step Functions, and Amazon DynamoDB. Media and entertainment companies can use the solution to automate the process of related rights calculation for a TV show rerun. The solution is also customizable, allowing for extensions and improvements.

Refer to the GitHub repository for the source code and the full set up instructions. Clone, change, deploy, and run it yourself.

Luiza Hagemann

Luiza Hagemann

Luiza Hagemann is a Prototyping Architect at the AWS Prototyping and Cloud Engineering team in Brazil. She has previously worked as a Software Engineer in the internet industry managing highly available, data-intensive applications.

Rafael Werneck

Rafael Werneck

Rafael Werneck is a Senior Prototyping Architect at AWS Prototyping and Cloud Engineering, based in Brazil. Previously, he worked as a Software Development Engineer on Amazon.com.br and Amazon RDS Performance Insights.

Rafael Ribeiro Martins

Rafael Ribeiro Martins

Rafael Ribeiro Martins is a Senior Technical Program Manager at AWS in Brazil. He helps customers envision the art of the possible on AWS by working with them on innovative prototyping engagements. With over 10 years of experience in project management for technical programs, he currently focuses on project management related to emerging technologies.