AWS Machine Learning Blog

Automated video editing with YOU as the star!

Have you ever wanted to find a specific person among hours of video footage?

Perhaps you’re preparing a video for a 21st birthday celebration, wanting to find happy memories of your birthday child. Maybe you are scouring video footage, looking to see what a specific employee did on their last day at work. Or it may be that you want to produce a highlight reel of your own efforts at the Nathan’s Hot Dog Eating Contest.

In this blog post, you will learn how to combine the capabilities of Amazon Rekognition Video and Amazon Elastic Transcoder to automatically convert a long video into a highlight video showing all footage of a given person.

A Simple Demonstration

To demonstrate this process, I will use our Day in the Life of an AWS Technical Trainer video. If you watch the video, you will notice that it features several people talking to camera, training customers, and walking around the office.

The video was run through the process we describe later in this blog post, which automatically produced a video of a specifically-selected person. Watch these output videos to see the final products:

In fact, the video of MJ looks like a single, continuous take because two separate scenes were automatically joined together. You’ll have to look closely to discover where the scenes were stitched together!

The Process

This is the overall process used to create a highlight video:

  1. Create a face collection in Amazon Rekognition, teaching it the people that should be recognized.
  2. Use Amazon Rekognition Video to search for faces in a saved video file.
  3. Collect the individual timestamps where faces were recognized and convert them into clips of defined duration.
  4. Stitch together a new video using Amazon Elastic Transcoder.

Each step is explained as follows.

Step 1: Create a face collection

An Amazon Rekognition face collection contains information about faces you want to recognize in pictures and videos. It can be created by using software SDKs or the AWS Command Line Interface by calling CreateCollection. I used the create-collection command:

$ aws rekognition create-collection --collection-id trainers

Individual faces can then be loaded into the collection by using the index-faces command and either passing the image directly or by referencing an object saved in Amazon S3:

$ aws rekognition index-faces
--collection-id trainers 
--image "S3Object={Bucket=<my-bucket>,Name=john.jpg}" 
--external-image-id John

The ExternalImageID can be used to attach a friendly name to the face. This ID is then returned when that face is detected in a picture or video.

For my video, I obtained individual pictures of each person from our staff directory or by copying their picture out of a frame from the video. I then ran the index-faces command to load a face for each person.

Amazon Rekognition does not actually save the faces detected. Instead, the underlying detection algorithm first detects the faces in the input image and extracts facial features into a feature vector, which is stored in the backend database. Amazon Rekognition uses these feature vectors when performing face match and search operations.

Step 2: Perform a face search on the video

Amazon Rekognition can recognize faces in pictures, saved videos, and streaming videos. For this project, I wanted to find faces in a video stored in Amazon S3, so I called StartFaceSearch:

$ aws rekognition start-face-search
--video "S3Object={Bucket=<my-bucket>,Name=trainers.mp4}"
--collection-id trainers

The face search will take several minutes depending upon the length of the video. If desired, a notification can be sent to an Amazon Simple Notification Service topic when the search is complete, which can then trigger a follow-up process.

The result of the face search is a list of people, timestamps, and face matches:

{
  "Persons": [
    {
      "Timestamp": 7560,
      "FaceMatches": [
        {
          "Face": {
            "BoundingBox": {
              "Width": 0.6916670203208923,
              "Top": 0.1909089982509613,
              "Left": 0.14166699349880219,
              "Height": 0.46111100912094116
            },
            "FaceId": "6b62481e-06fa-48ea-a892-b8684548958b",
            "ExternalImageId": "John",
            "Confidence": 99.99750518798828,
            "ImageId": "ad98b04a-6b06-5ca5-b5ce-4db389c65c18"
          },
          "Similarity": 89.65567016601562
        }
      ],
    },
    {
      "Timestamp": 7640,
      ...

All faces detected in the video will be listed, but whenever Amazon Rekognition Video finds a face that matches the face Collection it will include the friendly name of the face. Thus, I merely need to look for any records with an ExternalImageID of ‘John’.

A confidence rating (shown above as 99.99%) can also be used to reduce the chance of false matches.

The Face Search output can be quite large. For my 3-minute 17-second video, it provided 984 timestamps where faces were found, out of which 124 identified ‘John’ as being in the frame.

Step 3: Convert timestamps into scenes

The final step is to produce an output video just showing the scenes with a specified person. However, the Face Search output is simply a list of timestamps where that person appears. How, then, can I produce a new video based on these timestamps?

The answer is to use Amazon Elastic Transcoder, which has the ability to combine multiple clips together to create a single output video. This is known as clip stitching. The clips can come from multiple source videos or, in my case, from multiple parts of the same input video.

To convert the timestamps from Amazon Rekognition Video into inputs for Elastic Transcoder, I wrote a Python script that does the following:

  1. Retrieves the Face Search results by calling GetFaceSearch.
  2. Extracts the timestamps whenever a specified person appears:
    [99800, 99840, 100000, 100040, ...]
  1. Converts the timestamps into scenes where the person appears, recoding the start and end timestamps of each scene:
    [(99800, 101480), (127520, 131760), ...]
  1. Converts the scenes the format required by Elastic Transcoder:
    [
      {'Key': 'trainers.mp4', 'TimeSpan': {'StartTime': '99.8', 'Duration': '1.68'}},
      {'Key': 'trainers.mp4', 'TimeSpan': {'StartTime': '127.52', 'Duration': '4.24'}},
      ...
    ]
'''
Extract timestamps from Amazon Rekognition Video Face Search
Then use Amazon Elastic Transcoder to stitch the clips together
'''

import boto3

# Connect to Amazon Rekognition
client = boto3.client('rekognition', region_name = 'ap-southeast-2')

# Retrieve the face search results
person_to_find = 'Karthik'
timestamps=[]

search = client.get_face_search(JobId='...', SortBy='INDEX')

while (True):
  for person in search['Persons']:
    try:
      for face_matches in person['FaceMatches']:
        if face_matches['Face']['ExternalImageId'] == person_to_find:
          timestamps.append(person['Timestamp'])
    except KeyError:
      pass

  # Retrieve the next set of results
  try:
    next_token = search['NextToken']
    search = client.get_face_search(JobId='...', SortBy='INDEX', NextToken = search['NextToken'])
  except KeyError:
    break

'''
The timestamps array now looks like:
[99800, 99840, 100000, 100040, ...]
'''

# Break into scenes with start & end times
scenes=[]
start = 0

for timestamp in timestamps:
  if start == 0:
    # First timestamp
    start = end = timestamp
  else:
    # More than 1 second between timestamps? Then scene has ended
    if timestamp - end > 1000:
      # If the scene is at least 1 second long, record it
      if end - start >= 1000:
        scenes.append((start, end))
      # Start a new scene
      start = 0
    else:
      # Extend scene to current timestamp
      end = timestamp

# Append final scene if it is at least 1 second long
if (start != 0) and (end - start >= 1000):
    scenes.append((start, end))

'''
The scenes array now looks like:
[(99800, 101480), (127520, 131760), ...]
'''

# Convert into format required by Amazon Elastic Transcoder
inputs=[]
for scene in scenes:
  start, end = scene
  inputs.append({
    'Key': 'trainers.mp4',
    'TimeSpan': {
      'StartTime': str(start/1000.),
      'Duration': str((end-start)/1000.)
    }
  })

'''
The inputs array now looks like:
[
  {'Key': 'trainers.mp4', 'TimeSpan': {'StartTime': '99.8', 'Duration': '1.68'}},
  {'Key': 'trainers.mp4', 'TimeSpan': {'StartTime': '127.52', 'Duration': '4.24'}},
  ...
]
'''

# Call Amazon Elastic Transcoder to stitch together a new video
client = boto3.client('elastictranscoder', region_name = 'ap-southeast-2')

job = client.create_job(
  PipelineId = '...',
  Inputs=inputs,
  Output={'Key': person_to_find + '.mp4', 'PresetId': '...'}
)

The script finishes by calling the Elastic Transcoder create-job command to create the output video. A few seconds later, the video appears in my Amazon S3 bucket. I then ran the script again with a different person’s name to create another video specific to that person.

That’s how you can be the star of your very own video!


About the Author

John Rotenstein is a Senior Technical Trainer at Amazon Web Services. He specializes in creating hands-on labs that allow customers to gain practical experience in using AWS services. He lives in Sydney, Australia and has a penchant for Escape Rooms.