Find Distinct People in a Video with Amazon Rekognition

Note: AWS released Amazon Rekognition Video on November 29, 2017 which is now the preferred approach for analyzing videos and finding distinct people. Nevertheless, we continue to make this blog post available for educational purposes on how to use Amazon Rekognition.

Amazon Rekognition makes it easy to detect, search for, and compare faces in images to find matches. In this post, we show how to use Amazon Rekognition to find distinct people in a video and identify the frames that they appear in. You could use face detection in videos, for example, to identify actors in a movie, find relatives and friends in a personal video library, or track people in video surveillance.

First, we explain how the serverless solution finds distinct people in a video. Then, we explain how to implement the solution in your AWS account with AWS CloudFormation and to test it with a sample video.

How it works

The following diagram shows how this solution works:

Amazon Rekognition currently supports image analysis only. Therefore, we need to extract frames of the input video into images. We use Amazon Elastic Transcoder to create video thumbnails, a service that makes it easy to convert media files in the cloud with no need to manage the underlying infrastructure.

This is what happens in greater detail:

You upload a video file into an S3 bucket.
Amazon S3 invokes the first of the two AWS Lambda functions to create a new job in Amazon Elastic Transcoder (the code for this follows this list).
The Elastic Transcoder job creates video thumbnails in .png format for every second of input video and uploads them into the S3 bucket. (It also creates a transcoded video, which we don’t use for this post.)
When the job completes, Elastic Transcoder sends a notification to an SNS topic and Amazon Simple Notification Service (Amazon SNS) invokes another Lambda function.

# Retrieve the key for the S3 object that caused this function to be triggered
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode('utf8'))
filename = key.split('/')[-1]

# Create a new transcoding job. Files created by Elastic Transcoder start with 'elastictranscoder/[filename]/[timestamp]_'
timestamp = datetime.utcnow().strftime('%Y-%m-%d_%H-%M-%S')

client = boto3.client('elastictranscoder')
response = client.create_job(
  PipelineId=os.environ['PipelineId'],
  Input={'Key': key},
  OutputKeyPrefix='elastictranscoder/{}/{}_'.format(filename, timestamp),
  Output={
    'Key': 'transcoded-video.mp4',
    'ThumbnailPattern': 'thumbnail-{count}',
    'PresetId': os.environ['PresetId']
  }
)

The second Lambda function creates a new collection in Amazon Rekognition. A collection is a container for the faces that Amazon Rekognition detected in images by using the IndexFaces. Note that the image bytes don’t persist in Amazon Rekognition. Instead, Amazon Rekognition extracts and stores facial features in the collection. It then retrieves the list of thumbnail objects created by Elastic Transcoder for that video in the S3 bucket and does the following:
1. Calls the IndexFaces operation for each thumbnail. The solution uses concurrent threads to increase the throughput of requests to Amazon Rekognition and to reduce the time needed to complete the operation. In the end, the collection contains as many faces as there are faces detected in each thumbnail.

# Create a new collection. I use the job ID for the name of the collection
collectionId = sns_msg['jobId']
rekognition.create_collection(CollectionId=collectionId)

# Retrieve the list of thumbnail objects in the S3 bucket
thumbnailKeys = []
prefix = sns_msg['outputKeyPrefix']
prefix += sns_msg['outputs'][0]['thumbnailPattern'].replace('{count}', '')

paginator = s3.get_paginator('list_objects')
response_iterator = paginator.paginate(
  Bucket=os.environ['Bucket'],
  Prefix=prefix
)
for page in response_iterator:
  thumbnailKeys += [i['Key'] for i in page['Contents']]

# Call the IndexFaces operation for each thumbnail
faces = {}
indexFacesQueue = Queue()

def index_faces_worker():
  rekognition = boto3.client('rekognition', region_name=os.environ['AWS_REGION'])

  while True:
    key = indexFacesQueue.get()
    
    try:
      response = rekognition.index_faces(
        CollectionId=collectionId,
        Image={'S3Object': {
          'Bucket': os.environ['Bucket'],
          'Name': key
        }},
        ExternalImageId=str(frameNumber)
      )
      
      # Store information about returned faces in a local variable
      frameNumber = int(key[:-4][-5:])
      for face in response['FaceRecords']:
        faceId = face['Face']['FaceId']
        faces[faceId] = {
          'FrameNumber': frameNumber,
          'BoundingBox': face['Face']['BoundingBox']
        }

    # Put the key back in the queue if the IndexFaces operation failed
    except:
      indexFacesQueue.put(key)

    indexFacesQueue.task_done()

# Start CONCURRENT_THREADS threads
for i in range(CONCURRENT_THREADS):
  t = Thread(target=index_faces_worker)
  t.daemon = True
  t.start()

# Wait for all thumbnail objects to be processed
for key in thumbnailKeys:
  indexFacesQueue.put(key)
indexFacesQueue.join()

For each face stored in the collection, calls the SearchFaces operation to search for faces that are similar to that face and in which it has a confidence in the match that is higher than 97%. The following code shows how this works:

searchFacesQueue = Queue()

def search_faces_worker():
  rekognition = boto3.client('rekognition', region_name=os.environ['AWS_REGION'])
  
  while True:
    faceId = searchFacesQueue.get()

    try:
      response = rekognition.search_faces(
        CollectionId=collectionId,
        FaceId=faceId,
        FaceMatchThreshold=97,
        MaxFaces=256
      )
      matchingFaces = [i['Face']['FaceId'] for i in response['FaceMatches']]

      # Delete the face from the local variable 'faces' if it has no matching faces
      if len(matchingFaces) > 0:
        faces[faceId]['MatchingFaces'] = matchingFaces
      else:
        del faces[faceId]

    except:
        searchFacesQueue.put(faceId)

    searchFacesQueue.task_done()

for i in range(CONCURRENT_THREADS):
  t = Thread(target=search_faces_worker)
  t.daemon = True
  t.start()

for faceId in list(faces):
  searchFacesQueue.put(faceId)
searchFacesQueue.join()

Find faces in the collection that match each face that it detected. It starts from the first face that appears in the video and associates that face with a peopleId of 1. Then, it recursively propagates the peopleId to the matching faces. In other words, if faceA matches faceB and faceB matches faceC, the function decides that faceA, faceB and faceC correspond to the same person and assigns them all the same peopleId. To avoid false positives, the Lambda function propagates the peopleId from faceA to faceB only if there are at least two faces that match faceB that also match faceA. When the peopleId 1 has fully propagated, the function associates a peopleId of 2 to the next face appearing in the video that has no peopleId associated with it. It continues this process until all of the faces have a peopleId. The following code shows how this works:

# Sort the list of faces in the order of which they appear in the video
def getKey(item):
  return item[1]
facesFrameNumber = {k: v['FrameNumber'] for k, v in faces.items()}
faceIdsSorted = [i[0] for i in sorted(facesFrameNumber.items(), key=getKey)]

# Identify unique people and detect the frames in which they appear
def propagate_person_id(faceId):
  for matchingId in faces[faceId]['MatchingFaces']:
    if not 'PersonId' in faces[matchingId]:

      numberMatchingLoops = 0
      for matchingId2 in faces[matchingId]['MatchingFaces']:
          if faceId in faces[matchingId2]['MatchingFaces']:
              numberMatchingLoops = numberMatchingLoops + 1

      if numberMatchingLoops >= 2:
          personId = faces[faceId]['PersonId']
          faces[matchingId]['PersonId'] = personId
          propagate_person_id(matchingId)

personId = 0
for faceId in faceIdsSorted:
  if not 'PersonId' in faces[faceId]:
    personId = personId + 1
    faces[faceId]['PersonId'] = personId
    propagate_person_id(faceId)

In our solution, we arbitrarily chose to return people that appear in at least five consecutive frames. The Lambda function creates and uploads a JSON file to the S3 bucket with the following code:

{
  "People": [
    {
      "Frames": [
        {
          "FrameNumber": number,
          "FrameTimePosition": "HH:MM:SS",
          "BoundingBox": { 
            "Height": number,
            "Left": number,
            "Top": number,
            "Width": number
          }
        },
        ...
      ]
    },
    ...
  ]
}

It also creates and uploads a visual representation to the S3 bucket. You will see an example in the next section. Finally, the Lambda function deletes the collection from Amazon Rekognition.

Implementing and testing the solution

To implement and test the solution in your AWS account, you will use AWS CloudFormation to provision the required resources in the AWS North Virginia Region.

CloudFormation creates the following resources:

An S3 bucket that stores input videos, video thumbnails, and the files created with this solution.
An SNS topic where Elastic Transcoder publishes an event when a job completes.
An IAM role that grants Elastic Transcoder the required permissions to access Amazon S3 and Amazon SNS.
A pipeline and a preset in Elastic Transcoder. The pipeline is a queue for Elastic Transcoder jobs that defines how input and output files are stored in Amazon S3 and which notifications to send. The preset specifies settings, including thumbnail settings, for transcoding media files.
An IAM role that grants Lambda the required permissions to access Amazon S3 and Amazon Rekognition.
A Lambda function that Amazon S3 invokes when a new video is uploaded into the S3 bucket.
The second Lambda function that Amazon SNS invokes. This Lambda function processes the video thumbnails to find distinct people.

Some of the resources that AWS CloudFormation creates are custom resources. Therefore, AWS CloudFormation creates the related Lambda functions and IAM roles for Lambda beforehand.

To deploy and test the solution

Choose Create stack to create an AWS CloudFormation stack. Then, follow the on-screen instructions.
After creating these resources, AWS CloudFormation creates a copy of the video Democratizing LoRaWAN and IoT with The Things Network and stores it in the S3 bucket. This saves you from manually copying the video to test the solution. This triggers the solution. It can take up to 10 minutes after you start creating the stack for the solution to process the video.
After the video’s been processed, in the AWS CloudFormation console, choose Outputs and note the name of the S3 bucket.
Open the Amazon S3 console to browse the objects in this S3 bucket. You should see a new folder called output, which contains two files: the JSON document and the visual representation of each face in .png format, as follows:
The solution has detected seven people in the video. For each person, the visual representation shows four randomly selected views of that person’s face and red vertical lines that indicate where that person appears in a frame.
You can now clean up the resources by deleting the AWS CloudFormation stack. AWS CloudFormation does not delete the S3 bucket because it contains objects. You need to delete the S3 bucket manually.

Conclusion

In this post, we’ve shown how to use Amazon Rekognition, Amazon Elastic Transcoder, AWS Lambda, and Amazon S3 to identify people who appear in a video and to detect the frames in which they appear.

You can adapt this solution to your own requirements. For example, you could return additional attributes for the people that the solution finds, like an estimated age range or their name if they are famous individuals or celebrities.

If you have comments, submit them in the Comments section. If you have questions, start a new thread on the Amazon Rekognition forum.

Next Steps

Take your knowledge to the next level. Learn how to classify a large number of images with Amazon Rekognition and AWS Batch.

About the Authors

Nicolas Malaval is a Consultant for AWS Professional Services. He lives in Paris and works with our enterprise customers, helping them adopt cloud technology and innovate with AWS.

Rudy Krol is a Solution Architect for Amazon Web Services. He gained experience in software development before joining AWS. He is now specialized in serverless and IoT, helping our customers in France embrace the latest technologies on their innovative projects.

Artificial Intelligence

Find Distinct People in a Video with Amazon Rekognition

How it works

Implementing and testing the solution

To deploy and test the solution

Conclusion

Next Steps

About the Authors

Resources

Blog Topics

Follow

Learn

Resources

Developers

Help