Analyze Emotion in Video Frame Samples Using Amazon Rekognition on AWS

This guest post is by AWS Community Hero Cyrus Wong. Cyrus is a Data Scientist at the Hong Kong Vocational Education (Lee Wai Lee) Cloud Innovation Centre. He has achieved all 7 AWS Certifications and enjoys sharing his AWS knowledge with others through open-source projects, blog posts, and events.

HowWhoFeelInVideo is an application that analyzes faces detected in sampled video clips to interpret the emotion or mood of the subjects . It identifies faces, analyzes the emotions displayed on those faces, generates corresponding Emoji overlays on the video, and logs emotion data. The application accomplishes all of this within a serverless architecture using Amazon Rekognition, AWS Lambda, AWS Step Functions, and other AWS services.

HowWhoFeelInVideo was developed as part of a research project at the Hong Kong Vocational Education (Lee Wai Lee) Cloud Innovation Centre. The project is focused on childcare, elder care, and community services. However, emotion analysis can be used in many areas, including rehabilitative care, nursing care, and applied psychology. My initial focus has been on applying this technology to the classroom.

In this post, I explain how HowWhoFeelInVideo works and how to deploy and use it.

How it works

Teachers, such as myself, can use HowWhoFeelInVideo to get an overall measure of a student’s mood (e.g., happy, or calm, or confused) while taking attendance. The instructor can use this data to adjust his or her focus and approach to enhance the teaching experience. This research project is just beginning. I will update this post after I receive additional results.

To use HowWhoFeelInVideo, a teacher sets up a basic classroom camera to take each students’ attendance using face identification. The camera also captures how students feel during class. Teachers can also use HowWhoFeelInVideo to prevent students from falsely reporting attendance.

Architecture and design

HowWhoFeelInVideo is a serverless application built using AWS Lambda functions. Five of the Lambda functions are included in the HowWhoFeelInVideo state machine. AWS Step Functions streamlines coordinating the components of distributed applications and microservices using visual workflows. This simplifies building and running multi-step applications.

The HowWhoFeelInVideo state machine starts with the startFaceDetectionWorkFlowLambda function, which is triggered by an Amazon S3 PUT object event. startFaceDetectionWorkFlowLambda passes in the following information into the execution:

{
    "bucket": "howwhofeelinvideo",
    "key": "Test2.mp4"
}

With Step Functions, you can use meaningful names, making it easy to understand workflows. They also let you monitor the processing pipeline from the AWS console.

The HowWhoFeelInVideo state machine is available in the us-east-1 AWS Region. The video processing tasks are implemented using FFmpeg.

Behind the scenes

Before you start using HowWhoFeelInVideo, you need to understand how it works and a few general principles.

When you need to make use of other pre-built programs such as FFmpeg, you can run another program or start a new process in Lambda:

Copy the program for a new process in Lambda to the /tmp directory.
Call the shell and use chmod to give it execution
Call the shell to run the program.

Lambda saves files for data processing in the /tmp directory, which is limited to 500 MB. To improve Lambda’s performance, the container is reused. This means that files in the /tmp directory might retain and use additional space during the next Lambda call. Therefore, you should always remove old files from /tmp, either at the beginning or the end of each step.

Face analysis is triggered by the ProcessImage Lambda function in Scala. The ProcessImage function processes only one image at a time. It performs the following tasks:

Downloads an image from an S3 bucket
Calls Amazon Rekognition to detect faces and emotion (with the detectFaces operation)
Crops the face from the image using the bounding box provided by the detectFaces operation
Attempts to identify the owner by searching each face (using the searchFacesByImage operation) in the specified face collection
Joins the result of emotion and face identification
Creates an Emoji Face overlap image and emotion report records
Uploads the Emoji Face overlap image and emotion report records to an S3 bucket

Because AWS Lambda charges for memory usage per 100 ms and Amazon Rekognition charges by the number of requests, the system is designed to run at maximum concurrency. I pay the same price whether I process all screen capture images at once or one by one!

The Cascades Face Detection step asynchronously invokes the ProcessImage Lambda function for each screen capture image nearly in parallel. Each ProcessImage function calls Amazon Rekognition for each face detected.

The following is a parallel map function which invokes the ProcessImage function for each image frame.

let invokeLambda = (key) => new Promise((resolve, reject) => {
    let data = JSON.stringify({bucket: bucket, key: prefix + "/" + key});
    let params = {
        FunctionName: process.env['ProcessImage'], /* required */
        Payload: data /* required */
    };
    lambda.invoke(params, (err, data) => {
        if (err) reject(err, err.stack); // an error occurred
        else     resolve(data);           // successful response
    });
});

let invokeLambdaPromises = keys.map(invokeLambda);
Promise.all(invokeLambdaPromises).then(() => {
        let pngKey = keys.map(key => key.split(".")[0] + ".png");
        let data = {bucket: bucket, prefix: prefix, keys: pngKey};
        console.log("involveLambdaPromises complete!");
    callback(null, data);
    }
).catch(err => {
    console.log("involveLambdaPromises failed!");
    callback(err);
});

The following is a parallel map function that gets the face ID in Scala:

//Parallel the search request.
val faceMatchAndBoundBoxAndEmotion = faceImagesAndBoundBoxAndEmotion.par.map(f => {
  searchFacesByImage(f._1) match {
    case Some(face) => {
      val id = face.getFaceMatches.asScala.headOption match {
        case Some(a) => a.getFace.getExternalImageId
        case None => "?????"
      }
      (id, f._2, f._3)
    }
    case None => ("????", f._2, f._3)
  }
})
faceMatchAndBoundBoxAndEmotion.seq

The following service map shows the dependency trees with trace data that I can use to drill into specific services or issues. This provides a view of connections between services in your application and aggregated data for each service, including average latency and failure rates.

The following is a latency distribution histogram for an Amazon Rekognition API call:

Latency is the amount of time between the start of a request and when it completes. A histogram shows a distribution of latencies. This latency distribution histogram shows duration on the x-axis, and the percentage of requests that match each duration on the y-axis.

I set the maximum execution of the ProcessImage function to 1.5 minutes and added a 10-second wait step in the state machine to ensure that all images and emotion records are ready in Amazon S3.

The following Lambda cascading timeline shows how processing operates in a highly parallel manner:

The result

The result includes a single output image:

An output record from single image

[{"seq":"test5/0036","id":"????","happy":11.956384658813477,"sad":0.0,"angry":0.0,"confused":26.754457473754883,"disgusted":0.0,"surprised":16.45158576965332,"calm":0.0,"unknown":0.0},{"seq":"test5/0036","id":"2astudent21","happy":40.610809326171875,"sad":3.8441836833953857,"angry":0.0,"confused":11.73412799835205,"disgusted":0.0,"surprised":0.0,"calm":0.0,"unknown":0.0},{"seq":"test5/0036","id":"????","happy":97.30420684814453,"sad":19.768024444580078,"angry":0.0,"confused":0.0,"disgusted":0.0,"surprised":0.0,"calm":0.7546186447143555,"unknown":0.0}]

An output report for all images in .csv format:

Note:

I don’t index images of my students’ faces within the video. Instead, they are each allocated an unknown faceId.

With this report, you can easily aggregate data on overall student satisfaction and determine how each individual feels throughout the course or the event. For health research, we plan to objectively record emotional feedback during class for Special Education Needs (SEN) students when we use a different teaching method. For a class of non-SEN students, we import a CSV report into a database with the following simple SQL statement:

SELECT Report.id AS Student, Count(Report.seq) AS Attended, Sum(Report.happy) AS SumOfhappy, Sum(Report.sad) AS SumOfsad, Sum(Report.angry) AS SumOfangry, Sum(Report.confused) AS SumOfconfused, Sum(Report.disgusted) AS SumOfdisgusted, Sum(Report.surprised) AS SumOfsurprised, Sum(Report.calm) AS SumOfcalm, Sum(Report.unknown) AS SumOfunknown
FROM Report GROUP BY Report.id;

Demo Video Output

The following video is my TV interview with four of my students about Hong Kong Open Data and it has been processed using HowWhoFeelInVideo.

The step that extracts the video to images must complete within the maximum Lambda execution time of 5 minutes, so you cannot directly process long-running video. However, it is easy to create fragmented MP4 files with Amazon Elastic Transcoder and process the analysis over the MP4 fragments.

Overall AWS X-Ray Service Map

Source code for HowWhoFeelInVideo is available in GitHub.

Deploying HowWhoFeelInVideo

Deployment is very simple. I created an AWS CloudFormation template with AWS Serverless Application Model (AWS SAM). AWS SAM is a specification for describing Lambda-based applications. It offers a syntax designed specifically for expressing serverless resources. To deploy the application, perform the following steps:

1. In Amazon Rekognition, create a face collection named student.
2. Use the AWS CLI to store faces.
3. Create an S3 source bucket in the us-east-1 AWS Region.
4. Download the three files in the Deployment folder on GitHub.
5. Upload two source packages that you get from the Deployment folder, FaceAnalysis-assembly-1.0.jar and ProcessVideoLambda_latest.zip, into the S3 source bucket.
6. In the AWS CloudFormation console, choose Create Stack.
7. Select Upload a template to Amazon S3, choose HowWhoFeelInVideo.yaml, then choose Next:

Specify the following parameters, and choose Next.
1. Stack name: howwhofeelinvideo (A unique name for the stack in your AWS resgion.)
2. CollectionId: The name of the indexed face collection that you created in Step 1: student
3. FaceMatchThreshold: Type 70. The face match threshold ranges from 0 to 100. It specifies the minimum confidence in the face match required to consider it a match.
4. PackageBucket: The name of the S3 source bucket that you created in Step 3.
5. VideoBucketName:The name of the bucket that you want to create. This bucket starts the workflow for .mp4 and .mov files. The bucket name must be unique. You cannot use the bucket name used in the following screenshot. When you delete the AWS CloudFormation stack, this bucket remains.

On the Options page, choose Next:
Select all acknowledgment boxes, and choose Create Change Set:
When the change set has been created, choose Execute:
>
Wait while the AWS CloudFormation stack is created:

Try your deployment

Go to S3 console and login into your AWS account.
In the S3 console, upload a short video into the video bucket. If you are not familiar with the Amazon S3 console, please follow this tutorial: How Do I Upload Files and Folders to an S3 Bucket?
In a new browser, open the Step Function console:
When you see the task running, choose State Machines. You might need to refresh the browser.
Select the execution instance of the state machine that is running:

You will see an animation of the process:

When the process has completed, refresh the Amazon S3 console. A new folder appears:
Choose the new folder.
In the Search box, type video, then open or download the video file:
To get the report, in the Search box, type result, and open or download the report file.

Conclusion

HowWhoFeelInVideo can help us understand the emotions of all of the people captured in a video. It has a variety of applications, including education, training, rehabilitative care, and customer interactions. Deployment is simple with an AWS CloudFormation template. Just take a video with your smart phone and upload it to an S3 bucket. In a few minutes, you’ll get the emotional analytic report!

This project has been developed in collaboration with four of my students from the IT114115 Higher Diploma in Cloud and Data Centre Administration: Ng Ka Yin, Lai Kam To, Karlos Lam, and Pang Chin Wing. Also, thanks to the AWS Academy curriculum, which helps my students learn how to use AWS services!

Additional Reading

Learn how to create a serverless solution for video frame analysis and alerting with Amazon Rekognition.