From 2D to 3D: Building a Scalable Human Mesh Recovery Pipeline with Amazon SageMaker AI

An image of basketball players and a 3D animation

In the ever-evolving landscape of computer graphics and animation, the ability to automatically generate realistic 3D human animations from video data could transform how digital content is created. From immersive fitness experiences to cutting-edge movie productions, the demand for accurate and lifelike digital human representations has never been more important. However, the process of converting real-world human movements into detailed 3D mesh data has traditionally been a time-consuming and resource-intensive endeavor, often requiring specialized hardware and complex software pipelines.

As organizations increasingly seek to leverage advanced computer vision technologies, the demand for robust 3D human digitization solutions continues to grow. This article explores the recent development of a scalable Human Mesh Recovery (HMR) pipeline architected on AWS to process high volumes of video data while maintaining enterprise-grade reliability and performance.

Introduction to Human Mesh Recovery

Human Mesh Recovery is a computer vision technique that aims to reconstruct the 3D pose and shape of a human body from visual data like images or videos. HMR uses a parametric human body model, like Skinned Multi-Person Linear (SMPL), to estimate model parameters. The parametric human body model represents the human body as a mesh defined by pose and shape parameters.

Because of the challenging nature of HMR, it is an ongoing research topic with new and innovative approaches regularly published. One of the main challenges in HMR is accurately detecting the human form from images and videos where the human body is occluded by other objects, in an unusual pose, or in an environment that does not provide an optimal background or lighting condition. Another challenge is that reconstructing detailed 3D meshes is a computationally expensive and time-consuming process, especially when the input data is a video with a human in every frame. Reducing data needs, improving efficiency, and human identification as part of input data are key focuses of HMR research.

Recent HMR progress enables constructing accurate digital 3D humans from a single image or video, even if the real person is occluded by other objects or people. HMR techniques have furthered research into predicting the motion of humans through time, using newer AI models, such as diffusion models, to plan the pose and shape of the human and a future time in a video. These techniques make HMR applicable to 3D human animation.

Overview of Score-Guided HMR (ScoreHMR)

At the heart of our solution lies Score-Guided Human Mesh Recovery (ScoreHMR), a unique approach to 3D human pose and shape reconstruction. Unlike traditional optimization techniques, ScoreHMR uses diffusion models to capture and reconstruct human body parameters from input images. This advanced approach allows for accurate single-frame model fitting, multi-view reconstruction without camera calibration, and seamless video sequence reconstruction. The main advantages of ScoreHMR are that it achieves strong performance on challenging datasets by effectively leveraging the image data and it outperforms traditional optimization-based model fitting methods. The diffusion model technique allows capturing a diverse distribution of human poses compared to previous regression-based methods.

ScoreHMR was published by a research group from Rutgers University. To learn more about their work, please reference their publication here: Score-Guided Diffusion for 3D Human Recovery. The author of this post and the work discussed throughout is in no way affiliated with Rutgers University or the previous researchers.

Scaling ScoreHMR on AWS

Processing large volumes of video data to extract 3D human representations is a computationally intensive task that can quickly become a bottleneck, especially as the volume of data grows. This is where AWS come into play, providing the scalability and power to handle demanding workloads.

This scalable Human Mesh Recovery pipeline is designed as a serverless architecture, leveraging multiple AWS services, including AWS Lambda, Amazon S3, Amazon SQS, and Amazon SageMaker AI. This powerful combination enables the solution to scale effortlessly, processing any volume of video data without compromising performance or efficiency.

A gif of Raw video for processing - football players

Figure 1 – Raw video for processing – football players

Amazon S3 is used as the data ingestion source for storing the raw video data that needs to be processed by the pipeline. When new video files are uploaded to the S3 bucket, it triggers an event notification to Amazon SQS to queue up processing requests. AWS Lambda functions are used at multiple stages of the pipeline:

An AWS Lambda function is triggered by the Amazon SQS queue to preprocess the video data from Amazon S3 and prepare it for inference with the ScoreHMR model.
This AWS Lambda function also invokes the Amazon SageMaker AI asynchronous endpoint with the preprocessed data to run inference using the ScoreHMR model.
AWS Lambda functions are also used to handle success/failure notifications from Amazon SageMaker AI and update metadata in Amazon DynamoDB accordingly.

Amazon SageMaker AI hosts and manages the infrastructure for running the ScoreHMR model. The model is deployed as an asynchronous endpoint, which allows processing of large video payloads that may take several minutes. The Amazon SageMaker AI endpoint queues the incoming requests and automatically scales compute resources based on traffic.

A gif of Processed video - football players 3D reconstruction

Figure 2 – Processed video – football players 3D reconstruction

Asynchronous Inference is a capability in Amazon SageMaker AI that queues incoming requests and processes them asynchronously. This option is ideal for requests with large payload sizes (up to 1GB), long processing times (up to one hour), and near real-time latency requirements. Asynchronous Inference enables you to save on costs by autoscaling the endpoint instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests.

Today, ScoreHMR and Amazon SageMaker AI do not provide functionality for splitting large payloads, such as, full length videos over 1GB or one hour in duration. To automate video splitting as part of this solution, Amazon Bedrock Data Automation with a multi-modal Amazon Nova foundation model could be used to detect scene changes in the input video. Once smaller video clips have been created, event-driven approaches, like Amazon S3 Event Notifications, could be used to invoke the SageMaker endpoint.

A gif of Figure 3 - 3D rendering - football players 3D reconstruction

Figure 3 – 3D rendering – football players 3D reconstruction

Once processing is complete, the ScoreHMR model outputs several filetypes including: the 3D meshes of the tracked humans, vector keypoint data, tracked camera pose and orientation, and a video file with the overlayed generated meshes. The output data is stored in an Amazon S3 bucket and the SageMaker endpoint uses Amazon SNS to publish a topic for listening subscribers. In this case, a Lambda function is invoked upon successful execution of the model, which updates metadata in the DynamoDB table with the output data. Now the generated 3D meshes and keypoint data can be used in any 3D application to recreate the human action in the input video for downstream purposes.

Solution Overview

An image og Figure 4: AWS Reference Architecture

Figure 4_AWS Reference Architecture

The Scalable Human Mesh Recovery Pipeline leverages cutting-edge AI/ML technology to reconstruct 3D human pose and shape from video data. At its core, this solution utilizes the Score-Guided Human Mesh Recovery (ScoreHMR) model, a state-of-the-art approach for solving inverse problems in 3D human mesh recovery. Built on AWS Serverless architecture, this pipeline seamlessly integrates various AWS services, including AWS Lambda, Amazon S3, Amazon DynamoDB, and Amazon SageMaker. This powerful combination enables the solution to scale effortlessly, processing any volume of video data without compromising performance or efficiency. This serverless architecture enables the pipeline to handle bursts of incoming traffic and scale compute resources out or in automatically based on demand.

This serverless architecture enables the pipeline to handle bursts of incoming traffic and scale compute resources out or in automatically based on demand.

AWS Web Application Firewall (AWS WAF) protects the application from common web exploits and bots that can affect availability, compromise security, or consume excessive resources.
Amazon Cognito adds user access controls to this service, handling the sign-in and out processes. Once signed in, a user can be authorized to make requests to the backend.
Amazon API Gateway is configured to act as the front door to the backend app. The API routes user requests to access data and assets.
AWS Lambda functions route queries based on request parameters and perform backend operations.
Amazon S3 is used as an ingestion data source storing the raw video and image data.
When a new file is uploaded to Amazon S3, an event notification triggers Amazon SNS to queue Lambda invocations.
The Invoke SageMaker Endpoint Lambda Function triggers and makes an inference request to the Amazon SageMaker asynchronous endpoint.
Amazon SageMaker AI hosts the ScoreHMR model and makes it available using an asynchronous endpoint. SageMaker manages the infrastructure for running this AI model on AWS.
On success, the SageMaker Endpoint invokes an Amazon SNS topic that sends a success message using AWS Lambda. This sequence also updates the metadata in Amazon DynamoDB about the model invocation success.
In case of failure, the SageMaker Endpoint invokes an Amazon SNS topic that sends an error message using AWS Lambda. This sequence also updates the metadata in Amazon DynamoDB about the model invocation failure.
AWS Identity and Access Management (AWS IAM) securely manages identities and access to AWS services and resources.
Amazon CloudWatch provides monitoring, logging, and observability for resources.
AWS X-Ray provides a complete view of requests traced throughout the application.

By leveraging the scalability, performance, and cost-effectiveness of AWS services, this implementation of this scalable Human Mesh Recovery pipeline can handle large-scale video processing workloads efficiently, making it suitable for a wide range of applications that require accurate 3D human mesh recovery.

Future Possibilities

The ability to accurately generate 3D humans from image or video data has immense potential across a wide range of industries. In entertainment and gaming, a scalable pipeline for Human Mesh Recovery could be used to create realistic human animations that can enhance user experiences. In the world of sports, this pipeline could revolutionize athlete training and performance analysis by providing detailed 3D representations of movements, allowing coaches and trainers to identify areas for improvement. This technology would help optimize training regimens, increasing athlete performance and injury prevention. The applications extend even further into domains like healthcare, where monitoring patient movements can aid in rehabilitation and remote care.

A gif of Figure 5 - Processed Video - group breakdancing 3D reconstruction

Figure 5 – Processed Video – group breakdancing 3D reconstruction

The integration of powerful AWS cloud services with cutting-edge AI models, such as ScoreHMR, enables the creation of a robust automated solution for 3D human mesh animation. By seamlessly merging state-of-the-art AI technologies and the scalability of the AWS platform into streamlined pipelines, the intricate process of 3D animation becomes more accessible and efficient. This automated pipeline can prove invaluable across diverse industries that require human motion analysis, including entertainment, sports, fashion, and others. It offers the potential to optimize workflows and deliver high-quality, scalable results, regardless of the project’s scope or complexity.

A gif of Figure 6 - Processed video - basketball players 3D reconstruction

Figure 6 – Processed video – basketball players 3D reconstruction

Ready to get started with your own 3D human mesh animation pipeline? Dive into the Amazon SageMaker AI documentation to learn more about asynchronous AI workflows and ScoreHMR resources to begin building your solution today!

AWS Spatial Computing Blog

From 2D to 3D: Building a Scalable Human Mesh Recovery Pipeline with Amazon SageMaker AI

Introduction to Human Mesh Recovery

Overview of Score-Guided HMR (ScoreHMR)

Scaling ScoreHMR on AWS

Solution Overview

Future Possibilities

Learn

Resources

Developers

Help