Media localization pipeline with voice synthesis and lip synchronization

Media localization processes involve multi-step approaches, including translation, voice acting, and cultural adaptation, which can be time-consuming and costly. Businesses can now create realistic dubbed voices and sync lip movements in any language, at scale, unlocking new markets and making content more accessible worldwide.

In our hyper-connected world, businesses are racing to grab a share of a global localization market estimates to be worth a staggering $5.7 billion by 2030. Traditional localization is slow and costly, but solutions like this using AI speech synthesis and lip synchronization are revolutionizing the industry.

Companies can approach AI-powered dubbing in multiple ways. While some utilize open-source or in-house AI models to synthesize speech and create realistic lip sync animations, others partner with specialized providers like Deepdub Ltd. and Flawless, who offer ready-to-use solutions. Regardless of the approach, these AI technologies preserve the original intent and tone of performances while matching spoken audio across any language. These advanced capabilities not only improve the viewing experience for global audiences, but also reduce costs and speed to market compared to traditional dubbing methods.

We will focus on leveraging two open-source models for:

- Voice synthesis using Tortoise-TTS
- Lip syncing using Video Retalking

We’ll share an end-to-end solution that orchestrates the steps involved in the media localization workflow using serverless architecture. We will also explore different hosting techniques using Amazon SageMaker AI for maximizing operational efficiency and cost.

Architecture

The solution is built using Amazon Web Services (AWS) services focused on the following main components:

Amazon Simple Storage Service (Amazon S3): As the cloud storage for the videos on which media localization is applied.
Amazon EventBridge: An Amazon S3 event driven approach that triggers the media localization pipeline execution.
AWS Step Functions: A serverless pipeline used to orchestrate the media localization steps. These steps include:

- AWS Lambda: Used for media preprocessing.
- Amazon Transcribe: For extracting transcription from the original video using.
- Amazon Translate: Used for translating audio from the source to the target language.
- Amazon SageMaker AI: To apply machine learning (ML) voice synthesis and lip sync effects through asynchronous endpoints. SageMaker asynchronous endpoints can scale down to zero when not actively used, which optimizes the hosting cost.
- Uploading the finalized media content to Amazon S3 for consumption.

The following architecture diagram depicts our workflow in greater detail.

Solution walkthrough

In the following section we walkthrough a lip sync with voice synthesis workflow at a high level:

A user uploads raw video content that requires lip sync and voice synthesis performed to a specified S3 bucket.
Alongside the video content, the user creates a pipeline configuration file and uploads it to the same S3 bucket location as the video content. The pipeline configuration serves two purposes. First, it is used to trigger an event to initiate the step function for the localization workflow. Second, it is used to provide the parameters used by the individual steps within the workflow to process the media data. The configuration file contains the following parameters:

{
	"source_file_s3_uri" : "<S3 location for the source video content>",
	"media_format" : "<format of the video. For instance, mp4>",
	"transcribe_source_language_code" : "<the language code for the source video>",
	"translate_source_language_code" : "<the source language code>",
	"translate_target_language_code" : "<the target language to translate the audio>",
    "bucket" : "<s3 location to store artifacts>",
    "prefix_input" : "<s3 prefix for the input data>",
	"prefix_outputs" : "<s3 prefix for the output data>",
	"job_name" : "<user defined name to identify a localization job>",
	"tts_model_id" : "<model ID for the voice synthesis>",
	"tts_endpoint_name" : "<sagemaker endpoint name for the tts model>",
	"destination_s3_uri" : "<S3 location for the generated media content>",
	"retalking_endpoint_name" : "<SageMaker endpoint name for the video retalking model>"
 }

A step function is triggered to extract the transcription from the given media using Amazon Transcribe. After the transcription job is started, the step function polls the status until the job completes.
In this process, two steps are triggered in parallel:

- A language translation job using Amazon Translate. The service translates the transcription returned from the previous step into the target language specified in the pipeline configuration file.
- A function extracts audio samples from the original media, then uploads them to the specified S3 bucket location. The extracted audio samples are used as reference audios to help guide the voice synthesis model in the audio generation process. The goal is to leverage the reference audio to generate voices with the dimension and intonation to match the original speaker.

A step function is triggered to synthesize the voice for the given transcription that has gone through the translation task in the previous step. The voice synthesis is an open-source AI model called Tortoise-TTS. Given the time needed to generate a new voice with good quality, the model is hosted in AWS as an Amazon SageMaker Asynchronous Inference endpoint to prevent a time out error after 60 seconds. The generated voice is stored in an S3 bucket location specified in the pipeline configuration file.

- By default, the Tortoise-TTS model supports voice synthesis for English speakers. For voice synthesis of voices other than English, requires fine-tuning the model to adapt to the target language. We provide sample code on how to fine tune a Tortoise-TTS model in the GitHub repository.

In this final step, the lip sync process is triggered to generate new media content with the original video and the generated voice created in the previous step. The lip sync model is an open-source AI model called Video-Retalking. Similar to the voice synthesis step, the model for generating the new video with lip sync is hosted as an Amazon SageMaker Asynchronous Inference endpoint. This reduces the time to generate new content. The new content is stored in an S3 bucket location specified in the configuration file.

Demo Video

Finally, we’ve provided a demo video using our solution approach. The video contains a few example videos with generated voice synthesis and lip sync in Spanish, French and Japanese.

Limitations

While the lip sync solution could be used to achieve high quality results, there are some known limitations based on our experimentation and observations.

Although the lip sync model can work for many videos, it still contains noticeable artifacts in some cases. These artifacts are the side effect of an intermediate process that creates a canonical video of the original speaker to enhance consistency in video generation.
The model could introduce artifacts in some extreme poses, such as when the subject is not facing the camera, or when the subject moves their heads rapidly in the video.
Since the video generation process is performed in a frame-by-frame fashion, there could be small temporal jittering and flashing introduced in the final results.
When using a fine-tuned voice synthesis model, the quality of the voice synthesis depends on the quality of the data used in the fine-tuning process. General rules of thumb for improving the quality are:

- Use the cleanest voice samples.
- Use at least 10 hours of voice samples for model fine tuning.

The media localization pipeline currently supports lip sync of a single speaker in the video. While the lip sync model supports input videos with multiple people, the transcription job expects single speakers in the video. Additionally, the pipeline only supports a single target language. Supporting multiple speakers will require speaker diarization, and using a different synthetic voice in the target language for each speaker.

Conclusion

We demonstrated an end-to-end machine learning workflow in AWS that transforms an original media into a localized content in the target language using ML voice synthesis and lip-syncing techniques. We started by uploading original content into Amazon S3, then using AWS Step Functions to orchestrate the audio transcription and translation steps using AWS ML services.

Additionally, we also demonstrated how to host the open-source models using Amazon SageMaker AI for voice synthesis and lip syncing. The entire process is developed using a serverless design. Companies can now create localized content at scale, tailoring their messaging and branding to specific regions and cultural contexts.

Contact an AWS Representative to know how we can help accelerate your business.

AWS for M&E Blog

Media localization pipeline with voice synthesis and lip synchronization

Architecture

Solution walkthrough

Demo Video

Limitations

Conclusion

Further reading

Resources

Follow

Learn

Resources

Developers

Help