AWS Spatial Computing Blog
Live Translations in the Metaverse
BUILD, DELIVER, MANAGE BLOG SERIES: DELIVER
Imagine putting on your Virtual Reality (VR) headset and communicating with people from around the world, who natively speak French, Japanese and Thai, without a human translator. What if a fledgling start-up could now easily expand their product across borders and into new geographical markets by offering fluid, accurate, live translations in the metaverse across multiple domains like customer support and sales? What happens to your business when you are no longer bound by distance and language?
It’s common today to have virtual meetings with international teams and customers that speak languages ranging from Thai to Hindi to German. Whether they are internal or external meetings, frequently meaning gets lost in complex discussions. Global language barriers pose challenges for communication between individuals in the Metaverse, where human sight and hearing can be augmented beyond biological constraints.
In this blog post, you will build an application stitching together three fully managed Amazon services, Amazon Transcribe, Amazon Translate, and Amazon Polly. This application will produce a near real-time translation solution. This speech-to-speech translator solution quickly translates a speaker’s live voice into a spoken, accurate, translated target language, even if you don’t have any Machine Learning expertise.
Solution Overview
The translator solution comprises a simple Unity project that leverages the power of three fully managed Amazon Web Services (AWS) machine learning services. The solution uses the AWS SDK for .NET, as well as the Unity API for asynchronous audio streaming. This project was developed as VR application for the Meta Quest 2 headset.
The following diagram depicts an overview of this solution.
Here is how the solution works:
- Through the VR Application, the user authenticates using Amazon Cognito. Amazon Cognito lets you add user sign-up, sign-in, and access control to your web and mobile apps quickly and easily. The identity credentials are returned to the VR Application and used to call Amazon Transcribe, Amazon Translate, and Amazon Polly.
- Audio is ingested by the VR Application using the built-in microphone on the VR headset. The spoken audio data is sent to Amazon Transcribe, which converts it to text.
- Amazon Translate receives the text data and translates it to a target language specified by the user.
- The translated text data is sent to Amazon Polly, where it is converted to speech.
- The audio from Amazon Polly is played back by the VR Application allowing the user to hear the translation.
Let’s dive deeper into this solution.
Amazon Transcribe – Streaming Speech to Text
The first service you will use in the stack is Amazon Transcribe, a fully managed speech-to-text service that takes speech and transcribes it to text. Amazon Transcribe has flexible ingestion methods, batch or streaming, because it accepts either stored audio files or streaming audio data. In this post, you will use Transcribing streaming audio, which uses the WebSockets protocol to stream live audio and receive live transcriptions. Currently, these are the supported languages and language-specific features, but since you will be working with real-time streaming audio, your application will be able leverage 12 different languages to stream in and out audio.
Amazon Transcribe streaming works with Signing AWS API requests to Amazon Transcribe, which accepts audio data and returns text transcriptions. This text can be visually displayed using rendered UI in the VR application and passed as input to Amazon Translate.
Amazon Translate: State-of-the-art, fully managed translation API
Next in the stack is Amazon Translate, a translation service that delivers fast, high-quality, affordable, and customizable language translation. As of June 2022, Amazon Translate supports translation across 75 languages, with new language pairs and improvements being made constantly. Amazon Translate uses deep learning models hosted on a highly scalable and resilient AWS Cloud architecture to deliver accurate translations either in real time or batched, depending on your use case.
Using Amazon Translate requires no management of underlying architecture or ML skills. Amazon Translate has several features, like creating and using a customizing your translations with custom terminology to improve the recognition of industry-specific terms. For more information on Amazon Translate service limits, refer to Guidelines and limits.
After the application receives the translated text in the target language, it sends the translated text to Amazon Polly for immediate translated audio playback.
Amazon Polly: Fully managed text-to-speech API
Finally, you send the translated text to Amazon Polly, a fully managed text-to-speech service that can either send back lifelike audio stream responses for immediate streaming playback or batched and saved in Amazon Simple Storage Service (Amazon S3) for later use. You can control various aspects of speech, such as pronunciation, volume, pitch, speech rate, and more using standardized Speech Synthesis Markup Language (SSML).
You can synthesize speech for certain Amazon Polly Neural voices, for example using the Newscaster style to make them sound like a TV or radio newscaster. You can also detect when specific words or sentences in the text are being spoken based on the metadata included in the audio stream. This allows the developer to synchronize graphical highlighting and animations, such as the lip movements of an avatar, with the synthesized speech.
You can change the pronunciation of particular words, such as company names, acronyms, or neologisms, for example “P!nk,” “ROTFL,” or “C’est la vie” (when spoken in a non-French voice), using custom lexicons.
Project Setup
Prerequisites
- An AWS account with permissions to use Amazon Transcribe, Amazon Translate, Amazon Polly, and Cognito.
- A local machine with Unity 2021+ with Android build modules installed.
- Intermediate level knowledge of C# and Unity development.
- Optionally, a VR headset that is configured for development. In this blog, it will be assumed that you will use a Meta Quest 2.
You will need a Unity supported device with a microphone, speaker and reliable internet connection. A modern laptop will work for development and testing if you do not have a VR headset available.
For reference, this project was built with Unity 2021.3.2f1 using the Universal Render Pipeline (URP) and the Unity XR Interaction Toolkit package for VR locomotion and interaction. To learn more about VR development with Unity, please reference the Unity documentation: Getting started with VR development in Unity.
AWS Back-End
For authorization and authentication of service calls, the application uses Amazon Cognito User Pool and Identity Pool. The Cognito User Pool is used as a directory that provides sign-up and sign-in options for the application. An Identity Pool grants temporary access to these services. This ensures that services are being called by an authorized identity. As always, follow the principle of least privilege when assigning IAM policies to creating an IAM user or role.
- Set up a Cognito User Pool. This will allow for users to sign up and sign into their account, using email or username. It is recommended to toggle on the “strong passwords only” settings.
- Along with the User Pool, add an App Client that allows for Use SRP password verification in custom authentication flow.
- Create a Cognito Identity Pool that points to the User Pool as an identity provider.
- For users to access Amazon Transcribe, Amazon Translate, Amazon Polly, authorized users in the Identity Pool should assume an IAM role that includes the following IAM Policy document.
Note, the Cognito User Pool ID, User Pool App Client ID, and Identity Pool ID will be required in the Unity application.
Unity Application
Moving onto the Unity side, you will need to create a new Unity project. For this project, you will use Unity 2021.3.2f1 using the Universal Render Pipeline (URP). Once your Unity project is open, follow these steps to prepare and build the application.
- Add in the proper AWS SDKs for .NET/ C# by downloading the SDK DLLs from the AWS documentation link. API Compatibility for .NET Framework (.NET 4.x) is required. Follow this Developer Guide for detailed instructions on downloading the DLLs: Special considerations for Unity support
- Copy these DLLs to the Assets/Plugins folder in the Unity project.
- Copy the following DLLs alongside the AWS SDK files. Find the download links on the page, Special considerations for Unity support.
- In your Assets directory, make a file called link.xml copy the following, verifying the list of SDKs matches the DLLs you copied in the previous steps.
Amazon Cognito
In this project, you will use the Amazon CognitoAuthentication extension library and the Unity UI to build the client-side user authorization process. Since the Cognito User Pool was set up to allow SRP based authentication make sure your client-side authorization flow initiates an SRP request. To make calls to Amazon Transcribe, Amazon Translate, and Amazon Polly you will need to store references to the user’s identity credentials once the user has successfully signed in.
For a more in-depth understanding of working with the Amazon CognitoAuthentication extension library, please reference the documentation for .NET to connect the user to the AWS backend: Amazon CognitoAuthentication extension library examples
Amazon Transcribe
To produce real-time transcriptions, Amazon Transcribe streaming requires manual setup to generate a signed message, using AWS Signature Version 4, in event stream encoding. The following steps summarize the required process for creating an Amazon Transcribe streaming request, reference the documentation for a more in-depth understanding of additional requirements: Setting up a WebSocket stream.
- To start the streaming session, create a pre-signed URL including operations and parameters, as canonical request.
Source Code
See our Github here: https://github.com/aws-samples/spatial-real-time-translation