AWS Machine Learning Blog

Create video subtitles with translation using machine learning

Businesses from around the globe require fast and reliable ways to transcribe an audio or video file, and often in multiple languages.  This audio and video content can range from a news broadcast, call center phone interactions, a job interview, a product demonstration, or even court proceedings.  The traditional process for transcription is both expensive and lengthy, often involving the hiring of dedicated staff or services, with a high degree of manual effort.   This effort is compounded when a multi-language transcript is required, often leaving customers to over-dub the original content with a new audio track.

Natural language processing (NLP) and translation are some of the hardest problems for computers to solve because of the many contextual and idiomatic elements of speech. This has historically required a native speaker of both the source and target language to perform a translation. The recently introduced neural-network based approach to machine translation has brought on unprecedented translation accuracy and fluency. While it is not perfect yet, it is now a viable solution for many more use cases than ever before, including this one.

At AWS re:Invent 2017, Andy Jassy introduced some new services in the AI/ML group targeted to solve these problems. To demonstrate some practical ways to use these services to solve the subtitling and translation problem, this blog post will focus on using Amazon Transcribe, Amazon Translate, and Amazon Polly to transcribe, translate, and overdub subtitles and alternate voice tracks with the original content.

Let’s take a look at the final product, then we’ll provide details of our solution.

High-level approaches to solve the problem

Generally speaking, there are four high-level steps required to create subtitled, translated media:

  1. Obtain a transcript of the content.
  2. Translate the transcript into the target language.
  3. Create the corresponding subtitles from the transcript or translation and create a subtitles file.
  4. Combine all of those pieces into a finished product.

The following diagram illustrates the process in detail:

All of the source code files needed to perform these steps are included in this blog post and can be found at downloaded from here.

AWS services Overview

We’ll use the following services for our use case. I’ll describe the key methods and attributes needed to solve our problem.

Amazon Transcribe

Amazon Transcribe uses advanced deep learning technologies to recognize speech in audio files or video files and transcribe them into text. The output is stored in an Amazon S3 bucket that you specify or a bucket that is managed by Amazon Transcribe. The resulting transcription is a JSON file containing the full transcription of the text that includes word-level time stamping, confidence level, and punctuation. Content creators can easily use time-stamp information to sync the transcript with existing content to produce subtitles for the videos.

Amazon Transcribe can use custom vocabularies to help improve the accuracy of the transcription. You simply need to provide the customized words, jargon, and phrases to Amazon Transcribe and tell the job to use that vocabulary using the API.

In this blog post, I am using the API to launch a transcription job, set a custom vocabulary, and then use the result of the job to obtain the transcript. As you’ll see, using the word-level metadata will help us to assemble the subtitles. For more information, see the Amazon Transcribe documentation.

Amazon Translate

Amazon Translate is neural machine translation service that delivers fast, high-quality, and affordable language translation. It uses deep learning models to deliver more accurate and more natural sounding translations. The Amazon Translate SDK includes TranslateText, an API action that allows us to get translations in real-time.

In this blog post, I am using the Amazon Translate API to translate the text transcription into Spanish and German. For more information, see the Amazon Translate documentation.

Amazon Polly

Amazon Polly is a cloud service that leverages AI/ML technology to synthesize text into lifelike speech. You can simplify using Amazon Polly in your applications with an API tailored to your programming language or platform, such as Java, Python, Go, and others. Amazon Polly provides the ability for you to do the following:

  • Synthesize speech into an audio stream and cache and replay the speech output with no additional cost
  • Choose from dozens of languages and a wide selection of natural-sounding male and female voices
  • Modify your speech output using lexicons and SSML tags, such as pronunciation, volume, pitch, and speed rate

In this blog, I am using the Amazon Polly API to voice the translation in Spanish and German. I’ll save the produced audio stream as an MP3 file for inclusion as the audio track in our movie. For more information, see the Amazon Polly documentation.

Preparing our development environment

To follow along with our example, you’ll need to set up an environment, AWS account, download and install the tools we’ll need, and, finally, install the source code.

Step 1: Establish an operating system environment

This example should run on any operating system that supports Python 2.7.15 and the ImageMagick tools. I used a Windows 10 environment running in Amazon WorkSpaces. Amazon WorkSpaces is a fully managed, secure, Desktop-as-a-Service solution powered by AWS. With Amazon WorkSpaces, you have the ability to change the virtual hardware configuration through the AWS Management Console. Early on in the process when my code was less efficient, I was running the Power configuration, but as the code stabilized, I was able to reconfigure to use the Performance configuration.

Step 2: Establish an AWS account

You’ll need an Amazon Web Services account to be able to use the AWS services. The process for establishing an account can be found at https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/.

Step 3: Download and Install Python 2.7.15

Python is a programming language used by many developers for scripting batch programs, real time programs, big data, artificial intelligence, advanced analytics, and so on. There are a number of versions available. In this example, I used Python 2.7.15. Python can be downloaded from http://python.org for your target operating system environment. The process for downloading and installing will depend on your operating system, so follow the installation instructions found on the python.org site.

After Python is successfully installed, you can enter the following command to validate that it is correctly installed:

python —version

You should see something like the following:

Python 2.7.15

Step 4: Download and Install the AWS Command Line Interface

The Amazon CLI for Windows is used so you can execute AWS commands in a Windows Command Shell. For example, to copy files to and from Amazon S3, an “aws s3” CLI command can be used. To install the AWS CLI, enter the following command

pip install awscli

You should see something like the following:

Installing collected packages: awscli
Successfully installed awscli-1.15.43

To test that the command parser is working correctly, enter the following:

aws —version

You should see something like the following:

aws-cli/1.15.43 Python/2.7.15 Windows/10 botocore/1.10.43

Step 5: Configuring the AWS CLI

Amazon Web Services takes security very seriously. No access to your account is available without the proper credentials and encrypted keys. For our programs to access AWS services, we’ll need to configure the CLI to be able to automatically sign in. It’s not good security practice to embed your AWS Identity and Access Management (IAM) access key and secret key into your program code. So, the AWS CLI provides a mechanism to associate CLI usage with an IAM user. For an explanation of IAM, and step-by-step instructions to configure the AWS CLI, see the documentation on configuring the CLI.

Step 6: Download and install the AWS SDK for Python

The AWS SDK for Python (Boto3) is the Python SDK for access to many AWS services. It’s used in the various Python code examples included with this blog post. For more information, see the documentation for Boto 3.

Generally speaking, there are two methods for connecting to an Amazon service using Boto3 that I used for this blog post. I recommend that you browse the Boto3 documentation and tutorial to get familiar with its use:

  • Resource– This method provides a direct way to connect to and use an Amazon service. For example, to use Amazon S3, you would use the following code to establish the connection to the resource:
    import boto3
    
    # Let's use Amazon S3
    s3 = boto3.resource('s3') 
  • Client – This method provides a low-level connection to an underlying service. For example, to use an Amazon Simple Queue Service (Amazon SQS) queue, you would use the following code to establish the connection to the low-level client:
    import boto3
    
    # Create a low-level client with the service 
    sqs = boto3.client('sqs')
    

To install Boto3, enter the following command:

pip install boto3

You should see something like the following:

Installing collected packages: boto3
Successfully installed boto3-1.7.43

Step 7: Install “requests” package

The requests package is an easy-to-use package designed to access URIs from within Python. For information on working with requests see the Python Quickstart. In our example, requests will be used to connect to the signed URL created by Amazon Transcribe and to download the resulting transcript JSON data.

To install requests, enterthe following command:

pip install requests

You should see something like the following:

Installing collected packages: requests
Successfully installed requests-2.19.1

Step 8: Install ImageMagick

ImageMagick is an open source package used to create, edit, compose, and convert bitmap images across a variety of image formats. MoviePy is dependent on ImageMagick to create the subtitles that are used overlay the original video.

Be aware that you need to pay special attention to the installation instructions for your platform found at the ImageMagick website. MoviePy will not work correctly if this is misconfigured.

For this blog post, I downloaded the latest version for Windows 10. To verify that ImageMagick is installed, enter the following command:

D:\Users\...\dev\aiml\MakeVideo>convert

If it is correctly installed and configured, you should see the following:

Version: ImageMagick 7.0.7-35 Q16 x64 2018-05-21 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2018 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php

...

FFmpeg is an open source package that is used to convert audio and video into a wide variety of formats. Both ImageMagick and MoviePy are dependent on FFmpeg and will typically download it is if needed. This should be installed along with ImageMagick. Note that FFmpeg will download required additional code and updates the first time that it is run in our blog post example code.

Step 9: Install MoviePy

MoviePy is an open source Python module for video editing that can be used for basic operations (like cuts, concatenations, and title insertions), video compositing (non-linear editing), video processing, or to create advanced effects. It can read and write the most common video formats. Our example uses MoviePy for subtitling, audio processing (adding foreign language tracks), and compositing all of the pieces together into the final videos.

To install requests, enter the following command:

pip install moviepy

MoviePy depends on a number of other libraries that should be automatically installed through this command. After all of them are installed, you should see something like the following:

Installing collected packages: moviepy
Running setup.py install for moviepy ... done
Successfully installed moviepy-0.2.3.5

Step 10: Test MoviePy

To ensure that all of the pieces are correctly installed, you might want to test MoviePy. For more information, review the short tutorial on the MoviePy site. In the tutorial, a small example program is provided that, with some minor modifications, you can use to test all of the pieces. I recommend that you take the time to read over the installation and tutorial documentation. The example code shown on the MoviePy site is:

# Import everything needed to edit video clips
from moviepy.editor import *

# Load myHolidays.mp4 and select the subclip 00:00:50 - 00:00:60
clip = VideoFileClip("myHolidays.mp4").subclip(50,60)

# Reduce the audio volume (volume x 0.8)
clip = clip.volumex(0.8)

# Generate a text clip. You can customize the font, color, etc.
txt_clip = TextClip("My Holidays 2013",fontsize=70,color='white')

# Say that you want it to appear 10s at the center of the screen
txt_clip = txt_clip.set_pos('center').set_duration(10)

# Overlay the text clip on the first video 
clipvideo = CompositeVideoClip([clip, txt_clip])

# Write the result to a file (many options available !)
video.write_videofile("myHolidays_edited.webm")

For testing purposes, replace the “myHolidays.mp4” file with your content. Make sure to adjust the subclip parameters if your content is not at least 60 seconds long. You can also modify the message displayed in the TextClip by changing “My Holidays 2013” to your desired message. Finally, you can modify the output video file name as required and you can also change the output type simply by specifying the extension. So, for example, instead of producing a .webm file, you can simply replace it with .mp4. See the MoviePy documentation for details.

Transcribing video using Amazon Transcribe

Our process begins by creating a transcription job with the original content. The following code is part of the main program that drives the process. Click here for the full source code.

This snippet creates the job, then loops while the status is IN_PROGRESS. After the batch job is complete, we will get the transcript response and assign it to the variable “transcript” for further processing. If we got to the “getTranscript” call, then we have successfully transcribed the content.

# Create Transcription Job
response = createTranscribeJob( args.region, args.inbucket, args.infile )

# loop until the job successfully completes
print( "\n==> Transcription Job: " + response["TranscriptionJob"]["TranscriptionJobName"] + "\n\tIn Progress"),

while( response["TranscriptionJob"]["TranscriptionJobStatus"] == "IN_PROGRESS"):
    print( "."),
    time.sleep( 30 )
    response = getTranscriptionJobStatus( response["TranscriptionJob"]["TranscriptionJobName"] )

print( "\nJob Complete")
print( "\tStart Time: " + str(response["TranscriptionJob"]["CreationTime"]) )
print( "\tEnd Time: "  + str(response["TranscriptionJob"]["CompletionTime"]) )
print( "\tTranscript URI: " + str(response["TranscriptionJob"]["Transcript"]["TranscriptFileUri"]) )

# Now get the transcript JSON from AWS Transcribe
transcript = getTranscript( str(response["TranscriptionJob"]["Transcript"]["TranscriptFileUri"]) ) 
# print( "\n==> Transcript: \n" + transcript)

This snippet uses a utility function I created to place all of the Transcribe API calls in one Python file. The functions that follow leverage the Boto3 SDK for Python to invoke the Transcribe API. Detailed information about all of the API operations used can be found in the Transcribe Documentation.

import boto3
import uuid
import requests

# purpose: Function to format the input parameters and invoke the Transcribe service
def createTranscribeJob( region, bucket, mediaFile ):

    # Set up the Transcribe client 
    transcribe = boto3.client('transcribe')
    
    # Set up the full uri for the bucket and media file
    mediaUri = "https://" + "s3-" + region + ".amazonaws.com/" + bucket + mediaFile 
    
    print( "Creating Job: " + "transcribe" + mediaFile + " for " + mediaUri )
    
    # Use the uuid functionality to generate a unique job name.  Otherwise, the Transcribe service will return an error
    response = transcribe.start_transcription_job( TranscriptionJobName="transcribe_" + uuid.uuid4().hex + "_" + mediaFile , \
        LanguageCode = "en-US", \
        MediaFormat = "mp4", \
        Media = { "MediaFileUri" : mediaUri }, \
        Settings = { "VocabularyName" : "MyVocabulary" } \
        )
    
    # return the response structure found in the Transcribe Documentation
    return response
    
    
# purpose: simply return the job status    
def getTranscriptionJobStatus( jobName ):
    transcribe = boto3.client('transcribe')
    
    response = transcribe.get_transcription_job( TranscriptionJobName=jobName )
    return response
    
# purpose: get and return the transcript structure given the provided uri
def getTranscript( transcriptURI ):
    # Get the resulting Transcription Job and store the JSON response in transcript
    result = requests.get( transcriptURI )

    return result.text

A deeper look at the response

The response back for the TranscribeURI returns a JSON structure that contains a wealth of information about the content. Looking at an excerpt of the beginning of the response that follows, we can see several key items:

  • jobName
  • accountId
  • results – contains the full transcript, including the words and punctuation as a block of text:
    {
        “jobName":"MY_JOB","accountId":"1234567890",
        "results":{"transcripts":
            [{"transcript":"How about language ? You know, i talked earlier about the fact
             that last year we launched both polly and relax, but there are so many other things 
             the builders want to do with language... 
    

It also contains a very detailed structure for each word or punctuation element. For more information on the response structure, see the API Documentation.

... it."}],

"items":[
    {"start_time":"0.260","end_time":"0.480","alternatives":
        [{"confidence":"0.8977","content":"How"}],"type":"pronunciation"},
    {"start_time":"0.480","end_time":"0.750","alternatives":
        [{"confidence":"0.8431","content":"about"}],"type":"pronunciation"},
    {"start_time":"0.750","end_time":"1.440","alternatives":
        [{"confidence":"1.0000","content":"language"}],"type":"pronunciation"},
    {"alternatives":[{"content":"?"}],"type":"punctuation"},
    ...
    ]

Notice that each word element provides a start_time and an end_time. We’ll use this in later steps to create a subtitles file that exactly aligns subtitles with the audio track. Since Amazon Transcribe uses machine learning technologies to recognize words, it provides a confidence score for each identified word. If the content uses a lot of domain-specific terminology, then using a custom vocabulary will help to improve the accuracy of the transcription. Finally, the actual word spoken is found with the “content” tag.

The last tuple shown is for punctuation. Notice that the tuple doesn’t contain timing or confidence information. This is because it is inferred, and it is assumed to be associated with the previous tuple. We’ll need to address this as we process the response to prepare the subtitles file.

Translating a transcript into another language using Amazon Translate

Amazon Translate makes translating text to and from English simple. The following code snippet shows a utility function that invokes the Translate API:

import boto3

...

translate = boto3.client(service_name='translate', region_name='us-east-1', use_ssl=True)

...

def translateTranscript( transcript, sourceLangCode, targetLangCode ):
    # Get the translation in the target language.  We want to do this first so that the translation is in the full context
    # of what is said vs. 1 phrase at a time.  This really matters in some lanaguages

    # stringify the transcript
    ts = json.loads( transcript )

    # pull out the transcript text and put it in the txt variable
    txt = ts["results"]["transcripts"][0]["transcript"]
        
    # call Translate  with the text, source language code, and target language code.  The result is a JSON structure containing the 
    # translated text
    translation = translate.translate_text(Text=txt,SourceLanguageCode=sourceLangCode, TargetLanguageCode=targetLangCode)
    
    return translation

For more details about the Translate API, see the Translate Documentation.

Creating subtitle files

Before diving into the specifics of actually creating the subtitles, there are several design considerations we must think through.

Design consideration 1: How do we build a subtitles file?

MoviePy uses a standard subtitles file format known as “SubRip Subtitle” or SRT file. SRT files have the following format:

  • Sequence number followed by a new line
  • The start and end time of the subtitle to three decimal places followed by a new line
  • The text to be displayed on the screen followed by two new lines

Here is an example of the output from our code:

1
00:00:00,260 --> 00:00:02,899
How about language? You know, i talked earlier

2
00:00:02,899 --> 00:00:05,860
about the fact that last year we launched both Polly

...

The Amazon Transcribe JSON structure provides the timing for each word individually. To create the subtitle timing required for the SRT file, we need to first create phrases of about 10 words. Then we write them into the SRT file format taking the start time from the first word in the phrase and the end time of the last word in the phrase to create the sequenced entry. The following function is used to create the phrases in preparation for writing them to the SRT file.

def getPhrasesFromTranscript( transcript ):

    # This function is intended to be called with the JSON structure output from the Transcribe service.  However,
    # if you only have the translation of the transcript, then you should call getPhrasesFromTranslation instead

    # Now create phrases from the translation
    ts = json.loads( transcript )
    items = ts['results']['items']
    
    #set up some variables for the first pass
    phrase =  newPhrase()
    phrases = []
    nPhrase = True
    x = 0
    c = 0

    print "==> Creating phrases from transcript..."

    for item in items:

        # if it is a new phrase, then get the start_time of the first item
        if nPhrase == True:
            if item["type"] == "pronunciation":
                phrase["start_time"] = getTimeCode( float(item["start_time"]) )
                nPhrase = False
            c+= 1
        else:    
            # We need to determine if this pronunciation or puncuation here
            # Punctuation doesn't contain timing information, so we'll want
            # to set the end_time to whatever the last word in the phrase is.
            # Since we are reading through each word sequentially, we'll set 
            # the end_time if it is a word
            if item["type"] == "pronunciation":
                phrase["end_time"] = getTimeCode( float(item["end_time"]) )
                
        # in either case, append the word to the phrase...
        phrase["words"].append(item['alternatives'][0]["content"])
        x += 1
        
        # now add the phrase to the phrases, generate a new phrase, etc.
        if x == 10:
            #print c, phrase
            phrases.append(phrase)
            phrase = newPhrase()
            nPhrase = True
            x = 0
            
    return phrases

Note that the getTimeCode function is called two-thirds of the way through this snippet. Transcribe provides a start and end time based on the number of seconds that have passed since the beginning of the clip. However, the SRT format requires that time to be in an HH:MM:SS,MMM format. The following getTimeCode snippet does the conversion for us.

def getTimeCode( seconds ):
# Format and return a string that contains the converted number of seconds into SRT format

   thund = int(seconds % 1 * 1000)
    tseconds = int( seconds )
   tsecs = ((float( tseconds) / 60) % 1) * 60
   tmins = int( tseconds / 60 )
   return str( "%02d:%02d:%02d,%03d" % (00, tmins, int(tsecs), t_hund ))
    

Design consideration 2: What is the correct translation scope?

We all have seen overdubbed movies in which lips of the speaker don’t line up with the audio. As we prepare subtitles, we could translate word for word, or phrase by phrase in perfect alignment with the original content. However, that would likely lead to inaccurate, incomplete, or “out of context” translations. This is because each language has differing syntax, word placement, idioms, and phrasing. Also a translation doesn’t necessarily result in the same number of words in each language, hence the out-of-sync lips problem.

To illustrate the point, I used the AWS Translate console to translate the following English sentence into German and Spanish:

English

Differing syntax, word placement, and idioms distinguish one language from another.

 

Spanish

La sintaxis, la colocación de palabras y los modismos diferentes distinguen un idioma de otro.

 

German

Unterschiedliche Syntax, Wortplatzierung und Idiome unterscheiden eine Sprache von einer anderen.

As you can see, neither translation has the same number of words, nor are the words in the same relative position. If we were to only translate 10 words at a time instead of a whole sentence, we would not likely arrive at a translation with the correct overall meaning.

Bottom line: We are going to have to treat subtitles and the accompanying audio for translations differently that the original transcript.

Design consideration 3: Syncing up the audio and subtitles for translations

In our example, we’ll enhance the original video with translated subtitles, and we’ll add an alternate spoken foreign language audio track generated by Amazon Polly. However, that leads us to a potential problem to overcome. Unlike the original Amazon Transcribe output, we don’t know the start and end time for every word because it is simply a block of translated text. Even if we translate a whole sentence at a time, that still doesn’t provide timing; it only provides the translation. Moreover, unlike the original speaker, Amazon Polly will synthesize the speech at a different rate than the orginal speaker. Therefore we’ll need to calculate the duration of the spoken phrases. How do we do that?

We’ll let Amazon Polly and MoviePy help us. In this example, we’ll take the following approach:

  1. Using Amazon Translate, get the complete translation from the transcript.
  2. Using our Python code, break up the block of translated text into phrases of approximately 10 words so that the text fits on the screen and can be read quickly enough by the viewer.
  3. For each phrase, call Amazon Polly to get the spoken audio stream and write it to a temporary audio file.
  4. Load the temporary audio file into an AudioClip object from MoviePy.
  5. Return the duration of the AudioClip.

Now that we know the duration of each phrase, we can leverage the simliar logic to create the SRT file for inclusion in the final video. Note that this code differs from the transcript in that our start time value is contained in the “seconds” variable and the end time of the phrase will be calculated using the getSecondsFromTranslation function. We’ll add the result to “seconds” to give us the ending time for this phrase.  The code looks like this:

def getPhrasesFromTranslation( translation, targetLangCode ):

    # Now create phrases from the translation
    words = translation.split()
    
    #set up some variables for the first pass
    phrase =  newPhrase()
    phrases = []
    nPhrase = True
    x = 0
    c = 0
    seconds = 0

    print "==> Creating phrases from translation..."

    for word in words:

        # if it is a new phrase, then get the start_time of the first item
        if nPhrase == True:
            phrase["start_time"] = getTimeCode( seconds )
            nPhrase = False
            c += 1
                
        # Append the word to the phrase...
        phrase["words"].append(word)
        x += 1
        
        
        # now add the phrase to the phrases, generate a new phrase, etc.
        if x == 10:
        
            # For Translations, we now need to calculate the end time for the phrase
            psecs = getSecondsFromTranslation( getPhraseText( phrase), targetLangCode, "phraseAudio" + str(c) + ".mp3" ) 
            seconds += psecs
            phrase["end_time"] = getTimeCode( seconds )
        
            phrases.append(phrase)
            phrase = newPhrase()
            nPhrase = True

            x = 0
            
    return phrases

The code to get the duration from the phrase is below. Note that we need to translate the phrase, and request Amazon Polly to speak it to an audio stream. Then we can write the temporary audio stream to disk in order to load it as an AudioFileClip from MoviePy. After it’s loaded, we can determine the exact duration and use that to calculate the timing of the subtitles relative to the translated audio track.

def getSecondsFromTranslation( textToTranslate, targetLangCode, audioFileName ):

    # Set up the Amazon Polly and Amazon Translate services
    client = boto3.client('polly')
    translate = boto3.client(service_name='translate', region_name="us-east-1", use_ssl=True)
    
    # Use the translated text to create the synthesized speech
    response = client.synthesize_speech( OutputFormat="mp3", SampleRate="22050", Text=textToTranslate, VoiceId=getVoiceId( targetLangCode ) )
    
    # Write the stream out to disk so that we can load it into an AudioClip
    writeAudioStream( response, audioFileName )
    
    # Load the temporary audio clip into an AudioFileClip
    audio = AudioFileClip( audioFileName)
        
    # return the duration    
    return audio.duration

Creating streamed audio files from your translation using Amazon Polly

When we request Amazon Polly to synthesize text, we have several options for specifying the type, sample rate, etc., for the output. For our example, I chose an MP3 audio stream as the desired output because MoviePy can open an MP3 file as an AudioFileClip for inclusion as an alternate voice track for our final video.

After we receive a successful response from Amazon Polly, we can write the audio stream to our temporary file:

...

    # Use the translated text to create the synthesized speech
    response = client.synthesize_speech( OutputFormat="mp3", SampleRate="22050", Text=translated_txt, VoiceId=voiceId)
    
    if response["ResponseMetadata"]["HTTPStatusCode"] == 200:
        print( "\t==> Successfully called Polly for speech synthesis")
        writeAudioStream( response, audioFileName )
    else:
        print( "\t==> Error calling Polly for speech synthesis")
    
...


def writeAudioStream( response, audioFileName ):
    
    # Take the resulting stream and write it to an mp3 file
    if "AudioStream" in response:
        with closing(response["AudioStream"]) as stream:
            output = audioFileName
            writeAudio( output, stream )

Now we’ll write the binary bytes to disk. I’ve created this utility method to be used in preparing the SRT file and also when creating the full audio clip for the final movie.

def writeAudio( output_file, stream ):

    bytes = stream.read()
    
    print "\t==> Writing ", len(bytes), "bytes to audio file: ", output_file
    try:
        # Open a file for writing the output as a binary stream
        with open(output_file, "wb") as file:
            file.write(bytes)
        
        if file.closed:
                print "\t==>", output_file, " is closed"
        else:
                print "\t==>", output_file, " is NOT closed"
    except IOError as error:
        # Could not write to file, exit gracefully
        print(error)
        sys.exit(-1)

Using MoviePy to create a composite video

As we discussed earlier, MoviePy is a comprehensive library for video composition and production.  It leverages ImageMagick and FFmpeg (and others) for key functionality in building text titles, animation, audio, and videos. In our case, we use it for few of its features:

  • Read an SRT file to create subtitles and create a subtitle clip
  • Read the alternate audio track created with Amazon Polly
  • Composite all of the clips together (original content, subtitles, and alternate audio track) into the finished product

To accomplish these tasks, we’ll create a function called “createVideo”.   We will go through the key parts here, then put it all together in the next section.

def createVideo( originalClipName, subtitlesFileName, outputFileName, alternateAudioFileName, useOriginalAudio=True ):
    ...
    
    clip = VideoFileClip(originalClipName)
  
    ...

This creates a VideoFileClip from the video file we passed in as the original content. Next we will determine whether or not we need to replace the audio portion of the clip.

    if useOriginalAudio == False:
        audio = AudioFileClip(alternateAudioFileName)
        audio = audio.subclip( 0, clip.duration )
        audio.set_duration(clip.duration)
        clip = clip.set_audio( audio )
    else:
        print strftime( "\t" + "%H:%M:%S", gmtime()), "Using original audio track..."

If we are using the alternate audio track such as the one generated by Amazon Polly, we’ll load the audio file as an AudioFileClip. The subclip and set_duration lines are ensuring that the audio clip is only as long as the video clip. If we were writing a professional package, perhaps we would want to be more sophisticated about handling the length. For our example, this works fine. Finally, we replace the audio from the video clip with the audio clip.

Next we’ll create the subtitles.

# Create a Lambda function that will be used to generate the subtitles for each sequence in the SRT 
generator = lambda txt: TextClip(txt, font='Arial-Bold', fontsize=24, color='white')

# read in the subtitles files
print "\t" + strftime("%H:%M:%S", gmtime()), "Reading subtitle file: " + subtitlesFileName 
subs = SubtitlesClip(subtitlesFileName, generator)
print "\t\t==> Subtitles duration before: " + str(subs.duration)
subs = subs.subclip( 0, clip.duration - .001)
subs.set_duration( clip.duration - .001 )
print "\t\t==> Subtitles duration after: " + str(subs.duration)
print "\t" + strftime("%H:%M:%S", gmtime()), "Reading subtitle file complete: " + subtitlesFileName 

print "\t" + strftime( "%H:%M:%S", gmtime()), "Creating Subtitles Track..."
annotated_clips = [annotate(clip.subclip(from_t, to_t), txt) for (from_t, to_t), txt in subs]

print "\t" + strftime( "%H:%M:%S", gmtime()), "Creating composited video: " + outputFileName
# Overlay the text clip on the first video clip
final = concatenate_videoclips( annotated_clips ) 

print "\t" + strftime( "%H:%M:%S", gmtime()), "Writing video file: " + outputFileName 
final.write_videofile(outputFileName)

First, we need to define a Python Lambda function (not the same as an AWS Lambda Function!) that will be used to create a TextClip with the supplied font information. Then, we’ll read the SRT file to create the SubtitlesClip call subs. Since SubtitlesClip is still a somewhat experimental part of MoviePy, we need to trim it to be less than the overall video clip. Otherwise, it may result in runtime errors.

After the subtitles are created, we’ll create an array of subtitled clips called annotated_clips. Next, we concatenate all of the clips into one final clip. Finally, we’ll write the subtitled video and audio out to a new video file. Voila!

Putting it all together

Now that we’ve seen the parts, let’s put it all together into a batch file and main driver program.

makevideo.bat

cls
python translatevideo.py -region us-east-1 -inbucket my-aiml-test/ -infile AWS_reInvent_2017.mp4 -outbucket my-aiml-test/ -outfilename subtitledVideo 
-outfiletype mp4 -outlang es de

Note that in this example, I ignore the specified output bucket. This is because I wanted demonstrate that the output can be managed by Amazon Transcribe and made available via a pre-signed URL. You can optionally output the JSON transcription to a bucket you specify instead.

translatevideo.py

For this program, I used the argparse package to be able to drive the program with command line arguments.

# name: translatevideo.py
# author: Rob Dachowski
# purpose: This code drives the process to create a transription job, translate it into another language,
# create subtitles, use Amazon Polly to speak an alternate audio track, and finally put it all together
# into a new video.

import argparse
from transcribeUtils import *
from srtUtils import *
import time
from videoUtils import *
from audioUtils import *

# Get the command line arguments and parse them
parser = argparse.ArgumentParser( prog='translatevideo.py', description='Process a video found in the input file, process it, and write tit out to the output file')
parser.add_argument('-region', required=True, help="The AWS region containing the S3 buckets" )
parser.add_argument('-inbucket', required=True, help='The S3 bucket containing the input file')
parser.add_argument('-infile', required=True, help='The input file to process')
parser.add_argument('-outbucket', required=True, help='The S3 bucket containing the input file')
parser.add_argument('-outfilename', required=True, help='The file name without the extension')
parser.add_argument('-outfiletype', required=True, help='The output file type.  E.g. mp4, mov')
parser.add_argument('-outlang', required=True, nargs='+', help='The language codes for the desired output.  E.g. en = English, de = German')        
args = parser.parse_args()

# print out parameters and key header information for the user
print( "==> translatevideo.py:\n")
print( "==> Parameters: ")
print("\tInput bucket/object: " + args.inbucket + args.infile )
print( "\tOutput bucket/object: " + args.outbucket + args.outfilename + "." + args.outfiletype )

print( "\n==> Target Language Translation Output: " )

for lang in args.outlang:
    print( "\t" + args.outbucket + args.outfilename + "-" + lang + "." + args.outfiletype)  
   
# Create Transcription Job
response = createTranscribeJob( args.region, args.inbucket, args.infile )

# loop until the job successfully completes
print( "\n==> Transcription Job: " + response["TranscriptionJob"]["TranscriptionJobName"] + "\n\tIn Progress"),

while( response["TranscriptionJob"]["TranscriptionJobStatus"] == "IN_PROGRESS"):
    print( "."),
    time.sleep( 30 )
    response = getTranscriptionJobStatus( response["TranscriptionJob"]["TranscriptionJobName"] )

print( "\nJob Complete")
print( "\tStart Time: " + str(response["TranscriptionJob"]["CreationTime"]) )
print( "\tEnd Time: "  + str(response["TranscriptionJob"]["CompletionTime"]) )
print( "\tTranscript URI: " + str(response["TranscriptionJob"]["Transcript"]["TranscriptFileUri"]) )

# Now get the transcript JSON from AWS Transcribe
transcript = getTranscript( str(response["TranscriptionJob"]["Transcript"]["TranscriptFileUri"]) ) 
# print( "\n==> Transcript: \n" + transcript)

# Create the SRT File for the original transcript and write it out.  
writeTranscriptToSRT( transcript, 'en', "subtitles-en.srt" )  
createVideo( args.infile, "subtitles-en.srt", args.outfilename + "-en." + args.outfiletype, "audio-en.mp3", True)

# Now write out the translation to the transcript for each of the target languages
for lang in args.outlang:
    writeTranslationToSRT(transcript, 'en', lang, "subtitles-" + lang + ".srt" )     
    
    #Now that we have the subtitle files, let's create the audio track
    createAudioTrackFromTranslation( args.region, transcript, 'en', lang, "audio-" + lang + ".mp3" )
    
    # Finally, create the composited video
    createVideo( args.infile, "subtitles-" + lang + ".srt", args.outfilename + "-" + lang + "." + args.outfiletype, "audio-" + lang + ".mp3", False)

Other use cases

Now that we’ve seen how to use Amazon Transcribe, Amazon Translate, Amazon Polly, and Amazon S3 to create subtitled and translated videos from a transcript generated with video content, let’s think about some other ways we may be able to apply these technologies.

  • Call Center Transcriptions – Since Amazon Transcribe can recognize multiple voices, it can be used to identify the caller and agent in a customer service call. Not only can all centers analyze transcripts to improve service quality, they can also use the transcripts to develop subtitles for future training videos as well.
  • Translating a podcast into another language – Ever wanted to expand your podcast listener base, but didn’t speak another language? Similar to the other use cases, use Amazon Transcribe to obtain a transcript of the podcast, then translate it using Amazon Translate, and “speak” it with Amazon Polly. Keep in mind that if your Podcast contains technical jargon, you can add a custom lexicon to Amazon Polly to improve the pronunciation.

Conclusion

In this blog post, we’ve demonstrated a working example using Amazon Transcribe, Amazon Translate, and Amazon Polly to solve the subtitling and translation problems.   We’ve also talked through some of the key design issues you’ll face when taking this example to the next level for your own projects.  With the rapid progress being made in NLP technologies, it will be fun to see how far this can go in the near future.


About the Author

Rob Dachowski is a Solutions Architect aligned with the AWS National Systems Integrator team. With many years of experience architecting, designing, and implementing large scales systems, Rob has helped enterprise customers migrate to the AWS Cloud. As a professional musician and radio show producer, he has a keen interest in audio and video production tools.