“What was that?” Increasing subtitle accuracy for live broadcasts using Amazon Transcribe

Including accurate subtitles for your live content is crucial for an improved viewer experience and to meet accessibility requirements. While most transcriptions for live broadcast are still generated by human stenographers, these processes scale poorly and can be expensive. Amazon Transcribe, which automatically converts speech to text, is an automatic speech recognition (ASR) service that makes it easy for you to add subtitles to your live media content without any prior machine learning (ML) experience. In this post, we’ll take you through several considerations to improve the quality of subtitles before final broadcast to viewers using Amazon Transcribe.

Amazon Transcribe can process a file or operate on a live stream of audio in near real time. If you’re interested in using Amazon Transcribe to subtitle on demand files (batch processing), please explore our many posts on the AWS Machine Learning Blog.

For live content, Amazon Transcribe offers streaming transcriptions, which allow you to stream audio to Amazon Transcribe. This will return a stream of words and phrases in near real time. These can then be processed and converted into your desired captions format and combined with the outgoing broadcast.

If you have an existing streaming transcribe solution deployed in your account, skip forward to the section “Latency vs. accuracy.”

Subtitles for live media—high-level design

Sample high-level architecture overview

Figure 1 illustrates a sample architecture on how live audio is streamed to an application we’ll call the Transcribe Relay App (TRA). In this example, we use AWS Fargate, a serverless, pay-as-you-go compute engine, to manage the TRA. The TRA is responsible for passing the audio stream to Amazon Transcribe, which will analyze the audio and return a stream of transcript text. This raw text can then be processed to enhance accuracy and be converted into a desired subtitle format—for example, WebVTT, SRT, or json. Subtitles can be forwarded to your preferred solution for combining them with the outgoing video and audio broadcast.

More details on how to implement your own TRA, including input formats and sample responses from Amazon Transcribe, can be found in the streaming documentation. Fully operational sample code for a TRA is in the Amazon IVS Auto-Captions Web Demo project on GitHub, where it is called the transcribe-server.

Once you have a simple streaming flow working, it’s time to begin optimizing your results.

Latency vs. accuracy

Humans use not only the sounds that make up individual words but also context—where each word falls relative to the other words being spoken—to understand what’s being said. Understanding context is crucial for good transcription, especially when words may sound the same or similar to each other. For example, the difference between the last word in the sentence “Can the dog hear?” versus “Bring the dog here” hinges on the first word of each sentence.

The more time the transcriber has to listen and gather context, the more accurate the transcription will be. This has implications for running an ASR service, like Amazon Transcribe, on live content. When passed a complete audio file, Amazon Transcribe can gather all the context in each sentence before generating a transcription. But in any system for live streaming and broadcast, the audio is coming in near real time, and subtitles need to appear as close as possible to the action on screen. This reduces the time available for Amazon Transcribe to gather context.

Amazon Transcribe streaming will return individual words in a phrase as it identifies them and improves accuracy by revising past words in a phrase as it gets more context. Here’s an example of how transcripts returned by Amazon Transcribe might change over time. Text in red is initially wrong but fixed as time goes on:

and if you held onto the ships
and if you held onto the shift
and if you hold onto the shift key
and if you hold onto the shift keys

While waiting for all the context to arrive would be great, sometimes you have only a few seconds before your subtitles need to be sent for final broadcast. Amazon Transcribe streaming features Partial Results Stabilization, giving you the ability to restrict revisions to only the last few words in a phrase. This means that you can tune your transcriptions between speed and accuracy. If time is short, you can quickly generate subtitles. If there is more time, you can wait a few more seconds for a potentially more accurate transcription.

Improving accuracy

Regardless of whether you expect your subtitles to be available in 1 second or 5 seconds, it is important that they are as accurate as possible. The common measure of accuracy in speech recognition systems is the word error rate (WER), which is the proportion of transcription errors that the system makes relative to the number of words said. The lower the WER, the more accurate the system. Read our blog post on evaluating an automatic speech recognition service for an in-depth description of WER and strategies for measuring the accuracy of your transcriptions.

We can separate best practices for accuracy into three stages: pretranscription, during transcription, and post transcript.

Pretranscription: Audio clarity

The harder it is to hear what’s being said, the harder it is to understand, and the same is true for Amazon Transcribe. Final sound mixes from auto races, sporting events, and other loud venues often contain background noise that obscure commentator’s and other dialogue, reducing the accuracy of transcriptions. For the best results, send the audio tracks that contain dialogue directly to Amazon Transcribe before combining them with background audio or music.

Pretranscription: Audio fidelity

Low sample rates and compression can introduce artifacts that reduce transcript quality. Amazon Transcribe streaming will accept audio with sample rates as low as 8,000 Hz, but for best results, we recommend audio with sample rates of 16,000 Hz or higher. Likewise, though Amazon Transcribe streaming can accept OPUS-encoded audio in an Ogg container, we recommend using one of the lossless formats currently supported, FLAC or PCM 16-bit.

During transcription: Domain-specific models

The base models provided by Amazon Transcribe are effective across wide ranges of content and may already be suited to your broadcast. If needed, you can further improve accuracy for your specific domains—for example, racing, soccer games, or talk shows—by building a custom model specialized for each domain. Amazon Transcribe offers multiple tools for crafting models that match your unique content.

First and foremost, the base language model must match the language in your content. Many languages have different regional accents, and for some languages, Amazon Transcribe offers models specific to those regional dialects. If the speakers in your content have British accents, you’ll get better accuracy by specifying the English-GB model instead of English-US. See the documentation for a full list of languages and dialects supported by Amazon Transcribe streaming and batch.

Accuracy can be further improved by implementing a custom vocabulary or a custom language model for your domain. At time of writing, Amazon Transcribe streaming supports using one or the other for an audio input but not both. Below we describe each in more detail along with reasons to choose each, depending on your situation.

The custom vocabulary is a table of domain-specific words and phrases that Amazon Transcribe will use to improve accuracy. These are usually proper nouns, technical terms, sports jargon, and other terms that the Amazon Transcribe base model fails to recognize. More detail on implementing custom vocabularies is available in the documentation.

We recommend using custom vocabularies in cases where there are words and phrases unique to the domain, particularly when those words can change rapidly—for example, rotating sets of international player names in soccer/football—and you don’t have enough domain-specific text data to train a custom language model. Formula 1 Racing (F1) chose to use custom vocabularies for its live streaming service, F1 TV. For more details on F1’s implementation and why it chose custom vocabularies, see this AWS blog post.

Custom vocabularies improve accuracy based on word sounds and can’t provide Amazon Transcribe with the context in which those words generally appear. As an example, if you wanted to define a vocabulary rule for the proper noun Los Angeles, you can specify either a SoundsLike—“loss-ann-gel-es”—or use the International Phonetic Alphabet (IPA), “lɔs æn ʤəl əs.” Because of high variability with how people write the syllables that mimic how a word sounds, we recommend using IPA in your custom vocabularies for the greatest consistency wherever possible.

Custom language models are a feature that allows you to upload text data to train an Amazon Transcribe model that targets the specific domain of your media content. Unlike custom vocabularies, custom language models are able to use the context associated with domain-specific words when transcribing audio, greatly improving overall accuracy.

Gathering the text data to build a good custom language model can require much more work than a custom vocabulary but can provide significant increases in accuracy. Text data to train the custom models must be specific to the kind of media that you plan on transcribing and can include your pre-existing website content, long-form text and articles, or training manuals. If you have it, we highly recommend using human-generated audio transcripts from media you’ve already broadcast. These will be an exact fit for the domain you’re training the model for.

More details on building a custom language model can be found on the documentation page, and read the custom language models blog post for a guide on how to build and evaluate your first model.

Posttranscript: Automated postprocessing

Even after defining a custom vocabulary or building a custom language model, some erroneous transcriptions can slip through. Variances, such as speakers with different accents, can introduce recurring mistakes in the transcript—for example, a speaker whose catchphrase, “Let’s put the issue to bed,” is consistently transcribed as “Let’s put the issue Tibet” when using the English-US model. If the word Tibet is unlikely to appear in that speaker’s program, you can safely replace occurrences of the word with the correct phrase for that content before sending it out to broadcast or additional review.

As with the domain-specific models above, it is important to create postprocessing rules that are content specific. Attempting to create a single set of rules for all domains could increase the number of errors in your subtitles by performing substitutions when it is not appropriate.

This technique has been used to great effect by F1, who uses custom postprocessing rules to do text replacement of common subtitle errors in its racing content. Check out this AWS blog post for a deep dive on the work done to produce high-quality subtitles for F1.

Posttranscript: Human review

For high-priority content or anything else that requires the strictest accuracy, it can be valuable to conduct human review and even correction of subtitles prior to broadcast. This technique bridges the gap between ASR systems and full human transcription, allowing you to use people with a strong grasp of the language who might not have the speed and skill of a professional stenographer.

Human review can be implemented in two tiers, depending on your need and time-to-broadcast constraints. For the tightest schedules where there won’t be any time for edits before subtitles need to go out with the broadcast, customers like F1 have implemented a human-in-the-loop system that allows an operator a few seconds to block a word or phrase from appearing in the outgoing broadcast. Words and phrases that are blocked by the operators are tracked to later improve the models used for that domain, improving the overall subtitle quality over time.

In scenarios where there is a bit more time before the live show goes out, systems have been built that allow for editing and revision of subtitles, including nudging of time stamps to better align subtitles with on-screen content. This is an active growth space, with companies like CaptionHub integrating with Amazon Transcribe to provide tools that allow near-real-time work with ASR-generated subtitles. Check out the CaptionHub post on the AWS blog for more details on what can be done for full human-in-the-loop editing.

Conclusion

Subtitles are an essential part of viewer experience. Amazon Transcribe helps you deliver high-quality live-video content with accessible subtitling. In this post we explained how to get started with Amazon Transcribe streaming and described some of the best practices AWS Professional Services has used to help our customers improve the quality of their subtitles. AWS Professional Services has teams specializing in media and entertainment who are ready to help you develop a live subtitling system using Amazon Transcribe that meets your unique needs and domain. For more information, see the AWS Professional Services page or reach out to us through your account manager.

AWS for M&E Blog