AWS Machine Learning Blog
Build a custom vocabulary to enhance speech-to-text transcription accuracy with Amazon Transcribe
Amazon Transcribe is a fully-managed automatic speech recognition (ASR) service that makes it easy for developers to add speech-to-text capabilities to applications. Depending on your use case, you may have domain-specific terminology that doesn’t transcribe properly (e.g. “EBITDA” or “myocardial infarction”). In this post, we will show you how to leverage the custom vocabulary feature – by leveraging custom pronunciations and custom display forms – to enhance transcription accuracy of domain-specific words or phrases that are relevant to your use case.
Custom vocabulary is a powerful feature that helps users transcribe terms that would otherwise not be part of our general ASR service. For instance, your use case may involve brand names or proper names that are not normally part of a language model’s regular lexicon, like in the case of “Hogwarts”. In this case, it would not only be helpful to be able to add the custom text, but also be able to inform our ASR service on the pronunciation to help our system better recognize unfamiliar terms. On a related note, perhaps you have a term, say, “Lotus” which is a brand name of a car. Naturally, we recognize “lotus” as a flower already. But for your use case, you’d like to have the word transcribed with proper capitalization in the context of recognizing it as a make or model of a vehicle. You can therefore use the recently added custom display forms to achieve this.
First, check out the following video for a demo and then read on to walk through some examples of using both custom pronunciation and also custom display forms.
First, we’ve recorded a sample audio and stored it in an S3 bucket (this is a pre-requisite and can be achieved by following documentation). For reference, here’s the audio file’s ground truth transcript:
“Hi, my name is Paul. And I’m calling in about my order of two separate LEGO toys. The first one is the Harry Potter Hogwarts Castle that has a cool Grindelwald mini-fig. The second set is a model of the Lotus Elise car. I placed the orders on the same day. Can you tell me when they will be arriving please?”
As you can see, there are some very specific brand names and custom terms. Let’s see what happens when we pass the audio sample through Amazon Transcribe as is. First, let’s sign into the AWS Console and create a new transcription job:
Then, in the next screen, I’ll name my transcription job and reference the S3 bucket in which my sample audio is stored. I’ve selected the language model as US English and identified the file format as WAV. I’ll leave the sample rate blank as that’s optional. And also notice I deliberately left the custom vocabulary field blank, because we want to run a baseline transcription job without using the feature to see performance accuracy as is. I’ve left all of the remaining fields as default, since those are features we’re not interested in using for this baseline test. Then I’ll hit “Create Job” to initiate the transcription.
In the next screen you’ll see that the transcription job has completed with a preview window showing you the output text: “Hi. My name is Paul, and I’m calling in about my order of two separate Lego toys. The first one is the Harry Potter Hogwarts Castle that has a cool, Grendel walled many fig. The second set is a model of the lotus, At least car. I placed the orders on the same day. Can you tell me when they will be arriving, please? Thanks.”
Looks like the transcription output did pretty well overall, except it missed “Grindelwald”, “mini-fig”, and “Lotus Elise”. Additionally, it didn’t capture “LEGO” properly with full capitalization. No surprise, as these are pretty content-specific custom terms.
So, let’s see how we can use the custom vocabulary feature’s custom pronunciation to enhance the transcription output. First, we need to prepare a vocabulary file, which not only lists the custom terms (Phrase), but also indicates the corresponding pronunciations.
Using any simple text editor, I am going to create a new custom vocabulary file. And then type in the terminology (Phrase), the corresponding pronunciation (IPA, while International Phonetic Alphabet guidelines or orthography using SoundsLike), and then any output format of my preference (DisplayAs). In the text editor, I’ve configured the white bars to indicate when typing a tab for blanks where there are no inputs desired. Here’s what the vocabulary text file looks like in my text editor. Notice I basically augmented any of the words that were missed in the baseline transcription. I’ll save the file as “paul-sample-vocab.”
So now, all I have to do is upload this text file via the Amazon Transcribe Console by uploading the vocabulary file and clicking “Create Vocabulary”:
We can confirm that the custom vocabulary was successfully generated, as it will be visible in the custom vocabulary list:
Ok, so now we can start another new transcription job for the same audio file, but this time, we’ll invoke the custom vocabulary text file to see the accuracy results. The process is the same as we had been through before, except this time we will actually designate a custom vocabulary “paul-sample-vocab”. And of course, I’ll name the transcription job something different from the first one, like “customer-call-with-vocabulary”:
Let’s take a look at the new transcription results now!
Here’s the transcription output:
“Hi. My name is Paul, and I’m calling in about my order of two separate LEGO toys. The first one is the Harry Potter Hogwarts Castle. That has a cool Grindelwald mini-fig. The second set is a model of the Lotus Elise car. I placed the orders on the same day. Can you tell me when they will be arriving, please?”
We’ve not only correctly transcribed custom formal nouns such as “Grindelwald” but also custom terms like “mini-fig” which are specific to LEGO toys. And look at that, we also were able to properly capitalize “LEGO” as it is spelled as a brand, along with proper casing for “Lotus Elise” as well.
Custom vocabulary should be used in a targeted manner, meaning that the more specific a list of terms is when applied to specific audio recordings, the better the transcription result. We don’t recommend flooding a single vocabulary file with more than 300 words. The feature is available in all regions where Transcribe is available today. Refer to the Region Table to see the full list.
For more information, refer to the Amazon Transcribe technical documentation. If you have any questions, please leave them in the comments.
About the authors
Paul Zhao is a Product Manager at AWS Machine Learning. He manages the Amazon Transcribe service. Outside of work, Paul is a motorcycle enthusiast and avid woodworker.
Yibin Wang is a Software Development Engineer at Amazon Transcribe. Outside of work, Yibin likes to travel and explore new culinary experiences.