AWS Machine Learning Blog

Boost transcription accuracy of class lectures with custom language models for Amazon Transcribe

Many universities like transcribing their recorded class lectures and later creating captions out of these transcriptions. Amazon Transcribe is a fully-managed automatic speech recognition service (ASR) that makes it easy to add speech-to-text capabilities to voice-enabled applications. Transcribe assists in increasing accessibility and improving content engagement and learning outcomes by connecting with both auditory and visual learners.

When transcribing content that is more specialized or domain-specific such as biology, Amazon Transcribe offers custom language models (CLM). One common problem we see is the difficulty in accurately transcribing certain subjects. In this post, we show how you can harness readily available content to train a CLM in Amazon Transcribe and boost the transcription accuracy on scientific subjects like biology. This feature allows you to submit a corpus of text data to train custom language models that target domain-specific use cases. Using CLM is easy because it capitalizes on existing data that you already possess (such as website content, curriculum, and lesson plans). Since this is “custom”, you can easily use the approach presented here and create a CLM for your subject of interest.

This blog’s main purpose is to show how data can be easily downloaded from Wikipedia to generate a training corpus for CLM.

In this blog, we will refer to a few publicly available biology audio lectures from MIT. Amazon Transcribe might recognize the following advanced scientific terms:

Prokaryotic cells” as “Pro carry ah tick cells

Endoplasmic reticulum” as “Endo Plas Mick Ridiculous um

Vacuoles” as “Vac u ALS

Flagella” as “Flu Gela

These results shouldn’t be interpreted as a full representation of the Amazon Transcribe service performance—it’s just one instance for a very specific example.

Solution overview

With the CLM feature in Amazon Transcribe, you can build your own custom model for your class course content and improve the transcription accuracy of your class lectures.

The CLM feature in Transcribe carries three stages for building a custom model:

  1. Prepare training data
  2. Train a CLM model
  3. Transcribe an audio file using the CLM model and evaluate the results

Prepare training data

The Amazon Transcribe CLM feature requires training data that is specific to that particular domain. In our example, we require training data specific to biology. We can obtain training data from various sources. In our case we obtained it from Wikipedia using the following code. We can further improve the CLM’s accuracy using ground truth transcripts as tuning data. For more information, see Improving domain-specific transcription accuracy with custom language models.

Written in Python, this code pulls various biology-related articles from Wikipedia, and requires you to provide a few key terms related to the domain of interest. It then fetches Wikipedia articles on those key title terms if they exist, and ignores those articles if the terms don’t exist. Then our training data is ready. In this example, the code upon completion creates 137 separate text files. You can upload these text files to a folder in an Amazon Simple Storage Service (Amazon S3) bucket.

!pip3 install beautifulsoup4

!pip3 install nltk

import nltk
nltk.download('punkt')

from nltk import tokenize
import re

import urllib.request
from bs4 import BeautifulSoup

# Create a list of key terms related to biology
keywords_list = ["Abdominal cavity", "Absorption", "Acclimation", \
                 "Achondroplasia", "Acid", "Behaviour", "ACTH", \
                 "Adrenocorticotropic", "Hormone", "Aerobic", \
                 "Amoeba", "Amoeboid", "Anabolism", "Anabolic", \
                 "Anaerobic", "Anagen", "Anastomosis", "Anatomy", \
                 "Anterior", "Articulate", "Blastodisc", \
                 "Blastoderm", "Binocular", "Bolus", "Boli", \
                 "Catabolism", "Catabolic", "Caudal", "Choana", \
                 "Coelom", "Columnar", "Epithelium", "Conical", \
                 "Corium", "Cranial", "Dimorphism", "Distal", \
                 "Dorsal", "Ectoderm", "Electrolyte", \
                 "Endocardium", "Endoderm", "Entoderm", \
                 "Gamete", "Germ", "Gonads", "Gonadotropins", \
                 "Heterophile", "Homeothermic", "Hyperthermia", \
                 "Hypothermia", "Ingest", "Infection", "Infestation", \
                 "Lateral", "Longitudinal", "Lunar", "Median", \
                 "Meiosis", "Chromosome", "Metabolism", "Living organism", \
                 "Mitosis", "Mesoderm", "Myocardium", "Neo", \
                 "Ovum", "Paleo", "Respiration", "Papilla", \
                 "Papillae", "Exocrine gland", "Peri", \
                 "Pericardial", "Heart", "Peritoneal", \
                 "Intestine", "Abdomen", "PH", "Phagocyte", \
                 "White blood cell", "Foreign body", "Bacteria", \
                 "Physiology", "Organism", "Plantar", "Pleural", \
                 "Lung", "Poikilothermic", "Animal", "Body", \
                 "Polymorphonuclear", "Nucleus", "Posterior", \
                 "Proximal", "Pulmonary", "Veins", "Purkinje fibres", \
                 "Muscle", "Fibres", "Sagittal", "Tissue", "Sebaceous", \
                 "Serous", "Membrane", "Squamous", "Syncytium", \
                 "Protoplasm", "Telogen", "Thoracic", "Cavity", \
                 "Body cavity", "Diaphragm", "Transverse", "Ventral", \
                 "Virulent", "Disease", "Biology", "Human cell", \
                 "Animal cell", "Cell structure", "Zoology", "DNA", \
                 "Plant cell", "Biophysics", "Cell and molecular biology", \
                 "Computational biology", "Ecology", "Evolution", \
                 "Environmental biology", "Forensic biology", \
                 "Genetics", "Marine biology", "Microbiology", \
                 "Biosciences", "Natural science", "Neurobiology"]

# Purge any duplicates from list
keywords_list = list(set(keywords_list))

print("Size of keyword list =", len(keywords_list))

# Write output to a folder
def output_to_file(data, keyword):
  file_location = "./"+keyword+".txt"
  with open(file_location, "w", encoding="utf-8") as f:
    f.write(data)
  f.close()

# Helper method to get html text from wikipedia
def extract_html(keyword):
  try:
    fp = urllib.request.urlopen("https://en.wikipedia.org/wiki/"+keyword)
    html = fp.read().decode("utf8")
    fp.close()
    return html
  except:
    print("Page for "+keyword+" does not exist")
    return None

# Helper method to extract data from html text
def get_data(html):
  extracted_data = []
  soup = BeautifulSoup(html, 'html.parser')
  for data in soup.find_all('p'):
    res = tokenize.sent_tokenize(data.text)
    for txt in res:
        txt2 = re.sub("[\(\[].*?[\)\]]", "", txt)
        txt2 = txt2.strip()
        if len(txt2)>0:
            extracted_data.append(txt2)
  return extracted_data

# Download data from wikipedia to local text files
count = 0
for keyword in keywords_list:
  keyword = keyword.replace(" ","_")
  html = extract_html(keyword)
  if html:
      count += 1
      data = "\n".join(get_data(html))
      output_to_file(data,keyword)
print("Was able to download text for "+ str(count) + " out of "+str(len(keywords_list))+" keywords")

Train a Custom Language Model

We use this training data to train our CLM in Amazon Transcribe. To do so, we can use the AWS Management Console, the AWS Command Line Interface (AWS CLI), or the AWS SDK. The methodology shown in this post uses the console.

  1. On the Amazon Transcribe console, choose Custom language model in the navigation pane.

  1. Choose Train model.
  2. For Name, enter a name for your model.
  3. For Language, choose the language of your model (for this post, we choose English, US).
  4. For Base model, if your audio files have a sample rate greater than 16 kHz, select Wide band.
  5. For Training data, enter the S3 folder path for your training data.
  6. Create an AWS Identity and Access Management (IAM) role if you don’t have an existing role with the required permissions.
  7. Choose Train model.

Your model should be ready after a few hours. Make sure that your training data is in UTF-8 format. For more information, see Improving domain-specific transcription accuracy with custom language models.

When your model is ready, you can use it to create transcriptions.

Transcribe and evaluate the results

In this section, we compare the transcription output from standard Amazon Transcribe with the CLM output.

We took the standard biology audio file as input to show how CLM improves the results. The words highlighted in red show errors in transcription, and the ones highlighted in green show how those errors are fixed by the CLM.

Snippet 1 – Ground Truth
Outside the nucleus, the ribosomes and the rest of the organelles float around in cytoplasm, which is the jelly like substance. Ribosomes may wander freely within the cytoplasm or attach to the endoplasmic reticulum sometimes abbreviated as ER.
Snippet 1 – Standard Amazon Transcribe Snippet 1 – Amazon Transcribe with CLM
Outside the nucleus, the ribosomes and the rest of the organelles float around in cytoplasm, which is the jelly like substance. Ribosomes may wander freely within the cytoplasm or attach to the end a plasma critical, Um, sometimes abbreviated as E. R. Outside the nucleus, the ribosomes and the rest of the organelles float around in cytoplasm, which is the jelly like substance. Ribosomes may wander freely within the cytoplasm or attach to the endoplasmic reticulum, sometimes abbreviated as E. R.
Snippet 2 – Ground Truth
Another unique feature in some cells is flagella. Some bacteria have flagella. A flagellum is like a little tail that can help a cell move or propel itself.
Snippet 1 – Standard Amazon Transcribe Snippet 1 – Amazon Transcribe with CLM
Another unique feature in some cells is flat Gela. Some bacteria have fled. Gela, a flagellum, is like a little tail that can help us sell, move or propel itself. Another unique feature in some cells is flagella. Some bacteria have flagella. A flagellum is like a little tail that can help a cell move or propel itself.

To demonstrate this further, we downloaded several publicly available biology audio lectures from MIT, namely lectures 1, 3, and 4. Results from this exercise are reported in the following table using word error rate (WER) as a metric. WER is a standard metric used to measure transcription accuracy, where accuracy = (1.0 – WER). In this test, we used the asr-evaluation Python module for WER calculations.

Standard Amazon Transcribe WER Amazon Transcribe CLM WER Standard Amazon Transcribe Accuracy Amazon Transcribe CLM Accuracy # Words Words Improved by CLM WER  Improvement
Sample 1 9.5% 7.4% 90.5% 92.6% 5,678 119 22%
Sample 2 13.2% 11.6% 86.8% 88.4% 7,578 121 12%
Sample 3 12.2% 10.4% 87.8% 89.6% 7,534 135 15%

As is evident from the results, transcription accuracy improved through the use of CLM. The following are some of the transcription errors that the CLM fixed:

“file a Chinese” corrected to “Phylogenies”

“Metas Oona” corrected to “Metazoa”

“File Um” corrected to “Phylum”

“Endo plans particular” corrected to “Endoplasmic reticulum”

A lower WER is better. These WERs aren’t representative of overall Amazon Transcribe performance. All numbers are relative to demonstrate the point of using custom models over generic models, and are specific only to this singular audio sample. The number of words accurately transcribed by CLM is pretty significant! As you can see, although Amazon Transcribe’s generic engine performed decently in transcribing the sample audio from the biology domain, the CLM we built using training data performed even better! These comparative results are unsurprising because the more relevant training and tuning that a model experiences, the more tailored it is to the specific domain and use case.

Conclusion

In this post, we showed how results from the custom language feature of Amazon Transcribe can improve transcription accuracy on difficult specialized audio topics, such as biology lectures. Further improvements are possible by using course materials such as textbooks and relevant articles as additional training data. You can use some of the ground truth audio transcripts as tuning data.

You can also use the custom vocabulary feature in Amazon Transcribe in conjunction with CLM to provide pronunciations hints for particularly troublesome words. For more information, see Custom vocabularies.

As you start building a CLM for your use case, make sure that you train it on appropriate data for that particular subject. You can use the code provided in this post to source domain-specific tuning or training data from public websites such as Wikipedia. Try it out yourself and let us know how you do in the comments!


About the Author

Raju Penmatcha is a Senior AI/ML Specialist Solutions Architect at AWS. He works with education, government, and nonprofit customers on machine learning and artificial intelligence-related projects, helping them build solutions using AWS. Outside of work, he likes watching movies and exploring new places.