What is Audio File Transcription?

Organizations require audio transcriptions at scale for various use cases, ranging from organized meeting notes to healthcare applications. Modern AI technologies can transcribe audio to text, transforming various accents and conversations between multiple speakers into accurate, formatted documents. This guide explores methods to transcribe audio to text for enterprise and small business needs.

Speech-based communication is critical for humans to fully understand one another. Voice is a fast, point-in-time method to communicate ideas, information, instructions, and emotions. Recording and transcribing voice communications via audio-to-text converters has become essential for recall, accuracy, and further work. When you transcribe audio to text, important information can be retained, searched, analyzed, and remixed for faster insights and instant integration into business processes.

In the past, a person would listen to a single audio recording and simultaneously type out its content, converting spoken words by stopping and starting to produce an accurate transcript. Law firms, doctors, researchers, and other professional offices had typist pools to perform this manual role in transcribing audio to text from voice notes.

Now, machines can transcribe audio instantly via an audio-to-text converter. Instead of human effort in transcription work, speech-to-text (STT) technology converts audio files into written text files. This written text file can be read as-is, summarized with an AI transcriber, automatically actioned within other software systems, analyzed in isolation or as part of a broader corpus, and so much more. The applications of audio-to-text converters are boundless.

What are audio file transcription technologies?

Audio files may contain various speakers, accents, and domain-specific words. Audio recordings can also vary in sound quality. Converting spoken words to text requires vocal language comprehension and language syntax and grammar knowledge to produce readable output.

Older audio-to-text converter software made mistakes and produced difficult-to-read transcripts, without proper structure, and hierarchical, word, and grammar errors. Modern audio-to-text converter software performs far better, converting audio to text closely matching the spoken word, with accurate transcripts featuring proper written structure and grammar.

Amazon Transcribe is a fully managed service that converts speech to text using automatic speech recognition (ASR) technology. It can handle various speech characteristics, including variations in speaking rate, pitch, and volume. It can transcribe in over 100 languages, plugging into developer workflows and AWS infrastructure for enterprise audio-to-text requirements.

How to get started with audio transcription?

Two main methods exist to transcribe audio to text, driven by audio or video file type. Batch transcription is used for transcribing pre-recorded audio files, and streaming transcription is used for transcribing live media streams.

Amazon Transcribe supports single-channel and dual-channel audio for both batch and streaming audio and video transcription types.

Both batch and streaming audio-to-text transcription are output in the JSON file format. The fields provided in the output depend on the features you include in your transcription request when converting audio. At a minimum, your transcript contains each given word, its start time, end time, type, vocabulary filter match, and confidence score for verifiability. Other fields include speaker labels, alternative words, channels, and more.

Streaming transcriptions

Streaming transcription is used to transcribe audio streams in real-time. The Amazon Transcribe streaming transcription service supports FLAC and PCM signed 16-bit little-endian audio (not WAV) as preferred formats, along with Ogg Opus. Set a sample rate that matches the audio file to avoid audio-to-text errors.

You can use the AWS Management Console, HTTP/2, WebSockets, and various AWS SDKs for streaming transcriptions, depending on how you would like to use the transcription tool.

A streaming audio transcription walkthrough with the AWS Management Console is explained below.

Select Real-time transcription in the left navigation pane.
Select options like language, speaker identification, content removal, and customizations before starting your stream.
Click the Start streaming button to record directly in real-time and view the output that will start transcribing in the Transcription output box below.

Once the audio recording conversion is complete, you can click the Download full transcript button for a free download of the JSON file transcript.

Batch file transcription

Batch transcription is used to transcribe one or more existing media files stored in an Amazon S3 cloud storage bucket. With the batch service, you can upload up to 10,000 audio file jobs in a queue for processing in a first-in, first-out system. Voice recording jobs can be processed concurrently, converting audio files at once, depending on your subscription.

Batch transcription supports FLAC and WAV (with PCM 16-bit encoding) as preferred formats. However, other formats like AMR, M4A, MP3, MP4, Ogg, and WebM are also supported. Make sure to set a sample rate that matches the audio file to avoid audio-to-text errors.

You can use the AWS CLI, AWS Management Console, and various AWS SDKs to convert audio to text using the batch transcription process.

A batch audio transcription walkthrough with the AWS Management Console is explained below.

Upload the media file you want to transcribe into an Amazon S3 bucket.
Select Transcription jobs in the left navigation pane. This takes you to a list of your transcription jobs.
Select Create job and fill in the fields on the Specify job details page.
Once you’ve configured the job, click the Create job button to begin.
Return to the Transcription jobs page, where you can see the status of your job.
Select the linked filepath in the right column under Output data location to view your JSON file transcript.

Note: If you chose a service-managed bucket for output, you can see a Transcription preview pane on your transcription job's information page, along with a Download button for your JSON audio-to-text file.

Complete the following pages during configuration..

Input data

Under the Input data page, Input file location on S3 is your audio file in the existing S3 Bucket, and Output data is an S3 service-managed bucket or your own S3 bucket.

Configure job

The Configure job page allows you to select customizations such as channel identification, content redaction and filtering, and custom vocabulary.

What are some additional transcription capabilities?

Amazon Transcribe has a range of additional features for creating more useful, secure, and accurate transcripts when you convert audio or video files.

Custom vocabularies and language models

Users can create custom vocabularies and language models to accurately capture and transcribe audio with domain-specific brand names, acronyms, technical words, and jargon. Custom language models benefit large organizations with thriving internal language ecosystems or highly specialized, technical industries.

Custom vocabularies are user-created files that demonstrate how to pronounce specific words. For example, a project named VX02Q can be added to a custom vocabulary with the pronunciation V.X.-zero-two-Q.

Custom language models allow the audio-to-text model to complete extra training on an existing dataset to understand the context of domain-specific language. For example, if you train your model with a text upload of climate science research papers, your model may learn that 'ice floe' is a more likely word pair than 'ice flow'. Similarly, if you are referencing a product named ‘Bzntry’, an audio file dataset with multiple mentions of “bee-zen-tree” will automatically match the audio with the word output.

Batch and streaming audio-to-text transcription both support custom vocabularies and custom language models.

Automatic moderation

A custom vocabulary filter allows you to mask, replace, or tag ("vocabularyFilterMatch": true) a specific word or word combination in the JSON transcript output.

Examples:

Mask profane words with three asterisks (***)
Replace a pre-launch secret product name with the word ‘NewProduct’
Count the number of tags labeled “um” or “like” in a transcript to help a speaker hone their public speaking skills

Batch and streaming audio-to-text transcription both support vocabulary filters.

PII redaction and identification

Personally identifying information (PII) can be automatically redacted and tagged in audio-to-text transcripts. This is important for storing sensitive information in businesses, as PII can fall under strict confidentiality laws.

PII types included in Amazon Transcribe are names, addresses, email addresses, phone numbers, bank number details, PINs, and Social Security Numbers. The word in the JSON file is replaced with [PII] in the main text body of your transcript by the audio-to-text converter, and is counted and categorized by type in the “redactions” JSON field.

Subtitling

Amazon Transcribe allows users to generate WebVTT (*.vtt) and SubRip (*.srt) subtitle files to pair with videos, alongside the regular output JSON file. Subtitles are displayed at the same time as text is spoken in the audio or video file, and remain visible until there is a natural pause in the audio or the speaker finishes talking.

Toxicity detection

Amazon Transcribe can be used to identify and classify toxic language. Toxic content is flagged and classified across seven categories, including sexual harassment, hate speech, threat, abuse, profanity, insult, and graphic. Amazon Transcribe uses advanced identification techniques, including tone and pitch, to deliver extra context to conversations.

Call analytics

Amazon Transcribe offers a special API for customer service and sales calls. You can use it to gain insights on customer and agent sentiment, call drivers, phrase mentions, non-talk time, interruptions, talk speed, real-time issue detection, and conversation summarization. Amazon Transcribe can also perform post-call audio recording redaction, replacing PII with silence for stored calls.

Medical transcription

Amazon Transcribe offers HIPAA-compliant APIs that provide accurate medical-language audio-to-text transcriptions from audio files while prioritizing patient data privacy and security. It is useful in clinician-patient interactions, where note-taking is time-consuming, distracting, and disruptive.

How can AWS support your audio transcription needs?

Audio-to-text transcription takes voice from a point-in-time communication method to a stored, searchable, analyzable, and highly valuable data source. Organizations using speech recognition to transcribe audio are finding significant benefits in productivity, training, customer service, sales, and more.

Embedding the Amazon Transcribe audio-to-text converter within your organization ensures voice recordings retain value and multiply their useful applications. Take a look at the range of AI solutions on AWS to help you build and scale apps faster and stronger.

What is Audio File Transcription?