AWS News Blog

Amazon Transcribe – Accurate Speech To Text At Scale

Today we’re launching a private preview of Amazon Transcribe, an automatic speech recognition (ASR) service that makes it easy for developers to add speech to text capabilities to their applications. As bandwidth and connectivity improve, more and more of the world’s data is stored in video and audio formats. People are creating and consuming all of this data faster than ever before. It’s important for businesses to have some means of deriving value from all of that rich multimedia content. With Amazon Transcribe you can save on the costly process of manual transcription with an efficient and scalable API.

You can analyze audio files stored on Amazon Simple Storage Service (S3) in many common formats (WAV, MP3, Flac, etc.) by starting a job with the API. You’ll receive detailed and accurate transcriptions with timestamps for each word, as well as inferred punctuation. During the preview you can use the asynchronous transcription API to transcribe speech in English or Spanish.

Companies are looking to derive value from both their existing catalogs and their incoming data. By transcribing these stored media, companies can:

  • Analyze customer call data
  • Automate subtitle creation
  • Target advertising based on content
  • Enable rich search capabilities on archives of audio and video content

You can start a transcription job easily with the AWS Command Line Interface (CLI), AWS SDKs, or the Amazon Transcribe console.

Amazon Transcribe currently has 3, mostly self-explanatory, API Actions:

  • StartTranscriptionJob
  • GetTranscriptionJob
  • ListTranscriptionJobs

Here’s a quick python script that starts a job and polls until the job is finished:

from __future__ import print_function
import time
import boto3
transcribe = boto3.client('transcribe')
job_name = "RandallTest1"
job_uri = "https://s3-us-west-2.amazonaws.com/randhunt-transcribe-demos/test.flac"
transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': job_uri},
    MediaFormat='flac',
    LanguageCode='en-US',
    MediaSampleRateHertz=44100
)
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
	print("Not ready yet...")
    time.sleep(5)
print(status)

The result of a completed job links to an Amazon Simple Storage Service (S3) presigned-url that contains our transcription in JSON format:

{
  "jobName": "RandallTest1",
  "results": {
  	"transcripts": [{"transcript": "Hello World", "confidence": 1}],
    "items": [
      {
      	"start_time": "0.880", "end_time": "1.300",
        "alternatives": [{"confidence": 0.91, "word": "Hello"}]
      },
      {
        "start_time": "1.400", "end_time": "1.620",
        "alternatives": [{"confidence": 0.84, "word": "World"}]
      }
  	]
  },
  "status": "COMPLETED"
}

As you can see you get timestamps and confidence scores for each word.

Whether alone or combined with other Amazon AI services this is a powerful service and I can’t wait to see what our customers build with it! Sign up for the preview today.

Randall

P.S.
You might have noticed this lends itself well to AWS Step Functions and I thought the same. Here’s a workflow I might use: