Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for AWS customers to add speech-to-text capabilities to their applications.
In this tutorial, you learn how to add privacy to your transcriptions using the automatic content redaction feature of Amazon Transcribe.
A popular use case for Transcribe is the automatic transcription of customer calls (call centers, telemarketing, etc.) to build data sets for downstream analytics and natural language processing tasks, such as sentiment analysis. In these cases, personally identifiable information (PII) may need to be removed to protect privacy and comply with local laws and regulations. Moreover, within an organization, you may or may not want to expose certain transcription data to various team members with differing levels of access or viewing permissions. The accurate redaction of sensitive text content is hard to achieve at scale, as manual efforts are tedious, error prone, and time consuming. Fortunately, Amazon Transcribe supports automatic redaction of PII from transcriptions.
You can use the automatic content redaction feature through the AWS Management Console or through the API. This tutorial walks you through each option.
In this tutorial, you learn how to:
- Upload an audio file to be transcribed to Amazon S3
- Create and start an Amazon Transcribe job
- Clean up tutorial resources
- Review code to complete tutorial tasks using API
The cost of completing this tutorial is less than $1.
Note: Automatic content redaction currently only supports US English transcriptions and is a premium feature that is available in all AWS Regions where Amazon Transcribe operates today.
About this Tutorial | |
---|---|
Time | 10 minutes |
Cost | Less than $1 |
Use Case | Machine Learning |
Products | Amazon Transcribe |
Audience | Developer |
Level | Intermediate |
Last Updated | February 11, 2021 |
Before you begin
You must have an AWS account to complete this tutorial. If you do not already have an account, choose Sign up for AWS and create a new account.
Already have an account?
Log in to your account
Step 1. Upload an audio file for transcription to Amazon S3
Complete the following steps to create an Amazon S3 bucket and upload a sample transcribe audio file. You can download the sample audio file here: content-redaction-sample.wav
If you prefer to use your own audio file, we recommend that you use a lossless format, such as FLAC or WAV, with PCM 16-bit encoding; and use a sample rate of 8,000 Hz for low-fidelity audio and 16,000 Hz for high-fidelity audio. Amazon Transcribe also supports MP3, MP4, Ogg, WebM, and AMR formats.
Note: For more information, see How Amazon Transcribe Works in the Amazon Transcribe documentation.
a. Sign in to the Amazon S3 console and choose Create bucket.
Note: You can also use one of your existing S3 buckets.
b. On the Create bucket page, in the Bucket name field, type a unique bucket name. For Region, choose a Region where Amazon Transcribe is available. Keep the remaining default settings and choose Create bucket.
Note: This step is optional. If you do not choose to enable bucket versioning, you must acknowledge that bucket versioning is not in place.
Step 2. Create an Amazon Transcribe transcription job
In this step, you create your transcription job in the Amazon Transcribe console. Amazon Transcribe analyzes audio files that contain speech and uses advanced machine learning techniques to transcribe the voice data into text. Amazon Transcribe's automatic content redaction feature automatically redacts sensitive personally identifiable information (PII) from your transcription results. It replaces each identified instance of PII with a [PII] tag in the transcript.
Complete the following steps to start a transcription job with automatic content redaction.
Note: For more information, see Create a Transcription Job and Automatic Content Redaction in the Amazon Transcribe documentation.
- in the Job settings box, for Name, type tutorial-transcription-job.
- In the Input data section, paste your S3 URI from Step 1i. (If you do not have the S3 URI, choose Browse S3 to and browse to the content-redaction-sample.wav file in your S3 bucket).
- In the Output data section, choose Service-managed S3 bucket.
Then, choose Next.
Step 3. Clean up
In this step, you terminate the resources you used in this lab.
Important: Terminating resources that are not actively being used reduces costs and is a best practice. Not terminating your resources will result in charges to your account.
Delete the transcription job:
- Open the Amazon Transcribe console.
- In the left navigation pane, choose Transcription jobs.
- Choose the tutorial-transcription-job then choose Delete.
- Choose Delete to confirm.
Delete the audio file and S3 bucket:
- Open the S3 console.
- Select the S3 bucket you created for this tutorial, and choose Empty. Type permanently delete and choose Empty. Choose Exit.
- Select the S3 bucket you created for this tutorial and choose Delete. Type the name of the bucket and choose Delete bucket.
Step 4. Review code for Amazon Transcribe API
This step shows you the programmatic/API option of completing Step 1 and Step 2. The following script uses the AWS SDK for Python (Boto) to start the transcription job and get the job results.
Review the following script to learn more about using the Amazon Transcibe API to create a transcription job with automatic content redaction.
Note: For more information, see Getting Started Using the API in the Amazon Transcribe documentation.
Start transcription job
This script uses the start_transcription_job() method to start an an asynchronous job to transcribe speech to text. As part of the method call, you need to provide some information about the job, including the job name, S3 bucket location, object key of the audio file, language code, and optional configuration. In this code sample, optional configuration includes content redaction for PII.
To learn more about the Amazon Transcribe specific boto3 methods, see TranscribeService in the Boto3 documentation.
import boto3
import time
transcribe = boto3.client('transcribe')
job_name = "<Job Name>"
audio_file = "s3://<bucket-name>/<object-key of the audio file to be transcribed>"
transcribe.start_transcription_job(
TranscriptionJobName=job_name,
Media={'MediaFileUri': audio_file},
LanguageCode='en-US',
ContentRedaction={
'RedactionType': 'PII',
'RedactionOutput': 'redacted_and_unredacted'
}
)
while True:
status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
break
print("Not ready yet...")
time.sleep(5)
if status['TranscriptionJob']['TranscriptionJobStatus'] == 'COMPLETED':
print("Full Transcript available at -> "+status['TranscriptionJob']['Transcript']['TranscriptFileUri'])
print("Redacted Transcript available at -> "+status['TranscriptionJob']['Transcript']['RedactedTranscriptFileUri'])
else:
print("Transciption Job Failed.")
The following code shows sample output of the transcription job script.
Not ready yet...
Not ready yet...
Not ready yet...
Full Transcript available at -> https://s3.<region>.amazonaws.com/aws-transcribe-<region>/<account>/<job-name>/asrOutput.json
Redacted Transcript available at -> https://s3.<region>.amazonaws.com/aws-transcribe-<region>/<account>/<job-name>/asrOutputRedacted.json
Review job output
- asrOutput.json: This file contains the full transcript.
- asrOutputRedacted.json: This file contains the redacted transcript
Review redaction
{
"start_time":"13.96",
"end_time":"20.05",
"alternatives":[
{
"content":"[PII]",
"redactions":[
{
"confidence":"1.0"
}
]
}
],
"type":"pronunciation"
}
However, non-PII pronunciation/punctuation would show up as the following:
{
"start_time":"26.97",
"end_time":"27.19",
"alternatives":[
{
"confidence":"1.0",
"content":"card"
}
],
"type":"pronunciation"
}
The purpose of these fields is to provide as much information as possible about the pronunciation/punctuation detected in the speech. Notice in the redacted transcript that Amazon Transcribe marks pronunciation that is classified as PII as "content":"[PII]" . However, in the unredacted transcript, the same pronunciation is marked as "content":"4444333321111" .
Congratulations
You have successfully created a transcription of an audio file and redacted sensitive personally identifiable information (PII) information by using Amazon Transcribe.
Recommended next steps
Learn more
Learn more about Amazon Transcribe by reading the Amazon Transcribe Developer Guide.
Learn about Amazon Transcribe Medical
See the Amazon Transcribe Medical page for more information.