Add privacy to your transcriptions

with Amazon Transcribe

Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for AWS customers to add speech-to-text capabilities to their applications.

In this tutorial, you learn how to add privacy to your transcriptions using the automatic content redaction feature of Amazon Transcribe.

A popular use case for Transcribe is the automatic transcription of customer calls (call centers, telemarketing, etc.) to build data sets for downstream analytics and natural language processing tasks, such as sentiment analysis. In these cases, personally identifiable information (PII) may need to be removed to protect privacy and comply with local laws and regulations. Moreover, within an organization, you may or may not want to expose certain transcription data to various team members with differing levels of access or viewing permissions. The accurate redaction of sensitive text content is hard to achieve at scale, as manual efforts are tedious, error prone, and time consuming. Fortunately, Amazon Transcribe supports automatic redaction of PII from transcriptions.

You can use the automatic content redaction feature through the AWS Management Console or through the API. This tutorial walks you through each option.

In this tutorial, you learn how to:

  1. Upload an audio file to be transcribed to Amazon S3
  2. Create and start an Amazon Transcribe job
  3. Clean up tutorial resources
  4. Review code to complete tutorial tasks using API

The cost of completing this tutorial is less than $1.

Note: Automatic content redaction currently only supports US English transcriptions and is a premium feature that is available in all AWS Regions where Amazon Transcribe operates today.

About this Tutorial
Time 10 minutes                                  
Cost Less than $1
Use Case Machine Learning
Products Amazon Transcribe
Audience Developer
Level Intermediate
Last Updated February 11, 2021

Before you begin

You must have an AWS account to complete this tutorial. If you do not already have an account, choose Sign up for AWS and create a new account.

Already have an account?
Log in to your account

Step 1. Upload an audio file for transcription to Amazon S3

Complete the following steps to create an Amazon S3 bucket and upload a sample transcribe audio file. You can download the sample audio file here: content-redaction-sample.wav

If you prefer to use your own audio file, we recommend that you use a lossless format, such as FLAC or WAV, with PCM 16-bit encoding; and use a sample rate of 8,000 Hz for low-fidelity audio and 16,000 Hz for high-fidelity audio. Amazon Transcribe also supports MP3, MP4, Ogg, WebM, and AMR formats.

Note: For more information, see How Amazon Transcribe Works in the Amazon Transcribe documentation.


a. Sign in to the Amazon S3 console and choose Create bucket.

Note: You can also use one of your existing S3 buckets.

Create bucket

b. On the Create bucket page, in the Bucket name field, type a unique bucket name. For Region, choose a Region where Amazon Transcribe is available. Keep the remaining default settings and choose Create bucket.

Create a new bucket

c. In the list of Buckets, choose your newly created bucket. (Or, choose View details in the bucket creation banner.) 

d. In the bucket details view, choose Upload.

e. On the Upload page, in the Files and folders section, choose Add files. Then, browse to and open the content-redaction-sample.wav file.
 
Note: Make sure you have first downloaded the content-redaction-sample.wav file to your local hard drive.

f. In the Destination section, choose Enable Bucket Versioning.

Note: This step is optional. If you do not choose to enable bucket versioning, you must acknowledge that bucket versioning is not in place.


g. Choose Upload.
A status message appears showing the upload progress.

h. Choose the newly uploaded file to open the details view.

i. In the object details view, choose Copy S3 URI and make note of the value. You need this value in the next step.

Step 2. Create an Amazon Transcribe transcription job

In this step, you create your transcription job in the Amazon Transcribe console. Amazon Transcribe analyzes audio files that contain speech and uses advanced machine learning techniques to transcribe the voice data into text. Amazon Transcribe's automatic content redaction feature automatically redacts sensitive personally identifiable information (PII) from your transcription results. It replaces each identified instance of PII with a [PII] tag in the transcript.

Complete the following steps to start a transcription job with automatic content redaction.

Note: For more information, see Create a Transcription Job and Automatic Content Redaction in the Amazon Transcribe documentation.


a. Open the Amazon Transcribe console and choose Launch Amazon Transcribe.

b. In the left navigation pane, choose Transcription jobs. Then, choose Create job.

c. On the Specify job details page, specify the following:
  • in the Job settings box, for Name, type tutorial-transcription-job.
  • In the Input data section, paste your S3 URI from Step 1i. (If you do not have the S3 URI, choose Browse S3 to and browse to the content-redaction-sample.wav file in your S3 bucket).
  • In the Output data section, choose Service-managed S3 bucket.

Then, choose Next.

d. On the Configure job page, in the Content removal box, choose Automatic content redaction and select Include unredacted transcript in job output. Then, choose Create.

e. Wait for the status of your job to change from In progress to Complete, then choose the tutorial-transcription-job.

In the Transcription preview section of the job details page, you can see that Amazon Transcribe has replaced all personally identifable information (PII) in the transcript with [PII].

Optionally, download a copy of the transcript. At the top of the Job details view, choose Downlod full transcript, and choose the unredacted or redacted version.

Step 3. Clean up

In this step, you terminate the resources you used in this lab.

Important: Terminating resources that are not actively being used reduces costs and is a best practice. Not terminating your resources will result in charges to your account.


Delete the transcription job:

  1. Open the Amazon Transcribe console.
  2. In the left navigation pane, choose Transcription jobs.
  3. Choose the tutorial-transcription-job then choose Delete.
  4. Choose Delete to confirm.

Delete the audio file and S3 bucket:

  1. Open the S3 console.
  2. Select the S3 bucket you created for this tutorial, and choose Empty. Type permanently delete and choose Empty. Choose Exit.
  3. Select the S3 bucket you created for this tutorial and choose Delete. Type the name of the bucket and choose Delete bucket.

Step 4. Review code for Amazon Transcribe API

This step shows you the programmatic/API option of completing Step 1 and Step 2. The following script uses the AWS SDK for Python (Boto) to start the transcription job and get the job results.

Review the following script to learn more about using the Amazon Transcibe API to create a transcription job with automatic content redaction.

Note: For more information, see Getting Started Using the API in the Amazon Transcribe documentation.


Start transcription job

The following script starts the transcription job with content redaction and prints the job metadata when the transcription is complete.

This script uses the start_transcription_job() method to start an an asynchronous job to transcribe speech to text. As part of the method call, you need to provide some information about the job, including the job name, S3 bucket location, object key of the audio file, language code, and optional configuration. In this code sample, optional configuration includes content redaction for PII.

To learn more about the Amazon Transcribe specific boto3 methods, see TranscribeService in the Boto3 documentation.

import boto3
import time

transcribe = boto3.client('transcribe')
job_name = "<Job Name>"
audio_file = "s3://<bucket-name>/<object-key of the audio file to be transcribed>"
transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    Media={'MediaFileUri': audio_file},
    LanguageCode='en-US',
    ContentRedaction={
        'RedactionType': 'PII',
        'RedactionOutput': 'redacted_and_unredacted'
    }
)
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print("Not ready yet...")
    time.sleep(5)
if status['TranscriptionJob']['TranscriptionJobStatus'] == 'COMPLETED':
    print("Full Transcript available at -> "+status['TranscriptionJob']['Transcript']['TranscriptFileUri'])
    print("Redacted Transcript available at -> "+status['TranscriptionJob']['Transcript']['RedactedTranscriptFileUri'])
else:
    print("Transciption Job Failed.")

The following code shows sample output of the transcription job script. 

Not ready yet...
Not ready yet...
Not ready yet...
Full Transcript available at -> https://s3.<region>.amazonaws.com/aws-transcribe-<region>/<account>/<job-name>/asrOutput.json
Redacted Transcript available at -> https://s3.<region>.amazonaws.com/aws-transcribe-<region>/<account>/<job-name>/asrOutputRedacted.json

Review job output

The output produces links to two transcript files (which are the same files you produce when creating a transcription job through the Amazon Transcribe console). The transcripts are JSON files that contain data and metadata about the transcription output of Amazon Transcribe.
  • asrOutput.json: This file contains the full transcript.
  • asrOutputRedacted.json: This file contains the redacted transcript
These files are formatted to provide information about each pronunciation/punctuation that is detected as part of the speech. You can consume this JSON data in any way you like in the downstream applications you develop.

Review redaction

In the redacted transcript json, you'd see the PII data show up as the following:
{
             "start_time":"13.96",
             "end_time":"20.05",
             "alternatives":[
                {
                   "content":"[PII]",
                   "redactions":[
                      {
                         "confidence":"1.0"
                      }
                   ]
                }
             ],
             "type":"pronunciation"
          }

However, non-PII pronunciation/punctuation would show up as the following:

{
             "start_time":"26.97",
             "end_time":"27.19",
             "alternatives":[
                {
                   "confidence":"1.0",
                   "content":"card"
                }
             ],
             "type":"pronunciation"
          }

The purpose of these fields is to provide as much information as possible about the pronunciation/punctuation detected in the speech. Notice in the redacted transcript that Amazon Transcribe marks pronunciation that is classified as PII as "content":"[PII]" . However, in the unredacted transcript, the same pronunciation is marked as "content":"4444333321111" .

Congratulations

You have successfully created a transcription of an audio file and redacted sensitive personally identifiable information (PII) information by using Amazon Transcribe.

Was this tutorial helpful?

Learn more

Learn more about Amazon Transcribe by reading the Amazon Transcribe Developer Guide.

Learn about Amazon Transcribe Medical

See the Amazon Transcribe Medical page for more information.

Explore other machine learning services tutorials