AWS Machine Learning Blog
How to redact PII data in conversation transcripts
Customer service interactions often contain personally identifiable information (PII) such as names, phone numbers, and dates of birth. As organizations incorporate machine learning (ML) and analytics into their applications, using this data can provide insights on how to create more seamless customer experiences. However, the presence of PII information often restricts the use of this data. In this blog post, we will review a solution to automatically redact PII data from a customer service conversation transcript.
Let’s take an example conversation between a customer and a call center agent.
Agent: Hi, thank you for calling us today. Whom do I have the pleasure of speaking with today?
Caller: Hello, my name is John Stiles.
Agent: Hi John, how may I help you?
Caller: I haven’t received my W2 statement yet and wanted to check on its status.
Agent: Sure, I can help you with that. Can you please confirm the last four digits of your Social Security number?
Caller: Yes, it’s 1111.
Agent: Ok. I’m pulling up the status now. I see that it was sent out yesterday, and the estimated arrival is early next week. Would you like me to turn on automated alerts so you can be notified of any delays?
Caller: Yes, please.
Agent: The number we have on file for you is 555-456-7890. Is that still correct?
Caller: Yes, it is.
Agent: Great. I have turned on automated notifications. Is there anything else I can assist you with John?
Caller: No, that’s all. Thank you.
Agent: Thank you, John. Have a great day.
In this brief interaction, there are several pieces of data that would generally be considered PII, including the caller’s name, the last four digits of their Social Security number, and the phone number. Let’s review how we can redact this PII data in the transcript.
Solution overview
We will create an AWS Step Functions state machine, which orchestrates an Amazon Comprehend PII redaction job. Amazon Comprehend is a natural-language processing (NLP) service that uses machine learning to uncover valuable insights and connections in text, including the ability to detect and redact PII data.
You will provide the transcripts in the input Amazon S3 bucket. The transcripts are in the format used by Contact Lens for Amazon Connect. You will also specify an output S3 bucket, which stores the redaction output as well as intermediate data. The intermediate data are micro-batched versions of the input data. For example, if there are 10,000 conversations to be redacted, the workflow will split them into 10 batches of 1000 conversations each. Each batch is stored using a unique prefix, which is then used as the input source for Comprehend. The Step Functions map state is used to execute these redaction jobs in parallel by calling the StartPIIEntitiesDetectionJob API. This approach allows you to run multiple jobs in parallel rather than individual jobs in sequence. Since the job is implemented as a Step Functions state machine, it can be triggered to run manually or automatically as part of a daily process.
You can learn more about how Comprehend detects and redacts PII data in this blog post.
Deploy the sample solution
First, sign in to the AWS Management Console in your AWS account.
You will need an S3 bucket with some sample transcript data to redact and another bucket for output. If you don’t have existing sample transcript data, follow these steps:
- Navigate to the Amazon S3 console.
- Choose Create bucket.
- Enter a bucket name, such as
text-redaction-data-<your-account-number>
. - Accept the defaults, and choose Create bucket.
- Open the bucket you created, and choose Create folder.
- Enter a folder name, such as “sample-data” and choose Create folder.
- Click on your new folder name to open it.
- Download the SampleData.zip file.
- Open the .zip file on your local computer and then drag the folder to the S3 bucket you created.
- Choose Upload.
Now click the following link to deploy the sample solution to US East (N. Virginia):
This will create a new AWS CloudFormation stack.
Enter the Stack name (e.g., pii-redaction-workflow
), the name of the S3 input bucket containing the input transcript data, and the name of the S3 output bucket. Choose Next and add any tags that you want for your stack (optional). Choose Next again and review the stack details. Select the checkbox to acknowledge that AWS Identity and Access Management (IAM) resources will be created, and then choose Create stack.
The CloudFormation stack will create an IAM role with the ability to list and read objects from the input S3 bucket, and write objects to the output bucket. You can further customize the role per your requirements. It will also create a Step Functions state machine, and several AWS Lambda functions used by the state machine.
After a few minutes, your stack will be complete, and then you can examine the Step Functions state machine that was created as part of the CloudFormation template.
Run a redaction job
To run a job, navigate to Step Functions in the AWS console, select the state machine, and choose Start execution.
Next provide the input arguments to run the job. For the job input, you want to provide the name of your input S3 bucket as the S3InputDataBucket value, the folder name as the S3InputDataPrefix value, the name of your output S3 bucket as the S3OutputDataBucket
value, and the folder to store the results as S3OutputDataPrefix
value then click Start execution.
As the job executes, you can monitor its status in the Step Functions graph view. It will take a few minutes to run the job. Once the job is complete, you will see the output for each of the jobs in the Execution input and output section of the console. You can use the output URI to retrieve the output of a job. If multiple jobs were executed, you can copy the results of all jobs to a destination bucket for further analysis.
Let’s take a look at the redacted version of the conversation that we started with.
Agent: Hi, thank you for calling us today. Whom do I have the pleasure of speaking with today?
Caller: Hello, my name is [NAME].
Agent: Hi [NAME], how may I help you?
Caller: I haven’t received my W2 statement yet and wanted to check on its status.
Agent: Sure, I can help you with that. Can you please confirm the last four digits of your Social Security number?
Caller: Yes, it’s [SSN].
Agent: Ok. I’m pulling up the status now. I see that it was sent out yesterday, and the estimated arrival is early next week. Would you like me to turn on automated alerts so you can be notified of any delays?
Caller: Yes, please.
Agent: The number we have on file for you is [PHONE]. Is that still correct?
Caller: Yes, it is.
Agent: Great. I have turned on automated notifications. Is there anything else I can assist you with, [NAME]?
Caller: No, that’s all. Thank you.
Agent: Thank you, [NAME]. Have a great day.
Clean up
You may want to clean up the resources created as part of CloudFormation template after you are complete to avoid ongoing charges. To do so, delete the deployed CloudFormation stack and delete the S3 bucket with the sample transcript data if one was created.
Conclusion
With customers demanding seamless experiences across channels and also expecting security to be embedded at every point, the use of Step Functions and Amazon Comprehend to redact PII data in text conversation transcripts is a powerful tool at your disposal. Organizations can speed time to value by using the redacted transcripts to analyze customer service interactions and glean insights to improve the customer experience.
Try using this workflow to redact your data and leave us a comment!
About the author
Alex Emilcar is a Senior Solutions Architect in the Amazon Machine Learning Solutions Lab, where he helps customers build digital experiences with AWS AI technologies. Alex has over 10 years of technology experience working in different capacities from developer, infrastructure engineer, and Solutions Architecture. In his spare time, Alex likes to spend time reading and doing yard work.