Detecting and redacting PII using Amazon Comprehend
Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to find insights and relationships like people, places, sentiments, and topics in unstructured text. You can now use Amazon Comprehend ML capabilities to detect and redact personally identifiable information (PII) in customer emails, support tickets, product reviews, social media, and more. No ML experience required. For example, you can analyze support tickets and knowledge articles to detect PII entities and redact the text before you index the documents in the search solution. After that, search solutions are free of PII entities in documents. Redacting PII entities helps you protect privacy and comply with local laws and regulations.
Customer use case: TeraDact Solutions
TeraDact Solutions has already put this new feature to work. TeraDact Solutions’ software offers a robust alternative for secure information sharing in a world of ever-increasing compliance and privacy concerns. With its signature Information Identification & Presentation (IIaP™) capabilities, TeraDact’s tools provide the user with a safe information sharing environment. “Using Amazon Comprehend for PII redaction with our tokenization system not only helps us reach a larger set of our customers but also helps us overcome the shortcomings of rules-based PII detection which can result in false alarms or missed details. PII detection is critical for businesses and with the power of context-aware NLP models from Comprehend we can uphold the trust customers place in us with their information. Amazon is innovating in ways to help push our business forward by adding new features which are critical to our product suite.” said Chris Schrichte, CEO, TeraDact Solutions, Inc.
Detecting PII in Amazon Comprehend
When you analyze text using Amazon Comprehend real-time analysis, Amazon Comprehend automatically identifies PII, as summarized in the following table.
|PII entity category||PII entity types|
For each detected PII entity, you get the type of PII, a confidence score, and begin and end offset. These offsets help you locate PII entities in your documents for document processing to redact it at the secure storage or downstream solutions.
Analyzing text on the Amazon Comprehend console
To get started with Amazon Comprehend, all you need is an AWS account. To use the console, complete the following steps:
- On the Amazon Comprehend console, in the Input text section, select Built-in.
- For Input text, enter your text.
- Choose Analyze.
- On the Insights page, choose the PII
The PII tab shows color-coded text to indicate different PII entity types, such as name, email, address, phone, and others. The Results section shows more information about the text. Each entry shows the PII entity, its type, and the level of confidence Amazon Comprehend has in this analysis.
Analyzing text via the AWS CLI
To perform real-time analysis using the AWS CLI, enter the following code:
To view the output, open the JSON response object and look at the detected PII entities. For each entity, the service returns the type of PII, confidence score metric, BeginOffset, and EndOffset. See the following code:
Asynchronous PII redaction batch processing on the Amazon Comprehend console
You can redact documents by using Amazon Comprehend asynchronous operations. You can choose redaction mode Replace with PII entity to replace PII entities with PII entity type, or choose to mask PII entity with redaction mode Replace with character and replace the characters in PII entities with a character of your choice (!, #, $, %, &, *, or @).
To analyze and redact large documents and large collections of documents, ensure that the documents are stored in an Amazon Simple Storage Service (Amazon S3) bucket and start an asynchronous operation to detect and redact PII in the documents. The results of the analysis are returned in an S3 bucket.
- On the Amazon Comprehend console, choose Analysis jobs.
- Choose Create job.
- On the Create analysis job page, for Name, enter a name (for this post, we enter comprehend-blog-redact-01).
- For Analysis type, choose Personally identifiable information (PII).
- For Language, choose English.
- In the PII detection settings section, for Output mode, select Redactions.
- Expand PII entity types and select the entity types to redact.
- For Redaction mode, choose Replace with PII entity type.
Alternatively, you can choose Replace with character to replace PII entities with a character of your choice (!, #, $, %, &, *, or @).
- In the Input data section, for Data source, select My documents.
- For S3 location, enter the S3 path for pii-s3-input.txt.
This text file has the same example content we used earlier for real-time analysis.
- In the Output data section, for S3 location, enter the path to the output folder in Amazon S3.
Make sure you choose the correct input and output paths based on how you organized the document.
- In the Access permissions section, for IAM role, select Create an IAM role.
You need an AWS Identity and Access Management (IAM) role with required permissions to access the input and output S3 buckets for the job that is created and propagated.
- For Permissions to access, choose Input and Output S3 buckets.
- For Name suffix, enter a suffix for your role (for this post, we enter ComprehendPIIRole).
- Choose Create job.
You can see the job
comprehend-blog-redact-01 with the job status
When the job status changes to
Completed, you can access the output file to view the output. The
pii-s3-input.txt file has the same example content we used earlier, and using redaction mode replaces PII with its PII entity type. Your output looks like the following text:
If you have very long entity types, you may prefer to mask PII with a character. If you choose to replace PII with the character *, your output looks like the following text. :
Asynchronous PII redaction batch processing via the AWS CLI
To perform the PII redaction job using the AWS CLI, enter the following code:
The request yields the following output:
To monitor the job request, enter the following code:
The following output shows that the job is complete:
After the job is complete, the output file is plain text (same as the input file). Other Amazon Comprehend asynchronous jobs (
start-entities-detection-job) have an output file called
output.tar.gz, which is a compressed archive that contains the output of the operation.
Start-pii-entities-detection-job retains the folder and file structure as input. Our
comprehend-blog-redact-001 job input file
pii-s3-input.txt has the respective
pii-s3-input.txt.out file with the redacted text in the jobs output folder. You can find the Amazon S3 location in the output from monitoring the job; the JSON element
PiiEntitiesDetectionJobProperties.OutputDataConfig.S3uri has the file
pii-s3-input.txt.out and the redacted content with PII entity type.
As of this writing, the PII detection feature in Amazon Comprehend is available for US English in the following Regions:
- US East (Ohio)
- US East (N. Virginia)
- US West (Oregon),
- Asia Pacific (Mumbai)
- Asia Pacific (Seoul)
- Asia Pacific (Singapore)
- Asia Pacific (Sydney)
- Asia Pacific (Tokyo)
- EU (Frankfurt)
- EU (Ireland)
- EU (London)
- AWS GovCloud (US-West)
About the Author
Sriharsha M S is an AI/ML specialist solution architect in the Strategic Specialist team at Amazon Web Services. He works with strategic AWS customers who are taking advantage of AI/ML to solve complex business problems. He provides technical guidance and design advice to implement AI/ML applications at scale. His expertise spans application architecture, bigdata, analytics and machine learning.