Detecting and redacting PII using Amazon Comprehend

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to find insights and relationships like people, places, sentiments, and topics in unstructured text. You can now use Amazon Comprehend ML capabilities to detect and redact personally identifiable information (PII) in customer emails, support tickets, product reviews, social media, and more. No ML experience required. For example, you can analyze support tickets and knowledge articles to detect PII entities and redact the text before you index the documents in the search solution. After that, search solutions are free of PII entities in documents. Redacting PII entities helps you protect privacy and comply with local laws and regulations.

Customer use case: TeraDact Solutions

TeraDact Solutions has already put this new feature to work. TeraDact Solutions’ software offers a robust alternative for secure information sharing in a world of ever-increasing compliance and privacy concerns. With its signature Information Identification & Presentation (IIaP™) capabilities, TeraDact’s tools provide the user with a safe information sharing environment. “Using Amazon Comprehend for PII redaction with our tokenization system not only helps us reach a larger set of our customers but also helps us overcome the shortcomings of rules-based PII detection which can result in false alarms or missed details. PII detection is critical for businesses and with the power of context-aware NLP models from Comprehend we can uphold the trust customers place in us with their information. Amazon is innovating in ways to help push our business forward by adding new features which are critical to our product suite.” said Chris Schrichte, CEO, TeraDact Solutions, Inc.

In this post, I cover how to use Amazon Comprehend to detect PII and redact the PII entities via the AWS Management Console and the AWS Command Line Interface (AWS CLI).

Detecting PII in Amazon Comprehend

When you analyze text using Amazon Comprehend real-time analysis, Amazon Comprehend automatically identifies PII, as summarized in the following table.

PII entity category	PII entity types
Financial	BANK_ACCOUNT_NUMBER BANK_ROUTING CREDIT_DEBIT_NUMBER CREDIT_DEBIT_CVV CREDIT_DEBIT_EXPIRY PIN
Personal	NAME ADDRESS PHONE EMAIL AGE
Technical security	USERNAME PASSWORD URL AWS_ACCESS_KEY AWS_SECRET_KEY IP_ADDRESS MAC_ADDRESS
National	SSN PASSPORT_NUMBER DRIVER_ID
Other	DATE_TIME

For each detected PII entity, you get the type of PII, a confidence score, and begin and end offset. These offsets help you locate PII entities in your documents for document processing to redact it at the secure storage or downstream solutions.

Analyzing text on the Amazon Comprehend console

To get started with Amazon Comprehend, all you need is an AWS account. To use the console, complete the following steps:

On the Amazon Comprehend console, in the Input text section, select Built-in.
For Input text, enter your text.
Choose Analyze.

On the Insights page, choose the PII

The PII tab shows color-coded text to indicate different PII entity types, such as name, email, address, phone, and others. The Results section shows more information about the text. Each entry shows the PII entity, its type, and the level of confidence Amazon Comprehend has in this analysis.

Analyzing text via the AWS CLI

To perform real-time analysis using the AWS CLI, enter the following code:

aws comprehend detect-pii-entities \
--language-code en \
--text \
" Good morning, everybody. My name is Van Bokhorst Serdar, and today I feel like sharing a whole lot of personal information with you. Let's start with my Email address SerdarvanBokhorst@dayrep.com. My address is 2657 Koontz Lane, Los Angeles, CA. My phone number is 818-828-6231. My Social security number is 548-95-6370. My Bank account number is 940517528812 and routing number 195991012. My credit card number is 5534816011668430, Expiration Date 6/1/2022, my C V V code is 121, and my pin 123456. Well, I think that's it. You know a whole lot about me. And I hope that Amazon comprehend is doing a good job at identifying PII entities so you can redact my personal information away from this document. Let's check."

To view the output, open the JSON response object and look at the detected PII entities. For each entity, the service returns the type of PII, confidence score metric, BeginOffset, and EndOffset. See the following code:

{
    "Entities": [
        {
            "Score": 0.9996334314346313,
            "Type": "NAME",
            "BeginOffset": 36,
            "EndOffset": 55
        },
        {
            "Score": 0.9999902248382568,
            "Type": "EMAIL",
            "BeginOffset": 167,
            "EndOffset": 195
        },
        {
            "Score": 0.9999983310699463,
            "Type": "ADDRESS",
            "BeginOffset": 211,
            "EndOffset": 245
        },
        {
            "Score": 0.9999997615814209,
            "Type": "PHONE",
            "BeginOffset": 265,
            "EndOffset": 277
        },
        {
            "Score": 0.9999996423721313,
            "Type": "SSN",
            "BeginOffset": 308,
            "EndOffset": 319
        },
        {
            "Score": 0.9999984502792358,
            "Type": "BANK_ACCOUNT_NUMBER",
            "BeginOffset": 347,
            "EndOffset": 359
        },
        {
            "Score": 0.9999974966049194,
            "Type": "BANK_ROUTING",
            "BeginOffset": 379,
            "EndOffset": 388
        },
        {
            "Score": 0.9999991655349731,
            "Type": "CREDIT_DEBIT_NUMBER",
            "BeginOffset": 415,
            "EndOffset": 431
        },
        {
            "Score": 0.9923601746559143,
            "Type": "CREDIT_DEBIT_EXPIRY",
            "BeginOffset": 449,
            "EndOffset": 457
        },
        {
            "Score": 0.9999997615814209,
            "Type": "CREDIT_DEBIT_CVV",
            "BeginOffset": 476,
            "EndOffset": 479
        },
        {
            "Score": 0.9998345375061035,
            "Type": "PIN",
            "BeginOffset": 492,
            "EndOffset": 498
        }
    ]
}

Asynchronous PII redaction batch processing on the Amazon Comprehend console

You can redact documents by using Amazon Comprehend asynchronous operations. You can choose redaction mode Replace with PII entity to replace PII entities with PII entity type, or choose to mask PII entity with redaction mode Replace with character and replace the characters in PII entities with a character of your choice (!, #, $, %, &, *, or @).

To analyze and redact large documents and large collections of documents, ensure that the documents are stored in an Amazon Simple Storage Service (Amazon S3) bucket and start an asynchronous operation to detect and redact PII in the documents. The results of the analysis are returned in an S3 bucket.

On the Amazon Comprehend console, choose Analysis jobs.
Choose Create job.

On the Create analysis job page, for Name, enter a name (for this post, we enter comprehend-blog-redact-01).
For Analysis type, choose Personally identifiable information (PII).
For Language, choose English.

In the PII detection settings section, for Output mode, select Redactions.
Expand PII entity types and select the entity types to redact.
For Redaction mode, choose Replace with PII entity type.

Alternatively, you can choose Replace with character to replace PII entities with a character of your choice (!, #, $, %, &, *, or @).

In the Input data section, for Data source, select My documents.
For S3 location, enter the S3 path for pii-s3-input.txt.

This text file has the same example content we used earlier for real-time analysis.

In the Output data section, for S3 location, enter the path to the output folder in Amazon S3.

Make sure you choose the correct input and output paths based on how you organized the document.

In the Access permissions section, for IAM role, select Create an IAM role.

You need an AWS Identity and Access Management (IAM) role with required permissions to access the input and output S3 buckets for the job that is created and propagated.

For Permissions to access, choose Input and Output S3 buckets.
For Name suffix, enter a suffix for your role (for this post, we enter ComprehendPIIRole).
Choose Create job.

You can see the job comprehend-blog-redact-01 with the job status In progress.

When the job status changes to Completed, you can access the output file to view the output. The pii-s3-input.txt file has the same example content we used earlier, and using redaction mode replaces PII with its PII entity type. Your output looks like the following text:

Good morning, everybody. My name is [NAME], and today I feel like sharing a whole lot of personal information with you. Let's start with my Email address [EMAIL]. My address is [ADDRESS] My phone number is [PHONE]. My Social security number is [SSN]. My Bank account number is [BANK_ACCOUNT_NUMBER] and routing number [BANK_ROUTING]. My credit card number is [CREDIT_DEBIT_NUMBER], Expiration Date [CREDIT_DEBIT_EXPIRY], my C V V code is [CREDIT_DEBIT_CVV], and my pin [PIN]. Well, I think that's it. You know a whole lot about me. And I hope that Amazon comprehend is doing a good job at identifying PII entities so you can redact my personal information away from this document. Let's check.

If you have very long entity types, you may prefer to mask PII with a character. If you choose to replace PII with the character *, your output looks like the following text. :

Good morning, everybody. My name is *******************, and today I feel like sharing a whole lot of personal information with you. Let's start with my Email address ****************************. My address is ********************************** My phone number is ************. My Social security number is ***********. My Bank account number is ************ and routing number *********. My credit card number is ****************, Expiration Date ********, my C V V code is ***, and my pin ******. Well, I think that's it. You know a whole lot about me. And I hope that Amazon comprehend is doing a good job at identifying PII entities so you can redact my personal information away from this document. Let's check.

Asynchronous PII redaction batch processing via the AWS CLI

To perform the PII redaction job using the AWS CLI, enter the following code:

aws comprehend start-pii-entities-detection-job \
 --input-data-config S3Uri="s3://ai-ml-services-lab/public/labs/comprehend/pii/input/redact/pii-s3-input.txt"  \
 --output-data-config S3Uri="s3://ai-ml-services-lab/public/labs/comprehend/pii/output/redact/"  \
 --mode "ONLY_REDACTION" \
 --redaction-config PiiEntityTypes="BANK_ACCOUNT_NUMBER","BANK_ROUTING","CREDIT_DEBIT_NUMBER","CREDIT_DEBIT_CVV","CREDIT_DEBIT_EXPIRY","PIN","EMAIL","ADDRESS","NAME","PHONE","SSN",MaskMode="REPLACE_WITH_PII_ENTITY_TYPE" \
 --data-access-role-arn "arn:aws:iam::<ACCOUNTID>:role/service-role/AmazonComprehendServiceRole-ComprehendPIIRole" \
 --job-name "comprehend-blog-redact-001" \
 --language-code "en"

The request yields the following output:

{
    "JobId": "e41101e2f0919a320bc0583a50f86b5f",
    "JobStatus": "SUBMITTED"
}

To monitor the job request, enter the following code:

aws comprehend describe-pii-entities-detection-job --job-id " e41101e2f0919a320bc0583a50f86b5f "

The following output shows that the job is complete:

{
    "PiiEntitiesDetectionJobProperties": {
        "JobId": "e41101e2f0919a320bc0583a50f86b5f",
        "JobName": "comprehend-blog-redact-001",
        "JobStatus": "COMPLETED",
        "SubmitTime": <SubmitTime>,
        "EndTime": <EndTime>,
        "InputDataConfig": {
            "S3Uri": "s3://ai-ml-services-lab/public/labs/comprehend/pii/input/redact/pii-s3-input.txt",
            "InputFormat": "ONE_DOC_PER_LINE"
        },
        "OutputDataConfig": {
            "S3Uri": "s3://ai-ml-services-lab/public/labs/comprehend/pii/output/redact/<AccountID>-PII-e41101e2f0919a320bc0583a50f86b5f/output/"
        },
        "RedactionConfig": {
            "PiiEntityTypes": [
                "BANK_ACCOUNT_NUMBER",
                "BANK_ROUTING",
                "CREDIT_DEBIT_NUMBER",
                "CREDIT_DEBIT_CVV",
                "CREDIT_DEBIT_EXPIRY",
                "PIN",
                "EMAIL",
                "ADDRESS",
                "NAME",
                "PHONE",
                "SSN"
            ],
            "MaskMode": "REPLACE_WITH_PII_ENTITY_TYPE"
        },
        "LanguageCode": "en",
        "DataAccessRoleArn": "arn:aws:iam::<AccountID>:role/ComprehendBucketAccessRole",
        "Mode": "ONLY_REDACTION"
    }
}

After the job is complete, the output file is plain text (same as the input file). Other Amazon Comprehend asynchronous jobs (start-entities-detection-job) have an output file called output.tar.gz, which is a compressed archive that contains the output of the operation. Start-pii-entities-detection-job retains the folder and file structure as input. Our comprehend-blog-redact-001 job input file pii-s3-input.txt has the respective pii-s3-input.txt.out file with the redacted text in the jobs output folder. You can find the Amazon S3 location in the output from monitoring the job; the JSON element PiiEntitiesDetectionJobProperties.OutputDataConfig.S3uri has the file pii-s3-input.txt.out and the redacted content with PII entity type.

Conclusion

As of this writing, the PII detection feature in Amazon Comprehend is available for US English in the following Regions:

US East (Ohio)
US East (N. Virginia)
US West (Oregon),
Asia Pacific (Mumbai)
Asia Pacific (Seoul)
Asia Pacific (Singapore)
Asia Pacific (Sydney)
Asia Pacific (Tokyo)
EU (Frankfurt)
EU (Ireland)
EU (London)
AWS GovCloud (US-West)

Take a look at the pricing page, give the feature a try, and please send us feedback either via the AWS forum for Amazon Comprehend or through your usual AWS support contacts.

About the Author

Sriharsha M S is an AI/ML specialist solution architect in the Strategic Specialist team at Amazon Web Services. He works with strategic AWS customers who are taking advantage of AI/ML to solve complex business problems. He provides technical guidance and design advice to implement AI/ML applications at scale. His expertise spans application architecture, bigdata, analytics and machine learning.

Artificial Intelligence

Detecting and redacting PII using Amazon Comprehend

Customer use case: TeraDact Solutions

Detecting PII in Amazon Comprehend

Analyzing text on the Amazon Comprehend console

Analyzing text via the AWS CLI

Asynchronous PII redaction batch processing on the Amazon Comprehend console

Asynchronous PII redaction batch processing via the AWS CLI

Conclusion

About the Author

Resources

Blog Topics

Follow

Learn

Resources

Developers

Help