Build a data pipeline to automatically discover and mask PII data with AWS Glue DataBrew

Personally identifiable information (PII) data handling is a common requirement when operating a data lake at scale. Businesses often need to mitigate the risk of exposing PII data to the data science team while not hindering the productivity of the team to get to the data they need in order to generate valuable data insights. However, there are challenges in striking the right balance between data governance and agility:

Proactively identifying the dataset that contains PII data, if it’s not labeled by the data providers
Determining to what extent the data scientists can access the dataset
Minimizing chances that the data lake operator is visually exposed to PII data when they process the data

To help overcome these challenges, we can build a data pipeline that automatically scans data upon its arrival to the data lake, then further masks the portion of data that is labeled as PII data. Automating the PII data scanning and masking tasks helps prevent human actors processing the data while the PII data is still presented in plain text, yet still provides data consumers timely access to the newly arrived dataset.

To build a data pipeline that can automatically handle PII data, you can use AWS Glue DataBrew. DataBrew is a no-code data preparation tool with pre-built transformations to automate data preparation tasks. It natively supports PII data identification, entity detection, and PII data handling features. In addition to its visual interface for no-code data preparation, it offers APIs to let you orchestrate the creation and running of DataBrew profile jobs and recipe jobs.

In this post, we illustrate how you can orchestrate DataBrew jobs with AWS Step Functions to build a data pipeline to handle PII data. The pipeline is triggered by Amazon Simple Storage Service (Amazon S3) event notifications sent to Amazon EventBridge whenever there is a new data object lands in a S3 bucket. We also include an AWS CloudFormation template for you to deploy as a reference.

Solution overview

The following diagram describes the solution architecture.

Architecture Diagram

The solution includes a S3 bucket as the data input bucket and another S3 bucket as the data output bucket. Data uploaded to the data input bucket sends an event to EventBridge to trigger the data pipeline. The pipeline is composed of a Step Functions state machine, DataBrew jobs, and an AWS Lambda function used for reading the results of the DataBrew profile job.

The solution workflow includes the following steps:

A new data file is uploaded to the data input bucket.
EventBridge receives an object created event from the S3 bucket, and triggers the Step Functions state machine.
The state machine uses DataBrew to register the S3 object as a new DataBrew dataset, and creates a profile job. The profile job results, including the PII statistics, are written to the data output bucket.
A Lambda function reads the profile job results and returns whether the data file contains PII data.
If no PII data is found, the workflow is complete; otherwise, a DataBrew recipe job is created to target the columns that contain PII data.
When running the DataBrew recipe job, DataBrew uses the secret (a base64 encoded string, such as TXlTZWNyZXQ=) stored in AWS Secrets Manager to hash the PII columns.
When the job is complete, the new data file with PII data hashed is written to the data output bucket.

Prerequisites

To deploy the solution, you should have the following prerequisites:

An AWS account
An AWS user with AWS Identity and Access Management (IAM) permissions to manage AWS resources including Amazon S3, DataBrew, Step Functions, Lambda, and Secrets Manager

Deploy the solution using AWS CloudFormation

To deploy the solution using the CloudFormation template, complete the following steps.

Sign in to your AWS account.
Choose Launch Stack:
Navigate to one of the AWS Regions where DataBrew is available (such as us-east-1).
For Stack name, enter a name for the stack or leave as default (automate-pii-handling-data-pipeline).
For HashingSecretValue, enter a secret (which is base64 encoded during the CloudFormation stack creation) to use for data hashing.
For PIIMatchingThresholdValue, enter a threshold value (1–100 in terms of percentage, default is 80) to indicate the desired percentage of records the DataBrew profile job must identify as PII data in a given column, so that the data in the column is further hashed by the subsequent DataBrew PII recipe job.
Select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack.

CloudFormation Quick Launch Page

The CloudFormation stack creation process takes around 3-4 minutes to complete.

Test the data pipeline

To test the data pipeline, you can download the sample synthetic data generated by Mockaroo. The dataset contains synthetic PII fields such as email, contact number, and credit card number.

Data Preview

The sample data contains columns of PII data as an illustration; you can use DataBrew to detect PII values down to the cell level.

On the AWS CloudFormation console, navigate to the Outputs tab for the stack you created.
Choose the URL value for AmazonS3BucketForGlueDataBrewDataInput to navigate to the S3 bucket created for DataBrew data input.
Choose Upload.
Choose Add files to upload the data file you downloaded.
Choose Upload again.
Return to the Outputs tab for the CloudFormation stack.
Choose the URL value for AWSStepFunctionsStateMachine.

You’re redirected to the Step Functions console, where you can review the state machine you created. The state machine should be in a Running state.

In the Executions list, choose the current run of the state machine.

A graph inspector visualizes which step of the pipeline is being run. You can also inspect the step input and output of each step completed.
For the provided sample dataset, with 8 columns containing 1,000 rows of records, the whole run takes approximately 7–8 minutes.

Data pipeline details

While we’re waiting for the steps to complete, let’s explain more of how this data pipeline is built. The following figure is the detailed workflow of the Step Functions state machine.

Step Functions Workflow

The key step in the state machine is the Lambda function used to parse the DataBrew profile job result. The following code is a snippet of the profile job result in JSON format:

{
    "defaultProfileConfiguration": {...
    },
    "entityDetectorConfigurationOverride": {
        "AllowedStatistics": [...
        ],
        "EntityTypes": [
            "USA_ALL",
            "PERSON_NAME"
        ]
    },
    "datasetConfigurationOverride": {},
    "sampleSize": 1000,
    "duplicateRowsCount": 0,
    "columns": [
        {...
        },
        {
            "name": "email_address",
            "type": "string",
            "entity": {
                "rowsCount": 1000,
                "entityTypes": [
                    {
                        "entityType": "EMAIL",
                        "rowsCount": 1000
                    }
                ]
            }...
        }...
    ]...
}

Inside columns, each column object has the property entity if it’s detected to be a column containing PII data. rowsCount inside entity tells us how many rows out of the total sample are identified as PII, followed by entityTypes to indicate the type of PII identified.

The following is the Python code used in the Lambda function:

import json
import boto3
import os

def lambda_handler(event, context):

  s3Bucket = event["Outputs"][0]["Location"]["Bucket"]
  s3ObjKey = event["Outputs"][0]["Location"]["Key"]

  s3 =boto3.client('s3')
  glueDataBrewProfileResultFile = s3.get_object(Bucket=s3Bucket, Key=s3ObjKey)
  glueDataBrewProfileResult = json.loads(glueDataBrewProfileResultFile['Body'].read().decode('utf-8'))
  columnsProfiled = glueDataBrewProfileResult["columns"]
  PIIColumnsList = []

  for item in columnsProfiled:
    if "entityTypes" in item["entity"]:
      if (item["entity"]["rowsCount"]/glueDataBrewProfileResult["sampleSize"]) >= int(os.environ.get("threshold"))/100:
        PIIColumnsList.append(item["name"])

  if PIIColumnsList == []:
    return 'No PII columns found.'
  else:
    return PIIColumnsList

To summarize what the logic of the Lambda function is, a for-loop is implemented to aggregate a list of column names, in which the ratio of PII rows over the total sample size of that column is larger than or equal to the threshold value set earlier in the CloudFormation stack creation step. The Lambda function returns the list of column names to the Step Functions state machine to author a DataBrew recipe that masks only the columns in the returned list, instead of all the columns of the dataset. This way, we retain the content of non-PII columns for the data consumer while not exposing the PII data in plain text.

Step Functions Workflow Studio

We use CRYPTOGRAPHIC_HASH in this solution for the Operation parameter of the DataBrew CreateRecipe step. Because the profile job result and threshold value have already been used to determine which columns contain PII data to mask, the recipe step doesn’t include the parameter entityTypeFilter to enforce all rows of the columns getting hashed. Otherwise, some rows in the column might not be hashed by the operation if the particular rows of data are not identified by DataBrew as PII.

If your dataset potentially contains free-text columns such as doctor notes and email body, it would be beneficial to include the parameter entityTypeFilter in an additional recipe step to handle the free-text columns. For more information, refer to the values supported for this parameter.

To customize the solution further, you can also choose other PII recipe steps available from DataBrew to mask, replace, or transform the data in approaches best suited for your use cases.

Data pipeline results

After a deeper dive into the solution components, let’s check if all the steps in the Step Functions state machine are complete and review the results.

Navigate to the Datasets page on the DataBrew console to view the data profile result of the dataset you just uploaded.

Five columns of the dataset have been identified as columns containing PII data. Depending on the threshold value you set when creating the CloudFormation stack (the default is 80), the column spoken_language wouldn’t be included in the PII data masking step because only 14% of the rows were identified as a name of a person.

Navigate to the Jobs page to inspect the output of the data masking step.
Choose 1 output to see the S3 bucket containing the data output.
Choose the value for Destination to navigate to the S3 bucket.

The data output S3 bucket contains a .json file, which is the data profile result you just reviewed in JSON format. There is also a folder path that contains the data output of the PII data masking task.

Choose the folder path.
Select the CSV file, which is the output of the DataBrew recipe job.
On the Actions menu, choose Query with S3 Select.
In the SQL query section, choose Run SQL query.

The query results sampled five rows from the data output of the DataBrew recipe job; the columns identified as PII (full_name, email_address, and contact_phone_number) have been masked. Congratulations! You have successfully produced a dataset from a data pipeline that detects and masks PII data automatically.

Clean up

To avoid incurring future charges, delete the resources you created as part of this post.

On the AWS CloudFormation console, delete the stack you created (default name is automate-pii-handling-data-pipeline).

Conclusion

In this post, you learned how to build a data pipeline that automatically detects PII data and masks the data accordingly when a new data file arrives in an S3 bucket. With DataBrew profile jobs, you can develop logics with low code to run automatically on the profile results. For this post, our job determined which columns to mask. You can also author the DataBrew recipe job in an automated approach, which helps limit occasions when human actors can access the PII data while it’s still in plain text.

You can learn more about this solution and the source code by visiting the GitHub repository. To learn more about what DataBrew can do in handling PII data, refer to Introducing PII data identification and handling using AWS Glue DataBrew and Personally identifiable information (PII) recipe steps.

About the Author

Author

Samson Lee is a Solutions Architect with a focus on the data analytics domain. He works with customers to build enterprise data platforms, discovering and designing solutions on AI/ML use cases. Samson also enjoys coffee and wine tasting outside of work.

AWS Big Data Blog

Build a data pipeline to automatically discover and mask PII data with AWS Glue DataBrew

Solution overview

Prerequisites

Deploy the solution using AWS CloudFormation

Test the data pipeline

Data pipeline details

Data pipeline results

Clean up

Conclusion

About the Author

Resources

Follow

Learn

Resources

Developers

Help