Integrating Redaction of FinServ Data into a Machine Learning Pipeline
Financial companies process hundreds of thousands of documents every day. These include loan and mortgage statements that contain large amounts of confidential customer information.
Data privacy requires that sensitive data be redacted to protect the customer and the institution. Redacting digital and physical documents is time-consuming and labor-intensive. The accidental or inadvertent release of personal information can be devastating for the customer and the institution. Having automated processes in place reduces the likelihood of a data breach.
In this post, we discuss how to automatically redact personally identifiable information (PII) data fields from your financial services (FinServ) data through machine learning (ML) capabilities of Amazon Comprehend and Amazon Athena. This will ensure you comply with federal regulations and meet customer expectations.
Protecting data and complying with regulations
Protecting PII is crucial to complying with regulations like the California Consumer Privacy Act (CCPA), Europe’s General Data Protection Regulation (GDPR), and Payment Card Industry’s data security standards (PCI DSS).
In Figure 1, we show how structured and non-structured sensitive data stored in AWS data stores can be redacted before it is made available to data engineers and data scientists for feature engineering and building ML models in compliance with organizations data security policies.
This section explains each step presented in Figure 1 and the AWS services used:
- By using services like AWS DataSync, AWS Storage Gateway, and AWS Transfer Family, data can be ingested into AWS using batch or streaming pattern. This data lands in an Amazon Simple Storage Service (Amazon S3) bucket, we call this “raw data” in Figure 1.
- To detect if the raw data bucket has any sensitive data, use Amazon Macie. Macie is a fully managed data security and data privacy service that uses ML and pattern matching to discover and protect your sensitive data in AWS. When Macie discovers sensitive data, you can configure it to tag the objects with an Amazon S3 object tag to identify that sensitive data was found in the object before progressing to the next stage of the pipeline. Refer to the Use Macie to discover sensitive data as part of automated data pipelines blog post for detailed instruction on building such pipeline.
- This tagged data lands in a “scanned data” bucket, where we use Amazon Comprehend, a natural language processing (NLP) service that uses ML to uncover information in unstructured data. Amazon Comprehend works for unstructured text document data and redacts sensitive fields like credit card numbers, date of birth, social security number, passport number, and more. Refer to the Detecting and redacting PII using Amazon Comprehend blog post for step-by-step instruction on building such a capability.
- If your pipeline requires redaction for specific use cases only, you can use the information in Introducing Amazon S3 Object Lambda – Use Your Code to Process Data as It Is Being Retrieved from S3 to redact sensitive data. Using this operation, an AWS Lambda function will intercept each GET request. It will redact data as necessary before it goes back to the requestor. This allows you to keep one copy of all the data and redact the data as it is requested for a specific workload. For further details, refer to the Amazon S3 Object Lambda Access Point to redact personally identifiable information (PII) from documents developer guide.
- When you want to join multiple datasets from different data sources, use an Athena federated query. Using user-defined functions (UDFs) with Athena federated query will help you redact data in Amazon S3 or from other data sources such as an online transaction store like Amazon Relational Database Service (Amazon RDS), a data warehouse solution like Amazon Redshift, or a NoSQL store like Amazon DocumentDB. Athena supports UDFs, which enable you to write custom functions and invoke them in SQL queries. UDFs allow you to perform custom processing such as redacting sensitive data, compressing, and decompressing data or applying customized decryption. To read further on how you can get this set up refer to the Redacting sensitive information with user-defined functions in Amazon Athena blog post.
- Redacted data lands in another S3 bucket that is now ready for any ML pipeline consumption.
- Using AWS Glue DataBrew, the data preparation without writing any code. You can choose reusable recipes from over 250 pre-built transformations to automate data preparation tasks by jobs that can be scheduled based on your requirements.
- Data is then used by Amazon SageMaker Data Wrangler to do feature engineering on curated data in data preparation (step 6). SageMaker Data Wrangler offers over 300 pre-configured data transformations, such as convert column type, one hot encoding, impute missing data with mean or median, rescale columns, and data/time embedding, so you can transform your data into formats that can be effectively used for models without writing a single line of code.
- The output of the SageMaker Data Wrangler job is stored in Amazon SageMaker Feature Store, a purpose-built repository where you can store and access features to name, organize, and reuse them across teams. SageMaker Feature Store provides a unified store for features during training and real-time inference without the need to write additional code or create manual processes to keep features consistent.
- Use ML features in SageMaker notebooks or SageMaker Studio for ML training on your redacted data. SageMaker notebook instance is an ML compute instance running the Jupyter Notebook App. Amazon SageMaker Studio is a web-based, integrated development environment for ML that lets you build, train, debug, deploy, and monitor your ML models. SageMaker Studio is integrated with SageMaker Data Wrangler.
Federal regulations require that financial institutions protect customer data. To achieve this, redact sensitive fields in your data.
In this post, we showed you how to use AWS services to meet these requirements with Amazon Comprehend and Amazon Athena. These services allow data engineers and data scientist in your organization to safely consume this data for machine learning pipelines.