AWS Partner Network (APN) Blog
How to Scale Data Tokenization with AWS Glue and Protegrity
By Muneeb Hasan, Sr. Partner Solution Engineer – Protegrity
By Venkatesh Aravamudan, Partner Solutions Architect – AWS
By Tamara Astakhova, Sr. Partner Solutions Architect – AWS
Protegrity |
In the current era of big data, where data is growing exponentially and coming from various sources, a big challenge for companies can be to consolidate data from multiple sources into one system. That’s why many companies are using AWS Glue to build extract, transform, load (ETL) workflows to load their data into data lakes.
AWS Glue is a serverless data integration service that makes it easy to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.
AWS Glue can connect to more than 70 data sources and has centralized cataloging capability. Companies are able to immediately search and query the cataloged data using services like Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
There are many compliance regulations to protect confidential data, including personally identifiable information (PII), which should be implemented as part of any solution. Therefore, building a secure data pipeline is the highest priority for a modern business.
Amazon Web Services (AWS) has collaborated with Protegrity, an AWS Partner with Competencies in Security and Data and Analytics, to enable organizations with strict security requirements to protect their data while being able to obtain the powerful insights.
In this post, we will demonstrate how data tokenization for data in transit is performed using Protegrity’s Cloud API and AWS Glue.
Protegrity for AWS Glue
Protegrity is a global leader in data security and provides data tokenization for AWS Glue by employing a cloud-native, serverless architecture.
The solution scales elastically to seamlessly meet AWS Glue’s on-demand, intensive workload processing. Serverless tokenization with Protegrity delivers data security with the performance that organizations need for sensitive data protection and on-demand scale.
About Tokenization
Tokenization is a non-mathematical approach to protecting data while preserving its type, format, and length. Tokens appear similar to the original value and can keep sensitive data fully or partially visible for data processing and analytics.
Historically, vault-based tokenization uses a database table to create lookup pairs that associate a token with encrypted sensitive information.
Protegrity Vaultless Tokenization (PVT) uses innovative techniques to eliminate data management and scalability problems typically associated with vault-based tokenization. Using AWS Glue with Protegrity, data can be tokenized or de-tokenized (re-identified) with Protegrity Cloud API depending on the user’s role and the governing Protegrity security policy.
Below is an example of tokenized or de-identified PII data preserving potential analytic usability.
The email is tokenized while the domain name is kept in the clear, and date of birth (DOB) is tokenized except for the year. Other fields in green are fully tokenized. This example tokenization strategy provides the ability to do age-based analytics for balance, credit, and medical.
Figure 1 – Example tokenized data.
Solution Overview and Architecture
AWS Glue will use a custom transformation to the job to call the Protection function in AWS Lambda. When you execute the job, each slice in the AWS Glue batches the applicable rows after filtering and sends those batches to your Lambda function. A mapping file is associated with the job, which defines the field to be protected.
The federated user identity is included in the AWS Identity and Access Management (IAM) user policy to read the mapping file and invoke Lambda. The Lambda function compares with the Protegrity security policy to determine whether access to perform the operation is permitted.
The number of parallel requests to Lambda scales linearly with the number of slices on your Glue job, and performs up to 10 invocations per slice in parallel.
Figure 2 – AWS Glue and Protegrity Cloud Protect API architecture.
Set Up Tokenization with Protegrity and AWS Glue
For this post, here are the prerequisites:
- Cloud API on AWS installed.
- AWS Glue role with invoke protect Lambda and read mapping file permissions.
- AWS Glue database with data to protect.
We also assume you have a Protegrity account and have set up Protegrity Serverless in your account. The Lambda function used in this post can be acquired from Protegrity by visiting AWS Marketplace and looking for the Protegrity S3 Accelerator.
Step 1: Create Mapping File and Upload to Amazon S3
A mapping file is essential for Protegrity to perform the protect/unprotect operations for an AWS Glue job. It identifies the following:
- lambda_function_name: The Cloud API Lambda function name.
- batchsize: How many rows to send to Cloud API in a batch.
- policy_user: The user for the protect operation in Protegrity policy.
- columns: The columns to protect.
- Operation: Protect or unprotect.
- data_element: Data element to protect the column with. Most exists in the Protegrity policy.
Below is the sample for the mapping file which can be used:
Step 2: Create ETL Job in AWS Glue Studio
In the AWS Glue console, select ETL jobs > Visual ETL from the left menu and click Create.
Figure 3 – Create ETL job in AWS Glue Studio.
Step 3: Add Input Data Source
The next step is to select the Amazon Simple Storage Service (Amazon S3) file which needs to be loaded and provide the details in Data source properties.
Figure 4 – Add input data source properties.
Step 4: Create Custom Transform
Next, we’ll create a customer transform. The glue.py file contains the code; copy and paste into the second line (under function definition).
The screenshot below shows the Custom Transform containing the code from the snippet above.
Figure 5 – Create custom transformation code.
Step 5: Transform – SelectFromCollection
Select the Transform node, and then choose the Node properties tab and select the appropriate Protegrity Cloud API to protect or unprotect data.
Figure 6 – Add Protegrity Cloud API code to protect or unprotect data.
Step 6: Add Data Target
Select the target, which will be an S3 location in this case. Select the format and compression if needed.
Figure 7 – Add data target properties.
Step 7: Set Up ETL Job
The IAM role should be able to invoke the Protegrity Cloud API and be able to read the mapping file.
Below is an example of job parameters:
The screen shot below shows the job parameters and their values being set up in the Glue job.
Figure 8 – Set up ETL job.
Step 8: Save and Run/Execute
When all steps have been completed, save and execute the AWS Glue job. “Succeeded” will appear if the job was successfully executed.
Recommendations and Considerations
- Limit the number of workers based on your workload.
- Enable authorization JSON Web Token (JWT) on Cloud API.
- Generate JWT token from authentication service for each AWS Glue job.
Conclusion
In this post, we demonstrated how users can integrate AWS Glue jobs with the Protegrity Cloud API product and support scalable and performant tokenization and detokenization.
For details on how to create Cloud Protect API using AWS Lambda, refer to the AWS Marketplace listing for Protegrity. To learn more about the Protegrity solution, visit the Protegrity website.
Protegrity – AWS Partner Spotlight
Protegrity is an AWS Partner that provides fine-grained data protection capabilities (tokenization, encryption, masking) for sensitive data and compliance.