How to Scale Data Tokenization with AWS Glue and Protegrity

By Muneeb Hasan, Sr. Partner Solution Engineer – Protegrity
By Venkatesh Aravamudan, Partner Solutions Architect – AWS
By Tamara Astakhova, Sr. Partner Solutions Architect – AWS

Protegrity

In the current era of big data, where data is growing exponentially and coming from various sources, a big challenge for companies can be to consolidate data from multiple sources into one system. That’s why many companies are using AWS Glue to build extract, transform, load (ETL) workflows to load their data into data lakes.

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.

AWS Glue can connect to more than 70 data sources and has centralized cataloging capability. Companies are able to immediately search and query the cataloged data using services like Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

There are many compliance regulations to protect confidential data, including personally identifiable information (PII), which should be implemented as part of any solution. Therefore, building a secure data pipeline is the highest priority for a modern business.

Amazon Web Services (AWS) has collaborated with Protegrity, an AWS Partner with Competencies in Security and Data and Analytics, to enable organizations with strict security requirements to protect their data while being able to obtain the powerful insights.

In this post, we will demonstrate how data tokenization for data in transit is performed using Protegrity’s Cloud API and AWS Glue.

Protegrity for AWS Glue

Protegrity is a global leader in data security and provides data tokenization for AWS Glue by employing a cloud-native, serverless architecture.

The solution scales elastically to seamlessly meet AWS Glue’s on-demand, intensive workload processing. Serverless tokenization with Protegrity delivers data security with the performance that organizations need for sensitive data protection and on-demand scale.

About Tokenization

Tokenization is a non-mathematical approach to protecting data while preserving its type, format, and length. Tokens appear similar to the original value and can keep sensitive data fully or partially visible for data processing and analytics.

Historically, vault-based tokenization uses a database table to create lookup pairs that associate a token with encrypted sensitive information.

Protegrity Vaultless Tokenization (PVT) uses innovative techniques to eliminate data management and scalability problems typically associated with vault-based tokenization. Using AWS Glue with Protegrity, data can be tokenized or de-tokenized (re-identified) with Protegrity Cloud API depending on the user’s role and the governing Protegrity security policy.

Below is an example of tokenized or de-identified PII data preserving potential analytic usability.

The email is tokenized while the domain name is kept in the clear, and date of birth (DOB) is tokenized except for the year. Other fields in green are fully tokenized. This example tokenization strategy provides the ability to do age-based analytics for balance, credit, and medical.

Figure 1 – Example tokenized data.

Solution Overview and Architecture

AWS Glue will use a custom transformation to the job to call the Protection function in AWS Lambda. When you execute the job, each slice in the AWS Glue batches the applicable rows after filtering and sends those batches to your Lambda function. A mapping file is associated with the job, which defines the field to be protected.

The federated user identity is included in the AWS Identity and Access Management (IAM) user policy to read the mapping file and invoke Lambda. The Lambda function compares with the Protegrity security policy to determine whether access to perform the operation is permitted.

The number of parallel requests to Lambda scales linearly with the number of slices on your Glue job, and performs up to 10 invocations per slice in parallel.

Figure 2 – AWS Glue and Protegrity Cloud Protect API architecture.

Set Up Tokenization with Protegrity and AWS Glue

For this post, here are the prerequisites:

Cloud API on AWS installed.
AWS Glue role with invoke protect Lambda and read mapping file permissions.
AWS Glue database with data to protect.

We also assume you have a Protegrity account and have set up Protegrity Serverless in your account. The Lambda function used in this post can be acquired from Protegrity by visiting AWS Marketplace and looking for the Protegrity S3 Accelerator.

Step 1: Create Mapping File and Upload to Amazon S3

A mapping file is essential for Protegrity to perform the protect/unprotect operations for an AWS Glue job. It identifies the following:

lambda_function_name: The Cloud API Lambda function name.
batchsize: How many rows to send to Cloud API in a batch.
policy_user: The user for the protect operation in Protegrity policy.
columns: The columns to protect.
Operation: Protect or unprotect.
data_element: Data element to protect the column with. Most exists in the Protegrity policy.

Below is the sample for the mapping file which can be used:

{
  "lambda_function_name": "Protegrity_Protect_RESTAPI_Cloud-Protect-API-ETL",
  "batchsize": 25000,
  "policy_user": "glue_service_user",
  "columns":{
      "patient_name":{
         "operation":"protect",
         "data_element":"deTokAlpha"
      },
      "medical_record_number":{
         "operation":"protect",
         "data_element":"deTokAlpha"
      },
      "patient_id":{
         "operation":"protect",
         "data_element":"deTokAlpha"
      }
   }
}

Step 2: Create ETL Job in AWS Glue Studio

In the AWS Glue console, select ETL jobs > Visual ETL from the left menu and click Create.

Figure 3 – Create ETL job in AWS Glue Studio.

Step 3: Add Input Data Source

The next step is to select the Amazon Simple Storage Service (Amazon S3) file which needs to be loaded and provide the details in Data source properties.

Figure 4 – Add input data source properties.

Step 4: Create Custom Transform

Next, we’ll create a customer transform. The glue.py file contains the code; copy and paste into the second line (under function definition).

def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
    import asyncio
    import json
    import boto3
    import botocore
    
    DEFAULT_BATCHSIZE = 25000
    
    # Look for the mapping.json file in the same folder
    def extractMapping(bucket, key):
        try:
            s3Resource = boto3.client('s3')
            fileobj = s3Resource.get_object(Bucket=bucket, Key=key)
            file_content = fileobj['Body'].read().decode('utf-8')
            return json.loads(file_content)
        except botocore.exceptions.ClientError as err:
            print('Error Message: {}'.format(err.response['Error']['Message']))
            return None
    
    async def protect_data(lambda_function_name, values, data_element, method = "protect", user, query_id="0"):
        if not values:
            return values
        payload = {}
        body = {}
        body["data_element"] = data_element
        body["data"] = values
        payload["body"] = json.dumps(body)
        payload["path"] = f"/{method}"
        # -- Policy User ---
        body["user"] = user
        payload["headers"] = {}
        # -- End Policy User ---
        try:
            client = boto3.client('lambda')
            response = client.invoke(FunctionName=lambda_function_name, Payload=json.dumps(payload))
            response_payload = json.loads(response['Payload'].read().decode())
            json_response = json.loads(response_payload.get('body', {}))
            if not json_response.get("success"):
                errmsg = json_response.get("error_msg")
                raise RuntimeWarning(f"Cloud API response error: {errmsg}")
            protected = json_response.get('results')
            return protected
        except exception as ex:
            print(f'protect_data--{query_id}--exception: {ex}')
            raise ex
        return values
    
    def divide_list(l, n): 
        for i in range(0, len(l), n):  
            yield l[i:i + n] 
    
    async def proccess_column(lambda_function_name, batchsize, it_list, column_name, data_element, method, policyuser, query_id):
        listChunks = divide_list(it_list, batchsize)
        i = 0
        for listchunk in listChunks:
            data_to_protect = [x["record"][column_name] for x in listchunk]
            protected_data = await protect_data(lambda_function_name, data_to_protect, data_element, method, user=policyuser, query_id=query_id)
            # apply the protect values
            for idx in range(len(protected_data)):
                it_list[i]["record"][column_name] = protected_data[idx]
                i += 1
    
    async def protect_async(it_list, mapping_dic):
        # Initialize the IAM client
        iam_client = boto3.client(‘iam’)
        batchsize = mapping_dic.get("batchsize", DEFAULT_BATCHSIZE)
        aws_user_id = iam_client.caller_identity[‘UserId’]
        lambda_function_name = mapping_dic.get("lambda_function_name", None)
        if not lambda_function_name:
            raise ValueError("Required field 'lambda_function_name' is missing in mapping.json")
        print(f'lambda_function_name: {lambda_function_name}, batchsize: {batchsize}')
        columns = mapping_dic.get("columns", {})
        if not columns:
            raise ValueError("No columns to protect in mapping.json")
        coroutines = []
        for clmn in columns:
            coroutines.append(proccess_column(lambda_function_name, 
                                                batchsize, 
                                                it_list, 
                                                clmn, mapping_dic["columns"][clmn]["data_element"], 
                                                method=mapping_dic["columns"][clmn]["operation"],
                                                policyuser=mapping_dic.get("policy_user", aws_user_id),
                                                query_id=args['JOB_NAME']))
        await asyncio.gather(*coroutines, return_exceptions=True)
    
    def protect_by_key(it):
        import asyncio
        loop = asyncio.get_event_loop()
        list_it = list(it)
        loop.run_until_complete(protect_async(list_it, mapping_dic))
    
        return list_it
    
    dyF = dfc.select(list(dfc.keys())[0])
     
    args_internal = getResolvedOptions(sys.argv, ["s3_mapping_bucket", "s3_mapping_file"])
    s3_mapping_bucket = args_internal["s3_mapping_bucket"]
    s3_mapping_file = args_internal["s3_mapping_file"]
    print(f"s3_mapping_file: {s3_mapping_bucket}, s3_mapping_file: {s3_mapping_file}")
    mapping_dic = extractMapping(s3_mapping_bucket, s3_mapping_file)
    print(json.dumps(mapping_dic))
        
    newcustomerdyc = dyF.mapPartitions(protect_by_key, transformation_ctx="GlueProtectTransform")
    return (DynamicFrameCollection({"CustomTransform0": newcustomerdyc}, glueContext))

The screenshot below shows the Custom Transform containing the code from the snippet above.

Figure 5 – Create custom transformation code.

Step 5: Transform – SelectFromCollection

Select the Transform node, and then choose the Node properties tab and select the appropriate Protegrity Cloud API to protect or unprotect data.

Figure 6 – Add Protegrity Cloud API code to protect or unprotect data.

Step 6: Add Data Target

Select the target, which will be an S3 location in this case. Select the format and compression if needed.

Figure 7 – Add data target properties.

Step 7: Set Up ETL Job

The IAM role should be able to invoke the Protegrity Cloud API and be able to read the mapping file.

Below is an example of job parameters:

Job parameters:
--s3_mapping_bucket – The S3 bucket name where mapping file is stored
--s3_mapping_file – The path to the mapping file
IAM role:
Must be able to invoke Lambda Cloud API
Must be able to read mapping file

The screen shot below shows the job parameters and their values being set up in the Glue job.

Figure 8 – Set up ETL job.

Step 8: Save and Run/Execute

When all steps have been completed, save and execute the AWS Glue job. “Succeeded” will appear if the job was successfully executed.

Recommendations and Considerations

Limit the number of workers based on your workload.
Enable authorization JSON Web Token (JWT) on Cloud API.
Generate JWT token from authentication service for each AWS Glue job.

Conclusion

In this post, we demonstrated how users can integrate AWS Glue jobs with the Protegrity Cloud API product and support scalable and performant tokenization and detokenization.

For details on how to create Cloud Protect API using AWS Lambda, refer to the AWS Marketplace listing for Protegrity. To learn more about the Protegrity solution, visit the Protegrity website.

.

.

Protegrity – AWS Partner Spotlight

Protegrity is an AWS Partner that provides fine-grained data protection capabilities (tokenization, encryption, masking) for sensitive data and compliance.

Contact Protegrity | Partner Overview | AWS Marketplace