Data Tokenization with Amazon Redshift Dynamic Data Masking and Protegrity

By Alexandre Charlet, Principal Partner Solution Engineer – Protegrity
By Venkatesh Aravamudan, Partner Solution Architect – AWS

Protegrity

In this digital era where data is the new currency, safeguarding sensitive information has become paramount. As organizations increasingly rely on cloud solutions, the need for robust data protection measures has never been more critical.

The seamless integration between Amazon Redshift Dynamic Data Masking (DDM) and Protegrity’s Redshift Protector effectively protect sensitive data using Protegrity Vaultless Tokenization (PVT). This native integration, along with Protegrity’s solutions for both on-premises and AWS cloud services, enables AWS customers to manage sensitive data protection at rest, in transit, and during use.

This post reviews key concepts related to Protegrity Vaultless Tokenization, Amazon Redshift Dynamic Data Masking, and Protegrity’s Redshift Protector. It includes architecture overviews and code examples, demonstrating the integration between Amazon Redshift Dynamic Data Masking and Protegrity’s Redshift Protector.

Protegrity is an AWS Specialization Partner and AWS Marketplace Seller with AWS Competencies in Security and Data and Analytics. Protegrity works alongside with AWS security to protect specific data fields, using pseudonymization, encryption and/or masking techniques.

Amazon Redshift

Amazon Redshift is a fast, fully managed, cloud-native data warehouse that makes it simple and cost-effective to analyze data using standard SQL and your existing business intelligence tools.

Many enterprises today are leveraging Amazon Redshift to analyze and unlock value from their data. To help these companies ensure sensitive data is protected at every step of its lifecycle, Redshift and Protegrity have collaborated to provide customers with cloud-native, serverless data security.

Amazon Redshift’s Lambda User Defined Function (UDF) support makes it easy to integrate with external big data analytics and data enrichment platforms. Lambda UDFs can be written in any supported programming language, such as Java, Go PowerShell, Node.js, C#, Python, Ruby, or custom runtimes.

This collaboration enables organizations with strict security requirements to protect their data and derive greater insights without compromising data privacy.

Protegrity for Amazon Redshift

Protegrity provides data tokenization for Amazon Redshift by employing a cloud-native, serverless architecture. The solution scales elastically to meet Redshift’s on-demand, intensive workload processing seamlessly. Serverless tokenization with Protegrity delivers data security with the performance that organizations need for sensitive data protection and on-demand scale.

Amazon Redshift Serverless automatically provisions and intelligently scales data warehouse capacity to deliver fast performance for even the most demanding and unpredictable workloads, and you pay only for what you use. Just load data and start querying right away in Amazon Redshift Query Editor v2 or in your favorite business intelligence (BI) tool and continue to enjoy the best price performance and familiar SQL features in an easy-to-use, zero administration environment.

Protegrity’s tokenization solution supports both Amazon Redshift Serverless as well as provisioned Amazon Redshift in the same way.

Amazon Redshift Dynamic Data Masking

Amazon Redshift extends the security features by supporting Dynamic Data Masking that allows you to simplify the process of protecting sensitive data in your Redshift data warehouse. With DDM, you control access to your data through SQL-based masking policies that determine how Redshift returns sensitive data to the user at query time.

With this capability, you can create masking policies to define consistent, format preserving, and irreversible masked data values. You can apply masking on a specific column or list columns in a table. You also have the flexibility of choosing how to show the masked data. For example, you can completely hide all of the information about the data, you can replace partial real values with wildcard characters, or you can define your own way to mask the data using SQL Expressions, Python, or Lambda UDFs.

Additionally, you can apply a conditional masking based on other columns, which selectively protects the column data in a table based on the values in other columns. When you attach a policy to a table, the masking expression can be applied to one or more of its columns.

Protegrity leverages Redshift’s DDM feature to create a user-friendly integration for developers and transparent access for end users. Protegrity helps customer to go one level further than DMM by tokenizing sensitive data at-rest for the overall flow and detokenizing it for some specific requirements, leveraging Amazon Redshift external functions.

Protegrity Vaultless Tokenization

Tokenization is a non-mathematical approach to protecting data while preserving its type, format, and length. Tokens appear similar to the original value and can keep sensitive data fully or partially visible for data processing and analytics.

Historically, vault-based tokenization uses a database table to create lookup pairs that associate a token with encrypted sensitive information. Protegrity Vaultless Tokenization uses innovative techniques to eliminate data management and scalability problems typically associated with vault-based tokenization. Using Amazon Redshift with Protegrity, data can be tokenized or de-tokenized (re-identified) with SQL depending on the user’s role and the governing Protegrity security policy.

PVT supports all type of characters—for example, Chinese, Japanese, and Arabic—and gives customers the flexibility to define which set of characters are being protected.

See below an example of applying a Protegrity security policy that tokenizes personally identifiable information (PII) data while preserving potential analytic usability across non-sensitive data.

Figure 1 – Example tokenized data.

In Figure 1, the tokenized value of the name maintains the same Arabic alphabet as its original value. The date of birth (DOB) is tokenized while leaving the year unchanged (or “in the clear”). The Phone Number maintains the first four digits in the clear. All non-sensitive data does not need to be tokenized.

Role based-access control (RBAC) is applied for all users that attempt to access protected data, and their privileges will determine whether they see tokens or the original data. Protegrity RBAC allows customers to apply the principle of least privilege.

Figure 2 – Example of least privilege principle to sensitive data.

Figure 2 shows which sensitive data elements are visible to users based on their access permissions. Data has been protected by default and at rest, fully or partially depending on business and data requirements.

A data scientist may only have access to the value of the city in the clear, the four first digits in the clear of the phone number, and the year of the data of birth. The name value is still protected, and the Social Security Number (SSN) is masked.

A financial controller might have access to all of the protected data in the clear.

Protegrity Integration Overview and Architecture

Lambda UDFs are architected to perform efficiently and securely. When you invoke a Lambda UDF, Amazon Redshift batches the applicable rows after filtering and sends those batches to your Lambda function. The federated user identity is included in the payload, which Lambda compares with the Protegrity security policy to determine whether partial or full access to data is permitted.

The number of parallel requests to Lambda scales as Amazon Redshift Serverless automatically scales. To learn more, see the Amazon Redshift architecture.

Figure 3 – Integration of Protegrity with Amazon Redshift DDM.

Users access data through the Amazon Redshift DDM feature. Protegrity functions are called by Amazon Redshift DDM to detokenize sensitive data only when users have been previously defined in the Protegrity Data Protection Policy and its associated privileges. The Protegrity Enterprise Security Administrator (ESA) is the centralized appliance where Data Protection Policy is defined.

Setting Up Protegrity Within Amazon Redshift DDM

In this walkthrough, we will embed Protegrity UDFs into Amazon Redshift DDM. Previous setup steps (like Protegrity UDF creation) are covered in this AWS blog post about Protegrity for Amazon Redshift.

The following example shows how the masking policies are created and attached to the table and column:

## Sample Query to create masking policy and attach masking policy to one column of a specific table ## 
## pty_xxx objects are UDFs previously defined. These UDFs points to Protegrity Redshift Protector based on Lambda Serverless ##
CREATE MASKING POLICY unprotect_name_masking_policy
WITH (first_name    VARCHAR(256))
USING (pty.unprotect_Name(first_name));
--
ATTACH MASKING POLICY unprotect_name_masking_policy
ON fake_pii (first_name)
TO PUBLIC;
--
CREATE MASKING POLICY unprotect_ssn_masking_policy
WITH (ssn    VARCHAR(256))
USING (pty.unprotect_SSN(ssn));
--
ATTACH MASKING POLICY unprotect_ssn_masking_policy
ON fake_pii (ssn)
TO PUBLIC;

## Check if MASKING POLICY have been created and attached ## 
SELECT * FROM svv_masking_policy;
SELECT * FROM svv_attached_masking_policy;

Comparing Protegrity Before and After Amazon Redshift DDM Release

We assume at this stage that:

Sensitive data has already been protected earlier in the data flow. Typically, data protection will occur before data lands in cloud data storage.
Protegrity UDFs have been already created.

Before Amazon Redshift DDM:

Protegrity has been designed to minimize the impact on existing data consumers. To do that, most Protegrity customers create new table Views that enforce the Protegrity data protection policy at runtime.

The view’s definition integrates Protegrity UDFs:

## Query to create the view with embedded Protegrity UDFs ##
## pty_xxx objects are UDFs previously defined. These UDFs points to Protegrity Redshift Protector based on Lambda Serverless ##
CREATE
OR REPLACE VIEW "clinical_trial"."health_patient_record_v" AS
SELECT
    record_id, 
    pty_unprotect_name(patient_name),
    pty_unprotect_nin(nin),
    patient_id,
    pty_unprotect_medcondition(med_condition_1),
    pty_unprotect_medcondition(med_condition_2),
    height,
    weight,
    blood_type,
    rhd,
    allergy_1,
    allergy_2,
    previous_condition,
    primary_clinic,
    primary_physician,
    cost_center
FROM
    "clinical_trial"."health_patient_record";

While creating new views is relatively simple, it requires extra work for developers and results in changes to the database schema and structure. For end users, the view name can be the same as the original table name, and thus minimize how users access the data.

After Amazon Redshift DDM:

Creating masking policies and attaching them to the right tables and columns is a more efficient mechanism than creating new views. Further, this improves data access functionality, whether by direct end-user queries, BI platforms, or other connected data flows.

Let’s look at a few examples of how masking policy definitions that integrate Protegrity UDFs are created and attached to tables:

## Query to create and attached Masking Policy including Protegrity UDFs ## 
## pty_xxx objects are UDFs previously defined. These UDFs points to Protegrity Redshift Protector based on Lambda Serverless ##
 
CREATE MASKING POLICY unprotect_name_masking_policy
WITH (patient_name    VARCHAR(256))
USING (pty.pty_unprotect_Name(patient_name));
--
ATTACH MASKING POLICY unprotect_name_masking_policy
ON "clinical_trial"."health_patient_record" (patient_name)
TO PUBLIC;
--
CREATE MASKING POLICY unprotect_nin_masking_policy
WITH (nin    VARCHAR(256))
USING (pty.pty_unprotect_nin(nin));
--
ATTACH MASKING POLICY unprotect_nin_masking_policy
ON "clinical_trial"."health_patient_record"(nin)
TO PUBLIC;
--
CREATE MASKING POLICY unprotect_medcondition_masking_policy
WITH (ssn    VARCHAR(256))
USING (pty.pty_unprotect_medcondition(ssn));
--
ATTACH MASKING POLICY unprotect_nin_masking_policy
ON "clinical_trial"."health_patient_record_v" (med_condition_1)
TO PUBLIC;
--
ATTACH MASKING POLICY unprotect_nin_masking_policy
ON "clinical_trial"."health_patient_record_v" (med_condition_2)
TO PUBLIC;

The complexity of implementing and applying such masking policies is very low for database developers. As for end users, there is no change to how they accessed the data previously. If a user has the permission to see data in the detokenized format, the detokenization will be applied at run-time.

Performance

Performance is the same for both approaches described above. Please refer to this AWS blog post about data tokenization with Amazon Redshift and Protegrity.

Conclusion

In this post, we described how users can leverage Amazon Redshift Dynamic Data Masking (DDM) together with Protegrity serverless UDFs to deliver a high-performant and low-impact data protection solution for any number of valuable business use cases.

For more details on how to create an Amazon Redshift Lambda UDF, refer to the documentation.
To learn more about the Protegrity Serverless solution, visit the Protegrity website.
To read over other integration such as AWS Glue, please see this blog post.

Protegrity’s Cloud Protect for Amazon Redshift is available for purchase through AWS Marketplace.

.

.

Protegrity – AWS Partner Spotlight

Protegrity is an AWS Specialization Partner that provides fine-grained data protection capabilities (tokenization, encryption, masking) for sensitive data and compliance..

Contact Protegrity | Partner Overview | AWS Marketplace