AWS Big Data Blog

Apply fine-grained access and transformation on the SUPER data type in Amazon Redshift

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.

Amazon Redshift, a cloud data warehouse service, supports attaching dynamic data masking (DDM) policies to paths of SUPER data type columns, and uses the OBJECT_TRANSFORM function with the SUPER data type. SUPER data type columns in Amazon Redshift contain semi-structured data like JSON documents. Previously, data masking in Amazon Redshift only worked with regular table columns, but now you can apply masking policies specifically to elements within SUPER columns. For example, you could apply a masking policy to mask sensitive fields like credit card numbers within JSON documents stored in a SUPER column. This allows for more granular control over data masking in Amazon Redshift. Amazon Redshift gives you more flexibility in how you apply data masking to protect sensitive information stored in SUPER columns containing semi-structured data.

With DDM support in Amazon Redshift, you can do the following:

  • Define masking policies that apply custom obfuscation policies, such as masking policies to handle credit card, personally identifiable information (PII) entries, HIPAA or GDPR needs, and more
  • Transform the data at query time to apply masking policies
  • Attach masking policies to roles or users
  • Attach multiple masking policies with varying levels of obfuscation to the same column in a table and assign them to different roles with priorities to avoid conflicts
  • Implement cell-level masking by using conditional columns when creating your masking policy
  • Use masking policies to partially or completely redact data, or hash it by using user-defined functions (UDFs)

In this post, we demonstrate how a retail company can control the access of PII data stored in the SUPER data type to users based on their access privilege without duplicating the data.

Solution overview

For our use case, we have the following data access requirements:

  • Users from the Customer Service team should be able to view the order data but not PII information
  • Users from the Sales team should be able to view customer IDs and all order information
  • Users from the Executive team should be able to view all the data
  • Staff should not be able to view any data

The following diagram illustrates how DDM support in Amazon Redshift policies works with roles and users for our retail use case.

The solution encompasses creating masking policies with varying masking rules and attaching one or more to the same role and table with an assigned priority to remove potential conflicts. These policies may pseudonymize results or selectively nullify results to comply with retailers’ security requirements. We refer to multiple masking policies being attached to a table as a multi-modal masking policy. A multi-modal masking policy consists of three parts:

  • A data masking policy that defines the data obfuscation rules
  • Roles with different access levels depending on the business case
  • The ability to attach multiple masking policies on a user or role and table combination with priority for conflict resolution

Prerequisites

To implement this solution, you need the following prerequisites:

Prepare the data

To set up our use case, complete the following steps:

  1. On the Amazon Redshift console, choose Query editor v2 under Explorer in the navigation pane.

If you’re familiar with SQL Notebooks, you can download the SQL notebook for the demonstration and import it to quickly get started.

  1. Create the table and populate contents:
    -- 1- Create the orders table
    drop table if exists public.order_transaction;
    create table public.order_transaction (
     data_json super
    );
    
    -- 2- Populate the table with sample values
    INSERT INTO public.order_transaction
    VALUES
        (
            json_parse('
            {
            "c_custkey": 328558,
            "c_name": "Customer#000328558",
            "c_phone": "586-436-7415",
            "c_creditcard": "4596209611290987",
            "orders":{
              "o_orderkey": 8014018,
              "o_orderstatus": "F",
              "o_totalprice": 120857.71,
              "o_orderdate": "2024-01-01"
              }
            }'
            )
        ),
        (
            json_parse('
            {
            "c_custkey": 328559,
            "c_name": "Customer#000328559",
            "c_phone": "789-232-7421",
            "c_creditcard": "8709000219329924",
            "orders":{
              "o_orderkey": 8014019,
              "o_orderstatus": "S",
              "o_totalprice": 9015.98,
              "o_orderdate": "2024-01-01"
              }
            }'
            )
        ),
        (
            json_parse('
            {
            "c_custkey": 328560,
            "c_name": "Customer#000328560",
            "c_phone": "276-564-9023",
            "c_creditcard": "8765994378650090",
            "orders":{
              "o_orderkey": 8014020,
              "o_orderstatus": "C",
              "o_totalprice": 18765.56,
              "o_orderdate": "2024-01-01"
              }
            }
            ')
        );

Implement the solution

To satisfy the security requirements, we need to make sure that each user sees the same data in different ways based on their granted privileges. To do that, we use user roles combined with masking policies as follows:

  1. Create users and roles, and add users to their respective roles:
    --create four users
    set session authorization admin;
    CREATE USER Kate_cust WITH PASSWORD disable;
    CREATE USER Ken_sales WITH PASSWORD disable;
    CREATE USER Bob_exec WITH PASSWORD disable;
    CREATE USER Jane_staff WITH PASSWORD disable;
    
    -- 1. Create User Roles
    CREATE ROLE cust_srvc_role;
    CREATE ROLE sales_srvc_role;
    CREATE ROLE executives_role;
    CREATE ROLE staff_role;
    
    -- note that public role exists by default.
    -- Grant Roles to Users
    GRANT ROLE cust_srvc_role to Kate_cust;
    GRANT ROLE sales_srvc_role to Ken_sales;
    GRANT ROLE executives_role to Bob_exec;
    GRANT ROLE staff_role to Jane_staff;
    
    -- note that regualr_user is attached to public role by default.
    GRANT ALL ON ALL TABLES IN SCHEMA "public" TO ROLE cust_srvc_role;
    GRANT ALL ON ALL TABLES IN SCHEMA "public" TO ROLE sales_srvc_role;
    GRANT ALL ON ALL TABLES IN SCHEMA "public" TO ROLE executives_role;
    GRANT ALL ON ALL TABLES IN SCHEMA "public" TO ROLE staff_role;
  2. Create masking policies:
    -- Mask Full Data
    CREATE MASKING POLICY mask_full
    WITH(pii_data VARCHAR(256))
    USING ('000000XXXX0000'::TEXT);
    
    -- This policy rounds down the given price to the nearest 10.
    CREATE MASKING POLICY mask_price
    WITH(price INT)
    USING ( (FLOOR(price::FLOAT / 10) * 10)::INT );
    
    -- This policy converts the first 12 digits of the given credit card to 'XXXXXXXXXXXX'.
    CREATE MASKING POLICY mask_credit_card
    WITH(credit_card TEXT)
    USING ( 'XXXXXXXXXXXX'::TEXT || SUBSTRING(credit_card::TEXT FROM 13 FOR 4) );
    
    -- This policy mask the given date
    CREATE MASKING POLICY mask_date
    WITH(order_date TEXT)
    USING ( 'XXXX-XX-XX'::TEXT);
    
    -- This policy mask the given phone number
    CREATE MASKING POLICY mask_phone
    WITH(phone_number TEXT)
    USING ( 'XXX-XXX-'::TEXT || SUBSTRING(phone_number::TEXT FROM 9 FOR 4) );
  3. Attach the masking policies:
    • Attach the masking policy for the customer service use case:
      --customer_support (cannot see customer PHI/PII data but can see the order id , order details and status etc.)
      
      set session authorization admin;
      
      ATTACH MASKING POLICY mask_full
      ON public.order_transaction(data_json.c_custkey)
      TO ROLE cust_srvc_role;
      
      ATTACH MASKING POLICY mask_phone
      ON public.order_transaction(data_json.c_phone)
      TO ROLE cust_srvc_role;
      
      ATTACH MASKING POLICY mask_credit_card
      ON public.order_transaction(data_json.c_creditcard)
      TO ROLE cust_srvc_role;
      
      ATTACH MASKING POLICY mask_price
      ON public.order_transaction(data_json.orders.o_totalprice)
      TO ROLE cust_srvc_role;
      
      ATTACH MASKING POLICY mask_date
      ON public.order_transaction(data_json.orders.o_orderdate)
      TO ROLE cust_srvc_role;
    • Attach the masking policy for the sales use case:
      --sales —> can see the customer ID (non phi data) and all order info
      
      set session authorization admin;
      
      ATTACH MASKING POLICY mask_phone
      ON public.order_transaction(data_json.customer.c_phone)
      TO ROLE sales_srvc_role;
    • Attach the masking policy for the staff use case:
      --Staff — > cannot see any data about the order. all columns masked for them ( we can hand pick some columns) to show the functionality
      
      set session authorization admin;
      
      ATTACH MASKING POLICY mask_full
      ON public.order_transaction(data_json.orders.o_orderkey)
      TO ROLE staff_role;
      
      ATTACH MASKING POLICY mask_pii_full
      ON public.order_transaction(data_json.orders.o_orderstatus)
      TO ROLE staff_role;
      
      ATTACH MASKING POLICY mask_pii_price
      ON public.order_transaction(data_json.orders.o_totalprice)
      TO ROLE staff_role;
      
      ATTACH MASKING POLICY mask_date
      ON public.order_transaction(data_json.orders.o_orderdate)
      TO ROLE staff_role;

Test the solution

Let’s confirm that the masking policies are created and attached.

  1. Check that the masking policies are created with the following code:
    -- 1.1- Confirm the masking policies are created
    SELECT * FROM svv_masking_policy;

  2. Check that the masking policies are attached:
    -- 1.2- Verify attached masking policy on table/column to user/role.
    SELECT * FROM svv_attached_masking_policy;

Now you can test that different users can see the same data masked differently based on their roles.

  1. Test that the customer support can’t see customer PHI/PII data but can see the order ID, order details, and status:
    set session authorization Kate_cust;
    select * from order_transaction;

  2. Test that the sales team can see the customer ID (non PII data) and all order information:
    set session authorization Ken_sales;
    select * from order_transaction;

  3. Test that the executives can see all data:
    set session authorization Bob_exec;
    select * from order_transaction;

  4. Test that the staff can’t see any data about the order. All columns should masked for them.
    set session authorization Jane_staff;
    select * from order_transaction;

Object_Transform function

In this section, we dive into the capabilities and benefits of the OBJECT_TRANSFORM function and explore how it empowers you to efficiently reshape your data for analysis. The OBJECT_TRANSFORM function in Amazon Redshift is designed to facilitate data transformations by allowing you to manipulate JSON data directly within the database. With this function, you can apply transformations to semi-structured or SUPER data types, making it less complicated to work with complex data structures in a relational database environment.

Let’s look at some usage examples.

First, create a table and populate contents:

--1- Create the customer table 

DROP TABLE if exists customer_json;

CREATE TABLE customer_json (
    col_super super,
    col_text character varying(100) ENCODE lzo
) DISTSTYLE AUTO;

--2- Populate the table with sample data 

INSERT INTO customer_json
VALUES
    (
        
        json_parse('
            {
                "person": {
                    "name": "GREGORY HOUSE",
                    "salary": 120000,
                    "age": 17,
                    "state": "MA",
                    "ssn": ""
                }
            }
        ')
        ,'GREGORY HOUSE'
    ),
    (
        json_parse('
              {
                "person": {
                    "name": "LISA CUDDY",
                    "salary": 180000,
                    "age": 30,
                    "state": "CA",
                    "ssn": ""
                }
            }
        ')
        ,'LISA CUDDY'
    ),
     (
        json_parse('
              {
                "person": {
                    "name": "JAMES WILSON",
                    "salary": 150000,
                    "age": 35,
                    "state": "WA",
                    "ssn": ""
                }
            }
        ')
        ,'JAMES WILSON'
    )
;
-- 3 select the data 

SELECT * FROM customer_json;

Apply the transformations with the OBJECT_TRANSFORM function:

SELECT
    OBJECT_TRANSFORM(
        col_super
        KEEP
            '"person"."name"',
            '"person"."age"',
            '"person"."state"'
           
        SET
            '"person"."name"', LOWER(col_super.person.name::TEXT),
            '"person"."salary"',col_super.person.salary + col_super.person.salary*0.1
    ) AS col_super_transformed
FROM customer_json;

As you can see in the example, by applying the transformation with OBJECT_TRANSFORM, the person name is formatted in lowercase and the salary is increased by 10%. This demonstrates how the transformation makes is less complicated to work with semi-structured or nested data types.

Clean up

When you’re done with the solution, clean up your resources:

  1. Detach the masking policies from the table:
    -- Cleanup
    --reset session authorization to the default
    RESET SESSION AUTHORIZATION;
  2. Drop the masking policies:
    DROP MASKING POLICY mask_pii_data CASCADE;
  3. Revoke or drop the roles and users:
    REVOKE ROLE cust_srvc_role from Kate_cust;
    REVOKE ROLE sales_srvc_role from Ken_sales;
    REVOKE ROLE executives_role from Bob_exec;
    REVOKE ROLE staff_role from Jane_staff;
    DROP ROLE cust_srvc_role;
    DROP ROLE sales_srvc_role;
    DROP ROLE executives_role;
    DROP ROLE staff_role;
    DROP USER Kate_cust;
    DROP USER Ken_sales;
    DROP USER Bob_exec;
    DROP USER Jane_staff;
  4. Drop the table:
    DROP TABLE order_transaction CASCADE;
    DROP TABLE if exists customer_json;

Considerations and best practices

Consider the following when implementing this solution:

  • When attaching a masking policy to a path on a column, that column must be defined as the SUPER data type. You can only apply masking policies to scalar values on the SUPER path. You can’t apply masking policies to complex structures or arrays.
  • You can apply different masking policies to multiple scalar values on a single SUPER column as long as the SUPER paths don’t conflict. For example, the SUPER paths a.b and a.b.c conflict because they’re on the same path, with a.b being the parent of a.b.c. The SUPER paths a.b.c and a.b.d don’t conflict.

Refer to Using dynamic data masking with SUPER data type paths for more details on considerations.

Conclusion

In this post, we discussed how to use DDM support for the SUPER data type in Amazon Redshift to define configuration-driven, consistent, format-preserving, and irreversible masked data values. With DDM support in Amazon Redshift, you can control your data masking approach using familiar SQL language. You can take advantage of the Amazon Redshift role-based access control capability to implement different levels of data masking. You can create a masking policy to identify which column needs to be masked, and you have the flexibility of choosing how to show the masked data. For example, you can completely hide all the information of the data, replace partial real values with wildcard characters, or define your own way to mask the data using SQL expressions, Python, or Lambda UDFs. Additionally, you can apply conditional masking based on other columns, which selectively protects the column data in a table based on the values in one or more columns.

We encourage you to create your own user-defined functions for various use cases and achieve your desired security posture using dynamic data masking support in Amazon Redshift.


About the Authors

Ritesh Kumar Sinha is an Analytics Specialist Solutions Architect based out of San Francisco. He has helped customers build scalable data warehousing and big data solutions for over 16 years. He loves to design and build efficient end-to-end solutions on AWS. In his spare time, he loves reading, walking, and doing yoga.

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 15+ years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Omama Khurshid is an Acceleration Lab Solutions Architect at Amazon Web Services. She focuses on helping customers across various industries build reliable, scalable, and efficient solutions. Outside of work, she enjoys spending time with her family, watching movies, listening to music, and learning new technologies.