AWS Big Data Blog

Restrict Amazon Redshift Spectrum external table access to Amazon Redshift IAM users and groups using role chaining

With Amazon Redshift Spectrum, you can query the data in your Amazon Simple Storage Service (Amazon S3) data lake using a central AWS Glue metastore from your Amazon Redshift cluster. This capability extends your petabyte-scale Amazon Redshift data warehouse to unbounded data storage limits, which allows you to scale to exabytes of data cost-effectively. Like Amazon EMR, you get the benefits of open data formats and inexpensive storage, and you can scale out to thousands of Redshift Spectrum nodes to pull data, filter, project, aggregate, group, and sort. Like Amazon Athena, Redshift Spectrum is serverless and there’s nothing to provision or manage. You only pay $5 for every 1 TB of data scanned. This post discusses how to configure Amazon Redshift security to enable fine grained access control using role chaining to achieve high-fidelity user-based permission management.

As you start using the lake house approach, which integrates Amazon Redshift with the Amazon S3 data lake using Redshift Spectrum, you need more flexibility when it comes to granting access to different external schemas on the cluster. For example, in the following use case, you have two Redshift Spectrum schemas, SA and SB, mapped to two databases, A and B, respectively, in an AWS Glue Data Catalog, in which you want to allow access for the following when queried from Amazon Redshift:

  • Select access for SA only to IAM user group Grp1
  • Select access for database SB only to IAM user group Grp2
  • No access for IAM user group Grp3 to databases SA and SB

By default, the policies defined under the AWS Identity and Access Management (IAM) role assigned to the Amazon Redshift cluster manages Redshift Spectrum table access, which is inherited by all users and groups in the cluster. This IAM role associated to the cluster cannot easily be restricted to different users and groups. This post details the configuration steps necessary to achieve fine-grained authorization policies for different users in an Amazon Redshift cluster and control access to different Redshift Spectrum schemas and tables using IAM role chaining. When using role chaining, you don’t have to modify the cluster; you can make all modifications on the IAM side. Adding new roles doesn’t require any changes in Amazon Redshift. Even when using AWS Lake Formation, as of this writing, you can’t achieve this level of isolated, coarse-grained access control on the Redshift Spectrum schemas and tables. For more information about cross-account queries, see How to enable cross-account Amazon Redshift COPY and Redshift Spectrum query for AWS KMS–encrypted data in Amazon S3.

Prerequisites

This post uses a TPC-DS 3 TB public dataset from Amazon S3 cataloged in AWS Glue by an AWS Glue crawler and an example retail department dataset. To get started, you must complete the following prerequisites. The first two prerequisites are outside of the scope of this post, but you can use your cluster and dataset in your Amazon S3 data lake.

  1. Create an Amazon Redshift cluster with or without an IAM role assigned to the cluster.
  2. Create an AWS Glue Data Catalog with a database using data from the data lake in Amazon S3, with either an AWS Glue crawler, Amazon EMR, AWS Glue, or Athena.The database should have one or more tables pointing to different Amazon S3 paths. This post uses an industry standard TPC-DS 3 TB dataset, but you can also use your own dataset.
  3. Create IAM users and groups to use later in Amazon Redshift:
    • Create new IAM groups named grpA and grpB without any policies.
    • Create users a1 and b1 and add them to groups grpA and grpB, respectively. Use lower-case usernames.
    • Add the following policy to all the groups you created to allow IAM users temporary credentials when authenticating against Amazon Redshift:
      {
         			 "Version": "2012-10-17",
          			"Statement": {
              			"Effect": "Allow",
              			"Action": "redshift:GetClusterCredentials",
              			"Resource": "arn:aws:redshift:us-east-1:0123456789:dbuser:*/*"
          			}
      }

      You may want to use more restricted access by allowing specific users and groups in the cluster to this policy for additional security.

  1. Create the IAM users and groups locally on the Amazon Redshift cluster without any password.
    • To create user a1, enter the following code:
      create user a1 password disable;
    • To create grpA, enter the following code:
      create group grpA with user a1;
    • Repeat these steps for user b1 and add the user to grpB.
  1. Install a jdbc sql query client such as SqlWorkbenchJ on the client machine.

Use case

In the following use case, you have an AWS Glue Data Catalog with a database named tpcds3tb. Tables in this database point to Amazon S3 under a single bucket, but each table is mapped to a different prefix under the bucket. The following screenshot shows the different table locations.

You use the tpcds3tb database and create a Redshift Spectrum external schema named schemaA. You create groups grpA and grpB with different IAM users mapped to the groups. The goal is to grant different access privileges to grpA and grpB on external tables within schemaA.

This post presents two options for this solution:

  • Use the Amazon Redshift grant usage statement to grant grpA access to external tables in schemaA. The groups can access all tables in the data lake defined in that schema regardless of where in Amazon S3 these tables are mapped to.
  • Configure role chaining to Amazon S3 external schemas that isolate group access to specific data lake locations and deny access to tables in the schema that point to a different Amazon S3 locations.

Isolating user and group access using the grant usage privilege

You can use the Amazon Redshift grant usage privilege on schemaA, which allows grpA access to all objects under that schema. You don’t grant any usage privilege to grpB; users in that group should see access denied when querying.

  1. Create an IAM role named mySpectrum for the Amazon Redshift cluster to allow Redshift Spectrum to read Amazon S3 objects using the following policy:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:Get*",
                    "s3:List*"
                ],
                "Resource": "*"
            }
        ]
    }
  2. Add a trust relationship to allow users in Amazon Redshift to assume roles assigned to the cluster. See the following code:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "redshift.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
      ]
    }
  3. Choose your cluster name.
  4. Choose Properties.
  5. Choose Manage IAM roles.
  6. For Available IAM roles, choose your new role.If you don’t see the role listed, select Enter ARN and enter the role’s ARN.
  7. Choose Done as seen in screenshot below.
  8. Use the Amazon Redshift JDBC driver that has AWS SDK, which you can download from the Amazon Redshift console (see the following screenshot) and connect to the cluster using the IAM connection string from a SQL client such as SqlWorkbenchJ.Screenshot below depicts the jar file type you select.Following screenshot depicts the connection configuration using Workbenchj.
  9. As an Amazon Redshift admin user, create external schemas with schemaA mapped to the AWS Glue database tpcds3tb (you use the IAM role you created earlier to allow Redshift Spectrum access to Amazon S3). See the following code:
    create external schema schemaA from data catalog
    database 'tpcds3tb'
    iam_role 'arn:aws:iam::0123456789:role/mySpectrum' region 'us-east-1';
  10. Verify the schema is in the Amazon Redshift catalog with the following code:
    select * from svv_external_schemas;
  11. Grant usage privilege to grpA. See the following code:
    Grant usage on schema schemaA to group grpA;
  12. Query tables in schemaA as user a1 in grpA using your SQL client. See the following code:
    Select * from schemaA.customer limit 3;
    Select * from schemaA.call_center limit 3;

    The following screenshot shows the successful query results.

  13. Query tables in schemaA as user b1 in grpB.The following screenshot shows the error message you receive.

This option gives great flexibility to isolate user access on Redshift Spectrum schemas, but what if user b1 is authorized to access one or more tables in that schema but not all tables? The second option creates coarse-grained access control policies.

Isolating user and group access using IAM policies and role chaining

You can use IAM policies mapped to IAM roles with a trust relationship to specific users and groups based on Amazon S3 location access and assign it to the cluster. For this use case, grpB is authorized to only access the table catalog_page located at s3://myworkspace009/tpcds3t/catalog_page/, and grpA is authorized to access all tables but catalog_page located at s3://myworkspace009/tpcds3t/*. The following steps help you configure for the given security requirement.

The following diagram depicts how role chaining works.

You first create IAM roles with policies specific to grpA and grpB. The first role is a generic cluster role that allows users to assume this role using a trust relationship defined in the role.

  1. On the IAM console, create a new role. See the following code:
    RoleName: myblog-redshift-assumeRole
  2. Add the following two policies to this role:
    • Add a managed policy named AWSAWS GlueConsoleFullAccess. You might consider adding your inline policy with least privileges instead of this managed role for more restricted security needs.
    • Add an inline policy called myblog-redshift-assumerole-inline with the following rules:
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "Stmt1487639602000",
                  "Effect": "Allow",
                  "Action": [
                      "sts:AssumeRole"
                  ],
                  "Resource": "arn:aws:iam::0123456789:role/*"
              }
          ]
      }
  1. Add a trust relationship that allows the users in the cluster to assume this role. You can choose to limit this to specific users as necessary. See the following code:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "redshift.amazonaws.com"
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringLike": {
              "sts:ExternalId": "arn:aws:redshift:us-east-1:0123456789:dbuser:<YourRedshiftClusterName>/*"
            }
          }
        }
      ]
    }
  2. Create a new Redshift-customizable role specific to grpA with a policy allowing access to Amazon S3 locations for which this group is only allowed access. Make sure you omit the Amazon S3 location for the catalog_page table; you don’t want to authorize this group to view that data.
    • Name the role myblog-grpA-role.
  1. Add the following two policies to this role:
    • Add a managed policy named AWSAWS GlueConsoleFullAccess to the role. (Note: You might consider adding your inline policy with least privileges instead of this managed role for more restricted security needs.)
    • Add an inline policy named myblog-grpA-access-policy with the following rules (modify it to fit your security needs and allow minimal permissions):
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "AllowS3",
                  "Effect": "Allow",
                  "Action": [
                      "s3:*"
                  ],
                  "Resource": [
                      "arn:aws:s3:::myworkspace009/tpcds3t/call_center/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/catalog_returns/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/catalog_sales/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/customer/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/customer_address/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/customer_demographics/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/date_dim/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/item/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/promotion/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/store/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/store_sales/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/web_page/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/web_sales/*"
                  ]
              }
          ]
      }
  1. Add a trust relationship explicitly listing all users in grpA to only allow them to assume this role (choose the tab Trust relationships and edit it to add the following policy updating the relevant account details):The trust relationship has to be updated for each user added to this role, or build a new role for each user. It is fairly easy to script automate updating this trust relationship for each new user.
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": " arn:aws:iam::0123456789:role/myblog-redshift-assumeRole"
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringEquals": {
              "sts:ExternalId": "arn:aws:redshift:us-east-1:0123456789:dbuser:<YourRedshiftClusterName>/a1"
            }
          }
        }
      ]
    }
  2. Create another Redshift-customizable role specific to grpB with a policy restricting access only to Amazon S3 locations where this group is allowed access.
    • Name the role myblog-grpB-role.
  1. Add the following two policies to this role. Create these managed policies reflecting the data access per DB Group and attach them to the roles that are assumed on the cluster.
    • Add a managed policy named AWSAWS GlueConsoleFullAccess to the role. You might consider adding your inline policy with least privileges instead of this managed role for more restricted security needs.
    • Add an inline policy named myblog-grpB-access-policy with the following rules (modify it to fit your security needs and allow minimal permissions):
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "AllowS3",
                  "Effect": "Allow",
                  "Action": [
                      "s3:*"
                  ],
                  "Resource": [
                      "arn:aws:s3:::myworkspace009/tpcds3t/catalog_page/*"
                  ]
              }
          ]
      }
  1. Add a trust relationship explicitly listing all users in grpB to only allow them to assume this role (choose the tab Trust relationships and edit it to add the following policy updating the relevant account details):This trust relationship has to be updated for each user for this role, or build a role for each user. It is fairly easy to script automate updating this trust relationship.
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": " arn:aws:iam::0123456789:role/myblog-redshift-assumeRole"
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringEquals": {
              "sts:ExternalId": "arn:aws:redshift:us-east-1:0123456789:dbuser:<YourRedshiftClusterName>/b1"
            }
          }
        }
      ]
    }
  2. Attach the three roles to the Amazon Redshift cluster and remove any other roles mapped to the cluster. If you don’t find any roles in the drop-down menu, use the role ARN.
  3. As an admin user, create a new external schema for grpA and grpB, respectively, using role chaining with the two roles you created.
    • For grpA, enter the following code:
      create external schema rc_schemaA from data catalog
      database 'tpcds3tb'
      iam_role 'arn:aws:iam::0123456789:role/myblog-redshift-assumeRole,arn:aws:iam::0123456789:role/myblog-grpA-role' region 'us-east-1';
      
      grant usage on schema rc_schemaA to group grpA;
    • For grpB, enter the following code:
      create external schema rc_schemaB from data catalog
      database 'tpcds3tb'
      iam_role 'arn:aws:iam::0123456789:role/myblog-redshift-assumeRole,arn:aws:iam::0123456789:role/myblog-grpB-role' region 'us-east-1';
      
      grant usage on schema rc_schemaB to group grpB;
  1. Query the external schema as user in grpA and grpB.
    • To query the customer table and catalog_page table as user a1 in grpA, enter the following code:
      select * from rc_schemaA.customer limit 3;
      select * from rc_schemaA.catalog_page limit 3;

The following screenshot shows the query results; user a1 can access the customer table successfully.

The following screenshot shows that user a1 can’t access catalog_page.

    • Query the customer table and catalog_page table as user b1 in grpB.

The following screenshot shows that user b1 can access catalog_page.

The following screenshot shows that user b1 can’t access the customer table.

Conclusion

This post demonstrated two different ways to isolate user and group access to external schema and tables. With the first option of using Grant usage statements, the granted group has access to all tables in the schema regardless of which Amazon S3 data lake paths the tables point to. This approach gives great flexibility to grant access at ease, but it doesn’t allow or deny access to specific tables in that schema.

With the second option, you manage user and group access at the grain of Amazon S3 objects, which gives more control of data security and lowers the risk of unauthorized data access. This approach has some additional configuration overhead compared to the first approach, but can yield better data security.

In both approaches, building a right governance model upfront on Amazon S3 paths, external schemas, and table mapping based on how groups of users access them is paramount to provide the best security and allow low operational overhead.

Special acknowledgment goes to AWS colleague Martin Grund for his valuable comments and suggestions.

 


About the Authors

Harsha Tadiparthi is a Specialist Sr. Solutions Architect, AWS Analytics. He enjoys solving complex customer problems in Databases and Analytics and delivering successful outcomes. Outside of work, he loves to spend time with his family, watch movies, and travel whenever possible.

 

 

 

Harshida Patel is a Data Warehouse Specialist Solutions Architect with AWS.