AWS Big Data Blog
Use Amazon Redshift Spectrum with row-level and cell-level security policies defined in AWS Lake Formation
Data warehouses and data lakes are key to an enterprise data management strategy. A data lake is a centralized repository that consolidates your data in any format at any scale and makes it available for different kinds of analytics. A data warehouse, on the other hand, has cleansed, enriched, and transformed data that is optimized for faster queries. Amazon Redshift is a fast, petabyte-scale cloud data warehouse that powers a lake house architecture, which enables you to query the data in a data warehouse and an Amazon Simple Storage Service (Amazon S3) data lake using familiar SQL statements and gain deeper insights.
Data lakes often contain data for multiple business units, users, locations, vendors, and tenants. Enterprises want to share their data while balancing compliance and security needs. To satisfy compliance requirements and to achieve data isolation, enterprises often need to control access at the row level and cell level. For example:
- If you have a multi-tenant data lake, you may want each tenant to be able to view only those rows that are associated to their tenant ID
- You may have data for multiple portfolios in the data lake and you need to control access for various portfolio managers
- You may have sensitive information or personally identifiable information (PII) that can be viewed by users with elevated privileges only
AWS Lake Formation makes it easy to set up a secure data lake and access controls for these kinds of use cases. You can use Lake Formation to centrally define security, governance, and auditing policies, thereby achieving unified governance for your data lake. Lake Formation supports row-level security and cell-level security:
- Row-level security allows you to specify filter expressions that limit access to specific rows of a table to a user
- Cell-level security builds on row-level security by allowing you to apply filter expressions on each row to hide or show specific columns
Amazon Redshift is the fastest and most widely used cloud data warehouse. Amazon Redshift Spectrum is a feature of Amazon Redshift that enables you to query data from and write data back to Amazon S3 in open formats. You can query open file formats such as Parquet, ORC, JSON, Avro, CSV, and more directly in Amazon S3 using familiar ANSI SQL. This gives you the flexibility to store highly structured, frequently accessed data in an Amazon Redshift data warehouse, while also keeping up to exabytes of structured, semi-structured, and unstructured data in Amazon S3. Redshift Spectrum integrates with Lake Formation natively. This integration enables you to define data filters in Lake Formation that specify row-level and cell-level access control for users on your data and then query it using Redshift Spectrum.
In this post, we present a sample multi-tenant scenario and describe how to define row-level and cell-level security policies in Lake Formation. We also show how these policies are applied when querying the data using Redshift Spectrum.
Solution overview
In our use case, Example Corp has built an enterprise data lake on Amazon S3. They store data for multiple tenants in the data lake and query it using Redshift Spectrum. Example Corp maintains separate AWS Identity and Access Management (IAM) roles for each of their tenants and wants to control access to the multi-tenant dataset based on their IAM role.
Example Corp needs to ensure that the tenants can view only those rows that are associated to them. For example, Tenant1
should see only those rows where tenantid = 'Tenant1'
and Tenant2
should see only those rows where tenantid = 'Tenant2'
. Also, tenants can only view sensitive columns such as phone, email, and date of birth associated to specific countries.
The following is a screenshot of the multi-tenant dataset we use to demonstrate our solution. It has data for two tenants: Tenant1
and Tenant2
. tenantid
is the column that distinguishes data associated to each tenant.
To solve this use case, we implement row-level and cell-level security in Lake Formation by defining data filters. When Example Corp’s tenants query the data using Redshift Spectrum, the service checks filters defined in Lake Formation and returns only the data that the tenant has access to.
Lake Formation metadata tables contain information about data in the data lake, including schema information, partition information, and data location. You can use them to access underlying data in the data lake and manage that data with Lake Formation permissions. You can apply row-level and cell-level security to Lake Formation tables. In this post, we provide a walkthrough using a standard Lake Formation table.
The following diagram illustrates our solution architecture.
The solution workflow consists of the following steps:
- Create IAM roles for the tenants.
- Register an Amazon S3 location in Lake Formation.
- Create a database and use AWS Glue crawlers to create a table in Lake Formation.
- Create data filters in Lake Formation.
- Grant access to the IAM roles in Lake Formation.
- Attach the IAM roles to the Amazon Redshift cluster.
- Create an external schema in Amazon Redshift.
- Create Amazon Redshift users for each tenant and grant access to the external schema.
- Users
Tenant1
andTenant2
assume their respective IAM roles and query data using the SQL query editor or any SQL client to their external schemas inside Amazon Redshift.
Prerequisites
This walkthrough assumes that you have the following prerequisites:
- An Amazon Redshift cluster
- An S3 bucket
- A Lake Formation administrator role
Create IAM roles for the tenants
Create IAM roles Tenant1ReadRole
and Tenant2ReadRole
for users with elevated privileges for the two tenants, with Amazon Redshift as the trusted entity, and attach the following policy to both roles:
Register an Amazon S3 location in Lake Formation
We use the sample multi-tenant dataset SpectrumRowLevelFiltering.csv
. Complete the following steps to register the location of this dataset in Lake Formation:
- Download the dataset and upload it to the Amazon S3 path
s3://<your_bucket>/order_details/SpectrumRowLevelFiltering.csv
. - On the Lake Formation console, choose Data lake locations in the navigation pane.
- Choose Register location.
- For Amazon S3 path, enter the S3 path of your dataset.
- For IAM role, choose either the
AWSServiceRoleForLakeFormationDataAccess
service-linked role (the default) or the Lake Formation administrator role mentioned in the prerequisites. - Choose Register location.
Create a database and a table in Lake Formation
To create your database and table, complete the following steps:
- Sign in to the AWS Management Console as the data lake administrator.
- On the Lake Formation console, choose Databases in the navigation pane.
- Choose Create database.
- For Name, enter
rs_spectrum_rls_blog
. - If Use only IAM access control for new tables in this database is selected, uncheck it.
- Choose Create database.Next, you create a new data lake table.
- On the AWS Glue console, choose Crawlers in navigation pane.
- Choose Add crawler.
- For Crawler name, enter
order_details
. - For Specify crawler source type, keep the default selections.
- For Add data store, choose Include path, and choose the S3 path to the dataset (
s3://<your_bucket>/order_details/
). - For Choose IAM Role, choose Create an IAM role, with the suffix
rs_spectrum_rls_blog
. - For Frequency, choose Run on demand.
- For Database, choose database you just created (
rs_spectrum_rls_blog
). - Choose Finish to create the crawler.
- Grant CREATE TABLE permissions and DESCRIBE/ALTER/DELETE database permissions to the IAM role you created in Step 12.
- To run the crawler, in the navigation pane, choose Crawlers.
- Select the crawler
order_details
and choose Run crawler.When the crawler is complete, you can find the tableorder_details
created under the databasers_spectrum_rls_blog
in the AWS Glue Data Catalog. - On the AWS Glue console, in the navigation pane, choose Databases.
- Select the database
rs_spectrum_rls_blog
and choose View tables. - Choose the table
order_details
.
The following screenshot is the schema of the order_details
table.
Create data filters in Lake Formation
To implement row-level and cell-level security, first you create data filters. Then you choose that data filter while granting SELECT permission on the tables. For this use case, you create two data filters: one for Tenant1 and one for Tenant2.
- On the Lake Formation console, choose Data catalog in the navigation pane, then choose Data filters.
- Choose Create new filter.
Let’s create the first data filter filter-tenant1-order-details restricting the rows Tenant1 is able to see in tableorder_details
. - For Data filter name, enter
filter-tenant1-order-details
. - For Target database, choose
rs_spectrum_rls_blog
. - For Target table, choose
order_details
. - For Column-level access, select Include columns and then choose the following columns:
c_emailaddress
,c_phone
,c_dob
,c_firstname
,c_address
,c_country
,c_lastname
, andtenanted
. - For Row filter expression, enter
tenantid = 'Tenant1'
andc_country
in(‘USA’,‘Spain’)
. - Choose Create filter.
- Repeat these steps to create another data filter filter-tenant2-order-details, with row filter expression
tenantid = 'Tenant2'
andc_country
in(‘USA’,‘Canada’)
.
Grant access to IAM roles in Lake Formation
After you create the data filters, you need to attach them to the table to grant access to a principal. First let’s grant access to order_details
to the IAM role Tenant1ReadRole using the data filter we created for Tenant1.
- On the Lake Formation console, in the navigation pane, under Permissions, choose Data Permissions.
- Choose Grant.
- In the Principals section, select IAM users and roles.
- For IAM users and roles, choose the role
Tenant1ReadRole
. - In the LF-Tags or catalog resources section, choose Named data catalog resources.
- For Databases, choose
rs_spectrum_rls_blog
. - For Tables, choose
order_details
. - For Data filters, choose
filter-tenant1-order-details
. - For Data filter permissions, choose Select.
- Choose Grant.
- Repeat these steps with the IAM role
Tenant2ReadRole
and data filterfilter-tenant2-order-details
.
Attach the IAM roles to the Amazon Redshift cluster
To attach your roles to the cluster, complete the following steps:
- On the Amazon Redshift console, in the navigation menu, choose CLUSTERS, then select the name of the cluster that you want to update.
- On the Actions menu, choose Manage IAM roles.The IAM roles page appears.
- Either choose Enter ARN and enter an ARN of the
Tenant1ReadRole
IAM role, or choose the Tenant1ReadRole IAM role from the list. - Choose Add IAM role.
- Choose Done to associate the IAM role with the cluster.The cluster is modified to complete the change.
- Repeat these steps to add the
Tenant2ReadRole
IAM role to the Amazon Redshift cluster.
Amazon Redshift allows up to 50 IAM roles to attach to the cluster to access other AWS services.
Create an external schema in Amazon Redshift
Create an external schema on the Amazon Redshift cluster, one for each IAM role, using the following code:
Create Amazon Redshift users for each tenant and grant access to the external schema
Complete the following steps:
- Create Amazon Redshift users to restrict access to the external schemas (connect to the cluster with a user that has permission to create users or superusers) using the following code:
- Let’s create the read-only role (
tenant1_ro
) to provide read-only access to the spectrum_tenant1 schema: - Grant usage on
spectrum_tenant1
schema to the read-onlytenant1_ro
role: - Now assign the user to the read-only
tenant1_ro
role: - Repeat the same steps to grant permission to the user
tenant2_user
:
Tenant1 and Tenant2 users run queries using the SQL editor or a SQL client
To test the permission levels for different users, connect to the database using the query editor with that user.
In the Query Editor in the Amazon Redshift console, connect to the cluster with tenant1_user
and run the following query:
In the following screenshot, tenant1_user
is only able to see records where the tenantid
value is Tenant1
and only the customer PII fields specific to the US and Spain.
To validate the Lake Formation data filters, the following screenshot shows that Tenant1
can’t see any records for Tenant2
.
Reconnect to the cluster using tenant2_user
and run the following query:
In the following screenshot, tenant2_user
is only able to see records where the tenantid
value is Tenant2
and only the customer PII fields specific to the US and Canada.
To validate the Lake Formation data filters, the following screenshot shows that Tenant2
can’t see any records for Tenant1
.
Conclusion
In this post, you learned how to implement row-level and cell-level security on an Amazon S3-based data lake using data filters and access control features in Lake Formation. You also learned how to use Redshift Spectrum to access the data from Amazon S3 while adhering to the row-level and cell-level security policies defined in Lake Formation.
You can further enhance your understanding of Lake Formation row-level and cell-level security by referring to Effective data lakes using AWS Lake Formation, Part 4: Implementing cell-level and row-level security.
To learn more about Redshift Spectrum, refer Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required.
For more information about configuring row-level access control natively in Amazon Redshift, refer to Achieve fine-grained data security with row-level access control in Amazon Redshift.
About the authors
Anusha Challa is a Senior Analytics Specialist Solutions Architect at AWS. Her expertise is in building large-scale data warehouses, both on premises and in the cloud. She provides architectural guidance to our customers on end-to-end data warehousing implementations and migrations.
Ranjan Burman is an Analytics Specialist Solutions Architect at AWS.