Introducing hybrid access mode for AWS Glue Data Catalog to secure access using AWS Lake Formation and IAM and Amazon S3 policies
AWS Lake Formation helps you centrally govern, secure, and globally share data for analytics and machine learning. With Lake Formation, you can manage access control for your data lake data in Amazon Simple Storage Service (Amazon S3) and its metadata in AWS Glue Data Catalog in one place with familiar database-style features. You can use fine-grained data access control to verify that the right users have access to the right data down to the cell level of tables. Lake Formation also makes it simpler to share data internally across your organization and externally. Further, Lake Formation integrates with AWS analytics services such as Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, and AWS Glue ETL for Apache Spark. These services allow querying Lake Formation managed tables, thus helping you extract business insights from the data quickly and securely.
Before the introduction of Lake Formation and its database-style permissions for data lakes, you had to manage access to your data in the data lake and its metadata separately through AWS Identity and Access Management (IAM) policies and S3 bucket policies. With an IAM and Amazon S3 access control mechanism, which is more complex and less granular compared to Lake Formation, you need more time to migrate to Lake Formation because a given database or table in the data lake could have its access controlled by either IAM and S3 policies or Lake Formation policies, but not both. Also, various use cases operate on the data lakes. Migrating all use cases from one permissions model to another in a single step without disruption was challenging for operations teams.
To ease the transition of data lake permissions from an IAM and S3 model to Lake Formation, we’re introducing a hybrid access mode for AWS Glue Data Catalog. Please refer to the What’s New and documentation. This feature lets you secure and access the cataloged data using both Lake Formation permissions and IAM and S3 permissions. Hybrid access mode allows data administrators to onboard Lake Formation permissions selectively and incrementally, focusing on one data lake use case at a time. For example, say you have an existing extract, transform and load (ETL) data pipeline that uses the IAM and S3 policies to manage data access. Now you want to allow your data analysts to explore or query the same data using Amazon Athena. You can grant access to the data analysts using Lake Formation permissions, to include fine-grained controls as needed, without changing access for your ETL data pipelines.
Hybrid access mode allows both permission models to exist for the same database and tables, providing greater flexibility in how you manage user access. While this feature opens two doors for a Data Catalog resource, an IAM user or role can access the resource using only one of the two permissions. After Lake Formation permission is enabled for an IAM principal, authorization is completely managed by Lake Formation and existing IAM and S3 policies are ignored. AWS CloudTrail logs provide the complete details of the Data Catalog resource access in Lake Formation logs and S3 access logs.
In this blog post, we walk you through the instructions to onboard Lake Formation permissions in hybrid access mode for selected users while the database is already accessible to other users through IAM and S3 permissions. We will review the instructions to set-up hybrid access mode within an AWS account and between two accounts.
Scenario 1 – Hybrid access mode within an AWS account
In this scenario, we walk you through the steps to start adding users with Lake Formation permissions for a database in Data Catalog that’s accessed using IAM and S3 policy permissions. For our illustration, we use two personas:
Data-Engineer, who has coarse grained permissions using an IAM policy and an S3 bucket policy to run an AWS Glue ETL job and
Data-Analyst, whom we will onboard with fine grained Lake Formation permissions to query the database using Amazon Athena.
Scenario 1 is depicted in the diagram shown below, where the
Data-Engineer role accesses the database
hybridsalesdb using IAM and S3 permissions while
Data-Analyst role will access the database using Lake Formation permissions.
To set up Lake Formation and IAM and S3 permissions for a Data Catalog database with Hybrid access mode, you must have the following prerequisites:
- An AWS account that isn’t used for production applications.
- Lake Formation already set up in the account and a Lake Formation administrator role or a similar role to follow along with the instructions in this post. For example, we’re using a data lake administrator role called LF-Admin. To learn more about setting up permissions for a data lake administrator role, see Create a data lake administrator.
- A sample database in the Data Catalog with a few tables. For example, our sample database is called
hybridsalesdband has a set of eight tables, as shown in the following screenshot. You can use any of your datasets to follow along.
Personas and their IAM policy setup
There are two personas that are IAM roles in the account:
Data-Analyst. Their IAM policies and access are described as follows.
The following IAM policy on the
Data-Engineer role allows access to the database and table metadata in the Data Catalog.
The following IAM policy on the Data-Engineer role grants data access to the underlying Amazon S3 location of the database and tables.
Data-Engineer also has access to the AWS Glue console using the AWS managed policy
arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess and regressive
iam:Passrole to run an AWS Glue ETL script as below.
The following policy is also added to the trust policy of the
Data-Engineer role to allow AWS Glue to assume the role to run the ETL script on behalf of the role.
See AWS Glue studio set up for additional permissions required to run an AWS Glue ETL script.
Data-Analyst role has the data lake basic user permissions as described in Assign permissions to Lake Formation users.
Data-Analyst has permissions to write Athena query results to an S3 bucket that isn’t managed by Lake Formation and Athena console full access using the AWS managed policy
Set up Lake Formation permissions for Data-Analyst
Complete the following steps to configure your data location in Amazon S3 with Lake Formation in hybrid access mode and grant access to the
- Sign in to the AWS Management Console as a Lake Formation administrator role.
- Go to Lake Formation.
- Select Data lake locations from the left navigation bar under Administration.
- Select Register location and provide the Amazon S3 location of your database and tables. Provide an IAM role that has access to the data in the S3 location. For more details see Requirements for roles used to register locations.
- Select the Hybrid access mode under Permission mode and choose Register location.
- Select Data lake locations under Administration from the left navigation bar. Review that the registered location shows as Hybrid access mode for Permission mode.
- Select Databases from Catalog on the left navigation bar. Choose
hybridsalesdb. You will select the database that has the data in the S3 location that you registered in the preceding step. From the Actions drop down menu, select Grant.
Data-Analystfor IAM users and roles. Under LF-Tags or catalog resources, select Named Data Catalog resources and select
- Under Database permissions, select Describe. Under Hybrid access mode, select the checkbox Make Lake Formation permissions effective immediately. Choose Grant.
- Again, select Databases from Catalog on the left navigation bar. Choose
hybridsalesdb. Select Grant from the Actions drop down menu.
- On the Grant window, select
Data-Analyst for IAM users and roles. Under LF-Tags or catalog resources, choose Named Data Catalog resources and select
- Under Tables, select the three tables named
hybridsales_orderfrom the drop down.
- Under Table permissions, select Select and Describe permissions for the tables.
- Select the checkbox under Hybrid access mode to make the Lake Formation permissions effective immediately.
- Choose Grant.
- Review the granted permissions by selecting the Data lake permissions under Permissions on the left navigation bar. Filter Data permissions by Principal =
- On the left navigation bar, select Hybrid access mode. Verify that the opted in Data-Analyst shows up for the
hybridsalesdbdatabase and the three tables.
- Sign out from the console as the Lake Formation administrator role.
Validating Lake Formation permissions for Data-Analyst
- Sign in to the console as
- Go to the Athena console. If you’re using Athena for the first time, set up the query results location to your S3 bucket as described in Specifying a query result location.
- Run preview queries on the table from the Athena query editor.
Validating IAM and S3 permissions for Data-Engineer
- Sign out as Data-Analyst and sign back in to the console as
- Open the AWS Glue console and select ETL jobs from the left navigation bar.
- Under Create job, select Spark script editor. Choose Create.
- Download and open the sample script provided here.
- Copy and paste the script into your studio script editor as a new job.
- Edit the
catalog_id, database, and
table_nameto suit your sample.
- Save and Run your AWS Glue ETL script by providing the IAM role of Data-Engineer to run the job.
- After the ETL script succeeds, you can select the output logs link from the Runs tab of the ETL script.
- Review the table’s schema, top 20 rows, and the total number of rows and columns from the AWS CloudWatch logs.
Thus, you can add Lake Formation permissions to a new role to access a Data Catalog database without interfering with another role that is accessing the same database through IAM and S3 permissions.
Scenario 2 – Hybrid access mode set up between two AWS accounts
This is a cross-account sharing scenario where a data producer shares a database and its tables to a consumer account. The producer provides full database access for an AWS Glue ETL workload on the consumer account. At the same time, the producer shares a few tables of the same database to the consumer account using Lake Formation. We walk you through how you can use hybrid access mode to support both access methods.
- Cross-account sharing of a database or table location that’s registered in hybrid access mode requires the producer or the grantor account to be in version 4 of cross-account sharing in the catalog setting to grant permissions on the hybrid access mode resource. When moving from version 3 to version 4 of cross-account sharing, existing Lake Formation permissions aren’t affected for database and table locations that are already registered with Lake Formation (Lake Formation mode). For new data set location registration in hybrid access mode and new Lake Formation permissions on this catalog resource, you will need version 4 of cross-account sharing.
- The consumer or recipient account can use other versions of cross-account sharing. If your accounts are using version 1 or version 2 of cross-account sharing and if you want to upgrade, follow Updating cross-account data sharing version settings to first upgrade the catalog setting of cross-account sharing to version 3, before upgrading to version 4.
The producer account set up is similar to that of scenario 1 and we discuss the extra steps for scenario 2 in the following section.
Set up in producer account A
Data-Engineer role is granted Amazon S3 data access using the producer’s S3 bucket policy and Data Catalog access using the producer’s Data Catalog resource policy.
The S3 bucket policy in the producer account follows:
The Data Catalog resource policy in the producer account is shown below. You also need the
glue:ShareResource IAM permission for AWS Resource Access Manager (AWS RAM) to enable cross-account sharing.
Setting the cross-account version and registering the S3 bucket
- Sign in to the Lake Formation console as an IAM administrator role or a role with IAM permissions to the
PutDataLakeSettings()API. Choose the AWS Region where you have your sample data set in an S3 bucket and its corresponding database and tables in the Data Catalog.
- Select Data catalog settings from the left navigation bar under Administration. Select Version 4 from the dropdown menu for Cross account version settings. Choose Save.
Note: If there are any other accounts in your environment that share catalog resources to your producer account through Lake Formation, upgrading the sharing version might impact them. See <title of documentation page> for more information.
- Sign out as IAM administrator and sign back in to the Lake Formation console as a Lake Formation administrator role.
- Select Data lake locations from the left navigation bar under Administration.
- Select Register location and provide the S3 location of your database and tables.
- Provide an IAM role that has access to the data in the S3 location. For more details about this role requirement, see Requirements for roles used to register locations.
- Choose the Hybrid access mode under Permission mode, and then choose Register location.
- Select Data lake locations under Administration from the left navigation bar. Confirm that the registered location shows as Hybrid access mode for Permission mode.
Granting cross-account permissions
The steps to share the database
hybridsalesdb to the consumer account are similar to the steps to set up scenario 1.
- In the Lake Formation console, select Databases from Catalog on the left navigation bar. Choose
hybridsalesdb. Select your database that has the data in the S3 location that you registered previously. From the Actions drop down menu, select Grant.
- Select External accounts under Principals and provide the consumer account ID. Select Named catalog resources under LF-Tags or catalog resources. Choose hybridsalesdb for Databases.
- Select Describe for Database permissions and for Grantable permissions.
- Under Hybrid access mode, select the checkbox for Make Lake Formation permissions effective immediately. Choose Grant.
Note: Selecting the checkbox opts-in the consumer account Lake Formation administrator roles to use Lake Formation permissions without interrupting access to the consumer account’s IAM and S3 access for the same database.
- Repeat step 2 up to database selection to grant permission to the consumer account ID for table level permission. Select any three tables from the drop-down menu for table level permission under Tables.
- Select Select under Table permissions and Grantable permissions. Select the checkbox for Make Lake Formation permissions effective immediately under Hybrid access mode. Choose Grant.
- Select the Data lake permissions on the left navigation bar. Verify the granted permissions to the consumer account.
- Select the Hybrid access mode on the left navigation bar. Verify the opted-in resources and principal.
You have now enabled cross-account sharing using Lake Formation permissions without revoking access to the
IAMAllowedPrincipal virtual group.
Set up in consumer account B
In scenario 2, the
Data-Engineer roles are created in the consumer account similar to scenario 1, but these roles access the database and tables shared from the producer account.
In addition to
Data-Engineer role also has permissions to create and run an Apache Spark job in AWS Glue Studio.
Data-Engineer has the following IAM policy that grants access to the producer account’s S3 bucket, which is registered with Lake Formation in hybrid access mode.
Data-Engineer has the following IAM policy that grants access to the consumer account’s entire Data Catalog and producer account’s database
hybridsalesdb and its tables.
Data-Analyst has the same IAM policies similar to scenario 1, granting basic data lake user permissions. For additional details, see Assign permissions to Lake Formation users.
Accepting AWS RAM invites
- Sign in to the Lake Formation console as a Lake Formation administrator role.
- Open the AWS RAM console. Select Resource shares from Shared with me on the left navigation bar. You should see two invites from the producer account, one for database level share and one for table level share.
- Select each invite, review the producer account ID, and choose Accept resource share.
Granting Lake Formation permissions to Data-Analyst
- Open the Lake Formation console. As a Lake Formation administrator, you should see the shared database and tables from the consumer account.
- Select Databases from the Data catalog on the left navigation bar. Select the radio button on the database
hybridsalesdband select Create resource link from the Actions drop down menu.
rl_hybridsalesdbas the name for the resource link and leave the rest of the selections as they are. Choose Create.
- Select the radio button for
rl_hybridsalesdb. Select Grant from the Actions drop down menu.
- Grant Describe permissions on the resource link to
- Again, select the radio button on
rl_hybridsalesdbfrom the Databases under Catalog in the left navigation bar. Select Grant on target from the Actions drop down menu.
Data-Analyst for IAM users and roles, keep the already selected database
- Select Describe under Database permissions. Select the checkbox for Make Lake Formation permissions effective immediately under Hybrid access mode. Choose Grant.
- Select the radio button on
rl_hybridsalesdbfrom Databases under Catalog in the left navigation bar. Select Grant on target from the Actions drop down menu.
Data-Analystfor IAM users and roles. Select All tables of the database hybridsalesdb. Select Select under Table permissions.
- Select the checkbox for Make Lake Formation permissions effective immediately under Hybrid access mode.
- View and verify the permissions granted to Data-Analyst from the Data lake permissions tab on the left navigation bar.
- Sign out as Lake Formation administrator role.
Validate Lake Formation permissions as Data-Analyst
- Sign back in to the console as
- Open the Athena console. If you’re using Athena for the first time, set up the query results location to your S3 bucket as described in Specifying a query result location.
- In the Query Editor page, under Data, select
AWSDataDatalogfor Data source. For Tables, select the three dots next to any of the table names. Select Preview Table to run the query.
- In the Query Editor page, under Data, select
- Sign out as Data-Analyst.
Validate IAM and S3 permissions for Data-Engineer
- Sign back in to the console as
- Using the same steps as scenario 1, verify IAM and S3 access by running the AWS Glue ETL script in AWS Glue Studio.
You’ve added Lake Formation permissions to a new role
Data-Analyst, without interrupting existing IAM and S3 access to
Data-Engineer for a cross-account sharing use-case.
If you’ve used sample datasets from your S3 for this blog post, we recommend removing relevant Lake Formation permissions on your database for the Data-Analyst role and cross-account grants. You can also remove the hybrid access mode opt-in and remove the S3 bucket registration from Lake Formation. After removing all Lake Formation permissions from both the producer and consumer accounts, you can delete the Data-Analyst and Data-Engineer IAM roles.
Currently, only a Lake Formation administrator role can opt in other users to use Lake Formation permissions for a resource, since opting in user access using either Lake Formation or IAM and S3 permissions is an administrative task requiring full knowledge of your organizational data access setup. Further, you can grant permissions and opt in at the same time using only the named-resource method and not LF-Tags. If you’re using LF-Tags to grant permissions, we recommend you use the Hybrid access mode option on the left navigation bar to opt in (or the equivalent
CreateLakeFormationOptin() API using the AWS SDK or AWS CLI) as a subsequent step after granting permissions.
In this blog post, we went through the steps to set up hybrid access mode for Data Catalog. You learned how to onboard users selectively to the Lake Formation permissions model. The users who had access through IAM and S3 permissions continued to have their access without interruptions. You can use Lake Formation to add fine-grained access to Data Catalog tables to enable your business analysts to query using Amazon Athena and Amazon Redshift Spectrum, while your data scientists can explore the same data using Amazon Sagemaker. Data engineers can continue to use their IAM and S3 permissions on the same data to run workloads using Amazon EMR and AWS Glue. Hybrid access mode for the Data Catalog enables a variety of analytical use-cases for your data without data duplication.
To get started, see the documentation for hybrid access mode. We encourage you to check out the feature and share your feedback in the comments section. We look forward to hearing from you.
About the authors
Aarthi Srinivasan is a Senior Big Data Architect with AWS Lake Formation. She likes building data lake solutions for AWS customers and partners. When not on the keyboard, she explores the latest science and technology trends and spends time with her family.