AWS Big Data Blog
Use IAM runtime roles with Amazon EMR Studio Workspaces and AWS Lake Formation for cross-account fine-grained access control
Amazon EMR Studio is an integrated development environment (IDE) that makes it straightforward for data scientists and data engineers to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark. EMR Studio provides fully managed Jupyter notebooks and tools such as Spark UI and YARN Timeline Server via EMR Studio Workspaces. You can attach an EMR Studio Workspace to an EMR cluster, and use the compute power of the EMR cluster and run data science jobs on the cluster. Data is often stored in data lakes managed by AWS Lake Formation, enabling you to apply fine-grained access control through a simple grant or revoke mechanism.
We’re happy to introduce runtime roles for EMR Studio Workspaces. You can now define a runtime role and assign it to an EMR cluster when attaching an EMR Studio Workspace. The jobs on the EMR cluster will use this runtime role to access AWS resources. After configuring a runtime role, you can also use Lake Formation and apply fine-grained data access control for the jobs submitted by the EMR Studio Workspace.
Previously, when attaching EMR Studio Workspaces to EMR clusters, all Workspaces had to use the same AWS Identity and Access Management (IAM) role—namely, the cluster’s Amazon Elastic Compute Cloud (Amazon EC2) instance profile. Therefore, all Workspaces attached to the same EMR cluster had the same data access. To control access to data sources, each EMR Studio Workspace had to use a different EMR cluster, and multiple EMR instance profiles were needed.
Starting with the release of Amazon EMR 6.11, you can now choose a runtime role when attaching an EMR Studio Workspace to an EMR cluster. This runtime role scopes down access at the Workspace level. Your Apache Livy and Apache Spark jobs that run from the EMR Studio Workspaces will have permission to access only the data and resources permitted by policies attached to the runtime role. Also, when data is accessed from data lakes managed with Lake Formation, you can enforce fine-grained data access control using Lake Formation permissions. This helps you reduce operational overhead.
In this post, we demonstrate how to configure runtime roles for EMR Studio Workspaces and attach a Workspace to an EMR cluster with runtime roles. Because large enterprises typically use multiple AWS accounts, and many of those accounts might need access to a data lake managed by a single AWS account, our example uses two AWS accounts. We explain how to control access to EMR Studio runtime roles, manage data access across accounts in a data lake via Lake Formation, and enforce table-level and column-level permissions to the EMR runtime roles.
Solution overview
To demonstrate fine-grained access control, we create a sample AWS Glue database named company and manage the database permission in Lake Formation. The database consists of two separate tables:
- employees – This table stores information about the company’s employees, including employee ID, name, department, and salary
- products – This table stores information about the products sold by the company, including product ID, name, category, and price
To demonstrate data access control, we consider the following data users:
- Alice, a data scientist in the sales team – She should have read-only access to all columns in the
products
table and selected columns, including uID, name, and department in theemployees
table - Bob, a data scientist in the human resources team – He should have read-only access to all columns in
employees
table and should not have access to theproducts
table
To demonstrate cross-account data sharing, we consider two accounts:
- Data producer account – We refer to this account as
123456789012
in this post. This account manages the raw data in Amazon Simple Storage Service (Amazon S3) and writes data to the data lake. Thecompany
database and tables should be in this account. - Data consumer account – We refer to this account as
111122223333
in this post. This account is accessed directly by the users for data analysis and doesn’t have write access to the data. This account should be accessible by Alice and Bob.
The architecture is implemented as follows:
- The data producer account manages a data lake. Raw data is stored in S3 buckets and catalogued in the AWS Glue Data Catalog.
- Lake Formation in the data producer account governs the data access via the Data Catalog, and provides cross-account data sharing with the data consumer account.
- Lake Formation in the data consumer account governs cross-account access to the data lake on table level and fine-grained Lake Formation permissions. For more information, refer to Methods for fine-grained access control.
- EMR Studio Workspaces in the data consumer account use runtime roles when running jobs on an EMR cluster.
- The EMR cluster connects to Glue Data Catalog in the data consumer account and queries the data from the data lake through cross-account data sharing.
The following diagram illustrates this architecture.
In the following sections, we go through the steps to share data across accounts via Lake Formation, run an EMR Studio Workspace with runtime roles, and demonstrate fine-grained access control.
Prerequisites
You should have the following prerequisites:
- The AWS Command Line Interface (AWS CLI) installed and configured, or access to AWS CloudShell.
- Access to the data producer account and data consumer account with adequate permissions to create and deploy AWS CloudFormation stacks, upload files to S3 buckets, accept shared resources in AWS Resource Access Manager (AWS RAM), and other actions taken in this post.
- Access to IAM roles or users who are a Lake Formation data lake administrator in both the producer and consumer account for this blog. For instruction, please refer to Create a data lake administrator.
- PEM certificates, which can be used for in-transit encryption with Amazon EMR. The zipped certificates should be uploaded to an Amazon S3 location in the data consumer account.
Create the infrastructure in the data producer account
Complete the following steps to create the infrastructure resources:
- Log in to the data producer AWS account (
123456789012
). - Choose Launch Stack to deploy a CloudFormation template to create the necessary resources.
- For DataLakeBucketSuffix, enter the suffix for the S3 bucket used by the data lake. The whole S3 bucket name to be created will be
{AwsAccoundId}-{AwsRegion}-{DataLakeBucketSuffix}
. - After the CloudFormation stack is created, navigate to the Outputs tab of the stack and capture the value of
DataLakeS3Bucket
to use in the next step.
Create data files and upload them to Amazon S3 in the data producer account
Configure your AWS CLI to use the IAM identity with permission to upload to DataLakeS3BucketName in the data producer AWS account (123456789012
), or you can sign in to CloudShell using the AWS Management Console. Complete the following steps:
- On your local machine, move to a directory of your choice with the cd command, for example,
cd ~
. - Run the script with
chmod 744 create_sample_data.sh && ./create_sample_data.sh <DataLakeS3BucketName>
.
The script will create a subdirectory tmp
in your current working directory, create the test data in CSV files, and upload the files to the DataLakeS3BucketName
S3 bucket.
Set up Lake Formation in the data producer account
In this section, we walk through the steps to set up Lake Formation in the data producer account.
Set up Lake Formation cross-account data sharing version settings
Lake Formation supports multiple data sharing versions. For this post, we use version 3. To learn more about the differences between data sharing versions, refer to Updating cross-account data sharing version settings. To change the data sharing version, see To enable the new version.
Register the Amazon S3 location as the data lake location
When you register an Amazon S3 location with Lake Formation, you specify an IAM role with read/write permissions on that location. After registering, when EMR clusters request access to this Amazon S3 location, Lake Formation will supply temporary credentials of the provided role to access the data. We already created the role LakeFormationCompanyDatabaseDataAccessRole
for this purpose in the previous step. To register the Amazon S3 location as the data lake location, complete the following steps:
- Open the Lake Formation console with the Lake Formation data lake administrator in the data producer account (
123456789012
). - In the navigation pane, choose Data lake locations under Administration.
- Choose Register location.
- For Amazon S3 path, enter
s3://<DataLakeS3BucketName>/company-database
. - For IAM role, enter
LakeFormationCompanyDatabaseDataAccessRole
. - For Permission mode, select Lake Formation.
- Choose Register location.
Revoke permissions granted to IAMAllowedPrincipals
The IAMAllowedPrincipals
group includes any IAM users and roles that are allowed access to your Data Catalog resources by your IAM policies. To enforce the Lake Formation model, we need to revoke permission from IAMAllowedPrincipals using the following steps:
- Open the Lake Formation console with the Lake Formation data lake administrator in the data producer account.
- In the navigation pane, choose Data lake permissions under Permissions.
- Filter permissions by
Database = company
andPrinciple=IAMAllowedPrinciples
. - Select all the permissions given to the principal
IAMAllowedPrincipals
and choose Revoke.
Set up application integration settings
To enforce permissions for the EMR cluster, you need to register a session tag value with Lake Formation. Lake Formation uses this session tag to authorize callers and provide access to the data lake. We register Amazon EMR
as the session tag value. This value will be referenced in the security configuration when creating the EMR cluster.
Set up the session tag using the following steps:
- Open the Lake Formation console with the Lake Formation data lake administrator in the data producer account.
- Choose Application integration settings under Administration in the navigation pane.
- Select Allow external engines to filter data in Amazon S3 locations registered with Lake Formation.
- For Session tag values, enter
Amazon EMR
. - For AWS account IDs, enter the data consumer AWS account ID (
111122223333
). - Choose Save.
Share the database and tables to the data consumer account
We now grant permissions to the data consumer AWS account, including grantable permissions. This allows the Lake Formation data lake administrator in the data consumer account to control access to the data within the account.
Grant database permissions to the data consumer account
Complete the following steps:
- Open the Lake Formation console with the Lake Formation data lake administrator in the data producer account.
- In the navigation pane, choose Databases.
- Select the database
company
, and on the Actions menu, under Permissions, choose Grant. - In the Principles section, select External accounts and enter the data consumer AWS account (
111122223333
). - In the LF-Tags or catalog resources section, choose
company
for Databases. - In the Database permissions section, select Describe for both Database permissions and Grantable permissions.
This allows the data lake administrator in the data consumer account to describe the database and grant describe permissions to other principals in the data consumer account.
- Choose Grant.
Grant table permissions to the data consumer account
Complete the following steps:
- Open the Lake Formation console with the Lake Formation data lake administrator in the data producer account.
- In the navigation pane, choose Tables.
- Select the
products
table, which belongs to thecompany
database, and on the Actions menu, under Permissions, choose Grant. - In the Principles section, select External accounts and enter in the data consumer AWS account (
111122223333
). - In the LF-Tags or catalog resources section, select Named data catalog resources and specify the following:
- For Databases, choose
company
. - For Tables, choose
products
andemployees
.
- For Databases, choose
- In the Table permissions section, choose Select and Describe for both Table permissions and Grantable permissions.
This allows the data lake administrator in the data consumer account to select and describe the tables, and grant select and describe table permissions to other principals in the data consumer account.
- In the Data permissions section, select All data access.
- Choose Grant.
Now we have finished setting up the data producer account.
Set up the infrastructure in the data consumer account
Complete the following steps to create the infrastructure resources:
- Log in to the data consumer account (
111122223333
). - Choose Launch stack to deploy a CloudFormation template to create the necessary resources.
- For Release Label, enter the Amazon EMR release label to use, which can only be emr-6.11 or up.
- For InstanceType, choose the instance type for EMR cluster, such as r4.4xlarge.
- For EMRS3BucketNameSuffix, enter the S3 bucket suffix to store EMR cluster logs and EMR notebook files. The full S3 bucket name to be created will be
{AWSAccoundId}-{AWSRegion}-{EMRS3BucketNameSuffix}
. - For S3PathToInTransitCertificate, enter the S3 path for the .zip file that contains the .pem files used for in-transit encryption.
For instructions on creating the .zip file that contains the .pem files and uploading them to your S3 bucket, refer to Providing certificates for encrypting data in transit with Amazon EMR encryption.
- After the CloudFormation stack is created, navigate to the Outputs tab of the stack.
- Capture the value of
EMRStudioLink
to use to sign in to EMR Studio.
Accept the resource share in the data consumer account
To access shared resources, you must accept the invitation first.
- Open the AWS RAM console of the data consumer account with the IAM identity that has AWS RAM access.
- In the navigation pane, choose Resource shares under Shared with me.
You should see two pending resource shares from the data producer account.
- Accept both resource shares.
You should see the company
database, employees
table, and products
table in the Data Catalog.
Set up Lake Formation in the data consumer account
In this section, we walk through the steps to set up Lake Formation in the data consumer account.
Set up application integration settings
Similar to the setup in the data producer account, you need register Amazon EMR as a session tag. This value is referenced in the security configuration when creating the EMR cluster in the CloudFormation stack.
To do that, complete the following steps:
- Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account (
111122223333
). - Choose Application integration settings under Administration in the navigation pane.
- Select Allow external engines to filter data in Amazon S3 locations registered with Lake Formation.
- For Session tag values, enter
Amazon EMR
. - For AWS account IDs, enter the data consumer AWS account ID (
111122223333
). - Choose Save.
Grant describe permissions to runtime roles on the default database
If you don’t have a default database in Lake Formation, or your default database already has permissions to grant to IAMAllowedPrinciples
, you can skip this step.
Amazon EMR will check on the default database by default. If you already have a default database in your Lake Formation, grant the describe permission to the runtime roles on the default database by completing the following steps:
- Open the Lake Formation console with the Lake Formation data lake administrator user in the data consumer account.
- In the navigation pane, choose Databases.
- Select the default database, verify that the owner account ID is the data consumer account (
111122223333
), and on the Actions menu, choose Grant. - In the Principles section, select IAM users and roles.
- For IAM users and roles, choose
sales-runtime-role
andhuman-resource-runtime-role
. - For LF-Tags or catalog resources, select Named data catalog resources and choose default for Databases.
- In the Database permissions section, for Database permissions, choose Describe.
- Choose Grant.
Create a resource link for the shared database
To access the database and table resources that were shared by the data producer AWS account, you need to create a resource link in the data consumer AWS account. A resource link is a Data Catalog object that is a link to a local or shared database or table. After you create a resource link to a database or table, you can use the resource link name wherever you would use the database or table name. In this step, you grant permission on the resource links to the runtime role principles. The runtime roles will then access the data in shared databases and underlying tables through the resource link.
To create a resource link, complete the following steps:
- Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account.
- In the navigation pane, choose Databases.
- Select the
company
database, verify that the owner account ID is the data producer account (123456789012
), and on the Actions menu, choose Create Resource links. - For Resource link name, enter the name of the resource link (for example,
company-shared
). - For Shared database’s region, choose the Region of the
company
database. - For Shared database, choose the company database.
- For Shared database’s owner ID, enter the account ID of the data producer account (
123456789012
). - Choose Create.
Grant permissions on the resource link to the runtime role principle
Grant permissions on the resource link to sales-runtime-role and human-resource-runtime-role using the following steps:
- Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account.
- In the navigation pane, choose Databases.
- Select the resource link (
company-shared
) and on the Actions menu, choose Grant. - In the Principles section, select IAM users and roles, and choose
sales-runtime-role
andhuman-resource-runtime-role
. - In the LF-Tags or catalog resources section, for Databases, choose
company-shared
. - In the Resource link permissions section, select Describe.
This allows the runtime roles to describe the resource link. We don’t make any selections for grantable permissions because runtime roles shouldn’t be able to grant permissions to other principles.
- Choose Grant.
Grant permission on the tables to the runtime role principle
You need to grant permissions on the tables to sales-runtime-role
and human-resource-runtime-role
to allow data access:
Human-resource-runtime-role
should have describe and select permissions on all columns in theemployees
table, and no permissions on theproducts
table.Sales-runtime-role
should have select permissions on the columnsuid
,name
, anddepartment
in theemployees
table, and describe and select permissions on all columns in theproducts
table.
Grant permission on the employees table to human-resource-runtime-role
Complete the following steps:
- Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account.
- In the navigation pane, choose Databases.
- Select the resource link (
company-shared
) and on the Actions menu, choose Grant on Target. - In the Principles section, select IAM users and roles, then choose
human-resource-runtime-role
. - In the LF-Tags or catalog resources section, select Named data catalog resources and specify the following:
- For Databases, choose
company
. - For Tables¸ choose
employees
.
- For Databases, choose
- In the Table permissions section, for Table permissions, select Describe and Select.
- In the Data permissions section, select All data access.
- Choose Grant.
Grant permission on the employees table to sales-runtime-role
Complete the following steps:
- Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account.
- In the navigation pane, choose Databases.
- Select the resource link (
company-shared
) and on the Actions menu, choose Grant on Target. - In the Principles section, select IAM users and roles, then choose
sales-runtime-role
. - In the LF-Tags or catalog resources section, select Named data catalog resources and specify the following:
- For Databases, choose
company
. - For Tables, choose
employees
.
- For Databases, choose
- In the Table permissions section, for Table permissions, select Select.
- In the Data permissions section, select Column-based access.
- Select Include columns and choose the
uid
,name
, anddepartment
columns. - Choose Grant.
Grant permission on the products table to sales-runtime-role
Complete the following steps:
- Open the Lake Formation console with the Lake Formation data lake administrator in the data consumer account.
- In the navigation pane, choose Databases.
- Select the resource link (
company-shared
) and on the Actions menu, choose Grant on Target. - In the Principles section, select IAM users and roles, then choose
sales-runtime-role
. - In the LF-Tags or catalog resources section, select Named data catalog resources and specify the following:
- For Databases, choose
company
. - For Tables, choose
products
.
- For Databases, choose
- In the Table permissions section, for Table permissions, select Select and Describe.
- In the Data permissions section, select All data access.
- Choose Grant.
Log in to EMR Studio and use the EMR Studio Workspace
Switch your role to alice-role
or bob-role
on the console using different web browsers to test access. Open the EMRStudioLink
URL from the CloudFormation stack output to sign in to the EMR Studio with each role, then complete the following steps:
- Choose Workspaces in the navigation pane and choose Create Workspace.
- Enter a name and a description for the Workspace.
- Choose Create Workspace.
A new tab containing JupyterLab will open automatically when the Workspace is ready. Enable pop-ups in your browser if necessary.
- Chose the Compute icon in the navigation pane to attach the EMR Studio Workspace with a compute engine.
- Select EMR cluster on EC2 for Compute type.
- Choose the EMR cluster ID you created with AWS CloudFormation.
- For Runtime role, choose
sales-runtime-role
if signed in asalice-role
. Choosehuman-resource-runtime-role
if signed in asbob-role
. - Choose Attach.
Run code in the EMR Studio Workspace and verify data access
Run the following code in the EMR Studio Workspace with a PySpark kernel after signing in with alice-role or bob-role:
You should see different results when using different roles.
According to our data access configuration in Lake Formation, Alice will have full data access for the products
table. She can view all the columns except for salary in the employees
table.
For Bob, according to our data access configuration in Lake Formation, he will have full data access to the employees
table, but he has no access to the products
table.
Clean up
When you’re finished experimenting with this solution, clean up your resources:
- Stop and delete the EMR Studio Workspaces created in the data consumer AWS account.
- Delete all the content in the S3 bucket
EMRS3Bucket
in the data consumer AWS account. - Delete the CloudFormation stack in the data consumer AWS account.
- Delete all the content in the S3 bucket
DataLakeS3Bucket
in the data producer AWS account. - Delete the CloudFormation stack in the data producer AWS account.
Conclusion
This post showed how you can use runtime roles to connect to an EMR Studio Workspace with Amazon EMR to apply cross-account fine-grained data access control with Lake Formation. We also demonstrated how multiple EMR Studio users can connect to the same EMR cluster, each using a runtime role scoped with permissions matching their individual level of access to data.
To learn more about using EMR Studio Workspaces with Lake Formation, refer to Run an EMR Studio Workspace with a runtime role. We encourage you to try out this new functionality, and connect with the us if you have any questions or feedback!
About the Authors
Ashley Zhou is a Software Development Engineer at AWS. She is interested in data analytics and distributed systems.
Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building analytics and data mesh solutions on AWS and sharing them with the community.