AWS Big Data Blog

Controlling data lake access across multiple AWS accounts using AWS Lake Formation

When deploying data lakes on AWS, you can use multiple AWS accounts to better separate different projects or lines of business. In this post, we see how the AWS Lake Formation cross-account capabilities simplify securing and managing distributed data lakes across multiple accounts through a centralized approach, providing fine-grained access control to the AWS Glue Data Catalog and Amazon Simple Storage Service (Amazon S3) locations.

Use case

Keeping each business unit’s resources as compute and storage in its own AWS account allows for easier cost allocation and permissions governance. In the other hand, centralizing your Data Catalog into a single account with Lake Formation removes the overhead of managing multiple catalogs in isolated data silos, simplifying the management and data availability.

For this post, we use the example of a company with two separate teams:

  • The Analytics team is responsible for data ingestion, validation, and cleansing. After processing the income data, they store it on Amazon S3 and use Lake Formation for the Data Catalog, in a primary AWS account.
  • The Business Analyst team is responsible for generating reports and extracting insight from such data. They use Amazon Athena running in a secondary AWS account.

When a secondary account needs to access data, the data lake administrator use Lake Formation to share data across accounts, avoiding data duplication, silos, and reducing complexity. Data can be shared at the database or table level, and the administrator can define which tables and columns each analyst has access to, establishing a centralized and granular access control. The following diagram illustrates this architecture.

Architecture overview

We provide two AWS CloudFormation templates to set up the required infrastructure for this data lake use case. Each template deploys the resources in one of the accounts (primary and secondary).

In the primary account, the CloudFormation template loads the sample data in the S3 bucket. For this post, we use the publicly available dataset of historic taxi trips collected in New York City in the month of June 2020, in CSV format. The dataset is available from the New York City Taxi & Limousine Commission, via Registry of Open Data on AWS, and contains information on the geolocation and collected fares of individual taxi trips.

The template also creates a Data Catalog configuration by crawling the bucket using an AWS Glue crawler, and updating the Lake Formation Data Catalog on the primary account.

Prerequisites

To follow along with this post, you must have two AWS accounts (primary and secondary), with AWS Identity and Access Management (IAM) administrator access.

Deploying the CloudFormation templates

To get started, launch the first CloudFormation template in the primary account:

After that, deploy the second template in the secondary account:

You now have the deployment as depicted in the following architecture, and are ready to set up Lake Formation with cross-account access.

Setting up Lake Formation in the primary account

Now that you have the basic infrastructure provisioned by the template, we can dive deeper into the steps required for Lake Formation configuration. First sign in to the primary account on the AWS Management Console, using the existing IAM administrator role and account.

Assigning a role to our data lake

Lake Formation administrators are IAM users or roles that can grant and delegate Lake Formation permissions on data locations, databases, and tables. The CloudFormation template created an IAM role with the proper IAM permissions, named LakeFormationPrimaryAdmin. Now we need to assign it to our data lake:

  1. On the Lake Formation console, in the Welcome to Lake Formation pop-up window, choose Add administrators.
    1. If the pop-up doesn’t appear, in the navigation pane, under Permissions, choose Admins and database creators.
    2. Under Data Lake Administrators, choose Grant.
  2. For IAM users and roles, choose LakeFormationPrimaryAdmin.
  3. Choose Save.

After we assign the Lake Formation administrator, we can assume this role and start managing our data lake.

  1. On the console, choose your user name and choose Switch Roles.

  1. Enter your primary account number and the role LakeFormationPrimaryAdmin.
  2. Choose Switch Role.

For detailed instructions on changing your role, see Switching to a role (console).

Adding the Amazon S3 location as a storage layer

Now you’re the Lake Formation administrator. For Lake Formation to implement access control on the data lake, we need to include the Amazon S3 location as a storage layer. Let’s register our existing S3 bucket that contains sample data.

  1. On the Lake Formation console, in the navigation pane, under Register and Ingest, choose Data lake locations.
  2. For Amazon S3 path, choose Browse.
  3. Choose the S3 bucket in the primary account, referenced in the CloudFormation template outputs as S3BucketPrimary.
  4. Choose Register location.

Configuring access control

When you create the template, an AWS Glue crawler populates the Data Catalog with the database and catalog pointing to our S3 bucket. By default, Lake Formation adds IAMAllowedPrincipals permissions, which isn’t compatible with cross-account sharing. We must disable it on our database and table. For this post, we use Lake Formation access control in conjunction with IAM. For more information, see Change Data Catalog Settings.

  1. On the Lake Formation console, in the navigation pane, under Data Catalog, choose Databases.
  2. Choose gluedatabaseprimary.
  3. Choose Edit.
  4. Deselect Use only IAM access control for new tables in this database.
  5. Choose Save.

  1. On the database details page, on the Actions menu, choose Revoke.
  2. For IAM users and roles, choose IAMAllowedPrincipals.
  3. For Database permissions, select Super.

  1. Choose Revoke.
  2. On the database details page, choose View Tables.
  3. Select the table that starts with lf_table.
  4. On the Actions menu, choose Revoke.
  5. For IAM users and roles, choose IAMAllowedPrincipals.
  6. For Database permissions, select Super.
  7. Choose Revoke.

You can now see the metadata and Amazon S3 data location in the table details. The CloudFormation template ran an AWS Glue crawler that populated the table.

Granting permissions

Now we’re ready to grant permissions to the Business Analyst users. Because they’re in a separate AWS account, we need to share the database across accounts.

  1. On the Lake Formation console, under Data Catalog¸ choose Databases.
  2. Select our database.
  3. On the Actions menu, choose Grant.
  4. Select External account.
  5. For AWS account ID or AWS organization ID, enter the secondary account number.
  6. For Table, choose All tables.
  7. For Table permissions, select Select.
  8. For Grantable permissions, select Select.

Grantable permissions are required to allow the principal to pass this grant to other users and roles. For our use case, the secondary account LakeFormationAdministrator grants access to the secondary account BusinessAnalyst. If this permission is revoked on the primary account in the future, all access granted to BusinessAnalyst and LakeFormationAdministrator on the secondary account is also revoked.

For this post, we share the database with a single account. Lake Formation also allows sharing with an AWS organization.

  1. Choose Grant.

Sharing specific tables across accounts

Optionally, instead of sharing the whole database, you can share specific tables across accounts. You don’t need to share the database to share a table underneath it.

  1. On the Lake Formation console, under Data Catalog, choose Tables.
  2. Select the table that starts with lf_table.
  3. On the Actions menu, choose Grant.
  4. Select External account.
  5. For AWS account ID or AWS organization ID, enter the secondary account number.

You can also choose specific columns to share with the secondary account. For this post, we share five columns.

  1. For Columns, choose Include columns.
  2. For Include columns, choose the following columns
    1. vendorid
    2. lpep_pickup_datetime
    3. lp_dropoff_taketime
    4. store_and_forward_flag
    5. ratecodeid
  3. For Table permissions, select Select.
  4. For Grantable permissions, select Select.
  5. Choose Grant.

Setting up Lake Formation in the secondary account

Now that the primary account setup is complete, let’s configure the secondary account. We access the resource share and create appropriate resource links, pointing to the databases or tables in the primary account. This allows the data lake administrator to grant proper access to the Business Analyst team, who queries the data through Athena. The following diagram illustrates this architecture.

Assigning a role to our data lake

Similar to the primary account, we need to assign an IAM role as the Lake Formation administrator. To better differentiate the roles, this one is named LakeFormationSecondaryAdmin.

  1. On the Lake Formation console, under Permissions, choose Admins and database creators.
  2. Under Data Lake Administrators, choose Grant.
  3. In the pop-up window, choose LakeFormationSecondaryAdmin.
  4. Choose Save.
  5. On the console, switch to the LakeFormationSecondaryAdmin role.

Sharing resources

Lake Formation shares resources (databases and tables) by using AWS Resource Access Manager. AWS RAM provides a streamlined way to share resources across AWS accounts and also integrates with AWS Organizations. If both primary and secondary accounts are in the same organization with resource sharing enabled, resources shares are accepted automatically and you can skip this step. If not, complete the following steps:

  1. On the AWS RAM console, in the navigation pane, under Shared with me, choose Resource shares.
  2. Choose the Lake Formation share.
  3. Choose Accept resource share.

The resource status switches to Active.

Creating a resource link

With the share accepted, we can create a resource link in the secondary account. Resource links are Data Catalog virtual objects that link to a shared database or table. The resource link lives in your account and the referenced object it points to can be anywhere else.

  1. On the Lake Formation console, under Data Catalog, choose Databases.
  2. Choose Create database.
  3. Select Resource link.
  4. For Resource link name, enter a name, such as lf-primary-database-rl.
  5. For Shared database, choose gluedatabaseprimary.

The shared database’s owner ID is populated automatically.

  1. Choose Create.

You can use this resource link the same way you use database or table references in Lake Formation. The following screenshot shows the resource link listed on the Databases page.

Granting permissions

As the creator of the resource link, at this point only you (IAM role LakeFormationSecondaryAdmin) can view and access this object in the Data Catalog. To grant visibility on the resource link to our Business Analyst users (IAM role LakeFormationSecondaryAnalyst), we need to grant them describe permissions.

  1. On the Lake Formation console, navigate to the database details page.
  2. On the Actions menu, choose Grant.
  3. For IAM users and roles, choose LakeFormationSecondaryAnalyst.
  4. For Resource Link permissions, select Describe and deselect Super.
  5. Choose Grant.

Granting permissions on a resource link doesn’t grant permissions on the target (linked) database or table, so let’s do it now. For our use case, the analysts only need SQL SELECT capabilities, and only to the specific columns of the table.

  1. In the navigation pane, under Data Catalog, choose Databases.
  2. Select lf-primary-database-rl.
  3. On the Actions menu, choose Grant on Target.
  4. In the Grant permissions dialog box, choose My account.
  5. For IAM users and roles, choose LakeFormationSecondaryAnalyst.
  6. Choose the table that starts with lf_table.
  7. Under Columns, select Include Columns and select the first five columns.
  8. For Table permissions, select Select.
  9. Choose Grant.

Accessing the data

With all the Lake Formation grants in place, the users are ready to access the data at the proper level.

  1. In the secondary account, switch to the role LakeFormationSecondaryAnalyst.
  2. On the Athena console, choose Get Started.
  3. On the selection bar, under Workgroup, choose LakeFormationCrossAccount.
  4. Choose Switch workgroup.

The screen refreshes; make sure you are in the right workgroup.

To use Lake Formation cross-account access, you don’t need a separate Athena workgroup. For this post, the CloudFormation template created one to simplify deployment with the proper Athena configuration.

  1. For Data source, choose AwsDataCatalog.
  2. For Database, choose lf-primary-database-rl.
  3. For Tables, choose if_table_<string>.
  4. On the menu, choose Preview table.

  1. Choose Run query.

You now have a data analyst on the secondary account with access to an S3 bucket in the primary account. The analyst only has access to the five columns we specified earlier.

Data access that is granted by Lake Formation cross-account access is logged in the secondary account AWS CloudTrail log file, and Lake Formation copies the event to the primary account’s log file. For more information and examples of logging messages, see Cross-Account CloudTrail Logging.

Cleaning up

To avoid incurring future charges, delete the CloudFormation templates after you finish testing the solution.

Conclusion

In this post, we went through the process of configuring Lake Formation to share AWS Glue Data Catalog metadata information across AWS accounts.

Large enterprises typically use multiple AWS accounts, and many of those accounts might need access to a data lake managed by a single AWS account. AWS Lake Formation with cross-account access set up enables you to run queries and jobs that can join and query tables across multiple accounts.


About the Authors

Rafael Suguiura is a Principal Solutions Architect at Amazon Web Services. He guides some of the world’s largest financial services companies in their cloud journey. When the weather is nice, he enjoys cycling and finding new hiking trails—and when it’s not, he catches up with sci-fi books, TV series, and video games.

 

 

 

Himanish Kushary is a Senior Big Data Architect at Amazon Web Services. He helps customers across multiple domains build scalable big data analytics platforms. He enjoys playing video games, and watching good movies and TV series.