Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation
Today’s modern data lakes span multiple accounts, AWS Regions, and lines of business in organizations. Companies also have employees and do business across multiple geographic regions and even around the world. It’s important that their data solution gives them the ability to share and access data securely and safely across Regions.
The AWS Glue Data Catalog and AWS Lake Formation recently announced support for cross-Region table access. This feature lets users query AWS Glue databases and tables in one Region from another Region using resource links, without copying the metadata in the Data Catalog or the data in Amazon Simple Storage Service (Amazon S3). A resource link is a Data Catalog object that is a link to a database or table.
The AWS Glue Data Catalog is a centralized repository of technical metadata that holds the information about your datasets in AWS, and can be queried using AWS analytics services such as Amazon Athena, Amazon EMR, and AWS Glue for Apache Spark. The Data Catalog is localized to every Region in an AWS account, requiring users to replicate the metadata and the source data in S3 buckets for cross-Region queries. With the newly launched feature for cross-Region table access, you can create a resource link in any Region pointing to a database or table of the source Region. With the resource link in the local Region, you can query the source Region’s tables from Athena, Amazon EMR, and AWS Glue ETL in the local Region.
You can use the cross-Region table access feature of the Data Catalog in combination with the permissions management and cross-account sharing capability of Lake Formation. Lake Formation is a fully managed service that makes it easy to build, secure, and manage data lakes. By using cross-Region access support for Data Catalog, together with governance provided by Lake Formation, organizations can discover and access data across Regions without spending time making copies. Some businesses might have restrictions to run their compute in certain Regions. Organizations that need to share their Data Catalog with businesses that have such restrictions can now create and share cross-Region resource links.
In this post, we walk you through configuring cross-Region database and table access in two scenarios. In the first scenario, we go through an example where a customer wants to access an AWS Glue database in Region A from Region B in the same account. In scenario two, we demonstrate cross-account and cross-Region access where a customer wants to share a database in Region A across accounts and access it from Region B of the recipient account.
Scenario 1: Same account use case
In this scenario, we walk you through the steps required to share a Data Catalog database from one Region to another Region within the same AWS account. For our illustrations, we have a sample dataset in an S3 bucket in the
us-east-2 Region and have used an AWS Glue crawler to crawl and catalog the dataset into a database in the Data Catalog of the
us-east-2 Region. We share this dataset to the
us-west-2 Region. You can use any of your datasets to follow along. The following diagram illustrates the architecture for cross-Region sharing within the same AWS account.
To set up cross-Region sharing of a Data Catalog database for scenario 1, we recommend the following prerequisites:
- An AWS account that is not used for production use cases.
- Lake Formation set up already in the account and a Lake Formation administrator role or a similar role to follow along with the instructions in this post. For example, we are using a data lake administrator role called
LF-Adminrole also has the AWS Identity and Access Management (IAM) permission
iam:PassRoleon the AWS Glue crawler role. To learn more about setting up permissions for a data lake administrator, see Create a data lake administrator.
- A sample database in the Data Catalog with a few tables. For example, our sample database is called
salesdb_useast2and has a set of eight tables, as shown in the following screenshot.
Set up permissions for us-east-2
Complete the following steps to configure permissions in the
- Log in to the Lake Formation console and choose the Region where your database resides. In our example, it is
- Grant SELECT and DESCRIBE permissions to the
LF-Adminrole on all tables of the database
- You can confirm if permissions are working by querying the database and tables as the data lake administrator role from Athena.
Set up permissions for us-west-2
Complete the following steps to configure permissions in the
- Choose the
us-west-2Region on the Lake Formation console.
- Add LF-Admin as a data lake administrator and grant Create database permission to
- In the navigation pane, under Data catalog, select Databases.
- Choose Create database and select Resource link.
rl_salesdb_from_useast2as the name for the resource link.
- For Shared database’s region, choose US East (Ohio).
- For Shared database, choose
- Choose Create.
This creates a database resource link in
us-west-2 pointing to the database in
You will notice the Shared resource owner region column populate as us-east-2 for the resource link details on the Databases page.
LF-Admin role created the resource link
rl_salesdb_from_useast2, the role has implicit permissions on the resource link.
LF-Admin already has permissions to query the table in the
us-east-2 Region. There is no need to add a Grant on target permission for
LF-Admin. If you are granting permission to another user or role, you need to grant Describe permissions on the resource link
- Query the database using the resource link in Athena as
In the preceding steps, we saw how to create a resource link in
us-west-2 for a Data Catalog database in
us-east-2. You can also create a resource link to the source database in any additional Region where the Data Catalog is available. You can run extract, transform, and load (ETL) scripts in Amazon EMR and AWS Glue by providing the additional Region parameter when referring to the database and table. See the API documentation for GetTable() and GetDatabase() for additional details.
Also, Data Catalog permissions for the database, tables, and resource links and the underlying Amazon S3 data permissions can be managed by IAM policies and S3 bucket policies instead of Lake Formation permissions. For more information, see Identity and access management for AWS Glue.
Scenario 2: Cross-account use case
In this scenario, we walk you through the steps required to share a Data Catalog database from one Region to another Region between two accounts: a producer account and a consumer account. To show an advanced use case, we host the source dataset in
us-east-2 of account A and crawl it using an AWS Glue crawler in the Data Catalog in
us-east-1. The data lake administrator in account A then shares the database and tables to account B using Lake Formation permissions. The data lake administrator in account B accepts the share in
us-east-1 and creates resource links to query the tables from
eu-west-1. The following diagram illustrates the architecture for cross-Region sharing between producer account A and consumer account B.
To set up cross-Region sharing of a Data Catalog database for scenario 2, we recommend the following prerequisites:
- Two AWS accounts that are not used for production use cases
- Lake Formation administrator roles in both accounts
- Lake Formation set up in both accounts with cross-account sharing version 3. For more details, refer documentation.
- A sample database in the Data Catalog with a few tables
For our example, we continue to use the same dataset and the data lake administrator role
LF-Admin for scenario 2.
Set up account A for cross-Region sharing
To set up account A, complete the following steps:
- Sign in to the AWS Management Console as the data lake administrator role.
- Register the S3 bucket in Lake Formation in
us-east-1with an IAM role that has access to the S3 bucket. See registering your S3 location for instructions.
- Set up and run an AWS Glue crawler to catalog the data in the
us-east-2S3 bucket to the Data Catalog database
us-east-1. Refer to AWS Glue crawlers support cross-account crawling to support data mesh architecture for instructions.
The database, as shown in the following screenshot, has a set of eight tables.
- Grant SELECT and DESCRIBE along with grantable permissions on all tables of the database to account B.
- Grant DESCRIBE with grantable permissions on the database.
- Verify the granted permissions on the Data permissions page.
- Log out of account A.
Set up account B for cross-Region sharing
To set up account B, complete the following steps:
- Sign in as the data lake administrator on the Lake Formation console in
In our example, we have created the data lake administrator role
LF-Admin, similar to previous administrator roles in account A and scenario 1.
- On the AWS Resource Access Manager (AWS RAM) console, review and accept the AWS RAM invites corresponding to the shared database and tables from account A.
LF-Admin role can see the shared database
useast2data_salesdb from the producer account.
LF-Admin has access to the database and tables and so doesn’t need additional permissions on the shared database.
- You can grant DESCRIBE on the database and SELECT on
All_Tablespermissions to any additional IAM principals from the
us-east-1Region on this shared database.
- Open the Lake Formation console in
eu-west-1(or any Region where you have Lake Formation and Athena already set up).
- Choose Create database and create a resource link named
rl_useast1db_crossaccount, pointing to the
You can choose any Region on the Shared database’s region drop-down menu and choose the databases from those Regions.
Because we’re using the data lake administrator role
LF-Admin, we can see all databases from all Regions in the consumer account’s Data Catalog. A data lake user with restricted permissions will be able to see only those databases for which they have permissions to.
- Because LF-Admin created the resource link, this role has permissions to use the resource link
rl_useast1db_crossaccount. For additional IAM principals, grant DESCRIBE permissions on the database resource link
- You can now query the database and tables from Athena.
Cross-Region queries involve Amazon S3 data transfer by the analytics services, such as Athena, Amazon EMR, and AWS Glue ETL. As a result, cross-Region queries can be slower and will incur higher transfer costs compared to queries in the same Region. Some analytics services such as AWS Glue jobs and Amazon EMR may require internet access when accessing cross-Region data from Amazon S3, depending on your VPC set up. Refer to Considerations and limitations for more considerations.
In this post, you saw examples of how to set up cross-Region resource links for a database in the same account and across two accounts. You also saw how to use cross-Region resource links to query in Athena. You can share selected tables from a database instead of sharing an entire database. With cross-Region sharing, you can create a resource link for the table using the Create table option.
There are two key things to remember when using the cross-Region table access feature:
- Grant permissions on the source database or table from its source Region.
- Grant permissions on the resource link from the Region it was created in.
That is, the original shared database or table is always available in the source Region, and resource links are created and shared in their local Region.
To get started, see Accessing tables across Regions. Share your comments on the post or contact your AWS account team for more details.
About the author
Aarthi Srinivasan is a Senior Big Data Architect with AWS Lake Formation. She likes building data lake solutions for AWS customers and partners. When not on the keyboard, she explores the latest science and technology trends and spends time with her family.