AWS Big Data Blog

Create AWS Glue Data Catalog views using cross-account definer roles

With AWS Glue Data Catalog views you can create a SQL view in the Data Catalog that references one or more base tables. These multi-dialect views support various SQL query engines, providing consistent access across multiple Amazon Web Services (AWS) services including Amazon Athena, Amazon Redshift Spectrum, and Apache Spark in both Amazon EMR and AWS Glue 5.0.

You can now create Data Catalog views using a cross-account AWS Identity and Access Management (IAM) definer role. A definer role is an IAM role used to create the Data Catalog view and has SELECT permissions on all columns of the underlying base tables. This definer role is assumed by AWS Glue and AWS Lake Formation service principals to vend credentials to the base tables’ data whenever the view is queried. The definer role allows the Data Catalog view to be shared to principals or AWS accounts so that you can share a filtered subset of data without sharing the base tables.

Previously, Data Catalog views required a definer role within the same AWS account as the base tables. The introduction of cross-account definer roles enables Data Catalog view creation in enterprise data mesh architectures. In this setup, database and table metadata is centralized in a governance account, and individual data owner accounts maintain control over table creation and management through their IAM roles. Data owner accounts can now create and manage Data Catalog views in the central governance accounts using their existing continuous integration and continuous delivery (CI/CD) pipeline roles.

In this post, we show you a cross-account scenario involving two AWS accounts: a central governance account containing the tables and hosting the views and a data owner (producer) account with the IAM role used to create and manage views. We provide implementation details for both SPARK dialect using AWS SDK code samples and ATHENA dialect using SQL commands. Using this approach, you can implement sophisticated data governance models at enterprise scale while maintaining operational efficiency across your AWS environment.

Key benefits

Key benefits for cross-account definer roles are as follows:

  • Enhanced data mesh support – Enterprises with multi-account data lakehouse architectures can now maintain their existing operational model where data owner accounts manage table creation and updates using their established IAM roles. These same roles can now create and manage Data Catalog views across account boundaries.
  • Strengthened security controls – By keeping table and view management within data owner account roles:
    • Security posture is enhanced through proper separation of duties.
    • Audit trails become more comprehensive and meaningful.
    • Access controls follow the principle of least privilege.
  • Elimination of data duplication – Data owner accounts can create views in central accounts that:
    • Provide access to specific data subsets without duplicating tables.
    • Reduce storage costs and management overhead.
    • Maintain a single source of truth while enabling targeted data sharing.

Solution overview

An example customer has a database with two transaction tables in their central account, where the catalog and permissions are maintained. With the database shared to the data owner (producer) account, we create a Data Catalog view in the central account on these two tables, using the producer’s definer role. The view from the central account can be shared to additional consumer accounts and queried. We illustrate creating the SPARK dialect using create-table CLI, and add the ATHENA dialect for the same view from the Athena console. We also provide the AWS SDK sample code for CreateTable() and UpdateTable(), with view definition and a sample pySpark script to read and verify the view in AWS Glue.

The following diagram shows the table, view, and definer IAM role placements between a central governance account and data producer account.

Prerequisites

To perform this solution, you need to have the following prerequisites:

  1. Two AWS accounts with AWS Lake Formation set up. For details, refer to Set up AWS Lake Formation. The Lake Formation setup includes registering your IAM admin role as Lake Formation administrator. In the Data Catalog settings, shown in the following screenshot, Default permissions for newly created databases and tables is set to use Lake Formation permissions only. Cross-account version settings is set to Version 4.

  1. Create an IAM role Data-Analyst in the producer account. For the IAM permissions on this role, refer to Data analyst permissions. This role will also be used as the view definer role. Add the permissions to this definer role from the Prerequisites for creating views.

Create database and tables in the central account

In this step, you create two tables in the central governance account and populate them with few rows of data:

  1. Sign in to the central account as admin user. Open the Athena console and set up the Athena query results bucket.
  2. Run the following queries to create two sample Iceberg tables, representing bank customer transactions data:
/* Check if the Database exists, if not create new database. */
CREATE DATABASE IF NOT EXISTS bankdata_icebergdb;

/*Create transaction_table1*/ Replace the bucket name
CREATE TABLE bankdata_icebergdb.transaction_table1 (
  transaction_id string,
  transaction_type string,
  transaction_amount double)
LOCATION 's3://<bucket-name>/bankdata_icebergdb/transaction-table1'
TBLPROPERTIES (
  'table_type'='iceberg',
  'write_compression'='zstd'
);

/*Create transaction_table2*/
CREATE TABLE bankdata_icebergdb.transaction_table2 (
  transaction_id string,
  transaction_location string,
  transaction_date date)
LOCATION 's3://<bucket-name>/bankdata_icebergdb/transaction-table2'
TBLPROPERTIES (
  'table_type'='iceberg',
  'write_compression'='zstd'
);


INSERT INTO bankdata_icebergdb.transaction_table1 (transaction_id, transaction_type, transaction_amount)
VALUES
  ('T001', 'purchase', 50.0),
  ('T002', 'purchase', 120.0),
  ('T003', 'refund', 200.5),
  ('T004', 'purchase', 80.0),
  ('T005', 'withdrawal', 500.0),
  ('T006', 'purchase', 300.0),
  ('T007', 'deposit', 1000.0),
  ('T008', 'refund', 20.0),
  ('T009', 'purchase', 150.0),
  ('T010', 'withdrawal', 75.0);


INSERT INTO bankdata_icebergdb.transaction_table2 (transaction_id, transaction_location, transaction_date)
VALUES
  ('T001', 'Charlotte', DATE '2024-10-01'),
  ('T002', 'Seattle', DATE '2024-10-02'),
  ('T003', 'Chicago', DATE '2024-10-03'),
  ('T004', 'Miami', DATE '2024-10-04'),
  ('T005', 'New York', DATE '2024-10-05'),
  ('T006', 'Austin', DATE '2024-10-06'),
  ('T007', 'Denver', DATE '2024-10-07'),
  ('T008', 'Boston', DATE '2024-10-08'),
  ('T009', 'San Jose', DATE '2024-10-09'),
  ('T010', 'Phoenix', DATE '2024-10-10');
  1. Verify the created tables in Athena query editor by running a preview.

Share the database and tables from central to producer account

In the central governance account, you share the database and the two tables to the producer account and the Data-Analyst role in producer.

  1. Sign in to the Lake Formation console as the Lake Formation admin role.
  2. In the navigation pane, choose Data permissions.
  3. Choose Grant and provide the following information:
    1. For Principals, select External accounts and enter the producer account ID, as shown in the following screenshot.
    2. For Named Data Catalog Resources, select the default catalog and database bankdata_icebergdb, as shown in the following screenshot.
    3. Under Database permissions, select Describe. For Grantable permissions, select Describe.
    4. Choose Grant.
    5. Repeat the preceding steps to grant access to the producer account definer role Data-Analyst on the database bankdata_icebergdb and the two tables transaction_table1 and transaction_table2 as follows.
    6. Under Database permissions, grant Create table and Describe permissions.
    7. Under Table permissions, grant Select and Describe on all columns.

With these steps, the central governance account data admin steward has shared the database and tables to the producer account definer role.

Steps for producer account

Follow these steps for the producer account:

  1. Sign in to the Lake Formation console on the producer account as the Lake Formation administrator.
  2. In the left navigation pane, choose Databases. A blue banner will appear on the console, showing pending invitations from AWS Resource Access Manager (AWS RAM).
  3. Open the AWS RAM console and review the AWS RAM shares under Shared with me. You will see the AWS RAM shares in pending state. Select the pending AWS RAM share from central account and choose Accept resource share. After the resource share request is accepted, the shared database shows up in the producer account.
  4. On the Lake Formation console, select the database. On the Create dropdown list, choose Resource link. Provide a name rl_bank_iceberg and choose Create.
  5. Let’s grant Describe permission on the resource link to the Data-Analyst role in the producer account in the following steps.
    1. In the left navigation pane, choose Data permissions. Choose the Data-Analyst role. Select the resource link rl_bank_iceberg for the database as shown in the following screenshot.
    2. Grant Describe permission on the resource link.

Note: Cross-account Data Catalog views can’t be created using a resource link, although a resource link is needed for the SDK use of SPARK dialect.

Next, add the central account Data Catalog as a Data Source in Athena from producer account:

  1. Open the Athena console.
  2. On the left navigation pane, choose Data sources and catalogs. Choose Create data source.
    1. Select S3-AWS Glue Data Catalog.
    2. Choose AWS – Glue Data Catalog in another account and name the data source as centraladmin.
    3. Choose Next and then create data source.

After the data source is created, navigate to the Query editor and verify the Data source centraladmin appears, as shown in the following screenshot.

The definer role can also now access and query the central catalog database.

Create SPARK dialect view

In this step, you create a view with SPARK dialect, using AWS Glue CLI command create-table:

  1. Sign in to the AWS console in the producer account as Data-Analyst role. Enter the following command in your CLI environment, such as AWS CloudShell, to create a SPARK DIALECT:
aws glue create-table --cli-input-json '{
   "DatabaseName": "rl_bank_iceberg",
   "TableInput": {
     "Name": "mdv_transaction1",
     "StorageDescriptor": {
       "Columns": [
         {
           "Name": "transaction_id",
           "Type": "string"
         },
         {
           "Name": "transaction_type",
           "Type": "string"
         },
         {
           "Name": "transaction_amount",
           "Type": "float"
         },
         {
           "Name": "transaction_location",
           "Type": "string"
         },
         {
           "Name": "transaction_date",
           "Type": "date"
         }
       ],
       "SerdeInfo": {}
     },
     "ViewDefinition": {
       "SubObjects": [
         "arn:aws:glue:<your-region>:<your-central-account-id>:table/bankdata_icebergdb/transaction_table1",
         "arn:aws:glue:<your-region>:<your-central-account-id>:table/bankdata_icebergdb/transaction_table2"
        ],
       "IsProtected": true,
       "Representations": [
         {
           "Dialect": "SPARK",
           "DialectVersion": "1.0",
           "ViewOriginalText": "SELECT t1.transaction_id, t1.transaction_type, t1.transaction_amount, t2.transaction_location, t2.transaction_date FROM transaction_table1 t1 JOIN transaction_table2 t2 ON t1.transaction_id = t2.transaction_id WHERE t1.transaction_amount > 100;",
           "ViewExpandedText": "SELECT t1.transaction_id, t1.transaction_type, t1.transaction_amount, t2.transaction_location, t2.transaction_date FROM transaction_table1 t1 JOIN transaction_table2 t2 ON t1.transaction_id = t2.transaction_id WHERE t1.transaction_amount > 100;"
         }
       ]
     }
   }
 }'
  1. Open the Lake Formation console and verify if the view is created. Verify the dialect of the view on the SQL definitions tab for the view details.

Add ATHENA dialect

To add ATHENA dialect, follow these steps:

  1. On the Athena console, select centraladmin from the Data source.
  2. Enter the following SQL script to create the ATHENA dialect for the same view:
ALTER VIEW mdv_transaction1 FORCE ADD DIALECT
AS
SELECT t1.transaction_id, t1.transaction_type, t1.transaction_amount, t2.transaction_location, t2.transaction_date FROM transaction_table1 t1 JOIN transaction_table2 t2 ON t1.transaction_id = t2.transaction_id WHERE t1.transaction_amount > 100

We can’t use the resource link rl_bank_iceberg in the Athena query editor to create or alter a view in the central account.

  1. Verify the added dialect by running a preview in Athena. For running the query, you can use either the resource link rl_bank_iceberg from the producer account catalog or use the centraladmin catalog.

The following screenshot shows querying using the resource link of the database in the producer account catalog.

The following screenshot shows querying the view from the producer using the connected catalog centraladmin as the data source.

  1. Verify the dialects on the view by inspecting the table in the Lake Formation console.

You can now query the view as the Data-Analyst role in the producer account, using both Athena and Spark. The view will also show in the central account as shown in the following code example, with access to the Lake Formation admin.

You can also create the view with ATHENA dialect and add the SPARK dialect. The SQL syntax to create the view in ATHENA dialect is shown in the following example:

create protected multi dialect view mdv_transaction1
security definer
as
SELECT t1.transaction_id, t1.transaction_type, t1.transaction_amount, t2.transaction_location, t2.transaction_date FROM transaction_table1 t1 
JOIN transaction_table2 t2 
ON t1.transaction_id = t2.transaction_id 
WHERE t1.transaction_amount > 100;

The update-table CLI to add the corresponding SPARK dialect is shown in the following example:

aws glue update-table --cli-input-json '{
    "DatabaseName": "rl_bankdatadb",
    "ViewUpdateAction": "ADD",
    "Force": true,
    "TableInput": {
        "Name": " mdv_transaction1",
        "StorageDescriptor": {
            "Columns": [
                {
                  "Name": "transaction_id",
                  "Type": "string"
                },
                {
                  "Name": "transaction_type",
                  "Type": "string"
                },
                {
                  "Name": "transaction_amount",
                  "Type": "float"
                },
                {
                  "Name": "transaction_location",
                  "Type": "string"
                },
                {
                  "Name": "transaction_date",
                  "Type": "date"
                }
             ],
             "SerdeInfo": {}
         },
         "ViewDefinition": {
         "SubObjects": [
               " "arn:aws:glue:<your-region>:<your-central-account-id>:table/bankdata_icebergdb/transaction_table1",
           "arn:aws:glue:<your-region>:<your-central-account-id>:table/bankdata_icebergdb/transaction_table2" 
],
         "IsProtected": true,
         "Representations": [
             {
                 "Dialect": "SPARK",
                 "DialectVersion": "1.0",
                 "ViewOriginalText": " SELECT t1.transaction_id, t1.transaction_type, t1.transaction_amount, t2.transaction_location, t2.transaction_date FROM transaction_table1 t1 JOIN transaction_table2 t2 ON t1.transaction_id = t2.transaction_id WHERE t1.transaction_amount > 100",
                 "ViewExpandedText": " SELECT t1.transaction_id, t1.transaction_type, t1.transaction_amount, t2.transaction_location, t2.transaction_date FROM transaction_table1 t1 JOIN transaction_table2 t2 ON t1.transaction_id = t2.transaction_id WHERE t1.transaction_amount > 100"
              }
           ]
        }
    }
}'

The following is a sample Python script to create a SPARK dialect view: glueview-createtable.py.

The following code block is a sample AWS Glue extract, transfer, and load (ETL) script to access the Spark dialect of the view from AWS Glue 5.0 from the central account. The AWS Glue job execution role should have Lake Formation SELECT permission on the AWS Glue view:

from pyspark.context import SparkContext
from pyspark.sql import SparkSession

aws_region = "<your-region>"
aws_account_id = "<your-central-account-id>"
local_catalogname = "spark_catalog"
warehouse_path = "s3://<your-bucket-name>/bankdata_icebergdb/transaction-table1"

spark = SparkSession.builder.appName('query_glue_view') \
    .config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
    .config(f'spark.sql.catalog.{local_catalogname}', 'org.apache.iceberg.spark.SparkSessionCatalog') \
    .config(f'spark.sql.catalog.{local_catalogname}.catalog-impl', 'org.apache.iceberg.aws.glue.GlueCatalog') \
    .config(f'spark.sql.catalog.{local_catalogname}.client.region', aws_region) \
    .config(f'spark.sql.catalog.{local_catalogname}.glue.account-id', aws_account_id) \
    .config(f'spark.sql.catalog.{local_catalogname}.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO') \
    .config(f'spark.sql.catalog.{local_catalogname}.warehouse',warehouse_path) \
    .getOrCreate()
spark.sql(f"show databases").show()
spark.sql(f"SHOW TABLES IN {local_catalogname}.bankdata_icebergdb").show()
spark.sql(f"SELECT * FROM {local_catalogname}.bankdata_icebergdb. mdv_transaction1").show()

In the AWS Glue job-details, for Lake Formation managed tables and for Iceberg tables, set additional parameters respectively as follows:

--enable-lakeformation-fine-grained-access = true
--datalake-formats = iceberg

Cleanup

To avoid incurring costs, clean up the resources you used for this post:

  1. Revoke the Lake Formation permissions granted to the Data-Analyst role and Producer account
  2. Drop the Athena tables
  3. Delete the Athena query results from your Amazon Simple Storage Service (Amazon S3) bucket
  4. Delete the Data-Analyst role from IAM

Conclusion

In this post, we demonstrated how to use cross-account IAM definer roles with AWS Glue Data Catalog views. We showed how data owner accounts can create and manage views in a central governance account while maintaining security and control over their data assets. This feature enables enterprises to implement sophisticated data mesh architectures without compromising on security or requiring data duplication.

The ability to use cross-account definer roles with Data Catalog views provides several key advantages:

  • Streamlines view management in multi-account environments
  • Maintains existing CI/CD workflows and automation
  • Enhances security through centralized governance
  • Reduces operational overhead by eliminating the need for data duplication

As organizations continue to build and scale their data lakehouse architectures across multiple AWS accounts, cross-account definer roles for Data Catalog views provide a crucial capability for implementing efficient, secure, and well-governed data sharing patterns.


About the authors

Aarthi Srinivasan

Aarthi Srinivasan

Aarthi is a Senior Big Data Architect at Amazon Web Services (AWS). She works with AWS customers and partners to architect data lake solutions, enhance product features, and establish best practices for data governance.

Sundeep Kumar

Sundeep Kumar

Sundeep is a Sr. Specialist Solutions Architect at Amazon Web Services (AWS), helping customers build data lake and analytics platforms and solutions. When not building and designing data lakes, Sundeep enjoys listening to music and playing guitar.