AWS Big Data Blog

How to use streamlined permissions for Amazon S3 Tables and Iceberg materialized views

Apache Iceberg has emerged as the open table format for data lakes. It handles petabyte-scale datasets, lets teams evolve schemas and partitions in place, and supports time travel and incremental processing for data lake management at scale. Amazon S3 Tables provide a fully managed Apache Iceberg table experience in Amazon S3, optimized for analytics workloads, and integrate with the AWS Glue Data Catalog so AWS analytics services such as Amazon RedshiftAmazon EMRAmazon AthenaAmazon SageMaker, and AWS Glue query your data. Together, they form the foundation of a modern data lake architecture on AWS.

S3 Tables integrate with the AWS Glue Data Catalog using AWS Identity and Access Management (IAM) – based authorization. If you manage analytics workloads across these services, you can now define permissions across storage, catalog, and compute in a single IAM policy. This gives teams already using IAM a straightforward path to govern access to S3 Tables resources without changing their existing permission model. For fine-grained access controls, you can opt in to AWS Lake Formation at any time through the AWS Management Console, AWS Command Line Interface (AWS CLI), API, or AWS CloudFormation.

Iceberg materialized views created in the Glue Data Catalog extend this foundation by letting you store pre-computed query results as Iceberg data on Amazon S3. When a query repeats aggregations or joins across large datasets, the engine reads directly from the materialized view’s S3 location rather than reprocessing the base tables. A materialized view can reside in S3 Tables or in an S3 general purpose bucket, independent of where its base tables live, which lets you place pre-computed results wherever fits your access patterns and cost model best.

In this post, we walk through how to set up and manage S3 Tables in the AWS Glue Data Catalog, create and query Iceberg materialized views, and configure access controls that work across your analytics stack with IAM-based authorization.

 Solution overview

Architecture diagram showing AWS Glue Data Catalog integration with Amazon Athena, AWS Glue, Amazon Redshift, and Amazon EMR through IAM roles and policies, with Amazon S3 storage and optional AWS Lake Formation governance.

The above architecture illustrates how S3 Tables integrate with AWS Glue Data Catalog using IAM-based authorization, so you can define the necessary permissions across storage, catalog, and query engines in a single IAM policy. This permission model accelerates onboarding for new teams and workloads.

Key architecture components include:

Storage Layer: Data stored as Iceberg tables in Amazon S3 Tables

Catalog Layer: AWS Glue Data Catalog serves as the single metadata repository.

Compute Layer – Amazon Athena, AWS Glue, Amazon Redshift, and Amazon EMR connect to a single data Catalog to access Iceberg tables.

Security: AWS IAM authorizes access to resources in storage, catalog, and compute layers.

Prerequisites:

To follow along with this post, you must have an AWS account and an IAM role or user with appropriate permissions and familiarity to the following services:

  • IAM
  • AWS Glue Data Catalog
  • Amazon S3
  • Amazon Athena
  • Amazon Redshift
  • Amazon EMR

For the minimum permissions required for the role/user for metadata and data access, refer to required IAM permissions documentation.

Solution walkthrough

In this walkthrough, you will integrate S3 Tables with the AWS Glue Data Catalog, create Iceberg materialized views, and query data using multiple analytics engines. You will also learn to use materialized views when you have complex aggregations queried frequently but underlying data changes. You can follow these steps to implement the solution. It will take about 45–60 minutes to complete this walkthrough.

Setup S3 Tables and integrate with Glue Data Catalog

Navigate to Amazon S3 console:

  1. On the left menu, select Table buckets.
  2. Choose the Create table bucket button.

Amazon S3 console showing the Table buckets management page in the US West (N. California) us-west-1 Region with zero table buckets, integration status disabled, and the Create table bucket button highlighted.

  1. In the next screen, we will fill the name of the bucket as salesbucket. Please ensure the Enable Integration configuration is checked. This step integrates S3 Tables with AWS Glue Data Catalog.

AWS S3 Create table bucket form with General configuration showing bucket name "salesbucket" and Integration with AWS analytics services section with Enable integration checkbox selected.

  1. Keep the other options as default and choose Create table bucket.
  2. After it is created, you will be redirected back to the list of table buckets. Choose the table bucket salesbucket.
  3. Select the Create table with Athena button.
  4. Create a namespace in S3 Tables which is equivalent to a database in AWS Glue Data Catalog. Enter namespace (database) name as “sales” and click Create namespace.

Create table with Athena dialog in the Amazon S3 salesbucket console showing namespace configuration with "Create a namespace" selected and namespace name set to "sales."

  1. Choose Create table with Athena, and a new tab will be open with the Amazon Athena console.
  2. When the Amazon Athena console opens, you will see an example of a query to create a table and examples to insert rows in that table. You could use this query block by uncommenting the code and executing each statement individually by highlighting it. At the end, you will have data in the table.

Amazon Athena query editor showing a SQL analytics query on the daily_sales table with results displaying product categories, units sold, total revenue, and average price for February 2024 sales data.

Query S3 Tables and create materialized view using Amazon EMR:

To run the instruction on Amazon EMR, complete the following steps to configure the cluster:

  1. Create an IAM role for the Amazon EMR instance profile following the Amazon EMR Management Guide. Add the following as policies and trust relationship for working on materialized views.

Replace ACCOUNT_ID with your AWS account ID, Instance_profile_role to the Amazon EMR instance profile role, and REGION with your AWS Region.

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Sid":"GlueDataCatalogPermissions",
         "Effect":"Allow",
         "Action":[
            "glue:GetCatalog",
            "glue:GetDatabase",
            "glue:CreateTable",
            "glue:GetTable",
            "glue:GetTables",
            "glue:UpdateTable",
            "glue:DeleteTable"
         ],
         "Resource":[
            "arn:aws:glue:<REGION>:<ACCOUNT ID>:catalog",
            "arn:aws:glue:<REGION>:<ACCOUNT ID>:catalog/s3tablescatalog",
            "arn:aws:glue:<REGION>:<ACCOUNT ID>:catalog/s3tablescatalog/*",
            "arn:aws:glue:<REGION>:<ACCOUNT ID>:database/salesdb",
            "arn:aws:glue:<REGION>:<ACCOUNT ID>:database/salesdb/*",
            "arn:aws:glue:<REGION>:<ACCOUNT ID>:database/s3tablescatalog",
            "arn:aws:glue:<REGION>:<ACCOUNT ID>:database/s3tablescatalog/*",
            "arn:aws:glue:<REGION>:<ACCOUNT ID>:table/s3tablescatalog/*",
            "arn:aws:glue:<REGION>:<ACCOUNT ID>:table/*/*"
         ]
      },
      {
         "Sid":"S3TablesDataAccessPermissions",
         "Effect":"Allow",
         "Action":[
            "s3tables:GetTableBucket",
            "s3tables:GetNamespace",
            "s3tables:GetTable",
            "s3tables:GetTableMetadataLocation",
            "s3tables:GetTableData",
            "s3tables:ListTableBuckets",
            "s3tables:CreateTable",
            "s3tables:PutTableData",
            "s3tables:UpdateTableMetadataLocation",
            "s3tables:ListNamespaces",
            "s3tables:ListTables",
            "s3tables:DeleteTable"
         ],
         "Resource":[
            "arn:aws:s3tables:<REGION>:<ACCOUNT ID>:bucket/*"
         ]
      },
      {
         "Effect":"Allow",
         "Action":"iam:PassRole",
         "Resource":"arn:aws:iam::<ACCOUNT ID>:role/service-role/<Instance_profile_role>"
      }
   ]
}

Add the following to the trust policy in addition to existing:

 {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "Service": "glue.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
  1. Launch an Amazon EMR cluster 7.12.0 or higher with instance profile role created in the previous step and with Iceberg enabled. For more information, refer to Use an Iceberg cluster with Spark.
  2. Connect to the primary node of your Amazon EMR cluster by using SSH, and run the following command to start a Spark application with the required configurations:

Replace bucket_name with your bucket name.

spark-sql \
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
  --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog \
  --conf spark.sql.catalog.glue_catalog.type=glue \
  --conf spark.sql.catalog.glue_catalog.warehouse=s3://<bucket_name> \
  --conf spark.sql.catalog.glue_catalog.glue.region=<region> \
  --conf spark.sql.catalog.glue_catalog.glue.id=<accountid>:s3tablescatalog/salesbucket \
  --conf spark.sql.catalog.glue_catalog.glue.account-id=<accountid> \
  --conf spark.sql.catalog.glue_catalog.client.region=<region> \
  --conf spark.sql.optimizer.answerQueriesWithMVs.enabled=true \
  --conf spark.sql.defaultCatalog=glue_catalog
  1. Run the following queries to query the daily_sales table.
spark-sql ()> use sales;
spark-sql (sales)> select * from daily_sales;
2024-01-15 Laptop 900.0
2024-01-15 Monitor 250.0
2024-01-16 Laptop 1350.0
2024-02-01 Monitor 300.0
2024-02-01 Keyboard 60.0
2024-02-02 Mouse 25.0
2024-02-02 Laptop 1050.0
2024-02-03 Laptop 1200.0
2024-02-03 Monitor 375.0
  1. Create Materialized view.
CREATE MATERIALIZED VIEW sales_mv as 
SELECT 
    product_category,
    COUNT(*) as units_sold,
    SUM(sales_amount) as total_revenue, 
    AVG(sales_amount) as average_price 
FROM 
    glue_catalog.sales.daily_sales 
GROUP BY 
    product_category;

A newly created materialized view is populated with the initial query results but does not update automatically as base table data changes. To keep it current, specify a REFRESH EVERY clause when creating the view. This accepts a time interval and unit, so you can define how often the materialized view is recomputed from the base tables.

  1. Add refresh interval.
CREATE MATERIALIZED VIEW sales_mv 
SCHEDULE REFRESH EVERY 2 HOURS as 
SELECT 
    product_category,
    COUNT(*) as units_sold,
    SUM(sales_amount) as total_revenue, 
    AVG(sales_amount) as average_price 
FROM 
    glue_catalog.sales.daily_sales 
GROUP BY 
    product_category;
  1. Alternatively, you can refresh them manually.

For manual full refresh, you can use the following command:

REFRESH MATERIALIZED VIEW sales_mv FULL;

For manual incremental refresh, you can use the following command:

REFRESH MATERIALIZED VIEW sales_mv;

For more details, refer to Refreshing materialized views.

  1. Query the MV.
spark-sql (sales)> select * from sales_mv
Keyboard 1 60.0 60.0
Laptop 4 4500.0 1125.0
Mouse 1 25.0 25.0
Monitor 3 925.0 308.3333333333333

After the Iceberg materialized views are created, you can access them using IAM principals that have required IAM permissions to Glue Data Catalog resource and its underlying storage.

Iceberg materialized views are flexible in how they combine base tables and access control modes. Base tables can reside in S3 general-purpose buckets (with IAM or Lake Formation access control), in S3 Tables (through the s3tablescatalog catalog), or a combination of these—all within a single materialized view definition. The materialized view itself can use either IAM or AWS Lake Formation access control, independently of its base tables.

For more details, refer to How materialized views work with AWS Glue.

Query using Athena:

Additionally, you can query the same materialized view from Athena SQL. The following image shows the same query run on Athena and the resulting output.Amazon Athena query editor showing SELECT query results from the sales_mv materialized view with product category aggregations including Keyboard and Laptop sales data.

Query using Amazon Redshift:

To query the S3 Tables in AWS Glue Data Catalog using Amazon Redshift, you must create a database in the default catalog in Glue Data Catalog that points to the S3 Tables catalog.

  1. On the AWS Glue console, choose Databases, and then choose Add Database.

AWS Glue Data Catalog Databases page showing one default database in catalog 466053964652, with the Add database button highlighted.

  1. Choose the Glue Database resource link option, add a name for the database, choose salesbucket on the target catalog and sales as the target database. Then select Create database.

AWS Glue Create a database form with Glue Database Resource Link selected, name set to "salesdb," target catalog "salesbucket," and target database "sales."

After creating the database, we will see the “salesdb” resource link under Databases on AWS Glue Data Catalog.

AWS Glue Data Catalog Databases page showing two databases: "default" and the newly created "salesdb" resource link with source catalog pointing to s3tablescatalog.

Create IAM role with the following policy for the Amazon Redshift schema creation. Replace the AWS Region and account ID for your account.

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Sid":"GlueDataCatalogPermissions",
         "Effect":"Allow",
         "Action":[
            "glue:GetCatalog",
            "glue:GetDatabase",
            "glue:CreateTable",
            "glue:GetTable",
            "glue:GetTables",
            "glue:UpdateTable",
            "glue:DeleteTable"
         ],
         "Resource":[
            "arn:aws:glue:<REGION>:<ACCOUNTID>:catalog",
            "arn:aws:glue:<REGION>:<ACCOUNTID>:catalog/s3tablescatalog",
            "arn:aws:glue:<REGION>:<ACCOUNTID>:catalog/s3tablescatalog/*",
            "arn:aws:glue:<REGION>:<ACCOUNTID>:database/salesdb",
            "arn:aws:glue:<REGION>:<ACCOUNTID>:database/salesdb/*",
            "arn:aws:glue:<REGION>:<ACCOUNTID>:database/s3tablescatalog",
            "arn:aws:glue:<REGION>:<ACCOUNTID>:database/s3tablescatalog/*",
            "arn:aws:glue:<REGION>:<ACCOUNTID>:table/s3tablescatalog/*",
            "arn:aws:glue:<REGION>:<ACCOUNTID>:table/*/*"
         ]
      },
      {
         "Sid":"S3TablesDataAccessPermissions",
         "Effect":"Allow",
         "Action":[
            "s3tables:GetTableBucket",
            "s3tables:GetNamespace",
            "s3tables:GetTable",
            "s3tables:GetTableMetadataLocation",
            "s3tables:GetTableData",
            "s3tables:ListTableBuckets",
            "s3tables:CreateTable",
            "s3tables:PutTableData",
            "s3tables:UpdateTableMetadataLocation",
            "s3tables:ListNamespaces",
            "s3tables:ListTables",
            "s3tables:DeleteTable"
         ],
         "Resource":[
            "arn:aws:s3tables:<REGION>:<ACCOUNTID>:bucket/*"
         ]
      }
   ]
}

Create an Amazon Redshift provisioned cluster or Amazon Redshift Serverless, attaching the IAM role created in previous step.

To access the AWS Glue Catalog and the resource link, you can now log in to Amazon Redshift as a local user. We use the admin user and Amazon Redshift Query Editor v2.

Amazon Redshift Query Editor v2 interface connected to Serverless workgroup "s3tablesblog" showing 2 native databases and 1 external database with an empty query editor ready for input.

To create the external schema, you must run the following command: Replace ACCOUNT_ID with your AWS Account ID, IAM_ROLE to IAM role created for schema access, and REGION with your AWS Region.

CREATE EXTERNAL SCHEMA salesdb
FROM DATA CATALOG DATABASE 'salesdb'
IAM_ROLE 'arn:aws:iam::<ACCOUNT_ID>:role/<IAM_ROLE>'
REGION '<REGION>'
CATALOG_ID '<ACCOUNT_ID>';

After you have created the external schema, it will show up on the left side, under the dev database. The table that we created, daily_sales, is available and we can query directly from Amazon Redshift using a local user.

Amazon Redshift Query Editor v2 showing a SELECT query on the daily_sales table in the salesdb schema with 9 rows of results displaying sale dates, product categories, and sales amounts from January–February 2024.

Cleanup:

After completing the walkthrough, follow these steps to remove the resources and avoid ongoing charges. These cleanup steps will permanently delete the data, including the daily_sales table and sales_mv materialized view. Make sure that you have backed up the data that you need to retain before proceeding.

To avoid incurring future charges, clean up the resources that you created during this walkthrough:

  • Remove the Glue Data Catalog resources
  • Delete the table bucket
  • Terminate and Delete the Amazon Redshift cluster
  • Terminate and Delete the Amazon EMR cluster
  • Delete the IAM roles/policies created

Conclusion

Amazon S3 Tables now integrate with AWS Glue Data Catalog through IAM-based authorization via a single IAM policy. By consolidating permissions for storage, catalog, and query engines into one IAM policy, you can streamline authorization with AWS analytics services like Amazon Athena, Amazon EMR, and AWS Glue. You can use this streamlined IAM authorization model to build your data lake faster while maintaining enterprise-grade security. For organizations with additionally granular data access requirements, AWS Lake Formation remains available to layer fine-grained access controls on top of this foundation. This is configurable through the AWS Management Console, CLI, API, or CloudFormation. This integration allows AWS analytics users to use IAM and scale their analytics capabilities with reduced operational complexity.

To learn more about to S3 Tables and integration with Glue Data catalog, visit: Amazon S3 Tables integration with AWS analytics services overview and Integrating with Amazon S3 Tables.


About the authors

Ricardo Serafim

Ricardo is a Senior Analytics Specialist Solutions Architect at AWS. He has been helping companies with Data Warehouse solutions since 2007.

Milind Oke

Milind is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.

Pratik Das

Pratik is a Senior Product Manager with AWS Lake Formation. He is passionate about all things data and works with customers to understand their requirements and build delightful experiences. He has a background in building data-driven solutions and machine learning systems.

Srividya Parthasarathy

Srividya is a Senior Big Data Architect on the AWS Lake Formation team. She works with the product team and customers to build robust features and solutions for their analytical data platform. She enjoys building data mesh solutions and sharing them with the community.