AWS Big Data Blog

Introducing AWS Glue crawler and create table support for Apache Iceberg format

Apache Iceberg is an open table format for large datasets in Amazon Simple Storage Service (Amazon S3) and provides fast query performance over large tables, atomic commits, concurrent writes, and SQL-compatible table evolution. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. Iceberg captures metadata information on the state of datasets as they evolve and change over time.

AWS Glue crawlers now support Iceberg tables, enabling you to use the AWS Glue Data Catalog and migrate from other Iceberg catalogs easier. AWS Glue crawlers will extract schema information and update the location of Iceberg metadata and schema updates in the Data Catalog. You can then query the Data Catalog Iceberg tables across all analytics engines and apply AWS Lake Formation fine-grained permissions.

The Iceberg catalog helps you manage a collection of Iceberg tables and tracks the table’s current metadata. Iceberg provides several implementation options for the Iceberg catalog, including the AWS Glue Data Catalog, Hive Metastore, and JDBC catalogs. Customers prefer using or migrating to the AWS Glue Data Catalog because of its integrations with AWS analytical services such as Amazon Athena, AWS Glue, Amazon EMR, and Lake Formation.

With today’s launch, you can create and schedule an AWS Glue crawler to existing Iceberg tables into in the Data Catalog. You can then provide one or multiple S3 paths where the Iceberg tables are located. You have the option to provide the maximum depth of S3 paths that the crawler can traverse. With each crawler run, the crawler inspects each of the S3 paths and catalogs the schema information, such as new tables, deletes, and updates to schemas in the Data Catalog. Crawlers support schema merging across all snapshots and update the latest metadata file location in the Data Catalog that AWS analytical engines can directly use.

Additionally, AWS Glue is launching support for creating new (empty) Iceberg tables in the Data Catalog using the AWS Glue console or AWS Glue CreateTable API. Before the launch, customers who wanted to adopt Iceberg table format were required to generate Iceberg’s metadata.json file on Amazon S3 using PutObject separately in addition to CreateTable. Often, customers have used the create table statement on analytics engines such as Athena, AWS Glue, and so on. The new CreateTable API eliminates the need to create the metadata.json file separately, and automates generating metadata.json based on the given API input. Also, customers who manage deployments using AWS CloudFormation templates can now create Iceberg tables using the CreateTable API. For more details, refer to Creating Apache Iceberg tables.

For accessing the data using Athena, you can also use Lake Formation to secure your Iceberg table using fine-grained access control permissions when you register the Amazon S3 data location with Lake Formation. For source data in Amazon S3 and metadata that is not registered with Lake Formation, access is determined by AWS Identity and Access Management (IAM) permissions policies for Amazon S3 and AWS Glue actions.

Solution overview

For our example use case, a customer uses Amazon EMR for data processing and Iceberg format for the transactional data. They store their product data in Iceberg format on Amazon S3 and host the metadata of their datasets in Hive Metastore on the EMR primary node. The customer wants to make product data accessible to analyst personas for interactive analysis using Athena. Many AWS analytics services don’t integrate natively with Hive Metastore, so we use an AWS Glue crawler to populate the metadata in the AWS Glue Data Catalog. Athena supports Lake Formation permissions on Iceberg tables, so we apply fine-grained access for data access.

We configure the crawler to onboard the Iceberg schema to the Data Catalog and use Lake Formation access control for crawling. We apply Lake Formation grants on the database and crawled table to enable analyst users to query the data and verify using Athena.

After we populate the schema of the existing Iceberg dataset to the Data Catalog, we onboard new Iceberg tables to the Data Catalog and load data into the newly created data using Athena. We apply Lake Formation grants on the database and newly created table to enable analyst users to query the data and verify using Athena.

The following diagram illustrates the solution architecture.

Set up resources with AWS CloudFormation

To set up the solution resources using AWS CloudFormation, complete the following steps:

  1. Log in to the AWS Management Console as IAM administrator.
  2. Choose Launch Stack to deploy a CloudFormation template.
  3. Choose Next.
  4. On the next page, choose Next.
  5. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  6. Choose Create.

The CloudFormation template generates the following resources:

  • VPC, subnet, and security group for the EMR cluster
  • Data lake bucket to store Iceberg table data and metadata
  • IAM roles for the crawler and Lake Formation registration
  • EMR cluster and steps to create an Iceberg table with Hive Metastore
  • Analyst role for data access
  • Athena bucket path for results

  1. When the stack is complete, on the AWS CloudFormation console, navigate to the Resources tab of the stack.
  2. Note down the values of EmrClusterId, DataLakeBucketName, LFRegisterLocationServiceRole, AWSGlueServiceRole, AthenaBucketName, and LFBusinessAnalystRole.
  3. Navigate to the Amazon EMR console and choose the EMR cluster you created.
  4. Navigate to the Steps tab and verify that the steps were run.

This script run creates the database icebergcrawlerblodb using Hive and the Iceberg table product. It uses the Hive Metastore server on Amazon EMR as the metastore and stores the data on Amazon S3.

  1. Navigate to the S3 bucket you created and verify if the data and metadata are created for the Iceberg table.

Some of the resources that this stack deploys incur costs when in use.

Now that the data is on Amazon S3, we can register the bucket with Lake Formation to implement access control and centralize the data governance.

Set up Lake Formation permissions

To use the AWS Glue Data Catalog in Lake Formation, complete the following steps to update the Data Catalog settings to use Lake Formation permissions to control Data Catalog resources instead of IAM-based access control:

  1. Sign in to the Lake Formation console as admin.
    • If this is the first time accessing the Lake Formation console, add yourself as the data lake administrator.
  2. In the navigation pane, under Data catalog, choose Settings.
  3. Deselect Use only IAM access control for new databases.
  4. Deselect Use only IAM access control for new tables in new databases.
  5. Choose Version 3 for Cross account version settings.
  6. Choose Save.

Now you can set up Lake Formation permissions.

Register the data lake S3 bucket with Lake Formation

To register the data lake S3 bucket, complete the following steps:

  1. On the Lake Formation console, in the navigation pane, choose Data lake locations.
  2. Choose Register location.
  3. For Amazon S3 path, enter the data lake bucket path.
  4. For IAM role, choose the role noted from the CloudFormation template for LFRegisterLocationServiceRole.
  5. Choose Register location.

Grant crawler role access to the data location

To grant access to the crawler, complete the following steps:

  1. On the Lake Formation console, in the navigation pane, choose Data locations.
  2. Choose Grant.
  3. For IAM users and roles, choose the role for the crawler.
  4. For Storage locations, enter the data lake bucket path.
  5. Choose Grant.

Create database and grant access to the crawler role

Complete the following steps to create your database and grant access to the crawler role:

  1. On the Lake Formation console, in the navigation pane, choose Databases.
  2. Choose Create database.
  3. Provide the name icebergcrawlerblogdb for the database.
  4. Make sure Use only IAM access control for new tables in this database option is not selected.
  5. Choose Create database.
  6. On the Action menu, choose Grant.
  7. For IAM users and roles, choose the role for the crawler.
  8. Leave the database specified as icebergcrawlerblogdb.
  9. Select Create table, Describe, and Alter for Database permissions.
  10. Choose Grant.

Configure the crawler for Iceberg

To configure your crawler for Iceberg, complete the following steps:

  1. On the AWS Glue console, in the navigation pane, choose Crawlers.
  2. Choose Create crawler.
  3. Enter a name for the crawler. For this post, we use icebergcrawler.
  4. Under Data source configuration, choose Add data source.
  5. For Data source, choose Iceberg.
  6. For S3 path, enter s3://<datalakebucket>/icebergcrawlerblogdb.db/.
  7. Choose Add a Iceberg data source.

Support for Iceberg tables is available through CreateCrawler and UpdateCrawler APIs and adding the additional IcebergTarget as a target, with the following properties:

  • connectionId – If your Iceberg tables are stored in buckets that require VPC authorization, you can set your connection properties here
  • icebergTables – This is an array of icebergPaths strings, each indicating the folder with which the metadata files for an Iceberg table resides

See the following code:

{
    "IcebergTarget": {
        "connectionId": "iceberg-connection-123",
        "icebergMetaDataPaths": [
            "s3://bucketA/",
            "s3://bucketB/",
            "s3://bucket3/financedb/financetable/"
        ]
        "exclusions": ["departments/**", "employees/folder/**"]
        "maximumDepth": 5
    }
}
  1. Choose Next.
  2. For Existing IAM role, enter the crawler role created by the stack.
  3. Under Lake Formation configuration, select Use Lake Formation credentials for crawling S3 data source.
  4. Choose Next.
  5. Under Set output and scheduling, specify the target database as icebergcrawlerblogdb.
  6. Choose Next.
  7. Choose Create crawler.
  8. Run the crawler.

During each crawl, for each icebergTable path provided, the crawler calls the Amazon S3 List API to find the most recent metadata file under that Iceberg table metadata folder and updates the metadata_location parameter to the latest manifest file.

The following screenshot shows the details after a successful run.

The crawler was able to crawl the S3 data source and successfully populate the schema for Iceberg data in the Data Catalog.

You can now start using the Data Catalog as your primary metastore and create new Iceberg tables directly in the Data Catalog or using the createtable API.

Create a new Iceberg table

To create an Iceberg table in the Data Catalog using the console, complete the steps in this section. Alternatively, you can use a CloudFormation template to create an Iceberg table using the following code:

Type: AWS::Glue::Table
Properties: 
  CatalogId:"<account_id>"
  DatabaseName:"icebergcrawlerblogdb"
  TableInput:
    Name: "product_details"
    StorageDescriptor:
       Columns:
         - Name: "product_id"
           Type: "string"
         - Name: "manufacture_name"
           Type: "string"
         - Name: "product_rating"
           Type: "int"
       Location: "s3://<datalakebucket>/icebergcrawlerblogdb.db/"
    TableType: "EXTERNAL_TABLE"
  OpenTableFormatInput:
    IcebergInput:
      MetadataOperation: "CREATE"
      Version: "2"

Grant the IAM role access to the data location

First, grant the IAM role access to the data location:

  1. On the Lake Formation console, in the navigation pane, choose Data locations.
  2. Choose Grant.
  3. Select Admin IAM role for IAM users and roles.
  4. For Storage location, enter the data lake bucket path.
  5. Choose Grant.

Create the Iceberg table

Complete the following steps to create the Iceberg table:

  1. On the Lake Formation console, in the navigation pane, choose Tables.
  2. Choose Create table.
  3. For Name, enter product_details.
  4. Choose icebergcrawlerblogdb for Database.
  5. Select Apache Iceberg table for Table format.
  6. Provide the path for <datalakebucket>/icebergcrawlerblogdb.db/ for Table location.
  7. Provide the following schema and choose Upload schema:
    [
         {
             "Name": "product_id",
             "Type": "string"
         },
         {
             "Name": "manufacture_name",
             "Type": "string"
         },
         {
             "Name": "product_rating",
             "Type": "int"
         }
     ]
  8. Choose Submit to create the table.

Add a record to the new Iceberg table

Complete the following steps to add a record to the Iceberg table:

  1. On the Athena console, navigate to the query editor.
  2. Choose Edit settings to configure the Athena query results bucket using the value noted from the CloudFormation output for AthenaBucketName.
  3. Choose Save.
  4. Run the following query to add a record to the table:
    insert into icebergcrawlerblogdb.product_details values('00001','ABC Company',10)

Configure Lake Formation permissions on the Iceberg table in the Data Catalog

Athena supports Lake Formation permission on Iceberg tables, so for this post, we show you how to set up fine-grained access on the tables and query them using Athena.

Now the data lake admin can delegate permissions on the database and table to the LFBusinessAnalystRole-IcebergBlogIAM role via the Lake Formation console.

Grant the role access to the database and describe permissions

To grant the LFBusinessAnalystRole-IcebergBlogIAM role access to the database with describe permissions, complete the following steps:

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permissions.
  2. Choose Grant
  3. Under Principals, select IAM users and roles.
  4. Choose the IAM role LFBusinessAnalystRole-IcebergBlog.
  5. Under LF-Tags or catalog resources, choose icebergcrawlerblogdb for Databases.
  6. Select Describe for Database permissions.
  7. Choose Grant to apply the permissions.

Grant column access to the role

Next, grant column access to the LFBusinessAnalystRole-IcebergBlogIAM role:

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permissions.
  2. Choose Grant.
  3. Under Principals, select IAM users and roles.
  4. Choose the IAM role LFBusinessAnalystRole-IcebergBlog.
  5. Under LF-Tags or catalog resources, choose icebergcrawlerblogdb for Databases and product for Tables.
  6. Choose Select for Table permissions.
  7. Under Data permissions, select Column-based access.
  8. Select Include columns and choose product_name and price.
  9. Choose Grant to apply the permissions.

Grant table access to the role

Lastly, grant table access to the LFBusinessAnalystRole-IcebergBlogIAM role:

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permissions.
  2. Choose Grant.
  3. Under Principals, select IAM users and roles.
  4. Choose the IAM role LFBusinessAnalystRole-IcebergBlog.
  5. Under LF-Tags or catalog resources, choose icebergcrawlerblogdb for Databases and product_details for Tables.
  6. Choose Select and Describe for Table permissions.
  7. Choose Grant to apply the permissions.

Verify the tables using Athena

To verify the tables using Athena, switch to LFBusinessAnalystRole-IcebergBlogrole and complete the following steps:

  1. On the Athena console, navigate to the query editor.
  2. Choose Edit settings to configure the Athena query results bucket using the value noted from the CloudFormation output for AthenaBucketName.
  3. Choose Save.
  4. Run the queries on product and product_details to validate access.

The following screenshot shows column permissions on product.

The following screenshot shows table permissions on product_details.

We have successfully crawled the Iceberg dataset created from Hive Metastore with data on Amazon S3 and created an AWS Glue Data Catalog table with the schema populated. We registered the data lake bucket with Lake Formation and enabled crawling access to the data lake using Lake Formation permissions. We granted Lake Formation permissions on the database and table to the analyst user and validated access to the data using Athena.

Clean up

To avoid unwanted charges to your AWS account, delete the AWS resources:

  1. Sign in to the CloudFormation console as the IAM admin used for creating the CloudFormation stack.
  2. Delete the CloudFormation stack you created.

Conclusion

With the support for Iceberg crawlers, you can quickly move to using the AWS Glue Data Catalog as your primary Iceberg table catalog. You can automatically register Iceberg tables into the Data Catalog by running an AWS Glue crawler, which doesn’t require any DDL or manual schema definition. You can start building your serverless transactional data lake on AWS using the AWS Glue crawler, create a new table using the Data Catalog, and utilize Lake Formation fine-grained access controls for querying Iceberg tables formats by Athena.

Refer to Working with other AWS services for Lake Formation support for Iceberg tables across various AWS analytical services.

Special thanks to everyone who contributed to this crawler and createtable feature launch: Theo Xu, Kyle Duong, Anshuman Sharma, Atreya Srivathsan, Eric Wu, Jack Ye, Himani Desai, Atreya Srivathsan, Masoud Shahamiri and Sachet Saurabh.

If you have questions or suggestions, submit them in the comments section.


About the authors

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Mahesh Mishra is a Principal Product Manager with AWS Lake Formation team. He works with many of AWS largest customers on emerging technology needs, and leads several data and analytics initiatives within AWS including strong support for Transactional Data Lakes.