Read and write S3 Iceberg table using AWS Glue Iceberg Rest Catalog from Open Source Apache Spark

In today’s data-driven world, organizations are constantly seeking efficient ways to process and analyze vast amounts of information across data lakes and warehouses.

Enter Amazon SageMaker Lakehouse, which you can use to unify all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI and machine learning (AI/ML) applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines. This opens up exciting possibilities for Open Source Apache Spark users who want to use SageMaker Lakehouse capabilities. Further you can secure your data in SageMaker Lakehouse by defining fine-grained permissions, which are enforced across all analytics and ML tools and engines.

In this post, we will explore how to harness the power of Open source Apache Spark and configure a third-party engine to work with AWS Glue Iceberg REST Catalog. The post will include details on how to perform read/write data operations against Amazon S3 tables with AWS Lake Formation managing metadata and underlying data access using temporary credential vending.

Solution overview

In this post, the customer uses Data Catalog to centrally manage technical metadata for structured and semi-structured datasets in their organization and wants to enable their data team to use Apache Spark for data processing. The customer will create an AWS Glue database and configure Apache Spark to interact with Glue Data Catalog using the Iceberg Rest API for writing/reading Iceberg data on Amazon S3 using Lake Formation permission control.

We will start by running an extract, transform, and load (ETL) script using Apache Spark to create an Iceberg table on Amazon S3 and access the table using the Glue Iceberg REST Catalog. The ETL script will add data to the Iceberg table and then read it back using Spark SQL. This post will showcase how this data can also be queried by other data teams using Amazon Athena .

Prerequisites

Access to an AWS Identity and Access Management (IAM) role that is a Lake Formation data lake administrator in the account that has the Data Catalog. For instructions, see Create a data lake administrator.

Verify that you have Python version 3.7 or later installed. Check if pip3 version is 22.2.2 or higher is installed.
Install or update the latest AWS Command Line Interface (AWS CLI). For instructions, see Installing or updating the latest version of the AWS CLI. Run aws configure using AWS CLI to point to your AWS account.
Create an S3 bucket to store the customer Iceberg table. For this post, we will be using the us-east-2 AWS Region and will name the bucket: ossblog-customer-datalake.
Create an IAM role that will be used in OSS Spark for data access using an AWS Glue Iceberg REST catalog endpoint. Make sure that the role has AWS Glue and Lake Formation policies as defined in Data engineer permissions. For this post, we will use an IAM role named spark_role.

Enable Lake Formation permissions for third-party access

In this section, you will register the S3 bucket with Lake Formation. This step allows Lake Formation to act as a centralized permissions management system for metadata and data stored in Amazon S3, enabling more efficient and secure data governance in data lake environments.

Create a user defined IAM role following the instructions in Requirements for roles used to register locations. For this post, we will use the IAM role: LFRegisterRole.

aws lakeformation register-resource \
--resource-arn '< S3 bucket ARN for amzn-s3-demo-bucket>' \
--role-arn '< IAM Role ARN for LFRegisterRole >' \
--region <aws_region>

Alternatively you can use the AWS Management Console for Lake Formation.

Navigate to the Lake Formation console, choose Administration in the navigation pane, and then Data lake locations and provide the following values:
1. For Amazon S3 path, select s3://ossblog-customer-datalake.
2. For IAM role, select LFRegisterRole
3. For Permission mode, choose Lake Formation.
4. Choose Register location.

In Lake Formation, enable full table access for external engines to access data.
1. Sign in as an admin user, choose Administration in the navigation pane.
2. Choose Application integration settings and select Allow external engines to access data in Amazon S3 locations with full table access.
3. Choose Save.

Set up resource access for the OSS Spark role:

Create an AWS Glue database called ossblogdb in the default catalog by going to the Lake Formation console and choosing Databases in the navigation pane.
Select the database, choose Edit and clear the checkbox for Use only IAM access control for new tables in this database.

Grant resource permission to OSS Spark role:

To enable OSS Spark to create and populate the dataset in the ossblogdb database, you will use the IAM role (spark_role) for Apache Spark instance that you created in step 4 of the prerequisites section. Apache Spark will assume this role to create an Iceberg table, add records to it and read from it. To enable this functionality, grant full table access to spark_role and provide data location permission to the S3 bucket where the table data can be stored.

Grant create table permission to the spark_role:

aws lakeformation grant-permissions \
--principal '{"DataLakePrincipalIdentifier":"arn:aws:iam::<aws_account_id>:role/<iam_role_name>"}' \
--permissions '["CREATE_TABLE","DESCRIBE"]'\
--resource '{"Database":{"CatalogId":"<aws_account_id>","Name":"ossblogdb"}}' \
--region <aws_region>

Alternatively on the console:

In the Lake Formation console navigation pane, choose Data lake permissions, and then choose Grant.
In the Principals section, for IAM users and roles, select spark_role.
In the LF-Tags or catalog resources section, select Named Data Catalog resources:
1. Select <accountid> for Catalogs.
2. Select ossblogdb for Databases.
Select DESCRIBE and CREATE TABLE for Database permissions.
Choose Grant.

Grant data location permission to the spark_role:

aws lakeformation grant-permissions 
--principal '{"DataLakePrincipalIdentifier":"<Principal>"}' 
--permissions DATA_LOCATION_ACCESS 
--resource '{"DataLocation":{"CatalogId":"<Catalog ID>","ResourceArn":"<S3 bucket ARN>"}}' 
--region <aws_region>

Alternatively on the console:

In the Lake Formation console navigation pane, choose Data Locations, and then choose Grant.
For IAM users and roles, select spark_role.
For Storage locations, select the bucket_name
Choose Grant.

Set up a Spark script to use an AWS Glue Iceberg REST catalog endpoint:

Create a file named oss_spark_customer_etl.py in your environment with the following content:

import sys
import os
import time
from pyspark.sql import SparkSession

#Replace <aws_region> with AWS region name.
#Replace <aws_account_id> with AWS account ID.

spark = SparkSession.builder.appName('osspark') \
.config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.1,software.amazon.awssdk:bundle:2.20.160,software.amazon.awssdk:url-connection-client:2.20.160') \
.config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
.config('spark.sql.defaultCatalog', 'spark_catalog') \
.config('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkCatalog') \
.config('spark.sql.catalog.spark_catalog.type', 'rest') \
.config('spark.sql.catalog.spark_catalog.uri','https://glue.<aws_region>.amazonaws.com/iceberg') \
.config('spark.sql.catalog.spark_catalog.warehouse','<aws_account_id>') \
.config('spark.sql.catalog.spark_catalog.rest.sigv4-enabled','true') \
.config('spark.sql.catalog.spark_catalog.rest.signing-name','glue') \
.config('spark.sql.catalog.spark_catalog.rest.signing-region', <aws_region>) \
.config('spark.sql.catalog.spark_catalog.io-impl','org.apache.iceberg.aws.s3.S3FileIO') \
.config('spark.hadoop.fs.s3a.aws.credentials.provider','org.apache.hadoop.fs.s3a.SimpleAWSCredentialProvider') \
.config('spark.sql.catalog.spark_catalog.rest-metrics-reporting-enabled','false') \
.getOrCreate()
spark.sql("use ossblogdb").show()
spark.sql("""CREATE TABLE ossblogdb.customer (name string) USING iceberg location 's3://<3_bucket_name>/customer'""")
time.sleep(120)
spark.sql("insert into ossblogdb.customer values('Alice') ").show()
spark.sql("select * from ossblogdb.customer").show()

Launch Pyspark locally and validate read/write to the Iceberg table on Amazon S3

Run pip install pyspark. Save the script locally and set the environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN) with temporary credentials for the spark_role IAM role.

Run python /path/to/oss_spark_customer_etl.py

You can also use Athena to view the data in the Iceberg table:

To enable the other data team to view the content, provide read access to the data team IAM role using the Lake Formation console:

In the Lake Formation console navigation pane, choose Data lake permissions, and then choose Grant.
In the Principals section, for IAM users and roles choose <iam_role>.
In the LF-Tags or catalog resources section, select Named Data Catalog resources:
1. Select <accountid> for Catalogs.
2. Select ossblogdb for Databases.
3. Select customer for Tables.
Select DESCRIBE and SELECT for Table permissions.
Choose Grant.

SELECT * FROM "ossblogdb"."customer" limit 10;

Clean up

To clean up your resources, complete the following steps:

Delete the resources database/table created in Data Catalog.
Empty and then delete the S3 bucket

Conclusion

In this post, we’ve walked through the seamless integration between Apache Spark and an AWS Glue Iceberg Rest Catalog for accessing Iceberg tables in Amazon S3, demonstrating how to effectively perform read and write operations using Iceberg REST API. The beauty of this solution lies in its flexibility—whether you’re running Spark on bare metal servers in your data center, in a Kubernetes cluster, or any other environment, this architecture can be adapted to suit your needs.

About the Authors

Raj Ramasubbu is a Sr. Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 20 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life science, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She works with product team and customer to build robust features and solutions for their analytical data platform. She enjoys building data mesh solutions and sharing them with the community.

Pratik Das is a Senior Product Manager with AWS Lake Formation. He is passionate about all things data and works with customers to understand their requirements and build delightful experiences. He has a background in building data-driven solutions and machine learning systems in production.

Select your cookie preferences

AWS Big Data Blog