Getting started with Apache Iceberg write support in Amazon Redshift

Many companies store structured data in warehouses for analytics while keeping diverse datasets in data lakes for flexible processing. Until now, maintaining consistency between these systems required complex ETL processes and introduced potential data synchronization challenges.

The new Amazon Redshift Apache Iceberg write support removes these complexities through direct writes to Apache Iceberg tables stored in Amazon S3 and S3 Tables. With this native integration you can write data directly from Redshift queries to your data lake without intermediate ETL steps, facilitate data consistency with ACID-compliant transactions that help optimize query performance with flexible partitioning strategies, and use the familiar Redshift SQL interface when writing to Apache Iceberg tables. For example, you can now run a complex transformation in Redshift and write the results directly to an Apache Iceberg table that other analytics engines like Amazon EMR or Amazon Athena can immediately query. By using this approach you can query the same datasets from both Redshift and other analytics tools without copying data.

In this post, we show how you can use Amazon Redshift to write data directly to Apache Iceberg tables stored in Amazon S3 and S3 Tables for seamless integration between your data warehouse and data lake while maintaining ACID compliance.

“Verisk processes billions of catastrophe risk modeling records using Amazon Redshift and Apache Iceberg, achieving 30% faster query aggregations and significant storage cost reductions”

— Karthick Shanmugam, Associate Vice President, Verisk

Solution overview

You can now create and write directly to Apache Iceberg tables stored in Amazon S3 and S3 Tables using familiar SQL commands in Amazon Redshift. We’ll guide you through configuring permissions for S3 table buckets using AWS Lake Formation. Finally, we’ll analyze customer and order datasets across both Redshift native and Apache Iceberg data formats to derive insights. The workflow is illustrated in the following diagram:

In this post we will walk you through following steps:

Create an external database named customer_db in AWS Glue Data Catalog using Amazon Redshift SQL.
Create an external table named customer in the Glue database and write customer data using Amazon Redshift SQL.
Create table bucket named orders on Amazon S3 Tables to write orders data.
Grant permissions using AWS Lake Formation to an IAM role for reading and writing to the orders table.
Write orders data to the orders Amazon S3 table bucket.

This solution uses the following AWS services:

Prerequisites

Create Amazon Redshift data warehouse (provisioned or Serverless).
Permissions to create database on AWS Glue Data Catalog from Redshift.
Create a new AWS Glue database called customer_db or use an existing database of your choice. If you use an existing database or a different name, replace customer_db with your actual database name in the subsequent commands.
S3 bucket and S3 Table bucket in the same AWS Region as your Redshift cluster.
Have access to an IAM role that is a Lake Formation data lake administrator. For instructions, refer to Create a data lake administrator.

Create IAM role RedshifticebergRole with following policy. Add managed permission for AmazonRedshiftQueryEditorV2.

{
    "Version": "2012-10-17",
    "Statement": [
                    {
                        "Sid": "VisualEditor0",
                        "Effect": "Allow",
                        "Action": "redshift:GetClusterCredentialsWithIAM",
                        "Resource": "arn:aws:redshift:<YOUR-REGION>:<AWS-ACCOUNT-NUMBER>:dbname::<YOUR-REDSHIFT-CLUSTER-NAME>/*"
                    }
                ]
}

Setting up your environment

To set up your environment, complete the following steps.

Creating Apache Iceberg tables in Amazon S3 standard buckets

Connect to Redshift using Query Editor V2.
Create user for the Federated role RedshifticebergRole.
```
Create user IAMR:RedshifticebergRole
```

Verify you have an Amazon Redshift External Schema configured. Run following script on Redshift:

CREATE EXTERNAL SCHEMA demo_iceberg
FROM DATA CATALOG DATABASE 'customer_db'
IAM_ROLE 'arn:aws:iam::<AWS-ACCOUNT-NUMBER>:role/RedshiftCustomizedIcebergRole';

Create external table customer in Apache Iceberg table format in the demo_iceberg external schema created above and then insert data.

Use this two-step approach when you need control over column definitions or plan to append data.

Replace your S3 bucket name in place of <<your-bucket>>.

-- Step 1: Define your table structure
CREATE TABLE demo_iceberg.customer
(
customer_id bigint,
customer_name varchar,
email varchar,
city varchar
)
USING ICEBERG
LOCATION 's3://<<your-bucket>>/iceberg-data/customers/';

-- Step 2: Insert data
              
(1, 'Customer1 Smith', 'Customer1.smith@email.com', 'New York'),
(2, 'Customer2 Johnson', 'Customer2.johnson@email.com', 'Los Angeles'),
(3, 'Customer3 Brown', 'Customer3.brown@email.com', 'Chicago'),
(4, 'Customer4 Davis', 'Customer4.davis@email.com', 'Houston'),
(5, 'Customer5 Wilson', 'Customer5.wilson@email.com', 'Phoenix'),
(6, 'Customer6 Miller', 'Customer6.miller@email.com', 'Philadelphia'),
(7, 'Customer7 Garcia', 'Customer7.garcia@email.com', 'San Antonio'),
(8, 'Customer8 Rodriguez', 'Customer8.rodriguez@email.com', 'San Diego'),
(9, 'Customer9 Martinez', 'Customer9.martinez@email.com', 'Dallas'),
(10, 'Customer10 Anderson', 'Customer10.anderson@email.com', 'San Jose'),
(11, 'Customer11 Taylor', 'Customer11.taylor@email.com', 'Austin'),
(12, 'Customer12 Thomas', 'Customer12.thomas@email.com', 'Jacksonville'),
(13, 'Customer13 Jackson', 'Customer13.jackson@email.com', 'Fort Worth'),
(14, 'Customer14 White', 'Customer14.white@email.com', 'Columbus'),
(15, 'Customer15 Harris', 'Customer15.harris@email.com', 'Charlotte');

-- Step 3: Select data
SELECT * FROM demo_iceberg.customer;

Figure 2: Result from demo_iceberg.customer

Grant access to external schema for user IAMR:RedshifticebergRole:

Grant usage on schema demo_iceberg to "IAMR:RedshifticebergRole";

Create Apache Iceberg tables in Amazon S3 Table buckets

Amazon S3 table buckets are integrated with AWS Lake Formation, which serves as the central authority for managing data access permissions. When working with Apache Iceberg tables, Lake Formation provides a unified security framework that simplifies access control across your entire data lake. This centralized approach makes sure consistent and efficient permission management, alleviating the need to handle permissions in multiple places.

To create an S3 table bucket:

Go to Amazon S3, choose Table buckets in the left navigation pane.
On the Table buckets page, in the Integration with AWS analytics services section, choose Enable integration.
In the Table buckets list, choose the Create table bucket button and enter a name for your table bucket, for example, iceberg-write-blog, and choose Create table bucket. After creation, the bucket will appear in the S3 tables catalog, s3tablescatalog, in the Lake Formation console.
In the AWS Lake Formation console, choose Catalogs, in the Catalogs table select s3tablescatalog to open the detail page for that table.
On the s3tablescatalog details page, under Catalogs, choose the table bucket iceberg-write-blog.
On the iceberg-write-blog details page, under Databases, choose Create database.
Enter the database name iceberg_write_namespace, select the Catalog from the drop down menu, and choose Create database.
Grant a permission to create a table in the database to the Lake Formation IAM role. On the iceberg-write-blog details page select the radio button for iceberg_write_namespace, choose Actions, Grant.
On the Grant permissions page, under Principal type select Principals, under Principals select IAM users and roles, in the IAM users and roles drop down menu select RedshifticebergRole.
For LF-Tags or catalog resources, choose Named Data Catalog resources, for Catalogs select iceberg-write-blog and for Databases select iceberg_write_namespace.
For Database permissions select the checkbox for Create table, Drop, and Describe, then choose Grant.

Creating Apache Iceberg tables in Amazon Redshift using Amazon S3 table buckets

AWS Lake Formation catalogs are automatically mounted on Amazon Redshift data warehouses in same account. Amazon Redshift writes directly to S3 Tables using the auto mounted S3 table catalog. The SQL syntax for writing to Apache Iceberg tables stored in S3 table buckets is similar to the syntax for Apache Iceberg tables stored in S3 standard buckets. The key difference is the auto mounted S3 Table catalog, which supports three-part notation access. This feature alleviates the need to create an EXTERNAL SCHEMA when referencing data lake Apache Iceberg tables residing in S3 Table buckets.

To create the Apache Iceberg table:

Switch to the RedshifticebergRole. To access S3 tables through the Redshift Query Editor V2, you must use a Federated user account, the RedshifticebergRole has been granted the necessary Lake Formation permissions.
Log in to Redshift using the Query Editor V2 Federated user option.

In Query Editor V2, create the table named orders in Apache Iceberg table format:

CREATE TABLE "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders
(
 customer_id BIGINT,
 order_id BIGINT,
 Total_order_amt DECIMAL(10,2),
 Total_order_tax_amt REAL,
 tax_pct DOUBLE PRECISION,
 order_date DATE,
 order_created_at_tz TIMESTAMPTZ,
 is_active_ind BOOLEAN
)
USING ICEBERG 
PARTITIONED BY (DAY(order_date));

Insert data into the table using standard SQL:

INSERT INTO "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders
(order_date, order_id, customer_id, total_order_amt, total_order_tax_amt, tax_pct, order_created_at_tz, is_active_ind)
VALUES
('2024-01-15', 1001, 1, 125.50, 10.04, 0.08, '2024-01-15 10:30:00-06:00', true),
('2024-02-20', 1002, 2, 89.99, 6.75, 0.075, '2024-02-20 14:22:15-06:00', true),
('2024-03-10', 1003, 3, 234.75, 20.78, 0.0885, '2024-03-10 09:15:30-06:00', false),
('2024-04-05', 1004, 1, 67.25, 5.38, 0.08, '2024-04-05 16:45:00-05:00', true),
('2024-05-18', 1005, 4, 156.80, 12.54, 0.08, '2024-05-18 11:20:45-05:00', true),
('2024-06-22', 1006, 5, 45.99, 4.14, 0.09, '2024-06-22 13:10:20-05:00', true),
('2024-07-14', 1007, 2, 312.40, 24.99, 0.08, '2024-07-14 08:35:10-05:00', false),
('2024-08-30', 1008, 6, 78.50, 7.07, 0.09, '2024-08-30 15:25:35-05:00', true),
('2024-09-12', 1009, 3, 199.99, 18.00, 0.09, '2024-09-12 12:40:50-05:00', true),
('2024-10-08', 1010, 7, 523.75, 41.90, 0.08, '2024-10-08 17:15:25-05:00', true),
('2024-10-25', 1011, 4, 92.30, 8.31, 0.09, '2024-10-25 10:05:15-05:00', false),
('2024-11-02', 1012, 8, 167.45, 13.40, 0.08, '2024-11-02 14:50:40-06:00', true),
('2024-11-08', 1013, 1, 34.99, 2.80, 0.08, '2024-11-08 09:30:20-06:00', true),
('2024-11-09', 1014, 9, 445.60, 40.10, 0.09, '2024-11-09 16:20:55-06:00', true),
('2024-11-10', 1015, 5, 278.85, 22.31, 0.08, '2024-11-10 11:45:30-06:00', true);

Create a Redshift local_orders table and insert sample records:

CREATE TABLE dev.public.local_orders
(
customer_id BIGINT,
order_id BIGINT,
Total_order_amt DECIMAL(10,2),
Total_order_tax_amt REAL,
tax_pct DOUBLE PRECISION,
order_date DATE,
order_created_at_tz TIMESTAMPTZ,
is_active_ind BOOLEAN
);


INSERT INTO dev.public.local_orders
(customer_id, order_id, Total_order_amt, Total_order_tax_amt, tax_pct, order_date, order_created_at_tz, is_active_ind)
VALUES
(1001, 5001, 299.99, 24.00, 0.08, '2024-01-15', '2024-01-15 14:30:00-05:00', true),
(1002, 5002, 1250.50, 100.04, 0.08, '2024-01-16', '2024-01-16 09:15:22-05:00', true),
(1003, 5003, 75.25, 6.02, 0.08, '2024-01-16', '2024-01-16 16:45:33-05:00', true),
(1004, 5004, 499.99, 40.00, 0.08, '2024-01-17', '2024-01-17 11:20:45-05:00', true),
(1005, 5005, 149.50, 11.96, 0.08, '2024-01-17', '2024-01-17 13:55:12-05:00', false),
(1002, 5006, 899.99, 72.00, 0.08, '2024-01-18', '2024-01-18 10:05:30-05:00', true),
(1006, 5007, 45.75, 3.66, 0.08, '2024-01-18', '2024-01-18 15:40:18-05:00', true),
(1007, 5008, 1500.00, 120.00, 0.08, '2024-01-19', '2024-01-19 08:25:55-05:00', true),
(1008, 5009, 250.25, 20.02, 0.08, '2024-01-19', '2024-01-19 12:10:40-05:00', true),
(1009, 5010, 725.75, 58.06, 0.08, '2024-01-20', '2024-01-20 14:15:28-05:00', true);

Using the CREATE TABLE AS (CTAS) format, create a table from the existing table with no compression:

CREATE TABLE "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders_new 
using ICEBERG
TABLE PROPERTIES ('compression_type'='uncompressed')
AS
select * from dev.public.local_orders;

Select data with standard SQL using the three-part notation:
```
select * from "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders;
```
You can also use the USE clause to specify the default database (and omit the database name):
```
USE "iceberg-write-blog@s3tablescatalog";

select * from iceberg_write_namespace.orders;
```
The resulting table will look like the following image:

Set a schema search path to further simplify table access by omitting the schema name from the notation:

-- Redshift default database is set to 'iceberg-write-blog@s3tablescatalog'
USE "iceberg-write-blog@s3tablescatalog";

-- Redshift will search 'iceberg_write_namespace' to resolve table orders
set search_path to iceberg_write_namespace;

select * from orders;

Show table:

show table "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders 

--Result
CREATE TABLE "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders 
(
customer_id bigint,
order_id bigint,
total_order_amt decimal(10, 2),
total_order_tax_amt float,
tax_pct double precision,
order_date date,
order_created_at_tz timestamptz,
is_active_ind Boolean
)
USING ICEBERG
PARTITIONED BY (DAY(order_date))
TABLE PROPERTIES ('compression_type'='snappy');

Bringing it together

Let’s demonstrate how to combine data from two sources and show how they can work together in a single query.

Customer data stored in standard S3 buckets
Orders data stored in S3 table buckets

select 
b.order_date,
b.order_id,
b.total_order_amt,
CONVERT_TIMEZONE('America/Los_Angeles', b.order_created_at_tz) AS order_pacific_time,
a.customer_name,
a.email,
a.city
from dev.demo_iceberg.customer a join "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders b
on(a.customer_id = b.customer_id)
where b.order_date between '2024-01-15' and '2024-10-25'
and b.is_active_ind=true;

The result from the consolidated query:

Drop table:

Drop TABLE "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders_new;

Clean up

To avoid ongoing charges, follow these steps in order:

Drop Apache Iceberg tables:

DROP TABLE dev.demo_iceberg.customer;
DROP TABLE "iceberg-write-blog@s3tablescatalog".iceberg_write_namespace.orders;

Remove S3 objects, replace your-bucket with the name of the bucket you created:
```
aws s3 rm s3://<your-bucket>/iceberg/ --recursive
```
Remove Lake Formation permissions, replace your-bucket with the name of the bucket you created:
```
aws lakeformation deregister-resource --resource-arn arn:aws:s3:::<your-bucket>
```

Conclusion

With Apache Iceberg write support in Amazon Redshift you can to build flexible data architectures that combine the performance of a data warehouse with the scalability of a data lake. You can now write data directly to Apache Iceberg tables while maintaining ACID compliance and partitioning for query optimization. You can use Amazon Redshift to create Apache Iceberg tables in your data lake, making them immediately queryable through Amazon EMR or Amazon Athena.

To learn more, review the Amazon Redshift Iceberg integration and Writing to Apache Iceberg tables documentation. Visit the AWS Database Blog for latest updates.

AWS Big Data Blog

Getting started with Apache Iceberg write support in Amazon Redshift

Solution overview

Prerequisites

Setting up your environment

Creating Apache Iceberg tables in Amazon S3 standard buckets

Create Apache Iceberg tables in Amazon S3 Table buckets

Creating Apache Iceberg tables in Amazon Redshift using Amazon S3 table buckets

Bringing it together

Clean up

Conclusion

About the authors

Resources

Follow

Learn

Resources

Developers

Help