Interact with Apache Iceberg tables using Amazon Athena and cross account fine-grained permissions using AWS Lake Formation

We recently announced support for AWS Lake Formation fine-grained access control policies in Amazon Athena queries for data stored in any supported file format using table formats such as Apache Iceberg, Apache Hudi and Apache Hive. AWS Lake Formation allows you to define and enforce database, table, and column-level access policies to query Iceberg tables stored in Amazon S3. Lake Formation provides an authorization and governance layer on data stored in Amazon S3. This capability requires that you upgrade to Athena engine version 3.

Large organizations often have lines of businesses (LoBs) that operate with autonomy in managing their business data. It makes sharing data across LoBs non-trivial. These organizations have adopted a federated model, with each LoB having the autonomy to make decisions on their data. They use the publisher/consumer model with a centralized governance layer that is used to enforce access controls. If you are interested in learning more about data mesh architecture, visit Design a data mesh architecture using AWS Lake Formation and AWS Glue. With Athena engine version 3, customers can use the same fine-grained controls for open data frameworks such as Apache Iceberg, Apache Hudi, and Apache Hive.

In this post, we deep dive into a use-case where you have a producer/consumer model with data sharing enabled to give restricted access to an Apache Iceberg table that the consumer can query. We’ll discuss column filtering to restrict certain rows, filtering to restrict column level access, schema evolution, and time travel.

Solution overview

To illustrate the functionality of fine-grained permissions for Apache Iceberg tables with Athena and Lake Formation, we set up the following components:

In the producer account:
- An AWS Glue Data Catalog to register the schema of a table in Apache Iceberg format
- Lake Formation to provide fine-grained access to the consumer account
- Athena to verify data from the producer account
In the consumer account:
- AWS Resource Access Manager (AWS RAM) to create a handshake between the producer Data Catalog and consumer
- Lake Formation to provide fine-grained access to the consumer account
- Athena to verify data from producer account

The following diagram illustrates the architecture.

Prerequisites

Before you get started, make sure you have the following:

Two AWS accounts, one will act as a producer and other as a consumer
An understanding of AWS LakeFormation and a user that can act as a Lake Formation admin
An AWS Identity and Access Management (IAM) role with permissions to create S3 buckets, AWS Glue data catalog, AWS Resource Access Manager, and run SQL queries using Amazon Athena (see Adding and removing IAM identity permissions for reference)
Any workgroup with Athena engine version 3. For this blog our “primary” workgroup is using Athena engine version 3.

Data producer setup

In this section, we present the steps to set up the data producer.

Create an S3 bucket to store the table data

We create a new S3 bucket to save the data for the table:

On the Amazon S3 console, create an S3 bucket with unique name (for this post, we use iceberg-athena-lakeformation-blog).
Create the producer folder inside the bucket to use for the table.

Register the S3 path storing the table using Lake Formation

We register the S3 full path in Lake Formation:

Navigate to the Lake Formation console.
If you’re logging in for the first time, you’re prompted to create an admin user.
In the navigation pane, under Register and ingest, choose Data lake locations.
Choose Register location, and provide the S3 bucket path that you created earlier.
Choose AWSServiceRoleForLakeFormationDataAccess for IAM role.

For additional information about roles, refer to Requirements for roles used to register locations.

If you enabled encryption of your S3 bucket, you have to provide permissions for Lake Formation to perform encryption and decryption operations. Refer to Registering an encrypted Amazon S3 location for guidance.

Choose Register location.

Create an Iceberg table using Athena

Now let’s create the table using Athena backed by Apache Iceberg format:

On the Athena console, choose Query editor in the navigation pane.
If you’re using Athena for the first time, under Settings, choose Manage and enter the S3 bucket location that you created earlier (iceberg-athena-lakeformation-blog/producer).
Choose Save.
In the query editor, enter the following query (replace the location with the S3 bucket that you registered with Lake Formation). Note that we use the default database, but you can use any other database.

CREATE TABLE consumer_iceberg (
customerid bigint,
customername string,
email string,
city string,
country string,
territory string,
contactfirstname string,
contactlastname string)
LOCATION 's3://YOUR-BUCKET/producer/' -- *** Change bucket name to your bucket***
TBLPROPERTIES ('table_type'='ICEBERG')

Choose Run.

Share the table with the consumer account

To illustrate functionality, we implement the following scenarios:

Provide access to selected columns
Provide access to selected rows based on a filter

Complete the following steps:

On the Lake Formation console, in the navigation pane under Data catalog, choose Data filters.
Choose Create new filter.
For Data filter name, enter blog_data_filter.
For Target database, enter lf-demo-db.
For Target table, enter consumer_iceberg.
For Column-level access, select Include columns.
Choose the columns to share with the consumer: country, address, contactfirstname, city, customerid, and customername.
For Row filter expression, enter the filter country='France'.
Choose Create filter.

create data filter

Now let’s grant access to the consumer account on the consumer_iceberg table.

In the navigation pane, choose Tables.
Select the consumer_iceberg table, and choose Grant on the Actions menu.
Select External accounts.
Enter the external account ID.
Select Named data catalog resources.
Choose your database and table.
For Data filters, choose the data filter you created.
For Data filter permissions and Grantable permissions, select Select.
Choose Grant.

Data consumer setup

To set up the data consumer, we accept the resource share and create a table using AWS RAM and Lake Formation. Complete the following steps:

Log in to the consumer account and navigate to the AWS RAM console.
Under Shared with me in the navigation pane, choose Resource shares.
Choose your resource share.
Choose Accept resource share.
Note the name of the resource share to use in the next steps.
Navigate to the Lake Formation console.
If you’re logging in for the first time, you’re prompted to create an admin user.
Choose Databases in the navigation pane, then choose your database.
On the Actions menu, choose Create resource link.
For Resource link name, enter the name of your resource link (for example, consumer_iceberg).
Choose your database and shared table.
Choose Create.

Validate the solution

Now we can run different operations on the tables to validate the fine-grained access controls.

Insert operation

Let’s insert data into the consumer_iceberg table in the producer account, and validate the data filtering works as expected in the consumer account.

Log in to the producer account.
On the Athena console, choose Query editor in the navigation pane.
Use the following SQL to write and insert data into the Iceberg table. Use the Query editor to run one query at a time. You can highlight/select one query at a time and click “Run”/“Run again:

INSERT INTO consumer_iceberg VALUES (1, 'Land of Toys Inc.', 'gladys.rim@rim.org',
'NYC','USA', 'NA', 'James', 'xxxx 118th NE');

INSERT INTO consumer_iceberg VALUES (2, 'Reims Collectables', 'yuki_whobrey@aol.com',
'Reims','France', 'EMEA', 'Josephine', 'Darakjy');

INSERT INTO consumer_iceberg VALUES (3, 'Lyon Souveniers', 'fletcher.flosi@yahoo.com',
'Paris', 'France', 'EMEA','Art', 'Venere');

Use the following SQL to read and select data in the Iceberg table:

SELECT * FROM "lf-demo-db"."consumer_iceberg" limit 10;

Log in to the consumer account.
In the Athena query editor, run the following SELECT query on the shared table:

SELECT * FROM "lf-demo-db"."consumer_iceberg" limit 10;

Based on the filters, the consumer has visibility to a subset of columns, and rows where the country is France.

Update/Delete operations

Now let’s update one of the rows and delete one from the dataset shared with the consumer.

Log in to the producer account.
Update city='Paris' WHERE city='Reims' and delete the row customerid = 3;
```
UPDATE consumer_iceberg SET city= 'Paris' WHERE city= 'Reims' ;
```

DELETE FROM consumer_iceberg WHERE customerid =3;

Verify the updated and deleted dataset:

SELECT * FROM consumer_iceberg;

Log in to the consumer account.
In the Athena query editor, run the following SELECT query on the shared table:

SELECT * FROM "lf-demo-db"."consumer_iceberg" limit 10;

We can observe that only one row is available and the city is updated to Paris.

Schema evolution: Add a new column

Let’s update one of the rows and delete one from the dataset shared with the consumer.

Log in to the producer account.
Add a new column called geo_loc in the Iceberg table. Use the Query editor to run one query at a time. You can highlight/select one query at a time and click “Run”/“Run again:

ALTER TABLE consumer_iceberg ADD COLUMNS (geo_loc string);

INSERT INTO consumer_iceberg VALUES (5, 'Test_user', 'test_user@aol.com',
'Reims','France', 'EMEA', 'Test_user', 'Test_user', 'test_geo');

SELECT * FROM consumer_iceberg;

To provide visibility to the newly added geo_loc column, we need to update the Lake Formation data filter.

On the Lake Formation console, choose Data filters in the navigation pane.
Select your data filter and choose Edit.
Under Column-level access, add the new column (geo_loc).
Choose Save.
Log in to the consumer account.
In the Athena query editor, run the following SELECT query on the shared table:

SELECT * FROM "lf-demo-db"."consumer_iceberg" limit 10;

The new column geo_loc is visible and an additional row.

Schema evolution: Delete column

Let’s update one of the rows and delete one from the dataset shared with the consumer.

Log in to the producer account.
Alter the table to drop the address column from the Iceberg table. Use the Query editor to run one query at a time. You can highlight/select one query at a time and click “Run”/“Run again:

ALTER TABLE consumer_iceberg DROP COLUMN address;

SELECT * FROM consumer_iceberg;

We can observe that the column address is not present in the table.

Log in to the consumer account.
In the Athena query editor, run the following SELECT query on the shared table:

SELECT * FROM "lf-demo-db"."consumer_iceberg" limit 10;

The column address is not present in the table.

Time travel

We have now changed the Iceberg table multiple times. The Iceberg table keeps track of the snapshots. Complete the following steps to explore the time travel functionality:

Log in to the producer account.
Query the system table:

SELECT * FROM "lf-demo-db"."consumer_iceberg$snapshots" limit 10;

We can observe that we have generated multiple snapshots.

Note down one of the committed_at values to use in the next steps (for this example, 2023-01-29 21:35:02.176 UTC).
Use time travel to find the table snapshot. Use the Query editor to run one query at a time. You can highlight/select one query at a time and click “Run”/“Run again:

SELECT * FROM consumer_iceberg FOR TIMESTAMP
AS OF TIMESTAMP '2023-01-29 21:35:02.176 UTC';

Clean up

Complete the following steps to avoid incurring future charges:

On the Amazon S3 console, delete the table storage bucket (for this post, iceberg-athena-lakeformation-blog).
In the producer account on the Athena console, run the following commands to delete the tables you created:

DROP TABLE "lf-demo-db"."consumer_iceberg";
DROP DATABASE lf-demo-db;

In the producer account on the Lake Formation console, revoke permissions to the consumer account.
Delete the S3 bucket used for the Athena query result location from the consumer account.

Conclusion

With the support for cross account, fine-grained access control policies for formats such as Iceberg, you have the flexibility to work with any format supported by Athena. The ability to perform CRUD operations against the data in your S3 data lake combined with Lake Formation fine-grained access controls for all tables and formats supported by Athena provides opportunities to innovate and simplify your data strategy. We’d love to hear your feedback!

About the authors

Kishore Dhamodaran is a Senior Solutions Architect at AWS. Kishore helps strategic customers with their cloud enterprise strategy and migration journey, leveraging his years of industry and cloud experience.

Jack Ye is a software engineer of the Athena Data Lake and Storage team at AWS. He is an Apache Iceberg Committer and PMC member.

Chris Olson is a Software Development Engineer at AWS.

Xiaoxuan Li is a Software Development Engineer at AWS.

Rahul Sonawane is a Principal Analytics Solutions Architect at AWS with AI/ML and Analytics as his area of specialty.

AWS Big Data Blog