AWS Storage Blog
Build a data lake for streaming data with Amazon S3 Tables and Amazon Data Firehose
Businesses are increasingly adopting real-time data processing to stay ahead of user expectations and market changes. Industries such as retail, finance, manufacturing, and smart cities are using streaming data for everything from optimizing supply chains to detecting fraud and improving urban planning. The ability to use data as it is generated has become a critical competitive advantage for businesses, driving demand for a scalable data lake architecture for storing and managing streaming data. More recently, users are increasingly using Apache Iceberg to organize their streaming data in data lakes to use its database-like features, such as schema evolution, time travel, and ACID transactions.
Amazon S3 Tables provide purpose-built storage with a simple, performant, and cost-effective way to store and query Apache Iceberg tables. S3 Tables continuously optimize storage to maximize query performance and minimize costs, making it an excellent choice for businesses looking to streamline their data lake operations without further infrastructure setup. Businesses can stream and query tables in S3 table buckets through AWS analytics services, such as Amazon Data Firehose and Amazon Athena, by integrating the table buckets with AWS Glue Data Catalog and AWS Lake Formation. The S3 Tables integration with AWS analytics services for table buckets is in preview. Firehose is a fully managed serverless service that streams data from various sources to data lakes, data warehouses, and analytics data stores. With built-in support for Iceberg, Firehose can deliver real-time data from multiple sources to Iceberg tables in Amazon S3 without provisioning further resources or paying for idle streams during non-use hours. It streamlines data ingestion by processing streaming records as they arrive, and eliminates multi-step processes involved in writing streaming data in raw formats and converting it to Apache Iceberg format.
In this post, we walk through building a fully managed data lake using Firehose and S3 Tables to store and analyze real-time streaming data. We use a custom streaming source to deliver data into a table in an S3 table bucket, but the same workflow can be followed for other sources supported by Firehose that are listed in this Amazon Data Firehose user guide.
Solution overview
In this solution, we demonstrate an example where a user ingests streaming data from a source directly into a table in S3 table buckets. We start with creating an S3 table bucket and integrating it with AWS analytics services through AWS Glue Data Catalog and AWS Lake Formation. Then, we use Amazon Kinesis Data Generator to simulate and publish real-time data streams to Firehose and use Athena to view the data that is streamed into the table in table bucket.
Prerequisites
To follow along, you need the following setup:
- An AWS account with access to the following AWS services:
- Make sure the latest version of AWS Command Line Interface (AWS CLI) is installed and configured.
- Familiarity with the AWS Management Console.
Walkthrough
The following steps walk you through this solution.
- Create an S3 table bucket and integrate with AWS Analytics services
Navigate to Amazon S3 in the Console. Choose Table buckets and then Enable integration if the integration with AWS Analytics services isn’t already enabled, as shown in the following figure. This integration allows users to discover all the tables created in this AWS Region and this account in AWS Glue Data Catalog and AWS Lake Formation, and access them through AWS services, such as Firehose, Athena, Amazon Redshift, and Amazon EMR. When the integration is complete, all existing and future table buckets are automatically added as a sub-catalog, namespaces are organized as databases, and the tables within those namespaces are populated as tables in the AWS Glue Data Catalog. To learn more about this integration, refer to Using Amazon S3 with Analytics services.
Specify a name for your table bucket and continue with Create table bucket, as shown in the following figure.
2. Create a namespace in the table bucket
Using the AWS CLI, create a namespace s3tables_demo_namespace in the table bucket created previously, as shown in the following. Namespaces are logical constructs that help you organize your tables in a scalable manner.
aws s3tables create-namespace \ --table-bucket-arn arn:aws:s3tables:<region>:<account-id>:bucket/<s3tablebucket>\ --namespace s3tables_demo_namespace
3. Create a table in the table bucket
Using AWS CLI, create a table in the existing namespace in the table bucket as s3tables_demo_table. When you create a table, you can also define a schema for the table. For this post, we create a table with a schema consisting of three fields: ID, name, and value.
aws s3tables create-table --cli-input-json file://mytabledefinition.json
The following is the sample mytabledefinition.json used to set the table schema.
The following image shows the sample output from the command:
4. Create a resource link to the namespace
Firehose streams data to the tables in the database registered in the default catalog of the AWS Glue Data Catalog. To stream data to tables in S3 table buckets, create a resource link in the default catalog that points to the namespace in table bucket. A resource link is a Data Catalog object that acts as an alias or pointer to another Data Catalog resource, such as a database or table. To create a resource link using AWS CLI, provide the namespace you created in Step 2 for DatabaseName and replace <region>
and <account-id>
with your values before you run the command.
The following image shows the sample output from the command.
5. Create an IAM role for Firehose
Create an IAM role that grants permissions to Firehose to perform operations on tables in the default AWS Glue Data Catalog, to back up failed records during streaming in a general-purpose S3 bucket, and to interact with Kinesis Data Stream. Additionally, depending on your Firehose configuration, you may choose to grant extra permissions for Amazon CloudWatch logging, and AWS Lambda function operations. To configure, navigate to IAM in the Console, and create an IAM role with permissions policy as mentioned in this Amazon S3 Tables user guide. Keep track of the role you create, as you need it later to grant AWS Lake Formation permissions.
6. Configure AWS Lake Formation permissions
AWS Lake Formation manages access to your table resources. Lake Formation uses its own permissions model that enables fine-grained access control for Data Catalog resources. For Firehose to ingest data into table buckets, the Firehose role (created in step 5) requires DESCRIBE permissions on the resource link (created in step 4) to discover the S3 Tables namespace through the resource link and read/write permission on the underlying table.
To add describe permissions on the resource link, navigate to Lake Formation in the Console. Choose Databases on the left menu, and then choose the resource link you created in Step 4. Choose Actions, choose Grant, and then grant Describe permission to the Firehose role, as shown in the following figures. In this example, the Firehose role in named s3FirehoseRole.
To provide read and write permission on specific tables, go back and choose databases on the left menu, then choose the resource link you created in Step 4. First, choose actions, and then choose Grant on target. Choose the Firehose role, databases, and tables, then grant Super permission to the Firehose role, as shown in the following figures.
7. Set up a Firehose stream
To create a Firehose stream, open Firehose in the Console and choose Create Firehose Stream. Choose Direct PUT as the source and Apache Iceberg Tables as the destination. Then, choose a name for your Firehose stream, adhering to the naming conventions displayed in the interface, as shown in the following figure.
To configure Destination settings for your table bucket, you need to configure the database and table names to which Firehose should write. If you want the Firehose stream to write to one table only, then you can configure the Unique key configuration section. To configure this section, choose the resource link (s3tables_resource_link) that you created in Step 4 as your database name, and the table (s3tables_demo_table) you created in Step 3 as the table name, as shown in the following figure. Firehose delivers to S3ErrorOutputPrefix if it fails to deliver to the configured table.
Specify an Amazon S3 general purpose bucket to store records that fail to be delivered to your S3 table bucket, as shown in the following figure.
Under IAM role, choose the role you previously created for Firehose in Step 5, then choose Create Firehose stream to create your Firehose stream, as shown in the following figure.
When the stream is created, monitor the Firehose delivery stream status until it changes to Active, as shown in the following figure
8. Send streaming data using Kinesis Data Generator
Kinesis Data Generator is an application that allows you to send streaming data to Firehose. To begin, first configure Kinesis Data Generator for your account. Then, set the Region to match your Firehose, and choose the Firehose stream created in Step 7. Use the following template that matches the table schema defined in Step 3, as shown in the following figure. Depending on the buffer interval set when creating the Firehose stream, it may take up to 900 seconds for the data to appear in your table. For this post, we leave it at the default value of 300 seconds.

9. Verify and query data using Athena
To query the data using Athena, you must grant AWS Lake Formation permissions on the S3 table to the user or role you plan to use for Athena queries. In the left navigation pane of the Lake Formation console, choose Data permissions and choose Grant, and then choose the user/role you will use to access Athena under Principals. In the LF-Tags or Catalog resources choose Named Data Catalog resources, Default catalog, and the resource link associated with your S3 table bucket. Then choose the S3 table and grant Super under Table permissions, as shown in the following figure.
To query and verify data ingested from Firehose, run a SELECT command in Athena as shown in the following figure. As long as data is being streamed from Kinesis Data Generator, you should continue to see the row count increasing in this table, confirming successful data ingestion.
Cleaning up
To avoid future charges, delete the resources you created in Amazon S3 Tables and Firehose.
Considerations and limitations
Before using Firehose with Apache Iceberg, you should be aware of the considerations and limitations. For more information, see Considerations and limitations.
Conclusion
In this post, we showed you how to build a managed data lake from a streaming data source using Amazon Data Firehose and Amazon S3 Tables. We also showed how to query these tables from S3 table buckets using Amazon Athena. This integration enables businesses to automatically capture, store, and analyze streaming data without managing complex infrastructure, thereby accelerating data-driven decision making.