AWS Storage Blog

Build a data lake for streaming data with Amazon S3 Tables and Amazon Data Firehose

Businesses are increasingly adopting real-time data processing to stay ahead of user expectations and market changes. Industries such as retail, finance, manufacturing, and smart cities are using streaming data for everything from optimizing supply chains to detecting fraud and improving urban planning. The ability to use data as it is generated has become a critical competitive advantage for businesses, driving demand for a scalable data lake architecture for storing and managing streaming data. More recently, users are increasingly using Apache Iceberg to organize their streaming data in data lakes to use its database-like features, such as schema evolution, time travel, and ACID transactions.

Amazon S3 Tables provide purpose-built storage with a simple, performant, and cost-effective way to store and query Apache Iceberg tables. S3 Tables continuously optimize storage to maximize query performance and minimize costs, making it an excellent choice for businesses looking to streamline their data lake operations without further infrastructure setup. Businesses can stream and query tables in S3 table buckets through AWS analytics services, such as Amazon Data Firehose and Amazon Athena, by integrating the table buckets with AWS Glue Data Catalog and AWS Lake Formation. The S3 Tables integration with AWS analytics services for table buckets is in preview. Firehose is a fully managed serverless service that streams data from various sources to data lakes, data warehouses, and analytics data stores. With built-in support for Iceberg, Firehose can deliver real-time data from multiple sources to Iceberg tables in Amazon S3 without provisioning further resources or paying for idle streams during non-use hours. It streamlines data ingestion by processing streaming records as they arrive, and eliminates multi-step processes involved in writing streaming data in raw formats and converting it to Apache Iceberg format.

In this post, we walk through building a fully managed data lake using Firehose and S3 Tables to store and analyze real-time streaming data. We use a custom streaming source to deliver data into a table in an S3 table bucket, but the same workflow can be followed for other sources supported by Firehose that are listed in this Amazon Data Firehose user guide.

Solution overview

In this solution, we demonstrate an example where a user ingests streaming data from a source directly into a table in S3 table buckets. We start with creating an S3 table bucket and integrating it with AWS analytics services through AWS Glue Data Catalog and AWS Lake Formation. Then, we use Amazon Kinesis Data Generator to simulate and publish real-time data streams to Firehose and use Athena to view the data that is streamed into the table in table bucket.

The architecture diagram of the solution

Prerequisites

To follow along, you need the following setup:

Walkthrough

The following steps walk you through this solution.

  1. Create an S3 table bucket and integrate with AWS Analytics services

Navigate to Amazon S3 in the Console. Choose Table buckets and then Enable integration if the integration with AWS Analytics services isn’t already enabled, as shown in the following figure. This integration allows users to discover all the tables created in this AWS Region and this account in AWS Glue Data Catalog and AWS Lake Formation, and access them through AWS services, such as Firehose, Athena, Amazon Redshift, and Amazon EMR. When the integration is complete, all existing and future table buckets are automatically added as a sub-catalog, namespaces are organized as databases, and the tables within those namespaces are populated as tables in the AWS Glue Data Catalog. To learn more about this integration, refer to Using Amazon S3 with Analytics services.

Create an S3 table bucket and integrate with AWS Analytics services

Specify a name for your table bucket and continue with Create table bucket, as shown in the following figure.

Create table bucket 2. Create a namespace in the table bucket

Using the AWS CLI, create a namespace s3tables_demo_namespace in the table bucket created previously, as shown in the following. Namespaces are logical constructs that help you organize your tables in a scalable manner.

aws s3tables create-namespace \
--table-bucket-arn arn:aws:s3tables:<region>:<account-id>:bucket/<s3tablebucket>\
--namespace s3tables_demo_namespace

s3 tables

3. Create a table in the table bucket

Using AWS CLI, create a table in the existing namespace in the table bucket as s3tables_demo_table. When you create a table, you can also define a schema for the table. For this post, we create a table with a schema consisting of three fields: ID, name, and value.

aws s3tables create-table --cli-input-json 
file://mytabledefinition.json

The following is the sample mytabledefinition.json used to set the table schema.

{
    "tableBucketARN": "arn:aws:s3tables:ap-northeast-1:<account-id>:bucket/<s3tablebucket>",
    "namespace": "s3tables_demo_namespace",
    "name": "s3tables_demo_table",
    "format": "ICEBERG",
    "metadata": {
        "iceberg": {
            "schema": {
                "fields": [
                     {"name": "id", "type": "int","required": true},
                     {"name": "name", "type": "string"},
                     {"name": "value", "type": "int"}
                ]
            }
        }
    }
}
JSON

The following image shows the sample output from the command:

aws s3 tables

4. Create a resource link to the namespace

Firehose streams data to the tables in the database registered in the default catalog of the AWS Glue Data Catalog. To stream data to tables in S3 table buckets, create a resource link in the default catalog that points to the namespace in table bucket. A resource link is a Data Catalog object that acts as an alias or pointer to another Data Catalog resource, such as a database or table. To create a resource link using AWS CLI, provide the namespace you created in Step 2 for DatabaseName and replace <region> and <account-id> with your values before you run the command.

aws glue create-database --region <region> \
--cli-input-json \
'{
    "CatalogId": "<account-id>",
    "DatabaseInput": {
        "Name": "s3tables_resource_link",
        "TargetDatabase":{
          "CatalogId":"<account-id>:s3tablescatalog/<s3tablebucket>",
          "DatabaseName":"s3tables_demo_namespace",
          "Region": "<region>"
        }
    }
}‘
JSON

The following image shows the sample output from the command.

STF Glue terminal

5. Create an IAM role for Firehose

Create an IAM role that grants permissions to Firehose to perform operations on tables in the default AWS Glue Data Catalog, to back up failed records during streaming in a general-purpose S3 bucket, and to interact with Kinesis Data Stream. Additionally, depending on your Firehose configuration, you may choose to grant extra permissions for Amazon CloudWatch logging, and AWS Lambda function operations. To configure, navigate to IAM in the Console, and create an IAM role with permissions policy as mentioned in this Amazon S3 Tables user guide. Keep track of the role you create, as you need it later to grant AWS Lake Formation permissions.

6. Configure AWS Lake Formation permissions

AWS Lake Formation manages access to your table resources. Lake Formation uses its own permissions model that enables fine-grained access control for Data Catalog resources. For Firehose to ingest data into table buckets, the Firehose role (created in step 5) requires DESCRIBE permissions on the resource link (created in step 4) to discover the S3 Tables namespace through the resource link and read/write permission on the underlying table.

To add describe permissions on the resource link, navigate to Lake Formation in the Console. Choose Databases on the left menu, and then choose the resource link you created in Step 4. Choose Actions, choose Grant, and then grant Describe permission to the Firehose role, as shown in the following figures. In this example, the Firehose role in named s3FirehoseRole.

Grant permissions

To provide read and write permission on specific tables, go back and choose databases on the left menu, then choose the resource link you created in Step 4. First, choose actions, and then choose Grant on target. Choose the Firehose role, databases, and tables, then grant Super permission to the Firehose role, as shown in the following figures.

Grant permissions

7. Set up a Firehose stream

Set up firehose stream

To create a Firehose stream, open Firehose in the Console and choose Create Firehose Stream. Choose Direct PUT as the source and Apache Iceberg Tables as the destination. Then, choose a name for your Firehose stream, adhering to the naming conventions displayed in the interface, as shown in the following figure.

To configure Destination settings for your table bucket, you need to configure the database and table names to which Firehose should write. If you want the Firehose stream to write to one table only, then you can configure the Unique key configuration section. To configure this section, choose the resource link (s3tables_resource_link) that you created in Step 4 as your database name, and the table (s3tables_demo_table) you created in Step 3 as the table name, as shown in the following figure. Firehose delivers to S3ErrorOutputPrefix if it fails to deliver to the configured table.

Destination settings

Specify an Amazon S3 general purpose bucket to store records that fail to be delivered to your S3 table bucket, as shown in the following figure.

Backup settings

Under IAM role, choose the role you previously created for Firehose in Step 5, then choose Create Firehose stream to create your Firehose stream, as shown in the following figure.

Create firehose stream

When the stream is created, monitor the Firehose delivery stream status until it changes to Active, as shown in the following figure

firehose s3tables

8. Send streaming data using Kinesis Data Generator

Kinesis Data Generator is an application that allows you to send streaming data to Firehose. To begin, first configure Kinesis Data Generator for your account. Then, set the Region to match your Firehose, and choose the Firehose stream created in Step 7. Use the following template that matches the table schema defined in Step 3, as shown in the following figure. Depending on the buffer interval set when creating the Firehose stream, it may take up to 900 seconds for the data to appear in your table. For this post, we leave it at the default value of 300 seconds.

{
"id": {{random.number(99999)}},
"name": "{{random.arrayElement( ["CHECKIN","CHECKOUT","PAYMENT_FULL","PAYMENT_PARTIAL"] )}}",
"value": {{random.number(99999)}}
}
JSON
 
       Send streaming data using Kinesis Data Generator 
       

9. Verify and query data using Athena

To query the data using Athena, you must grant AWS Lake Formation permissions on the S3 table to the user or role you plan to use for Athena queries. In the left navigation pane of the Lake Formation console, choose Data permissions and choose Grant, and then choose the user/role you will use to access Athena under Principals. In the LF-Tags or Catalog resources choose Named Data Catalog resources, Default catalog, and the resource link associated with your S3 table bucket. Then choose the S3 table and grant Super under Table permissions, as shown in the following figure.

Verify and query data using Athena

To query and verify data ingested from Firehose, run a SELECT command in Athena as shown in the following figure. As long as data is being streamed from Kinesis Data Generator, you should continue to see the row count increasing in this table, confirming successful data ingestion.

Kinesis Data Generator

Cleaning up

To avoid future charges, delete the resources you created in Amazon S3 Tables and Firehose.

Considerations and limitations

Before using Firehose with Apache Iceberg, you should be aware of the considerations and limitations. For more information, see Considerations and limitations.

Conclusion

In this post, we showed you how to build a managed data lake from a streaming data source using Amazon Data Firehose and Amazon S3 Tables. We also showed how to query these tables from S3 table buckets using Amazon Athena. This integration enables businesses to automatically capture, store, and analyze streaming data without managing complex infrastructure, thereby accelerating data-driven decision making.

Swapna Bandla

Swapna Bandla

Swapna Bandla is a Senior Streaming Solutions Architect on the AWS Streaming Specialist SA Team. Swapna has a passion for understanding customers' data and analytics needs, and empowering them to develop cloud-based, well-architected solutions. Outside of work, she enjoys spending time with her family.

Anupriti Warade

Anupriti Warade

Anupriti Warade is a Senior Technical Product Manager on the Amazon S3 team at AWS. She is passionate about helping customers innovate and building solutions that solve problems. Anupriti is based out of Seattle, Washington and enjoys spending time with family and friends, cooking, and creating DIY crafts.

Phaneendra Vuliyaragoli

Phaneendra Vuliyaragoli

Phaneendra Vuliyaragoli is a Product Management Lead for Amazon Data Firehose at AWS. In this role, Phaneendra leads the product and go-to-market strategy for Amazon Data Firehose.

Prashant Singh

Prashant Singh

Prashant Singh is a Software Development Engineer with expertise in databases and data warehouse engines. He has worked on optimizing Apache Iceberg performance and its integration across services such as Amazon EMR (Apache Spark), Amazon Redshift, and Amazon Data Firehose. Prashant is also an active contributor to open source projects, including Apache Spark and Apache Iceberg. Outside of work, he enjoys exploring new places, skiing, and hiking.

Rakesh Ghodasara

Rakesh Ghodasara

Rakesh Ghodasara is a Solutions Architect at AWS in the Auto Manufacturing vertical. He helps customers with application modernization and cloud migrations. His core expertise is in data analytics and cloud architecture. Outside of work, he likes playing table tennis, watching TV, and playing with his daughters.