Guidance for Migrating Tabular Data from Amazon S3 to S3 Tables

Go to sample code

Overview

This Guidance demonstrates how to migrate tabular data from Amazon Simple Storage Service (Amazon S3) general purpose buckets to Amazon S3 Tables, purpose-built storage for tabular data. S3 Tables introduces a new bucket type, S3 table bucket, that stores fully managed Apache Iceberg tables to deliver up to three times faster query performance and up to ten times higher transactions per second compared to storing Iceberg tables in Amazon S3 general purpose buckets.

The Guidance sets up an automated migration process for moving Apache Iceberg and Apache Hive tables registered in AWS Glue Data Catalog and stored in Amazon S3 general purpose buckets to Amazon S3 table buckets using AWS Step Functions and Amazon EMR with Apache Spark. With built-in support for Apache Iceberg, you can query tabular data in S3 table buckets with popular query engines including Amazon Athena, Amazon Redshift, and Apache Spark.

How it works

These technical details feature an architecture diagram to illustrate how to effectively use this solution. The architecture diagram shows the key components and their interactions, providing an overview of the architecture's structure and functionality step-by-step.

Download the architecture diagram

Get Started

Deploy this Guidance

Sample code

Use sample code to deploy this Guidance in your AWS account

Well-Architected Pillars

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

By using CloudFormation, you gain automated deployment and comprehensive visibility into all created AWS resources and their deployment status. To enhance monitoring and alerting, Lambda functions store invocation and operational events in Amazon CloudWatch Logs, while Amazon SNS sends email notifications about the migration workflow's status. These services collectively enable robust auditing and monitoring of S3 Tables. S3 Tables provide automated compaction and unreferenced file cleanup. This combination of services helps ensure optimal performance, facilitates troubleshooting, and minimizes operational overhead, allowing you to maintain excellence in your data management operations.

Read the Operational Excellence whitepaper

S3 Tables and IAM work together to provide robust security measures. They offer identity-based and resource-based fine-grained access controls so that only authorized users and processes can interact with your data. Data protection is enhanced through encryption at rest and in transit, safeguarding your information throughout the migration process and beyond. IAM is designed for precise control over who can access AWS resources and what actions they can perform, enabling you to maintain strict compliance requirements. By implementing these security features, you can prevent unauthorized access to table data, protect sensitive information, and ensure that your migration process adheres to your organization's security policies and regulatory standards.

Read the Security whitepaper

Lambda automatically scales to handle increasing concurrent requests across multiple Availability Zones (AZs) for high availability. Amazon SNS delivers messages across AZs, while Amazon S3 provides durable, multi-AZ storage for logs. S3 Tables offer automated maintenance, support for concurrent operations, and inherit the durability of Amazon S3. Step Functions contributes retry and catch mechanisms for workflow management. AWS Glue Tables provide a serverless way to organize related data. These services collectively support consistent performance, data durability, and automated maintenance throughout the migration, minimizing manual intervention and maximizing reliability of your data operations.

Read the Reliability whitepaper

S3 Tables deliver the same durability, availability, scalability, and performance characteristics as S3 itself, and automatically optimizes storage to maximize query performance and to minimize cost. Step Functions enhances efficiency by breaking workflows into smaller, manageable tasks and orchestrating them to reduce overall processing time and resource utilization. AWS Glue Tables contribute with their schema-on-read capability, enabling flexible and efficient querying of large datasets. Collectively, these services deliver improved object data storage, query throughput, and transaction processing for analytics workloads compared to traditional Amazon S3 buckets.

Read the Performance Efficiency whitepaper

Lambda provides serverless compute, enabling cost-effective scaling for multiple parallel invocations without the need for provisioned infrastructure. Amazon S3 offers reliable, low-cost object storage, while Amazon SNS delivers messages to multiple subscribers efficiently. S3 Tables significantly reduce operational costs through automated compaction, snapshot management, and cleanup of unreferenced files. This automation eliminates the need for you to build and maintain costly compute clusters for table optimization, a process that traditionally requires skilled development teams and complex systems.

Furthermore, this Guidance combines cost-effective storage with scalable compute and orchestration, while S3 Tables keep Apache Iceberg tables performant without additional infrastructure costs. This approach not only optimizes expenses but also improves reliability and lowers the barrier to entry for modern analytics.

Read the Cost Optimization whitepaper

Lambda, a serverless compute service, provisions resources on-demand, reducing energy consumption by eliminating idle infrastructure. Similarly, Amazon SNS offers serverless messaging, efficiently delivering messages between applications and subscribers without maintaining always-on servers. Amazon S3 Tables further contribute to sustainability by optimizing storage layout through compaction and removing unnecessary data through automated maintenance. This approach significantly reduces the storage footprint required for data persistence. By using these serverless and storage-efficient services, your migration process not only becomes more cost-effective but also aligns with environmental sustainability goals, demonstrating a commitment to responsible resource usage in cloud operations.

Read the Sustainability whitepaper

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages

Guidance for Migrating Tabular Data from Amazon S3 to S3 Tables

Overview

How it works

Get Started

Deploy this Guidance

Sample code

Well-Architected Pillars

Related Content

Build a managed transactional data lake with Amazon S3 Tables

How Amazon S3 Tables use compaction to improve query performance by up to 3 times

New Amazon S3 Tables: Storage optimized for analytics workloads

Disclaimer

Did you find what you were looking for today?

Learn

Resources

Developers

Help