[SEO Subhead]
This Guidance demonstrates how to migrate tabular data from Amazon Simple Storage Service (Amazon S3) general-purpose buckets to Amazon S3 Tables, a purpose-built storage for tabular data. S3 Tables introduces a new bucket type, S3 table bucket, that stores fully managed Apache Iceberg tables to deliver up to three times faster query performance and up to ten times higher transactions per second compared to storing Iceberg tables in Amazon S3 general purpose buckets.
The Guidance sets up an automated migration process for moving Apache Iceberg and Apache Hive tables registered in AWS Glue Table Catalog and stored in Amazon S3 general purpose buckets to Amazon S3 table buckets using AWS Step Functions and Amazon EMR with Apache Spark. With built-in support for Apache Iceberg, you can query tabular data in the Amazon S3 table buckets with popular engines including Amazon Athena, Amazon Redshift, and Apache Spark.
Note: [Disclaimer]
Architecture Diagram
[Architecture diagram description]
Step 1
The user deploys this solution using AWS CloudFormation by creating a stack through the AWS Management Console.
Step 2
CloudFormation deploys resources including AWS Lambda, AWS Identity and Access Management (IAM), custom resources, AWS Step Functions, and a PySpark Script.
Step 3
The CheckResourceExists Lambda function checks for the existence of a source Amazon Simple Storage Service (Amazon S3) bucket and the AWS Glue table for migration.
Step 4
The EMRLogS3Bucket Amazon S3 bucket is created by CloudFormation to store the Amazon EMR cluster logs, as well as the PySpark script for the Apache Spark on Amazon EMR jobs.
Step 5
The EMREC2StateMachine Step Functions task is manually invoked by the user to orchestrate the creation of an Amazon EMR cluster and the execution of an Apache Spark job.
Step 6
The Apache Spark jobs running on the Amazon EMR cluster use the Create Table As Select (CTAS) functionality to migrate data from the source AWS Glue table and source Amazon S3 bucket to the target Amazon S3 table bucket.
Step 7
Upon completion of the migration workflow, the EMREC2StateMachine Step Functions task sends a notification email to the user by Amazon Simple Notification Service (Amazon SNS).
Step 8
The Amazon EMR cluster is terminated by the EMREC2StateMachine Step Functions task.
Get Started
Deploy this Guidance
Well-Architected Pillars
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
-
Operational Excellence
By using CloudFormation, you gain automated deployment and comprehensive visibility into all created AWS resources and their deployment status. To enhance monitoring and alerting, Lambda functions store invocation and operational events in Amazon CloudWatch Logs, while Amazon SNS sends email notifications about the migration workflow's status. These services collectively enable robust auditing and monitoring of S3 Tables. S3 Tables provide automated compaction and unreferenced file cleanup. This combination of services helps ensure optimal performance, facilitates troubleshooting, and minimizes operational overhead, allowing you to maintain excellence in your data management operations.
-
Security
S3 Tables and IAM work together to provide robust security measures. They offer identity-based and resource-based fine-grained access controls so that only authorized users and processes can interact with your data. Data protection is enhanced through encryption at rest and in transit, safeguarding your information throughout the migration process and beyond. IAM is designed for precise control over who can access AWS resources and what actions they can perform, enabling you to maintain strict compliance requirements. By implementing these security features, you can prevent unauthorized access to table data, protect sensitive information, and ensure that your migration process adheres to your organization's security policies and regulatory standards.
-
Reliability
Lambda automatically scales to handle increasing concurrent requests across multiple Availability Zones (AZs) for high availability. Amazon SNS delivers messages across AZs, while Amazon S3 provides durable, multi-AZ storage for logs. S3 Tables offer automated maintenance, support for concurrent operations, and inherit the durability of Amazon S3. Step Functions contributes retry and catch mechanisms for workflow management. AWS Glue Tables provide a serverless way to organize related data. These services collectively support consistent performance, data durability, and automated maintenance throughout the migration, minimizing manual intervention and maximizing reliability of your data operations.
-
Performance Efficiency
S3 Tables deliver the same durability, availability, scalability, and performance characteristics as S3 itself, and automatically optimizes storage to maximize query performance and to minimize cost. Step Functions enhances efficiency by breaking workflows into smaller, manageable tasks and orchestrating them to reduce overall processing time and resource utilization. AWS Glue Tables contribute with their schema-on-read capability, enabling flexible and efficient querying of large datasets. Collectively, these services deliver improved object data storage, query throughput, and transaction processing for analytics workloads compared to traditional Amazon S3 buckets.
-
Cost Optimization
Lambda provides serverless compute, enabling cost-effective scaling for multiple parallel invocations without the need for provisioned infrastructure. Amazon S3 offers reliable, low-cost object storage, while Amazon SNS delivers messages to multiple subscribers efficiently. S3 Tables significantly reduce operational costs through automated compaction, snapshot management, and cleanup of unreferenced files. This automation eliminates the need for you to build and maintain costly compute clusters for table optimization, a process that traditionally requires skilled development teams and complex systems.
Furthermore, this Guidance combines cost-effective storage with scalable compute and orchestration, while S3 Tables keep Apache Iceberg tables performant without additional infrastructure costs. This approach not only optimizes expenses but also improves reliability and lowers the barrier to entry for modern analytics.
-
Sustainability
Lambda, a serverless compute service, provisions resources on-demand, reducing energy consumption by eliminating idle infrastructure. Similarly, Amazon SNS offers serverless messaging, efficiently delivering messages between applications and subscribers without maintaining always-on servers. Amazon S3 Tables further contribute to sustainability by optimizing storage layout through compaction and removing unnecessary data through automated maintenance. This approach significantly reduces the storage footprint required for data persistence. By using these serverless and storage-efficient services, your migration process not only becomes more cost-effective but also aligns with environmental sustainability goals, demonstrating a commitment to responsible resource usage in cloud operations.
Related Content
Build a managed transactional data lake with Amazon S3 Tables
How Amazon S3 Tables use compaction to improve query performance by up to 3 times
New Amazon S3 Tables: Storage optimized for analytics workloads
Disclaimer
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.