AWS Big Data Blog
Category: Intermediate (200)
Process Apache Hudi, Delta Lake, Apache Iceberg dataset at scale, part 2: Using AWS Glue Studio Visual Editor
June 2023: This post was reviewed and updated for accuracy. AWS Glue supports native integration with Apache Hudi, Delta Lake, and Apache Iceberg. Refer to Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 2: AWS Glue Studio Visual Editor to learn more. Transactional data lake […]
Process Apache Hudi, Delta Lake, Apache Iceberg datasets at scale, part 1: AWS Glue Studio Notebook
August 2023: This post was reviewed and updated for accuracy. AWS Glue supports native integration with Apache Hudi, Delta Lake, and Apache Iceberg. Refer to Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 2: AWS Glue Studio Visual Editor to learn more. Cloud data lakes […]
Accelerate Amazon DynamoDB data access in AWS Glue jobs using the new AWS Glue DynamoDB Export connector
Jan 2024: This post was reviewed and updated for accuracy. Modern data architectures encourage the integration of data lakes, data warehouses, and purpose-built data stores, enabling unified governance and easy data movement. With a modern data architecture on AWS, you can store data in a data lake and use a ring of purpose-built data services […]
Synchronize your AWS Glue Studio Visual Jobs to different environments
June 2023: This post was reviewed and updated for accuracy. AWS Glue has become a popular option for integrating data from disparate data sources due to its ability to integrate large volumes of data using distributed data processing frameworks. Many customers use AWS Glue to build data lakes and data warehouses. Data engineers who prefer […]
Introducing AWS Glue Auto Scaling: Automatically resize serverless computing resources for lower cost with optimized Apache Spark
October 2024: This post has been updated along with Interactive Sessions support for AWS Glue Auto scaling. June 2023: This post was reviewed and updated for accuracy. Data created in the cloud is growing fast in recent days, so scalability is a key factor in distributed data processing. Many customers benefit from the scalability of […]
Best practices to optimize data access performance from Amazon EMR and AWS Glue to Amazon S3
June 2024: This post was reviewed for accuracy and updated to cover Apache Iceberg. June 2023: This post was reviewed and updated for accuracy. Customers are increasingly building data lakes to store data at massive scale in the cloud. It’s common to use distributed computing engines, cloud-native databases, and data warehouses when you want to […]
Use unsupervised training with K-means clustering in Amazon Redshift ML
June 2023: This post was reviewed and updated for accuracy. Amazon Redshift is a fast, petabyte-scale cloud data warehouse delivering the best price–performance. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads. Data analysts and database developers want to use this data to train […]
Simplify data integration pipeline development using AWS Glue custom blueprints
June 2023: This post was reviewed and updated for accuracy. August 2021: AWS Glue custom blueprints are now generally available. Please visit https://docs.aws.amazon.com/glue/latest/dg/blueprints-overview.html to learn more. Organizations spend significant time developing and maintaining data integration pipelines that hydrate data warehouses, data lakes, and lake houses. As data volume increases, data engineering teams struggle to keep up with […]
Get started with the Amazon Redshift Data API
June 2023: This post was reviewed and updated for accuracy. The GitHub repository mentioned in this post is now updated with examples for serverless. Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that enables you to analyze your data at scale. Tens of thousands of customers use Amazon Redshift to […]
Using the Amazon Redshift Data API to interact with Amazon Redshift clusters
June 2023: This post was reviewed and updated for accuracy. July 2021: This post was reviewed and updated to include multi-statement and parameterization support. Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL […]







