AWS Big Data Blog
Category: AWS Glue
Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation
Amazon Q data integration, introduced in January 2024, allows you to use natural language to author extract, transform, load (ETL) jobs and operations in AWS Glue specific data abstraction DynamicFrame. This post introduces exciting new capabilities for Amazon Q data integration that work together to make ETL development more efficient and intuitive. We’ve added support for DataFrame-based code generation that works across any Spark environment. We’ve also introduced in-prompt context-aware development that applies details from your conversations, working seamlessly with a new iterative development experience.
Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction
In this post, we explore new features of the AWS Glue Data Catalog, which now supports improved automatic compaction of Iceberg tables for streaming data, making it straightforward for you to keep your transactional data lakes consistently performant. Enabling automatic compaction on Iceberg tables reduces metadata overhead on your Iceberg tables and improves query performance
Introducing a new unified data connection experience with Amazon SageMaker Lakehouse unified data connectivity
With Amazon SageMaker Lakehouse unified data connectivity, you can confidently connect, explore, and unlock the full value of your data across AWS services and achieve your business objectives with agility. This post demonstrates how SageMaker Lakehouse unified data connectivity helps your data integration workload by streamlining the establishment and management of connections for various data sources.
Building end-to-end data lineage for one-time and complex queries using Amazon Athena, Amazon Redshift, Amazon Neptune and dbt
In this post, we use dbt for data modeling on both Amazon Athena and Amazon Redshift. dbt on Athena supports real-time queries, while dbt on Amazon Redshift handles complex queries, unifying the development language and significantly reducing the technical learning curve. Using a single dbt modeling language not only simplifies the development process but also automatically generates consistent data lineage information. This approach offers robust adaptability, easily accommodating changes in data structures.
Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality
This post explores robust strategies for maintaining data quality when ingesting data into Apache Iceberg tables using AWS Glue Data Quality and Iceberg branches. We discuss two common strategies to verify the quality of published data. We dive deep into the Write-Audit-Publish (WAP) pattern, demonstrating how it works with Apache Iceberg.
Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg
This post will explore how to look up the history of records and tables using Apache Iceberg, focusing on Slowly Changing Dimensions (SCD) Type-2. This method creates new records for each data change while preserving old ones, thus maintaining a full history. By the end, you’ll understand how to use Apache Iceberg to manage historical records effectively on a typical CDC architecture.
Enforce fine-grained access control on data lake tables using AWS Glue 5.0 integrated with AWS Lake Formation
AWS Glue 5.0 supports fine-grained access control (FGAC) based on your policies defined in AWS Lake Formation. FGAC enables you to granularly control access to your data lake resources at the table, column, and row levels. This post demonstrates how to enforce FGAC on AWS Glue 5.0 through Lake Formation permissions.
Use open table format libraries on AWS Glue 5.0 for Apache Spark
Open table formats are emerging in the rapidly evolving domain of big data management, fundamentally altering the landscape of data storage and analysis. In earlier posts, we discussed AWS Glue 5.0 for Apache Spark. In this post, we highlight notable updates on Iceberg, Hudi, and Delta Lake in AWS Glue 5.0.
Introducing AWS Glue 5.0 for Apache Spark
Today, we are launching AWS Glue 5.0, a new version of AWS Glue that accelerates data integration workloads in AWS. AWS Glue 5.0 upgrades the Spark engines to Apache Spark 3.5.2 and Python 3.11, giving you newer Spark and Python releases so you can develop, run, and scale your data integration workloads and get insights faster. This post describes what’s new in AWS Glue 5.0, performance improvements, key highlights on Spark and related libraries, and how to get started on AWS Glue 5.0.
Read and write S3 Iceberg table using AWS Glue Iceberg Rest Catalog from Open Source Apache Spark
In this post, we will explore how to harness the power of Open source Apache Spark and configure a third-party engine to work with AWS Glue Iceberg REST Catalog. The post will include details on how to perform read/write data operations against Amazon S3 tables with AWS Lake Formation managing metadata and underlying data access using temporary credential vending.