Analytics | AWS Big Data Blog

Run Apache Spark Structured Streaming jobs at scale on Amazon EMR Serverless

Amazon EMR Serverless emerges as a pivotal solution for running streaming workloads, enabling the use of the latest open source frameworks like Spark without the need for configuration, optimization, security, or cluster management. In this post, we highlight some of the key enhancements introduced for streaming jobs.

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

In this post, we explore the process of federating into AWS using Microsoft Entra ID and AWS Identity and Access Management (IAM), and how to restrict access to datasets based on permissions linked to AD groups. We guide you through the setup process, and demonstrate how to seamlessly connect to the Redshift Query Editor while making sure data access permissions are accurately enforced based on your Microsoft Entra ID groups.

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

This post explores robust strategies for maintaining data quality when ingesting data into Apache Iceberg tables using AWS Glue Data Quality and Iceberg branches. We discuss two common strategies to verify the quality of published data. We dive deep into the Write-Audit-Publish (WAP) pattern, demonstrating how it works with Apache Iceberg.

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

This post will explore how to look up the history of records and tables using Apache Iceberg, focusing on Slowly Changing Dimensions (SCD) Type-2. This method creates new records for each data change while preserving old ones, thus maintaining a full history. By the end, you’ll understand how to use Apache Iceberg to manage historical records effectively on a typical CDC architecture.

How REA Group approaches Amazon MSK cluster capacity planning

REA Group, a digital real estate business, uses Amazon Managed Streaming for Apache Kafka (Amazon MSK) and a data streaming platform called Hydro to efficiently share and access large amounts of data across multiple domains and services. This approach allows REA Group to maintain optimal performance and cost-efficiency while scaling to meet growing user demands. In this post, they share their approach to MSK cluster capacity planning.

Simplify data access for your enterprise using Amazon SageMaker Lakehouse

Amazon SageMaker Lakehouse offers a unified solution for enterprise data access, combining data from warehouses and lakes. This post demonstrates how SageMaker Lakehouse integrates scattered data sources, enabling secure enterprise-wide access, and allowing teams to use their preferred tools for predicting and analyzing customer churn. The solution involves multiple data sources, including Amazon S3, Amazon Redshift, and AWS Glue Data Catalog, with AWS Lake Formation managing permissions.

Enforce fine-grained access control on data lake tables using AWS Glue 5.0 integrated with AWS Lake Formation

AWS Glue 5.0 supports fine-grained access control (FGAC) based on your policies defined in AWS Lake Formation. FGAC enables you to granularly control access to your data lake resources at the table, column, and row levels. This post demonstrates how to enforce FGAC on AWS Glue 5.0 through Lake Formation permissions.

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Open table formats are emerging in the rapidly evolving domain of big data management, fundamentally altering the landscape of data storage and analysis. In earlier posts, we discussed AWS Glue 5.0 for Apache Spark. In this post, we highlight notable updates on Iceberg, Hudi, and Delta Lake in AWS Glue 5.0.

Introducing AWS Glue 5.0 for Apache Spark

Today, we are launching AWS Glue 5.0, a new version of AWS Glue that accelerates data integration workloads in AWS. AWS Glue 5.0 upgrades the Spark engines to Apache Spark 3.5.2 and Python 3.11, giving you newer Spark and Python releases so you can develop, run, and scale your data integration workloads and get insights faster. This post describes what’s new in AWS Glue 5.0, performance improvements, key highlights on Spark and related libraries, and how to get started on AWS Glue 5.0.

Read and write S3 Iceberg table using AWS Glue Iceberg Rest Catalog from Open Source Apache Spark

In this post, we will explore how to harness the power of Open source Apache Spark and configure a third-party engine to work with AWS Glue Iceberg REST Catalog. The post will include details on how to perform read/write data operations against Amazon S3 tables with AWS Lake Formation managing metadata and underlying data access using temporary credential vending.

AWS Big Data Blog

Category: Analytics