AWS Big Data Blog

Analyzing your data catalog: Query SageMaker Catalog metadata with SQL

In this post, we demonstrate how to use the metadata export capability in Amazon SageMaker Catalog and perform analytics such as historical changes, monitor asset growth and track metadata improvements.

Configure a custom domain name for your Amazon MSK cluster enabled with IAM authentication

In the first part of Configure a custom domain name for your Amazon MSK cluster, we discussed about why custom domain names are important and provided details on how to configure a custom domain name in Amazon MSK when using SASL_SCRAM authentication. In this post, we discuss how to configure a custom domain name in Amazon MSK when using IAM authentication.

Migrate third-party and self-managed Apache Kafka clusters to Amazon MSK Express and Standard brokers with Amazon MSK Replicator

In this post, we walk you through how to replicate Apache Kafka data from your external Apache Kafka deployments to Amazon MSK Express brokers using MSK Replicator. You will learn how to configure authentication on your external cluster, establish network connectivity, set up bidirectional replication, and monitor replication health to achieve a low-downtime migration.

Building unified data pipelines with Apache Iceberg and Apache Flink

In this post, you build a unified pipeline using Apache Iceberg and Amazon Managed Service for Apache Flink that replaces the dual-pipeline approach. This walkthrough is for intermediate AWS users who are comfortable with Amazon Simple Storage Service (Amazon S3) and AWS Glue Data Catalog but new to streaming from Apache Iceberg tables.

Securely connecting on-premises data systems to Amazon Redshift with IAM Roles Anywhere

In this post, you will learn how to use AWS IAM Roles Anywhere with Amazon Redshift for secure, private connections. This removes the need to expose traffic to the public internet or manage long-lived access keys.

Enhancing Identity Intelligence with Babel Street Match and Amazon OpenSearch

This post explores how combining Babel Street Match with OpenSearch Service provides a solution that helps your organization to handle large-scale, multilingual data.

Getting started with Apache Iceberg write support in Amazon Redshift – Part 2

Amazon Redshift now supports DELETE, UPDATE, and MERGE operations for Apache Iceberg tables stored in Amazon S3 and Amazon S3 table buckets. With these operations, you can modify data at the row level, implement upsert patterns, and manage the data lifecycle while maintaining transactional consistency using familiar SQL syntax. You can run complex transformations in Amazon Redshift and write results to Apache Iceberg tables that other analytics engines like Amazon EMR or Amazon Athena can immediately query. In this post, you work with datasets to demonstrate these capabilities in a data synchronization scenario.

Get to insights faster using Notebooks in Amazon SageMaker Unified Studio

In this post, we demonstrate how Notebooks in Amazon SageMaker Unified Studio help you get to insights faster by simplifying infrastructure configuration. You’ll see how to analyze housing price data, create scalable data tables, run distributed profiling, and train machine learning (ML) models within a single notebook environment.

How to use Parquet Column Indexes with Amazon Athena

In this blog post, we use Athena and Amazon SageMaker Unified Studio to explore Parquet Column Indexes and demonstrate how they can improve Iceberg query performance. We explain what Parquet Column Indexes are, demonstrate their performance benefits, and show you how to use them in your applications.

Implementing Kerberos authentication for Apache Spark jobs on Amazon EMR on EKS to access a Kerberos-enabled Hive Metastore

In this post, we show how to configure Kerberos authentication for Spark jobs on Amazon EMR on EKS, authenticating against a Kerberos-enabled HMS so you can run both Amazon EMR on EC2 and Amazon EMR on EKS workloads against a single, secure HMS deployment.